点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
cs.LG 方向,今日共计182篇
大模型相关(35篇)
【1】Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
标题:揭开OPD的神秘面纱:大型语言模型的长度膨胀和稳定策略
链接:https://arxiv.org/abs/2604.08527
作者:Feng Luo,Yu-Neng Chuang,Guanchu Wang,Zicheng Xu,Xiaotian Han,Tianyi Zhang,Vladimir Braverman
摘要:政策上的蒸馏(OPD)在自己的诱导分布下训练学生模型,同时利用更强大的教师的监督。我们确定了OPD的失败模式:随着训练的进行,策略上的推出可能会发生突然的长度膨胀,导致截断的轨迹主导训练数据。这种截断崩溃与突然的重复饱和相一致,并引起有偏的梯度信号,导致严重的训练不稳定性和验证性能的急剧下降。我们把这个问题归因于学生引起的数据收集和蒸馏目标之间的相互作用,这隐含地有利于长期和重复的推出。为了解决这个问题,我们提出了StableOPD,一个稳定的OPD框架,它结合了基于参考的发散约束和推出混合蒸馏。这些共同减轻了重复引起的长度膨胀,并进一步稳定OPD训练。在多个数学推理数据集上,我们的方法可以防止截断崩溃,稳定训练动态,并将性能平均提高7.2%。
摘要:On-policy distillation (OPD) trains student models under their own induced distribution while leveraging supervision from stronger teachers. We identify a failure mode of OPD: as training progresses, on-policy rollouts can undergo abrupt length inflation, causing truncated trajectories to dominate the training data. This truncation collapse coincides with abrupt repetition saturation and induces biased gradient signals, leading to severe training instability and sharp degradation in validation performance. We attribute this problem to the interaction between student-induced data collection and the distillation objective, which implicitly favors long and repetitive rollouts. To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length inflation and further stabilize OPD training. Across multiple math reasoning datasets, our approach prevents truncation collapse, stabilizes training dynamics, and improves performance by 7.2% on average.
【2】SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
标题:SUPERNova:通过自然指令上的强化学习在LLM中激发一般推理
链接:https://arxiv.org/abs/2604.08477
作者:Ashima Suvarna,Kendrick Phan,Mehrab Beikzadeh,Hritik Bansal,Saadia Gabriel
备注:23 Pages, 4 figures
摘要:带有可验证奖励的强化学习(RLVR)显著改善了数学和代码等正式领域的大型语言模型(LLM)推理。尽管取得了这些进步,LLM仍然难以完成需要因果推理和时间理解等能力的一般推理任务。将RLVR扩展到一般推理从根本上受到缺乏高质量,可验证的训练数据的限制,这些数据涵盖了各种推理技能。为了应对这一挑战,我们提出了SUPERNOVA,这是一个用于RLVR的数据管理框架,旨在增强一般推理。我们的关键见解是,包含专家注释的地面实况的推理调整数据集编码了丰富的推理模式,可以系统地适应RLVR。为了研究这一点,我们进行了100多个受控RL实验,以分析数据设计选择如何影响下游推理性能。特别是,我们研究了三个关键因素:(i)源任务选择,(ii)任务混合策略,(iii)提高数据质量的综合干预措施。我们的分析表明,源任务的选择是不平凡的,下游推理性能有显着的影响。此外,根据单个目标任务的性能选择任务优于基于整体平均性能的策略。最后,在SUPERNOVA上训练的模型优于强基线(例如,Qwen3.5)在具有挑战性的推理基准测试中,包括BBEH、Zebralogic和MMLU-Pro。特别是,SUPERNOVA上的训练在不同模型大小的BBEH上产生了高达52.8%的相对改善,证明了RLVR原则性数据管理的有效性。我们的研究结果为策划人类注释的资源提供了实用的见解,以将RLVR扩展到一般推理。代码和数据可在https://github.com/asuvarna31/supernova上获得。
摘要:Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8\% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.
【3】Dead Weights, Live Signals: Feedforward Graphs of Frozen Language Models
标题:自重,实时信号:冻结语言模型的前向图
链接:https://arxiv.org/abs/2604.08335
作者:Marcus Armstrong,Navid Ayoobi,Arjun Mukherjee
摘要:我们提出了一种前馈图架构,其中异构冻结的大型语言模型作为计算节点,通过学习线性投影通过共享的连续潜在空间进行通信。基于最近的工作证明独立训练的LLM潜在空间之间的几何兼容性,我们将这一发现从静态双模型转向扩展到端到端的可训练多节点图,其中投影矩阵通过残余流注入钩子通过反向传播进行联合优化。三个小的冻结模型(Llama-3.2-1B,Qwen2.5-1.5B,Gemma-2-2B)将输入编码到一个共享的潜在空间中,其聚合信号被注入到两个较大的冻结模型(Phi-3-mini,Mistral-7 B)中,其表示馈送一个轻量级的交叉注意输出节点。在大约12 B冻结的情况下,只有17.6M可训练参数,该架构在ARC-Challenge上达到87.3\%,在OpenBookQA上达到82.8\%,在MMLU上达到67.2\%,分别超过最佳单成分模型11.4,6.2和1.2个百分点,并在冻结的单模型上超过参数匹配的学习分类器9.1,5.2,6.7分。通过多个冻结模型边界的梯度流被经验验证是易处理的,并且输出节点在没有明确监督的情况下跨第二层节点开发选择性路由行为。
摘要
:We present a feedforward graph architecture in which heterogeneous frozen large language models serve as computational nodes, communicating through a shared continuous latent space via learned linear projections. Building on recent work demonstrating geometric compatibility between independently trained LLM latent spaces~\cite{armstrong2026thinking}, we extend this finding from static two-model steering to end-to-end trainable multi-node graphs, where projection matrices are optimized jointly via backpropagation through residual stream injection hooks. Three small frozen models (Llama-3.2-1B, Qwen2.5-1.5B, Gemma-2-2B) encode the input into a shared latent space whose aggregate signal is injected into two larger frozen models (Phi-3-mini, Mistral-7B), whose representations feed a lightweight cross-attention output node. With only 17.6M trainable parameters against approximately 12B frozen, the architecture achieves 87.3\% on ARC-Challenge, 82.8\% on OpenBookQA, and 67.2\% on MMLU, outperforming the best single constituent model by 11.4, 6.2, and 1.2 percentage points respectively, and outperforming parameter-matched learned classifiers on frozen single models by 9.1, 5.2, and 6.7 points. Gradient flow through multiple frozen model boundaries is empirically verified to be tractable, and the output node develops selective routing behavior across layer-2 nodes without explicit supervision.
【4】Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification
标题:迷失在炒作中:揭示和剖析医学多模式大型语言模型在图像分类中的性能下降
链接:https://arxiv.org/abs/2604.08333
作者:Xun Zhu,Fanbin Mo,Xi Chen,Kaili Zheng,Shaoshuai Yang,Yiming Shi,Jian Gao,Miao Li,Ji Wu
摘要:多模态大语言模型(MLLM)的兴起在医学影像分析领域引发了前所未有的应用浪潮。然而,作为最早和最基本的任务之一,医学图像分类揭示了一个令人清醒的现实:与传统的深度学习模型相比,最先进的医学MLLM始终表现不佳,尽管它们在预训练数据和模型参数方面具有压倒性的优势。这一悖论引发了一个批判性的反思:性能下降究竟源于何处?在本文中,我们在三个代表性的图像分类数据集上对14个开源医学MLLM进行了广泛的实验。超越表面的性能基准测试,我们采用特征探测来跟踪整个MLLM管道中逐模块和逐层的视觉特征的信息流,从而实现分类信号在何处以及如何被扭曲,稀释或覆盖的显式可视化。作为第一次尝试剖析医学MLLM中的分类性能下降,我们的研究结果揭示了四种故障模式:1)视觉表示的质量限制,2)连接器投影的保真度损失,3)LLM推理的理解缺陷,以及4)语义映射的错位。与此同时,我们引入了表征特征演化健康性的定量分数,从而实现了不同MLLM和数据集之间的原则性比较。此外,我们提供了有见地的讨论集中在关键的障碍,阻止当前的医学MLLM履行其承诺的临床潜力。我们希望我们的工作能引起社区的重新思考,强调从高期望到临床部署的MLLM的道路仍然漫长而曲折。
摘要:The rise of multimodal large language models (MLLMs) has sparked an unprecedented wave of applications in the field of medical imaging analysis. However, as one of the earliest and most fundamental tasks integrated into this paradigm, medical image classification reveals a sobering reality: state-of-the-art medical MLLMs consistently underperform compared to traditional deep learning models, despite their overwhelming advantages in pre-training data and model parameters. This paradox prompts a critical rethinking: where exactly does the performance degradation originate? In this paper, we conduct extensive experiments on 14 open-source medical MLLMs across three representative image classification datasets. Moving beyond superficial performance benchmarking, we employ feature probing to track the information flow of visual features module-by-module and layer-by-layer throughout the entire MLLM pipeline, enabling explicit visualization of where and how classification signals are distorted, diluted, or overridden. As the first attempt to dissect classification performance degradation in medical MLLMs, our findings reveal four failure modes: 1) quality limitation in visual representation, 2) fidelity loss in connector projection, 3) comprehension deficit in LLM reasoning, and 4) misalignment of semantic mapping. Meanwhile, we introduce quantitative scores that characterize the healthiness of feature evolution, enabling principled comparisons across diverse MLLMs and datasets. Furthermore, we provide insightful discussions centered on the critical barriers that prevent current medical MLLMs from fulfilling their promised clinical potential. We hope that our work provokes rethinking within the community-highlighting that the road from high expectations to clinically deployable MLLMs remains long and winding.
【5】Small Vision-Language Models are Smart Compressors for Long Video Understanding
标题:小型视觉语言模型是长视频理解的智能压缩器
链接:https://arxiv.org/abs/2604.08120
作者:Junjie Fei,Jun Chen,Zechun Liu,Yunyang Xiong,Chong Zhou,Wei Wen,Junlin Han,Mingchen Zhuge,Saksham Suri,Qi Qian,Shuming Liu,Lemeng Wu,Raghuraman Krishnamoorthi,Vikas Chandra,Mohamed Elhoseiny,Chenchen Zhu
备注:Project page and demo are available at https://FeiElysia.github.io/tempo-page/
摘要:为长达一小时的视频调整多模态大型语言模型(MLLM)受到上下文限制的影响。密集的视觉流使令牌预算饱和,并加剧了中间丢失现象。现有的算法,如稀疏采样或均匀池,通过丢弃决定性时刻和在不相关的背景上浪费带宽来盲目牺牲保真度。我们提出了Tempo,一个高效的查询感知框架,压缩长视频以供下游理解。Tempo利用小视觉语言模型(SVLM)作为局部时间压缩器,将令牌还原作为早期跨模态蒸馏过程,以在单个向前传递中生成紧凑的意图对齐表示。为了在不破坏因果关系的情况下实施严格的预算,我们引入了自适应令牌分配(ATA)。利用SVLM的zero-shot相关性先验和语义前加载,ATA充当无训练的$O(1)$动态路由器。它将密集的带宽分配给查询关键段,同时将冗余压缩到最小的时间锚点,以保持全局故事情节。大量的实验表明,我们的6 B架构实现了最先进的性能与积极的动态压缩(0.5-16令牌/帧)。在超长的LVBench(4101 s)上,Tempo在严格的8 K视觉预算下得分为52.3,优于GPT-4 o和Gemini 1.5 Pro。缩放到2048帧达到53.7。至关重要的是,Tempo将长达一小时的视频压缩到理论极限以下,证明真正的长格式视频理解依赖于意图驱动的效率,而不是绿色填充的上下文窗口。
摘要:Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free $O(1)$ dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.
【6】Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization
标题:初始化决定盆地:极端LLM量化的高效代码簿优化
链接:https://arxiv.org/abs/2604.08118
作者:Ian W. Kennedy,Nafise Sadat Moosavi
备注:9 pages (+ references and appendix). Under review at ACL Rolling Review
摘要:加法量化实现了具有O(1)查找表解量化的极端LLM压缩,使其对边缘部署具有吸引力。然而,在2位精度下,即使进行了广泛的搜索和微调,它也经常会灾难性地失败。我们发现,主要的瓶颈是码本初始化。贪婪的顺序初始化经常将模型置于随后的波束搜索和PV调谐难以克服的不良优化区域中。我们分析这种行为,通过代表性的比率\r{ho} = N/KM,其特征在于权重组和码本容量之间的关系,并提出OA-EM,输出感知EM初始化方法,使用Hessian加权马氏距离。在压缩率、搜索预算和三种架构(Llama 3.2 3B、Llama 3.1 8B、Qwen 2.5 3B)上,OA-EM在PV调整后始终产生更好的解决方案,并在质量计算领域占据主导地位。瓶颈的严重程度与\r{ho}成比例:在3 bpp时为中度,但在2 bpp时为极端,其中初始化不良可以使困惑度降低几个数量级。更广泛地说,我们的研究结果突出了压缩模型空间中优化几何结构的重要性,其中初始化可以主导后续的搜索和微调。
摘要
:Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio \r{ho} = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality-compute frontier. The severity of the bottleneck scales with \r{ho}: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning.
【7】LINE: LLM-based Iterative Neuron Explanations for Vision Models
标题:LINE:基于LLM的视觉模型迭代神经元简化
链接:https://arxiv.org/abs/2604.08039
作者:Vladimir Zaigrajew,Michał Piechota,Gaspar Sekula,Przemysław Biecek
摘要:解释深度神经网络中单个神经元编码的概念是理解其复杂决策过程和确保人工智能安全的关键一步。尽管最近在神经元标记方面取得了进展,但现有方法通常将搜索空间限制为预定义的概念词汇表,或者产生过于具体的描述,无法捕获更高阶的全局概念。我们介绍LINE,这是一种新颖的,无训练的迭代方法,专为视觉模型中的开放词汇概念标记而设计。LINE在严格的黑盒设置中运行,利用大型语言模型和文本到图像生成器,在激活历史的指导下,在闭环中迭代地提出和完善概念。我们证明了LINE在多个模型架构中实现了最先进的性能,在ImageNet上的AUC提高了0.18,在Places365上提高了0.05,同时发现了平均29%的被大量预定义词汇表遗漏的新概念。除了识别顶级概念之外,LINE还提供了完整的生成历史,这使得能够进行多义评估,并产生与梯度依赖激活最大化方法相媲美的支持性视觉解释。
摘要:Interpreting the concepts encoded by individual neurons in deep neural networks is a crucial step towards understanding their complex decision-making processes and ensuring AI safety. Despite recent progress in neuron labeling, existing methods often limit the search space to predefined concept vocabularies or produce overly specific descriptions that fail to capture higher-order, global concepts. We introduce LINE, a novel, training-free iterative approach tailored for open-vocabulary concept labeling in vision models. Operating in a strictly black-box setting, LINE leverages a large language model and a text-to-image generator to iteratively propose and refine concepts in a closed loop, guided by activation history. We demonstrate that LINE achieves state-of-the-art performance across multiple model architectures, yielding AUC improvements of up to 0.18 on ImageNet and 0.05 on Places365, while discovering, on average, 29% of new concepts missed by massive predefined vocabularies. Beyond identifying the top concept, LINE provides a complete generation history, which enables polysemanticity evaluation and produces supporting visual explanations that rival gradient-dependent activation maximization methods.
【8】Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs
标题:解释“为什么”:LLM中归纳推理的统一分类和调查
链接:https://arxiv.org/abs/2604.08016
作者:Moein Salimi,Shaygan Adim,Danial Parnian,Nima Alighardashi,Mahdi Jafari Siavoshani,Mohammad Hossein Rohban
摘要:不管它在人类发现和意义创造中的基础作用如何,溯因推理(对观察结果最合理的解释的推理)在大型语言模型(LLM)中相对来说还没有得到充分的研究。尽管LLM的快速发展,溯因推理及其不同方面的探索迄今为止是脱节的,而不是凝聚力。本文介绍了LLM中溯因推理的第一次调查,从哲学基础到当代AI实现追踪其轨迹。为了解决该领域普遍存在的概念混乱和脱节的任务定义,我们建立了一个统一的两阶段定义,正式分类以前的工作。这个定义将溯因分解为\textit{假设生成}和\textit{假设选择},前者是模型弥合认知差距以产生候选解释,后者是对生成的候选解释进行评估并选择最合理的解释。在此基础上,我们提出了一个全面的文献分类,分类之前的工作的基础上,他们的溯因任务,数据集,底层的方法,和评估策略。为了根据我们的框架经验,我们进行了一个紧凑的基准研究,目前的LLM的溯因任务,以及有针对性的比较分析,跨模型大小,模型家族,评估风格,和不同的生成与选择任务类型。此外,通过综合最近的实证结果,我们研究了LLM在溯因推理上的表现与演绎和归纳任务的关系,从而深入了解了它们更广泛的推理能力。我们的分析揭示了当前方法中的关键差距-从静态基准设计和狭窄的领域覆盖到狭窄的培训框架和对溯因过程的有限机械理解。
摘要:Regardless of its foundational role in human discovery and sense-making, abductive reasoning--the inference of the most plausible explanation for an observation--has been relatively underexplored in Large Language Models (LLMs). Despite the rapid advancement of LLMs, the exploration of abductive reasoning and its diverse facets has thus far been disjointed rather than cohesive. This paper presents the first survey of abductive reasoning in LLMs, tracing its trajectory from philosophical foundations to contemporary AI implementations. To address the widespread conceptual confusion and disjointed task definitions prevalent in the field, we establish a unified two-stage definition that formally categorizes prior work. This definition disentangles abduction into \textit{Hypothesis Generation}, where models bridge epistemic gaps to produce candidate explanations, and \textit{Hypothesis Selection}, where the generated candidates are evaluated and the most plausible explanation is chosen. Building upon this foundation, we present a comprehensive taxonomy of the literature, categorizing prior work based on their abductive tasks, datasets, underlying methodologies, and evaluation strategies. In order to ground our framework empirically, we conduct a compact benchmark study of current LLMs on abductive tasks, together with targeted comparative analyses across model sizes, model families, evaluation styles, and the distinct generation-versus-selection task typologies. Moreover, by synthesizing recent empirical results, we examine how LLM performance on abductive reasoning relates to deductive and inductive tasks, providing insights into their broader reasoning capabilities. Our analysis reveals critical gaps in current approaches--from static benchmark design and narrow domain coverage to narrow training frameworks and limited mechanistic understanding of abductive processes...
【9】A Decomposition Perspective to Long-context Reasoning for LLMs
标题:LLM长上下文推理的分解视角
链接:https://arxiv.org/abs/2604.07981
作者:Yanling Xiao,Huaibing Xie,Guoliang Zhao,Shihan Dou,Shaolei Wang,Yiting Liu,Nantao Zheng,Cheng Zhang,Pluto Zhou,Zhisong Zhang,Lemao Liu
摘要:长上下文推理对于复杂的现实世界应用程序至关重要,但对于大型语言模型(LLM)仍然是一个重大挑战。尽管长上下文推理的发展迅速,但目前的研究往往忽视了长上下文推理任务本身的内部复杂性。在本文中,我们超越了这种整体观点,将长上下文推理分解为一组基本的原子技能,然后自动合成一套伪数据集,每个数据集都明确针对特定的原子技能。我们的实证分析证实,这些原子技能的熟练程度与一般长文本推理性能密切相关。基于这一认识,我们在这些伪数据集上采用强化学习来提高模型的原子技能,希望提高其一般的长上下文推理能力。在多个基准测试中进行的大量实验证明了我们方法的有效性:它在Loxis,Loong,LongBench-v2,BrowscompLong,Ruler-qa 2和MRCR上的平均利润率为7.7%(从46.3%提高到54.0%)。
摘要:Long-context reasoning is essential for complex real-world applications, yet remains a significant challenge for Large Language Models (LLMs). Despite the rapid evolution in long-context reasoning, current research often overlooks the internal complexity of the long-context reasoning task itself. In this paper, we move beyond this holistic view and decompose long-context reasoning into a set of fundamental atomic skills, and we then automatically synthesize a suite of pseudo datasets, each explicitly targeting a specific atomic skill. Our empirical analysis confirms that proficiency in these atomic skills is strongly correlated with general long-text reasoning performance. Building on this insight, we employ reinforcement learning on these pseudo datasets to sharpen the model's atomic skills, in the hope of boosting its general long-context reasoning ability. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our approach: it outperforms a strong baseline by an average margin of 7.7\% (improving from 46.3\% to 54.0\%) across Loogle, Loong, LongBench-v2, BrowscompLong, Ruler-qa2, and MRCR.
【10】Rethinking Data Mixing from the Perspective of Large Language Models
标题:从大型语言模型的角度重新思考数据混合
链接:https://arxiv.org/abs/2604.07963
作者:Yuanjian Xu,Tianze Sun,Changwei Xu,XinLong Zhao,Jianing Hao,Ran Chen,Yang Liu,Ruijie Xu,Stephen Chen,Guang Zhang
摘要:数据混合策略是大型语言模型(LLM)训练的关键。经验证据表明,不适当的策略可以显着减少泛化。虽然最近的方法已经提高了经验性能,但几个基本问题仍然悬而未决:什么构成了一个域,人类和模型对域的感知是否一致,以及域加权如何影响泛化。我们通过建立梯度动态和域分布之间的正式联系来解决这些问题,提供了一个理论框架,阐明了域在训练动态中的作用。在此分析的基础上,我们引入了DoGraph,一个重新加权的框架,制定数据调度作为一个图约束优化问题。在不同尺度的GPT-2模型上进行的大量实验表明,DoGraph始终能够实现有竞争力的性能。
摘要:Data mixing strategy is essential for large language model (LLM) training. Empirical evidence shows that inappropriate strategies can significantly reduce generalization. Although recent methods have improved empirical performance, several fundamental questions remain open: what constitutes a domain, whether human and model perceptions of domains are aligned, and how domain weighting influences generalization. We address these questions by establishing formal connections between gradient dynamics and domain distributions, offering a theoretical framework that clarifies the role of domains in training dynamics. Building on this analysis, we introduce DoGraph, a reweighting framework that formulates data scheduling as a graph-constrained optimization problem. Extensive experiments on GPT-2 models of varying scales demonstrate that DoGraph consistently achieves competitive performance.
【11】Rethinking Residual Errors in Compensation-based LLM Quantization
标题:重新思考基于补偿的LLM量化中的残余误差
链接:https://arxiv.org/abs/2604.07955
作者:Shuaiting Li,Juncan Deng,Kedong Xu,Rongtao Deng,Hong Gu,Minghan Jiang,Haibin Shen,Kejie Huang
备注:ICLR'26 camera ready
摘要:基于权重补偿的方法,迭代地应用量化和权重补偿以最小化输出误差,最近在量化大型语言模型(LLM)方面取得了显着的成功。代表性工作,GPTQ,介绍了几个关键技术,使这种迭代方法实用的LLM与数十亿的参数。GPTAQ通过引入非对称校准过程扩展了这种方法,该过程将每个量化层的输出与其全精度对应层对齐,并将残差纳入权重补偿框架。在这项工作中,我们重新制定的残余误差。我们在现有方法中确定了一个次优校准目标:在层内校准过程中,它们将量化输出与补偿权重的输出对齐,而不是原始全精度模型的真实输出。因此,我们重新定义了目标,以便在每一步都将量化模型的输出与全精度模型的原始输出精确对齐。然后,我们发现,残留误差不仅来源于前一层的输出差异,而且来源于每一层内的补偿和原始权重之间的差异,我们称之为“补偿感知误差”。通过继承GPTAQ的神经元分解技术,我们可以有效地将这种补偿感知误差纳入权重更新过程。在各种LLM和量化设置上的广泛实验表明,我们提出的增强功能与GPTQ和GPTAQ无缝集成,显着提高了它们的量化性能。我们的代码可在https://github.com/list0830/ResComp上公开获取。
摘要:Methods based on weight compensation, which iteratively apply quantization and weight compensation to minimize the output error, have recently demonstrated remarkable success in quantizing Large Language Models (LLMs). The representative work, GPTQ, introduces several key techniques that make such iterative methods practical for LLMs with billions of parameters. GPTAQ extends this approach by introducing an asymmetric calibration process that aligns the output of each quantized layer with its full-precision counterpart, incorporating a residual error into the weight compensation framework. In this work, we revisit the formulation of the residual error. We identify a sub-optimal calibration objective in existing methods: during the intra-layer calibration process, they align the quantized output with the output from compensated weights, rather than the true output from the original full-precision model. Therefore, we redefine the objective to precisely align the quantized model's output with the original output of the full-precision model at each step. We then reveal that the residual error originates not only from the output difference of the preceding layer but also from the discrepancy between the compensated and original weights within each layer, which we name the 'compensation-aware error'. By inheriting the neuron decomposition technique from GPTAQ, we can efficiently incorporate this compensation-aware error into the weight update process. Extensive experiments on various LLMs and quantization settings demonstrate that our proposed enhancements integrate seamlessly with both GPTQ and GPTAQ, significantly improving their quantization performance. Our code is publicly available at https://github.com/list0830/ResComp.
【12】Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
标题:大型语言模型训练后:政策外和政策内学习的统一看法
链接:https://arxiv.org/abs/2604.07941
作者:Shiwan Zhao,Zhihu Wang,Xuyang Zhao,Jiaming Zhou,Caiyue Xu,Chenfei Liu,Liting Zhang,Yuhang Jia,Yanzhe Zhang,Hualong Yu,Zichen Xu,Qicheng Li,Yong Qin
备注:38 pages, 1 figure, 8 tables
摘要:后训练已经成为将预训练的大型语言模型(LLM)转化为对齐和可部署系统的核心。最近的进展包括监督微调(SFT),偏好优化,强化学习(RL),过程监督,验证器引导方法,蒸馏和多级管道。然而,这些方法通常是以零散的方式讨论的,按标签或目标家族组织,而不是按它们所解决的行为瓶颈组织。 这项调查认为,LLM后培训最好理解为对模型行为的结构化干预。我们首先通过轨迹起源来组织该领域,该轨迹起源定义了两种主要的学习制度:外部提供的轨迹上的政策外学习,以及学习者生成的推出上的政策学习。然后,我们通过两个重复的角色来解释方法-有效的支持扩展,这使得有用的行为更容易达到,以及策略重塑,这改善了已经达到的区域内的行为-以及补充的系统级角色,行为整合,它在各个阶段和模型转换中保留,转移和摊销行为。 这一观点产生了对主要范式的统一解读。SFT可以用于支持扩展或策略重塑,而基于偏好的方法通常是脱离策略的重塑。基于策略的强化学习通常会改善学习者生成的状态的行为,尽管在更强的指导下,它也可以使难以到达的推理路径可达。蒸馏通常最好理解为合并而不仅仅是压缩,混合管道作为协调的多级组合物出现。 总体而言,该框架有助于诊断培训后的瓶颈和原因的阶段组成,这表明在LLM后培训的进展越来越依赖于协调的系统设计,而不是任何单一的主导目标。
摘要:Post-training has become central to turning pretrained large language models (LLMs) into aligned and deployable systems. Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines. Yet these methods are often discussed in fragmented ways, organized by labels or objective families rather than by the behavioral bottlenecks they address. This survey argues that LLM post-training is best understood as structured intervention on model behavior. We organize the field first by trajectory provenance, which defines two primary learning regimes: off-policy learning on externally supplied trajectories, and on-policy learning on learner-generated rollouts. We then interpret methods through two recurring roles -- effective support expansion, which makes useful behaviors more reachable, and policy reshaping, which improves behavior within already reachable regions -- together with a complementary systems-level role, behavioral consolidation, which preserves, transfers, and amortizes behavior across stages and model transitions. This perspective yields a unified reading of major paradigms. SFT may serve either support expansion or policy reshaping, whereas preference-based methods are usually off-policy reshaping. On-policy RL often improves behavior on learner-generated states, though under stronger guidance it can also make hard-to-reach reasoning paths reachable. Distillation is often best understood as consolidation rather than only compression, and hybrid pipelines emerge as coordinated multi-stage compositions. Overall, the framework helps diagnose post-training bottlenecks and reason about stage composition, suggesting that progress in LLM post-training increasingly depends on coordinated system design rather than any single dominant objective.
【13】Dynamic Attentional Context Scoping: Agent-Triggered Focus Sessions for Isolated Per-Agent Steering in Multi-Agent LLM Orchestration
标题:动态注意力上下文范围界定:在多代理LLM规划中进行独立的按代理引导的代理触发焦点会议
链接:https://arxiv.org/abs/2604.07911
作者:Nickson Patel
备注
:15 pages, 4 figures, preprint
摘要:多代理LLM编排系统遭受上下文污染:当N个并发代理竞争编排器的上下文窗口时,每个代理的任务状态,部分输出和未决问题污染了每个其他代理的转向交互,降低了决策质量。我们引入动态注意力上下文范围(DACS),一种机制,其中的协调器在两个不对称的模式。在注册表模式下,它只保存轻量级的每个代理状态摘要(每个<=200个令牌),对所有代理和用户保持响应。当代理发出SteeringRequest时,编排器进入Focus(a_i)模式,注入代理a_i的完整上下文,同时将所有其他代理压缩到其注册表项。上下文隔离是代理触发的、非对称的和确定性的:上下文窗口在转向期间精确地包含F(a_i)+ R_{-i},消除了跨代理污染,而不需要上下文压缩或检索。我们在四个实验阶段共200次试验中评估DACS:第1阶段测试{3,5,10}中的N(60次试验);第2阶段测试代理异质性和对抗依赖性(60次试验);第3阶段测试决策密度高达D=15(40次试验);第4阶段使用自主LLM代理进行自由形式的问题(40次试验,Claude Haiku 4.5)。在所有8个合成场景中,DACS实现了90.0- 98.4%的转向准确度,而平面背景基线的转向准确度为21.0- 60.0%(p < 0.0001),错误代理污染从28- 57%下降到0- 14%,背景效率比高达3.53倍。准确性优势随着N和D的增长而增长;关键字匹配在所有阶段都通过LLM作为判断进行了验证(平均kappa=0.909)。在第4阶段,DACS在N=3(p=0.0023)和N=5(p=0.0008)时的表现优于平面背景基线+17.2pp和20.4pp,两名独立法官证实了优势随着N的增加而增加。
摘要:Multi-agent LLM orchestration systems suffer from context pollution: when N concurrent agents compete for the orchestrator's context window, each agent's task state, partial outputs, and pending questions contaminate the steering interactions of every other agent, degrading decision quality. We introduce Dynamic Attentional Context Scoping (DACS), a mechanism in which the orchestrator operates in two asymmetric modes. In Registry mode it holds only lightweight per-agent status summaries (<=200 tokens each), remaining responsive to all agents and the user. When an agent emits a SteeringRequest, the orchestrator enters Focus(a_i) mode, injecting the full context of agent a_i while compressing all other agents to their registry entries. Context isolation is agent-triggered, asymmetric, and deterministic: the context window contains exactly F(a_i) + R_{-i} during steering, eliminating cross-agent contamination without requiring context compression or retrieval. We evaluate DACS across four experimental phases totalling 200 trials: Phase 1 tests N in {3,5,10} (60 trials); Phase 2 tests agent heterogeneity and adversarial dependencies (60 trials); Phase 3 tests decision density up to D=15 (40 trials); Phase 4 uses autonomous LLM agents for free-form questions (40 trials, Claude Haiku 4.5). Across all 8 synthetic scenarios, DACS achieves 90.0--98.4% steering accuracy versus 21.0--60.0% for a flat-context baseline (p < 0.0001 throughout), with wrong-agent contamination falling from 28--57% to 0--14% and context efficiency ratios of up to 3.53x. The accuracy advantage grows with N and D; keyword matching is validated by LLM-as-judge across all phases (mean kappa=0.909). DACS outperforms the flat-context baseline by +17.2pp at N=3 (p=0.0023) and +20.4pp at N=5 (p=0.0008) in Phase 4, with the advantage growing with N confirmed by two independent judges.
【14】Bit-by-Bit: Progressive QAT Strategy with Outlier Channel Splitting for Stable Low-Bit LLMs
标题:逐位:用于稳定低位LLM的具有离群值信道分裂的渐进QAT策略
链接:https://arxiv.org/abs/2604.07888
作者:Binxing Xu,Hao Gu,Lujun Li,Hao Wang,Bei Liu,Jiacheng Liu,Qiyuan Zhu,Xintong Yang,Chao Li,Sirui Han,Yike Guo
摘要:在超低精度下训练LLM仍然是一个艰巨的挑战。直接低比特QAT通常遭受收敛不稳定性和大量训练成本,加重了来自重尾离群通道的量化噪声和跨层的误差累积。为了解决这些问题,我们提出了逐位,一个渐进的QAT框架与离群通道分裂。我们的方法集成了三个关键部分:(1)逐块渐进训练,逐步降低精度,确保稳定的初始化低比特优化;(2)整数量化网格的嵌套结构,使“训练一次,部署任何精度”的范例,允许一个单一的模型,以支持多个位宽,而无需重新训练;(3)舍入感知离群值通道分裂,其减轻量化误差,同时充当保留量化输出的恒等变换。此外,我们遵循E4 M3尺度的微尺度组,捕获符合OCP/NVIDIA标准的动态激活范围。为了解决缺乏有效的2位内核的问题,我们为W2 A2和W2 A16配置开发了自定义运算符,与BF 16相比实现了高达11\times $的加速。在W2 A2设置下,Bit-by-Bit在Llama 2/3上的性能明显优于BitDistiller和EfficientQAT等基线,与全精度模型相比,仅损失2.25 WikiText 2 PPL。
摘要:Training LLMs at ultra-low precision remains a formidable challenge. Direct low-bit QAT often suffers from convergence instability and substantial training costs, exacerbated by quantization noise from heavy-tailed outlier channels and error accumulation across layers. To address these issues, we present Bit-by-Bit, a progressive QAT framework with outlier channel splitting. Our approach integrates three key components: (1) block-wise progressive training that reduces precision stage by stage, ensuring stable initialization for low-bit optimization; (2) nested structure of integer quantization grids to enable a "train once, deploy any precision" paradigm, allowing a single model to support multiple bit-widths without retraining; (3) rounding-aware outlier channel splitting, which mitigates quantization error while acting as an identity transform that preserves the quantized outputs. Furthermore, we follow microscaling groups with E4M3 scales, capturing dynamic activation ranges in alignment with OCP/NVIDIA standards. To address the lack of efficient 2-bit kernels, we developed custom operators for both W2A2 and W2A16 configurations, achieving up to 11$\times$ speedup over BF16. Under W2A2 settings, Bit-by-Bit significantly outperforms baselines like BitDistiller and EfficientQAT on both Llama2/3, achieving a loss of only 2.25 WikiText2 PPL compared to full-precision models.
【15】GRASS: Gradient-based Adaptive Layer-wise Importance Sampling for Memory-efficient Large Language Model Fine-tuning
标题:GRASS:基于对象的自适应分层重要性抽样,用于内存高效的大型语言模型微调
链接:https://arxiv.org/abs/2604.07808
作者:Kaiyuan Tian,Yu Tang,Gongqingjian Jiang,Baihui Liu,Yifu Gao,Xialin Su,Linbo Qiao,Dongsheng Li
备注:Accepted by ACL 2026 Findings
摘要:大型语言模型的全参数微调受到大量GPU内存需求的限制。低秩自适应方法通过仅更新参数的子集来缓解这一挑战。然而,这些方法通常限制了模型的表达能力,并且比全参数微调产生更低的性能。分层微调方法已经成为一种替代方案,通过静态层重要性采样策略实现内存高效的训练。然而,这些方法忽略了跨任务和训练阶段的层重要性的变化,导致下游任务的性能不佳。为了解决这些限制,我们提出了GRASS,一个基于梯度的自适应逐层重要性采样框架。GRASS利用平均梯度范数作为用于估计层重要性的任务感知和训练阶段感知度量。此外,GRASS通过自适应训练策略自适应地调整层采样概率。我们还引入了一个逐层优化器状态卸载机制,该机制重叠了计算和通信,以进一步减少内存使用,同时保持相当的训练吞吐量。在多个模型和基准测试中进行的大量实验表明,GRASS始终优于最先进的方法,平均精度提高了4.38个点,内存使用量减少了19.97%。
摘要:Full-parameter fine-tuning of large language models is constrained by substantial GPU memory requirements. Low-rank adaptation methods mitigate this challenge by updating only a subset of parameters. However, these approaches often limit model expressiveness and yield lower performance than full-parameter fine-tuning. Layer-wise fine-tuning methods have emerged as an alternative, enabling memory-efficient training through static layer importance sampling strategies. However, these methods overlook variations in layer importance across tasks and training stages, resulting in suboptimal performance on downstream tasks. To address these limitations, we propose GRASS, a gradient-based adaptive layer-wise importance sampling framework. GRASS utilizes mean gradient norms as a task-aware and training-stage-aware metric for estimating layer importance. Furthermore, GRASS adaptively adjusts layer sampling probabilities through an adaptive training strategy. We also introduce a layer-wise optimizer state offloading mechanism that overlaps computation and communication to further reduce memory usage while maintaining comparable training throughput. Extensive experiments across multiple models and benchmarks demonstrate that GRASS consistently outperforms state-of-the-art methods, achieving an average accuracy improvement of up to 4.38 points and reducing memory usage by up to 19.97\%.
【16】MIPT-SSM: Scaling Language Models with $O(1)$ Inference Cache via Phase Transitions
标题:MIPT-RSM:通过阶段转换使用$O(1)$推理缓存扩展语言模型
链接:https://arxiv.org/abs/2604.07716
作者:Yasong Fan
备注:6 pages, 8 tables
摘要:我们提出了MIPT-SSM,建立在测量诱导相变(MIPT)的物理上的神经序列架构。其核心思想是一个学习的测量率$p_{t}\in(0,1)$,它在两种状态之间路由计算:波相$(p_{t}\rightarrow0)$,其中信息作为分布式复相干扰传播;粒子相$(p_{t}\rightarrow1)$,其中状态崩溃到当前令牌上,从而实现精确的本地存储。这两个制度是可证明的不相容的一个单一的线性算子的少数“不去定理”在序列建模和$p_{t}$是我们的方式around it. The模型被预测表现出相变在临界序列长度$N^{*}\approx 1024 $,其中的信息密度比$N/D$交叉单位,符合我们的记忆缩放的观察。在AG News(四级分类)上,MIPT的准确率为0.905,而Transformer的准确率为0.736(+16.6%),在3个种子上稳定。在$N=8192时,$ MIPT需要810 MB,而Transformer需要34,651 MB,内存减少了42.8倍。在精确召回(“大海捞针”),我们的因果稀疏KV缓存达到0.968的准确性。值得注意的是,在无限的缓存容量下,$p_{t}$门自主学习仅存储单个关键令牌(平均使用$1.0/512$插槽),过滤掉所有噪声并实现99.8%的稀疏率。在语言建模(WikiText-103,31 M参数)上,具有$K=64$缓存的MIPT-LM达到了PPL 92.1,而Transformer为90.5(差距:1.8%),而推理KV缓存从$O(N)$缩小到$O(64)$。
摘要:We present MIPT-SSM, a neural sequence architecture built on the physics of Measurement-Induced Phase Transitions (MIPT). The central idea is a learned measurement rate $p_{t}\in(0,1)$ that routes computation between two regimes: wave phase $(p_{t}\rightarrow0)$, where information propagates as distributed complex-phase interference; and particle phase $(p_{t}\rightarrow1)$ where the state collapses onto the current token, enabling precise local storage. These two regimes are provably incompatible in a single linear operator one of the few "no-go theorems" in sequence modeling and $p_{t}$ is our way around it. The model is predicted to exhibit a phase transition at critical sequence length $N^{*}\approx1024$, where the information density ratio $N/D$ crosses unity, consistent with our memory scaling observations. On AG News (four-class classification), MIPT achieves 0.905 accuracy versus Transformer's 0.736 (+16.6%), stable across 3 seeds. At $N=8192$ MIPT requires 810 MB versus Transformer's 34,651 MB a 42.8x memory reduction. On exact-recall ("needle-in-a-haystack"), our causal sparse KV cache achieves 0.968 accuracy. Remarkably, under unbounded cache capacity, the $p_{t}$ gate autonomously learns to store only the single critical token (averaging $1.0/512$ slots used), filtering out all noise and achieving a 99.8% sparsity rate. On language modeling (WikiText-103, 31M parameters), MIPT-LM with $K=64$ cache reaches PPL 92.1 versus Transformer's 90.5 (gap: 1.8%) while inference KV cache shrinks from $O(N)$ to $O(64)$.
【17】Reinforcement Learning with LLM-Guided Action Spaces for Synthesizable Lead Optimization
标题:利用LLM引导的动作空间进行强化学习以实现可合成潜在客户优化
链接:https://arxiv.org/abs/2604.07669
作者:Tao Li,Kaiyuan Hou,Tuan Vinh,Monika Raj,Zhichun Guo,Carl Yang
摘要:药物发现中的铅优化需要改善治疗特性,同时确保所提出的分子修饰对应于可行的合成路线。现有的方法要么优先考虑属性得分而不强制执行可合成性,要么依赖于大型反应网络上昂贵的枚举,而直接应用大型语言模型(LLM)经常会产生化学无效的结构。我们介绍MolReAct,一个框架,制定领先的优化作为一个马尔可夫决策过程在一个合成约束的行动空间定义的验证反应模板。一个工具增强的LLM代理作为一个动态的反应环境,调用专门的化学分析工具,以确定反应位点,并提出从匹配的模板化学接地转换。通过组相对策略优化(GRPO)训练的策略模型在这些受约束的动作中进行选择,以最大化跨多步反应轨迹的长期Oracle奖励。基于SMILES的缓存机制进一步减少了大约43%的端到端优化时间。在来自治疗数据共享的13个属性优化任务和一个基于结构的对接任务中,MolReAct的平均Top-10得分为0.563,在相对改善方面超过最强的可合成基线10.4%,并在14个任务中的10个任务中获得最佳样本效率。消融证实,这两个工具增强的反应建议和自动化水平的政策优化贡献互补的收益。通过将每一步都建立在经过验证的反应模板上,MolReAct产生了性能改进的分子,每个分子都伴随着明确的合成途径。
摘要:Lead optimization in drug discovery requires improving therapeutic properties while ensuring that proposed molecular modifications correspond to feasible synthetic routes. Existing approaches either prioritize property scores without enforcing synthesizability, or rely on expensive enumeration over large reaction networks, while direct application of Large Language Models (LLMs) frequently produces chemically invalid structures. We introduce MolReAct, a framework that formulates lead optimization as a Markov Decision Process over a synthesis-constrained action space defined by validated reaction templates. A tool-augmented LLM agent serves as a dynamic reaction environment that invokes specialized chemical analysis tools to identify reactive sites and propose chemically grounded transformations from matched templates. A policy model trained via Group Relative Policy Optimization (GRPO) selects among these constrained actions to maximize long-term oracle reward across multi-step reaction trajectories. A SMILES-based caching mechanism further reduces end-to-end optimization time by approximately 43%. Across 13 property optimization tasks from the Therapeutic Data Commons and one structure-based docking task, MolReAct achieves an average Top-10 score of 0.563, outperforming the strongest synthesizable baseline by 10.4% in relative improvement, and attains the best sample efficiency on 10 of 14 tasks. Ablations confirm that both tool-augmented reaction proposals and trajectory-level policy optimization contribute complementary gains. By grounding every step in validated reaction templates, MolReAct produces molecules that are property-improved and each accompanied by an explicit synthetic pathway.
【18】SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization
标题:SAGE:用于内存高效LLM优化的符号自适应梯度
链接:https://arxiv.org/abs/2604.07663
作者:Wooin Lee,Hyun-Tae Kim
备注:Accepted to Findings of the Association for Computational Linguistics: ACL 2026. 13 pages, 4 figures, 4 tables
摘要:AdamW优化器虽然是LLM预训练的标准,但它是一个关键的内存瓶颈,消耗的优化器状态相当于模型大小的两倍。虽然像SinkGD这样的轻状态优化器试图解决这个问题,但我们发现了嵌入层的困境:这些方法无法处理嵌入固有的稀疏,高方差梯度,迫使混合设计恢复到AdamW并部分否定内存增益。我们提出了SAGE(符号自适应优化),一种新的优化器,解决了这个困境,取代AdamW在这个混合结构。SAGE结合了狮子风格的更新方向与一个新的,内存高效的$O(d)$自适应规模。这个尺度充当了一个“安全阻尼器”,可证明以1.0为界,它比现有方法更有效地驯服了高方差维度。这种卓越的稳定性使SAGE能够实现更好的收敛。在高达1.3B参数的Llama模型上,我们基于SAGE的混合实现了新的最先进的困惑,优于所有基线,包括SinkGD混合,同时显着减少优化器状态内存。
摘要:The AdamW optimizer, while standard for LLM pretraining, is a critical memory bottleneck, consuming optimizer states equivalent to twice the model's size. Although light-state optimizers like SinkGD attempt to address this issue, we identify the embedding layer dilemma: these methods fail to handle the sparse, high-variance gradients inherent to embeddings, forcing a hybrid design that reverts to AdamW and partially negates the memory gains. We propose SAGE (Sign Adaptive GradiEnt), a novel optimizer that resolves this dilemma by replacing AdamW in this hybrid structure. SAGE combines a Lion-style update direction with a new, memory-efficient $O(d)$ adaptive scale. This scale acts as a "safe damper," provably bounded by 1.0, which tames high-variance dimensions more effectively than existing methods. This superior stability allows SAGE to achieve better convergence. On Llama models up to 1.3B parameters, our SAGE-based hybrid achieves new state-of-the-art perplexity, outperforming all baselines, including SinkGD hybrid, while significantly reducing optimizer state memory.
【19】Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
标题:监护人作为顾问:为值得信赖的LLM推进下一代监护人模型
链接:https://arxiv.org/abs/2604.07655
作者:Yue Huang,Haomin Zhuang,Jiayi Ye,Han Bao,Yanbo Wang,Hang Hua,Siyuan Wu,Pin-Yu Chen,Xiangliang Zhang
摘要
:硬门安全检查器经常过度拒绝并与供应商的模型规范不一致;流行的分类法还忽视了鲁棒性和诚实性,产生了纸面上更安全但实用性较差的系统。这项工作引入了Guardian-as-an-Advisor(GaaA),这是一种软门控管道,其中监护人预测二元风险标签加上简洁的解释,并将此建议预先添加到原始查询中以进行重新推理,保持基础模型在其原始规范下运行。为了支持训练和评估,构建了GuardSet,一个208 k+的多域数据集,将有害和无害的案例与有针对性的鲁棒性和诚实切片统一起来。GuardAdvisor通过SFT进行训练,然后进行RL,以加强标签解释的一致性。GuardAdvisor在实现咨询工作流的同时获得了具有竞争力的检测准确性;当用于增强输入时,响应优于未增强的提示。一项延迟研究表明,顾问推理使用了低于5%的基本模型计算,并且在实际有害输入率下仅增加了2-10%的端到端开销。总体而言,GaaA引导模型符合模型规范,在保持安全性的同时减少过度拒绝。
摘要:Hard-gated safety checkers often over-refuse and misalign with a vendor's model spec; prevailing taxonomies also neglect robustness and honesty, yielding safer-on-paper yet less useful systems. This work introduces Guardian-as-an-Advisor (GaaA), a soft-gating pipeline where a guardian predicts a binary risk label plus a concise explanation and prepends this advice to the original query for re-inference, keeping the base model operating under its original spec. To support training and evaluation, GuardSet is constructed, a 208k+ multi-domain dataset unifying harmful and harmless cases with targeted robustness and honesty slices. GuardAdvisor is trained via SFT followed by RL to enforce label-explanation consistency. GuardAdvisor attains competitive detection accuracy while enabling the advisory workflow; when used to augment inputs, responses improve over unaugmented prompts. A latency study shows advisor inference uses below 5% of base-model compute and adds only 2-10% end-to-end overhead under realistic harmful-input rates. Overall, GaaA steers models to comply with the model spec, maintaining safety while reducing over-refusal.
【20】Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC
标题:眨眼:通过将服务栈委托给图形处理器和SmartNIC来进行无CPULLM推理
链接:https://arxiv.org/abs/2604.07609
作者:Mohammad Siavashi,Mariano Scazzariello,Gerald Q. Maguire,Dejan Kostić,Marco Chiesa
摘要:大型语言模型(LLM)推理正在迅速成为核心数据中心服务,但当前的服务堆栈将主机CPU保持在编排和令牌级控制的关键路径上。这使得LLM性能对CPU干扰敏感,破坏了应用程序托管并迫使运营商保留CPU余量,使大量容量未被利用。 我们引入了Blink,这是一种端到端的服务架构,通过在SmartNIC和GPU之间重新分配责任,将主机CPU从稳态推理路径中移除。Blink将请求处理卸载到SmartNIC,SmartNIC通过RDMA将输入直接传递到GPU内存,并将主机驱动的调度替换为持久的GPU内核,该内核执行缓存、调度和KV缓存管理,而无需CPU参与。 通过与TensorRT-LLM、vLLM和SGLang进行对比评估,Blink即使在隔离状态下也优于所有基线,可将预饱和P99 TTFT降低高达8.47$\times$,将P99 TPOT降低高达3.40$\times$,将解码吞吐量提高高达2.1$\times$,并将每个令牌的能量降低高达48.6$\%$。在CPU干扰下,Blink保持稳定的性能,而现有系统的性能下降高达两个数量级。
摘要:Large Language Model (LLM) inference is rapidly becoming a core datacenter service, yet current serving stacks keep the host CPU on the critical path for orchestration and token-level control. This makes LLM performance sensitive to CPU interference, undermining application colocation and forcing operators to reserve CPU headroom, leaving substantial capacity unutilized. We introduce Blink, an end-to-end serving architecture that removes the host CPU from the steady-state inference path by redistributing responsibilities across a SmartNIC and a GPU. Blink offloads request handling to the SmartNIC, which delivers inputs directly into GPU memory via RDMA, and replaces host-driven scheduling with a persistent GPU kernel that performs batching, scheduling, and KV-cache management without CPU involvement. Evaluated against TensorRT-LLM, vLLM, and SGLang, Blink outperforms all baselines even in isolation, reducing pre-saturation P99 TTFT by up to 8.47$\times$ and P99 TPOT by up to 3.40$\times$, improving decode throughput by up to 2.1$\times$, and reducing energy per token by up to 48.6$\%$. Under CPU interference, Blink maintains stable performance, while existing systems degrade by up to two orders of magnitude.
【21】CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data
标题:CAMO:一种面向不平衡数据的鲁棒语言模型评估的类感知少数优化算法
链接:https://arxiv.org/abs/2604.07583
作者:Mohamed Ehab,Ali Hamdi,Khaled Shaban
摘要:现实世界的分类严重阻碍了类的不平衡,因为传统的合奏有利于多数类,这降低了少数民族的表现和整体F1分数。我们为不平衡问题提供了一种独特的集成技术,称为CAMO(Class-Aware Minority-Optimized)。通过一个分层过程,结合投票分布,置信度校准和模型间不确定性,CAMO动态提升代表性不足的类,同时保留和放大少数预测。我们在两个高度不平衡的特定领域基准上验证了CAMO:DIAR-AI/Emotion数据集和三元数据集ECONO 2025。我们在zero-shot和微调设置下使用八种不同的语言模型(三种LLM和五种SLM)对七种经过验证的集成算法进行基准测试。通过优化模型,CAMO始终获得最高的严格宏F1分数,树立了新的基准。它的优点与模型自适应一致,表明最佳集成选择取决于模型属性。这证明CAMO是一个可靠的,领域中立的框架,用于不平衡分类。
摘要:Real-world categorization is severely hampered by class imbalance because traditional ensembles favor majority classes, which lowers minority performance and overall F1-score. We provide a unique ensemble technique for imbalanced problems called CAMO (Class-Aware Minority-Optimized).Through a hierarchical procedure that incorporates vote distributions, confidence calibration, and inter model uncertainty, CAMO dynamically boosts underrepresented classes while preserving and amplifying minority forecasts.We verify CAMO on two highly unbalanced, domain-specific benchmarks: the DIAR-AI/Emotion dataset and the ternary BEA 2025 dataset. We benchmark against seven proven ensemble algorithms using eight different language models (three LLMs and five SLMs) under zero-shot and fine-tuned settings .With refined models, CAMO consistently earns the greatest strict macro F1-score, setting a new benchmark. Its benefit works in concert with model adaptation, showing that the best ensemble choice depends on model properties .This proves that CAMO is a reliable, domain-neutral framework for unbalanced categorization.
【22】Learning is Forgetting: LLM Training As Lossy Compression
标题:学习是遗忘:LLM训练作为有损压缩
链接:https://arxiv.org/abs/2604.07569
作者:Henry C. Conklin,Tom Hosking,Tan Yi-Chern,Julian Gold,Jonathan D. Cohen,Thomas L. Griffiths,Max Bartolo,Seraphina Goldfarb-Tarrant
备注:12 page core paper, 16 page Appendix - A shorter version with fewer visuals appears at ICLR 2026
摘要:尽管大型语言模型(LLM)越来越流行,但我们对它们的表示空间是如何构造的仍然有有限的了解。这限制了我们解释它们如何学习和学习什么,或将它们与人类学习联系起来的能力。我们认为LLM最好被视为有损压缩的一个实例,在这种情况下,通过训练,它们通过只保留与其目标相关的训练数据中的信息来学习。我们在模型中显示了预训练结果,这些模型被最佳压缩用于下一个序列预测,接近压缩的信息瓶颈界限。在一系列开放权重模型中,每个模型的压缩方式都不同,这可能是由于所使用的数据和训练配方的差异。然而,即使在不同的LLM系列中,模型压缩的最优性以及其中存在的信息也可以预测各种基准的下游性能,让我们直接将代表性结构与关于模型性能的可操作见解联系起来。在一般情况下,这里提出的工作为这些模型如何学习提供了一个统一的信息理论框架,可以大规模部署。
摘要:Despite the increasing prevalence of large language models (LLMs), we still have a limited understanding of how their representational spaces are structured. This limits our ability to interpret how and what they learn or relate them to learning in humans. We argue LLMs are best seen as an instance of lossy compression, where over training they learn by retaining only information in their training data relevant to their objective(s). We show pre-training results in models that are optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound on compression. Across an array of open weights models, each compresses differently, likely due to differences in the data and training recipes used. However even across different families of LLMs the optimality of a model's compression, and the information present in it, can predict downstream performance on across a wide array of benchmarks, letting us directly link representational structure to actionable insights about model performance. In the general case the work presented here offers a unified Information-Theoretic framing for how these models learn that is deployable at scale.
【23】Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs
标题
:使用LLM对无监督文本集群进行基于推理的细化
链接:https://arxiv.org/abs/2604.07562
作者:Tunazzina Islam
备注:Accepted to the Findings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
摘要:无监督方法被广泛用于从大型文本集合中归纳潜在语义结构,但它们的输出通常包含不连贯、冗余或基础差的聚类,这些聚类在没有标记数据的情况下难以验证。本文提出了一个基于推理的改进框架,该框架利用大语言模型(LLM)作为语义判断器,而不是嵌入生成器,来验证和重构任意无监督聚类算法的输出,该框架引入了三个推理阶段:(i)一致性验证,LLM评估聚类摘要是否得到其成员文本的支持;(ii)冗余判定,其中候选聚类基于语义重叠被合并或拒绝;以及(iii)标签基础,其中聚类以完全无监督的方式被分配可解释的标签。这种设计从结构验证中学习表示,并减轻了仅嵌入方法的常见故障模式。我们从两个具有不同交互模型的平台上评估了真实世界社交媒体语料库的框架,证明了经典主题模型和最近基于表示的基线在聚类一致性和人类对齐标签质量方面的一致改进。人类评估显示与LLM生成的标签高度一致,尽管没有黄金标准注释。我们进一步在匹配的时间和体积条件下进行鲁棒性分析,以评估跨平台稳定性。除了经验上的收获,我们的研究结果表明,基于LLM的推理可以作为一种通用机制,用于验证和改进无监督的语义结构,从而在没有监督的情况下对大型文本集进行更可靠和更可解释的分析。
摘要:Unsupervised methods are widely used to induce latent semantic structure from large text collections, yet their outputs often contain incoherent, redundant, or poorly grounded clusters that are difficult to validate without labeled data. We propose a reasoning-based refinement framework that leverages large language models (LLMs) not as embedding generators, but as semantic judges that validate and restructure the outputs of arbitrary unsupervised clustering algorithms.Our framework introduces three reasoning stages: (i) coherence verification, where LLMs assess whether cluster summaries are supported by their member texts; (ii) redundancy adjudication, where candidate clusters are merged or rejected based on semantic overlap; and (iii) label grounding, where clusters are assigned interpretable labels in a fully unsupervised manner. This design decouples representation learning from structural validation and mitigates common failure modes of embedding-only approaches. We evaluate the framework on real-world social media corpora from two platforms with distinct interaction models, demonstrating consistent improvements in cluster coherence and human-aligned labeling quality over classical topic models and recent representation-based baselines. Human evaluation shows strong agreement with LLM-generated labels, despite the absence of gold-standard annotations. We further conduct robustness analyses under matched temporal and volume conditions to assess cross-platform stability. Beyond empirical gains, our results suggest that LLM-based reasoning can serve as a general mechanism for validating and refining unsupervised semantic structure, enabling more reliable and interpretable analyses of large text collections without supervision.
【24】From LLM to Silicon: RL-Driven ASIC Architecture Exploration for On-Device AI Inference
标题:从LLM到硅:RL驱动的ASIC架构探索设备上人工智能推理
链接:https://arxiv.org/abs/2604.07526
作者:Ravindra Ganti,Steve Xu
备注:25 pages, 12 figures, 21 tables
摘要:我们提出了一个RL驱动的编译器,它联合优化了ASIC架构,内存层次结构和工作负载分区,用于3 nm到28 nm的AI推理。设计空间被制定为一个单一的马尔可夫决策过程与混合离散连续的行动和一个统一的功率性能面积(PPA)的目标。具有混合专家门控的Soft Actor-Critic(SAC)探索了网状拓扑、每核微架构和操作符放置的联合空间。我们在两个工作负载上进行了验证,Llama 3.1 8B FP 16(高性能模式,3 nm时每秒29809个令牌)和SmolVLM(低功耗模式,所有节点均小于13 mW,10 MHz)。在7个过程节点中,RL自动适应网格大小和每瓦片配置,包括异构FETCH,VLEN和内存分配,而无需特定于节点的手动重新调整。
摘要:We present an RL-driven compiler that jointly optimizes ASIC architecture, memory hierarchy, and workload partitioning for AI inference across 3nm to 28nm. The design space is formulated as a single Markov Decision Process with mixed discrete-continuous actions and a unified Power-Performance-Area (PPA) objective. Soft Actor-Critic (SAC) with Mixture-of-Experts gating explores the joint space of mesh topology, per-core microarchitecture, and operator placement. We validate on two workloads, Llama 3.1 8B FP16 (high-performance mode, 29809 tokens per second at 3nm) and SmolVLM (low-power mode, less than 13 mW at all nodes, 10 MHz). Across 7 process nodes, the RL automatically adapts mesh sizes and per-tile configurations, including heterogeneous FETCH, VLEN, and memory allocation without node-specific manual retuning.
【25】Triage: Routing Software Engineering Tasks to Cost-Effective LLM Tiers via Code Quality Signals
标题:分类:通过代码质量信号将软件工程任务路由到经济高效的LLM层级
链接:https://arxiv.org/abs/2604.07494
作者:Lech Madeyski
备注:5 pages, 1 figure
摘要:内容:AI编码代理将每个任务路由到单个前沿大型语言模型(LLM),即使许多任务都是常规任务,也要支付额外的推理成本。 目的:我们提出了Triage,一个框架,使用代码健康指标-软件可维护性的指标-作为路由信号,将每个任务分配给最便宜的模型层,其输出通过相同的验证门作为昂贵的模型。 方法:分类定义了三个功能层(轻型、标准、重型-镜像,例如,Haiku,Sonnet,Opus),并根据预先计算的代码健康子因素和任务元数据路由任务。我们设计了一个评估,比较了SWE-Bench Lite(三个模型层的300个任务)上的三种路由策略:启发式阈值,经过训练的ML分类器和完美的后见之明预言。 结果如下:我们分析得出了两个可证伪的条件下,依赖于层的不对称性(中等LLM受益于干净的代码,而前沿模型没有)产生成本效益的路由:健康代码的轻层通过率必须超过层间成本比,代码健康必须区分所需的模型层,至少有一个小的效果大小($\hat{p} \geq 0.56$)。 结论:Triage将诊断代码质量度量转换为可操作的模型选择信号。我们提出了一个严格的评估协议,以测试成本-质量的权衡,并确定哪些代码的健康子因素驱动路由决策。
摘要:Context: AI coding agents route every task to a single frontier large language model (LLM), paying premium inference cost even when many tasks are routine. Objectives: We propose Triage, a framework that uses code health metrics -- indicators of software maintainability -- as a routing signal to assign each task to the cheapest model tier whose output passes the same verification gate as the expensive model. Methods: Triage defines three capability tiers (light, standard, heavy -- mirroring, e.g., Haiku, Sonnet, Opus) and routes tasks based on pre-computed code health sub-factors and task metadata. We design an evaluation comparing three routing policies on SWE-bench Lite (300 tasks across three model tiers): heuristic thresholds, a trained ML classifier, and a perfect-hindsight oracle. Results: We analytically derived two falsifiable conditions under which the tier-dependent asymmetry (medium LLMs benefit from clean code while frontier models do not) yields cost-effective routing: the light-tier pass rate on healthy code must exceed the inter-tier cost ratio, and code health must discriminate the required model tier with at least a small effect size ($\hat{p} \geq 0.56$). Conclusion: Triage transforms a diagnostic code quality metric into an actionable model-selection signal. We present a rigorous evaluation protocol to test the cost--quality trade-off and identify which code health sub-factors drive routing decisions.
【26】Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference
标题:快速异类服务:用于SLO约束推理的可扩展混合规模LLM分配
链接:https://arxiv.org/abs/2604.07472
作者:Jiaming Cheng,Duong Tung Nguyen
摘要
:大规模部署大型语言模型(LLM)推理需要联合选择基础模型,配置异构GPU,配置并行性,并在严格的延迟,准确性和预算限制下分配工作负载。精确混合整数线性规划(MILP)方法保证最优性,但规模差。我们提出了两个约束感知的启发式算法:一个贪婪的启发式(GH)的单遍分配,和自适应贪婪启发式(AGH),通过多开始建设,基于重定位的本地搜索,和GPU合并,以增强GH。三个约束感知机制-TP感知可行性选择,每有效覆盖率的成本排名和TP升级-确保在紧耦合的内存,延迟,错误和预算约束下的可行性。在使用Azure LLM推理跟踪(2025)校准的工作负载上,两种算法都能在一秒内生成可行的解决方案,AGH接近最佳成本,同时在大规模实例上实现超过260倍的加速。在高达1.5倍参数膨胀的样本外压力测试下,AGH保持受控的SLO违规和稳定的成本,而精确求解器的位置急剧下降。
摘要:Deploying large language model (LLM) inference at scale requires jointly selecting base models, provisioning heterogeneous GPUs, configuring parallelism, and distributing workloads under tight latency, accuracy, and budget constraints. Exact mixed-integer linear programming (MILP) approaches guarantee optimality but scale poorly. We propose two constraint-aware heuristics: a Greedy Heuristic (GH) for single-pass allocation, and an Adaptive Greedy Heuristic (AGH) that enhances GH via multi-start construction, relocate-based local search, and GPU consolidation. Three constraint-aware mechanisms -- TP-aware feasibility selection, cost-per-effective-coverage ranking, and TP upgrade -- ensure feasibility under tightly coupled memory, delay, error, and budget constraints. On workloads calibrated with the Azure LLM Inference Trace (2025), both heuristics produce feasible solutions in under one second, with AGH closely approaching optimal cost while achieving over 260x speedup on large-scale instances. Under out-of-sample stress tests with up to 1.5x parameter inflation, AGH maintains controlled SLO violations and stable cost, whereas the exact solver's placement degrades sharply.
【27】Multimodal Large Language Models for Multi-Subject In-Context Image Generation
标题:用于多主题上下文图像生成的多模态大语言模型
链接:https://arxiv.org/abs/2604.07422
作者:Yucheng Zhou,Dubing Chen,Huan Zheng,Jianbing Shen
备注:ACL 2026
摘要:文本到图像(T2 I)生成的最新进展已经使得能够从描述进行视觉上连贯的图像合成,但是生成包含多个给定主题的图像仍然具有挑战性。随着引用标识数量的增加,现有方法往往遭受主题丢失和语义漂移。为了解决这个问题,我们提出了MUSIC,这是第一个专门为\textbf{MU}lti-\textbf{S} subject\textbf{I}n-\textbf{C}ontext图像生成而设计的MLLM。为了克服数据稀缺性,我们引入了一个自动和可扩展的数据生成管道,消除了手动注释的需要。此外,我们通过视觉思维链(CoT)机制,引导逐步推理从主题图像到语义和生成,提高了模型对多主题语义关系的理解。为了减轻身份纠缠和管理视觉复杂性,我们开发了一种新的语义驱动的空间布局规划方法,并证明了其测试时的可扩展性。通过在训练期间合并复杂的主题图像,我们提高了模型的链式推理能力。此外,我们还策划了MSIC,这是一个为多主题背景生成量身定制的新基准。实验结果表明,MUSIC显着优于其他方法在多和单主题的情况下。
摘要:Recent advances in text-to-image (T2I) generation have enabled visually coherent image synthesis from descriptions, but generating images containing multiple given subjects remains challenging. As the number of reference identities increases, existing methods often suffer from subject missing and semantic drift. To address this problem, we propose MUSIC, the first MLLM specifically designed for \textbf{MU}lti-\textbf{S}ubject \textbf{I}n-\textbf{C}ontext image generation. To overcome the data scarcity, we introduce an automatic and scalable data generation pipeline that eliminates the need for manual annotation. Furthermore, we enhance the model's understanding of multi-subject semantic relationships through a vision chain-of-thought (CoT) mechanism, guiding step-by-step reasoning from subject images to semantics and generation. To mitigate identity entanglement and manage visual complexity, we develop a novel semantics-driven spatial layout planning method and demonstrate its test-time scalability. By incorporating complex subject images during training, we improve the model's capacity for chained reasoning. In addition, we curate MSIC, a new benchmark tailored for multi-subject in-context generation. Experimental results demonstrate that MUSIC significantly surpasses other methods in both multi- and single-subject scenarios.
【28】SHIELD: A Segmented Hierarchical Memory Architecture for Energy-Efficient LLM Inference on Edge NPUs
标题:SHIELD:一种分段分层内存架构,用于边缘NPU上的节能LLM推理
链接:https://arxiv.org/abs/2604.07396
作者:Jintao Zhang,Xuanyao Fong
摘要:边缘神经处理单元(NPU)上的大型语言模型(LLM)推理从根本上受到有限的片上存储器容量的限制。虽然高密度嵌入式DRAM(eDRAM)对于存储激活数据是有吸引力的,但是其周期性刷新消耗大量能量。以前的工作主要集中在减少片外流量或优化持久性键值(KV)缓存的刷新,而瞬态和错误弹性查询和注意力输出(QO)激活在很大程度上被忽视。我们提出了SHIELD,一个生命周期感知分段eDRAM架构,共同利用时间驻留和bfloat 16(BF 16)激活位级灵敏度。SHIELD将符号和指数字段与尾数隔离,禁用瞬态QO尾数的刷新,并将松弛刷新应用于持久KV尾数。在多个LLM和推理场景中,SHIELD将eDRAM刷新能耗相对于标准刷新基准降低了35%,同时保持了WikiText-2、PIQA和ARC-Easy的准确性。
摘要:Large Language Model (LLM) inference on edge Neural Processing Units (NPUs) is fundamentally constrained by limited on-chip memory capacity. Although high-density embedded DRAM (eDRAM) is attractive for storing activation workspaces, its periodic refresh consumes substantial energy. Prior work has primarily focused on reducing off-chip traffic or optimizing refresh for persistent Key-Value (KV) caches, while transient and error-resilient Query and Attention Output (QO) activations are largely overlooked. We propose SHIELD, a lifecycle-aware segmented eDRAM architecture that jointly exploits temporal residency and bit-level sensitivity in bfloat16 (BF16) activations. SHIELD isolates the sign and exponent fields from the mantissa, disables refresh for transient QO mantissas, and applies relaxed refresh to persistent KV mantissas. Across multiple LLMs and inference scenarios, SHIELD reduces eDRAM refresh energy by 35% relative to a standard-refresh baseline while preserving accuracy on WikiText-2, PIQA, and ARC-Easy.
【29】Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
标题:通量注意力:上下文感知混合注意力,以实现高效的LLM推理
链接:https://arxiv.org/abs/2604.07394
作者:Quantong Qiu,Zhiyi Hong,Yi Yang,Haitian Wang,Kebin Liu,Qingqing Dang,Juntao Li,Min Zhang
摘要:标准注意力机制的二次计算复杂度在长上下文场景中为LLM提供了严重的可扩展性瓶颈。虽然混合注意力机制结合充分注意力(FA)和稀疏注意力(SA)提供了一个潜在的解决方案,现有的方法通常依赖于静态分配比率,无法适应不同任务的可变检索需求。此外,头级动态稀疏性经常引入严重的计算负载不平衡和同步长尾,这阻碍了自回归解码期间的硬件加速。为了弥合这一差距,我们引入了通量注意力,一个上下文感知的框架,动态优化注意力计算在层的水平。通过将轻量级层路由器集成到冻结的预训练LLM中,所提出的方法基于输入上下文自适应地将每个层路由到FA或SA。这种逐层路由保留了高保真信息检索,同时确保连续的内存访问,将理论上的计算减少转化为实际的挂钟加速。作为一种参数高效的方法,我们的框架只需要在8$\times$A800 GPU上进行12小时的训练。跨多个长上下文和数学推理基准的广泛实验表明,与基线模型相比,Flux Attention在性能和推理速度之间实现了卓越的权衡,在预填充和解码阶段的速度提高高达2.8\times $和2.0\times $。
摘要
:The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA) offer a potential solution, existing methods typically rely on static allocation ratios that fail to accommodate the variable retrieval demands of different tasks. Furthermore, head-level dynamic sparsity often introduces severe computational load imbalance and synchronization long-tails, which hinder hardware acceleration during autoregressive decoding. To bridge this gap, we introduce Flux Attention, a context-aware framework that dynamically optimizes attention computation at the layer level. By integrating a lightweight Layer Router into frozen pretrained LLMs, the proposed method adaptively routes each layer to FA or SA based on the input context. This layer-wise routing preserves high-fidelity information retrieval while ensuring contiguous memory access, translating theoretical computational reductions into practical wall-clock speedups. As a parameter-efficient approach, our framework requires only 12 hours of training on 8$\times$A800 GPUs. Extensive experiments across multiple long-context and mathematical reasoning benchmarks demonstrate that Flux Attention achieves a superior trade-off between performance and inference speed compared with baseline models, with speed improvements of up to $2.8\times$ and $2.0\times$ in the prefill and decode stages.
【30】Playing DOOM with 1.3M Parameters: Specialized Small Models vs Large Language Models for Real-Time Game Control
标题:使用1.3M参数玩DOOM:用于实时游戏控制的专用小型模型与大型语言模型
链接:https://arxiv.org/abs/2604.07385
作者:David Golchinfar,Daryoush Vaziri,Alexander Marquardt
备注:17 pages, 3 figures, 3 tables. Code and model weights available at https://github.com/VAGOsolutions/SauerkrautLM-Doom-MultiVec
摘要:我们介绍了SauerkrautLM-Doom-MultiVec,这是一个130万参数的模型,可以实时播放经典的第一人称射击游戏DOOM,其性能优于高达92,000倍大小的大型语言模型,包括Nemotron-120 B,Qwen3.5- 27 B和GPT-4 o-mini。我们的模型将ModernBERT编码器与哈希嵌入,深度感知令牌表示和注意力池分类头相结合,以每个决策31 ms的速度从ASCII帧表示中选择游戏动作。它只接受了31,000次人类游戏演示的训练,在defend_the_center场景中,它在10集中实现了178个碎片(每集17.8个),超过了所有测试的LLM的总和(总共13个碎片)。所有代理都接收等效的输入:ASCII帧和深度图。尽管参数比Nemotron-120 B少了92,000倍,但我们的模型是唯一一个主动与敌人交战而不是纯粹躲避敌人的代理。这些结果表明,在领域适当的数据上训练的小型特定于任务的模型可以在实时控制任务中以推理成本的一小部分决定性地优于通用LLM,并具有在消费者硬件上的部署能力。
摘要:We present SauerkrautLM-Doom-MultiVec, a 1.3 million parameter model that plays the classic first-person shooter DOOM in real time, outperforming large language models up to 92,000x its size, including Nemotron-120B, Qwen3.5-27B, and GPT-4o-mini. Our model combines a ModernBERT encoder with hash embeddings, depth-aware token representations, and an attention pooling classification head to select game actions from ASCII frame representations at 31ms per decision. Trained on just 31,000 human gameplay demonstrations, it achieves 178 frags in 10 episodes (17.8 per episode) in the defend_the_center scenario, more than all tested LLMs combined (13 frags total). All agents receive equivalent input: ASCII frames and depth maps. Despite having 92,000x fewer parameters than Nemotron-120B, our model is the only agent that actively engages enemies rather than purely evading them. These results demonstrate that small, task-specific models trained on domain-appropriate data can decisively outperform general-purpose LLMs at real-time control tasks, at a fraction of the inference cost, with deployment capability on consumer hardware.
【31】Latent Structure of Affective Representations in Large Language Models
标题:大型语言模型中情感表示的潜在结构
链接:https://arxiv.org/abs/2604.07382
作者:Benjamin J. Choi,Melanie Weber
摘要:大型语言模型(LLM)中潜在表示的几何结构是一个活跃的研究领域,部分原因是其对模型透明度和人工智能安全的影响。现有的文献主要集中在一般的几何和拓扑性质的学习表示,但由于缺乏地面真相潜在的几何,验证这些方法的研究结果是具有挑战性的。情绪处理为探索表征几何提供了一个有趣的测试平台,因为情绪表现出分类组织和连续的情感维度,这在心理学文献中已经确立。此外,理解这些表示具有安全相关性。在这项工作中,我们调查的潜在结构的情感表示在LLM使用几何数据分析工具。我们提出了三个主要发现。首先,我们表明,LLM学习连贯的潜在表征的情感情绪,与广泛使用的效价-唤醒模型从心理学。其次,我们发现,这些表示表现出非线性的几何结构,但可以很好地近似线性,提供经验支持的线性表示假设模型透明的方法。第三,我们证明了学习的潜在表征空间可以用来量化情感处理任务中的不确定性。我们的研究结果表明,LLM获得情感表示与几何结构平行建立的模型的人类情感,模型的可解释性和安全性的实际影响。
摘要:The geometric structure of latent representations in large language models (LLMs) is an active area of research, driven in part by its implications for model transparency and AI safety. Existing literature has focused mainly on general geometric and topological properties of the learnt representations, but due to a lack of ground-truth latent geometry, validating the findings of such approaches is challenging. Emotion processing provides an intriguing testbed for probing representational geometry, as emotions exhibit both categorical organization and continuous affective dimensions, which are well-established in the psychology literature. Moreover, understanding such representations carries safety relevance. In this work, we investigate the latent structure of affective representations in LLMs using geometric data analysis tools. We present three main findings. First, we show that LLMs learn coherent latent representations of affective emotions that align with widely used valence--arousal models from psychology. Second, we find that these representations exhibit nonlinear geometric structure that can nonetheless be well-approximated linearly, providing empirical support for the linear representation hypothesis commonly assumed in model transparency methods. Third, we demonstrate that the learned latent representation space can be leveraged to quantify uncertainty in emotion processing tasks. Our findings suggest that LLMs acquire affective representations with geometric structure paralleling established models of human emotion, with practical implications for model interpretability and safety.
【32】The Role of Emotional Stimuli and Intensity in Shaping Large Language Model Behavior
标题:情绪刺激和强度在塑造大型语言模型行为中的作用
链接:https://arxiv.org/abs/2604.07369
作者:Ameen Patel,Felix Lee,Kyle Liang,Joseph Thomas
摘要:情感提示-在提示工程中使用特定的情感措辞-在提高大型语言模型(LLM)的性能,真实性和责任感方面表现出越来越大的希望。然而,这些研究仅限于单一类型的积极情绪刺激,并没有考虑不同程度的情绪强度在他们的分析。在本文中,我们探讨了四种不同的情绪-喜悦,鼓励,愤怒和不安全感-在情绪提示和评估他们的准确性,奉承,和毒性。我们使用GPT-4 o mini开发了一个生成提示的管道,以创建一套LLM和人类生成的提示,这些提示在四种情绪中具有不同的强度。然后,我们编译一个提示的“黄金数据集”,其中人类和模型标签对齐。我们对LLM行为的实证评估表明,积极的情绪刺激会导致更准确和毒性更小的结果,但也会增加奉承行为。
摘要:Emotional prompting - the use of specific emotional diction in prompt engineering - has shown increasing promise in improving large language model (LLM) performance, truthfulness, and responsibility. However these studies have been limited to single types of positive emotional stimuli and have not considered varying degrees of emotion intensity in their analyses. In this paper, we explore the effects of four distinct emotions - joy, encouragement, anger, and insecurity - in emotional prompting and evaluate them on accuracy, sycophancy, and toxicity. We develop a prompt-generation pipeline with GPT-4o mini to create a suite of LLM and human-generated prompts with varying intensities across the four emotions. Then, we compile a "Gold Dataset" of prompts where human and model labels align. Our empirical evaluation on LLM behavior suggests that positive emotional stimuli lead to more accurate and less toxic results, but also increase sycophantic behavior.
【33】Benchmark Shadows: Data Alignment, Parameter Footprints, and Generalization in Large Language Models
标题:基准阴影:大型语言模型中的数据对齐、参数足迹和概括
链接:https://arxiv.org/abs/2604.07363
作者:Hongjian Zou,Yidan Wang,Qi Ding,Yixuan Liao,Xiaoxin Chen
备注:28 pages, 26 figures, 8 tables
摘要
:大型语言模型通常会获得强大的基准收益,而不会在更广泛的能力方面进行相应的改进。我们假设这种差异来自于数据分布引起的训练机制的差异。为了研究这一点,我们设计了受控数据干预措施,在固定的训练设置下隔离分布效应。我们发现,基准对齐的数据改善了狭窄的评估指标,同时限制了更广泛的代表性发展,而覆盖范围扩大的数据导致更多的分布参数适应和更好的泛化。我们进一步介绍了参数空间诊断的基础上,频谱和秩分析,揭示了这些制度的不同结构的签名。在不同的开源模型家族中也观察到类似的模式,包括作为关键案例研究的多模式模型,这表明这些影响超出了受控环境。提示重复的案例研究表明,并不是所有的数据工件诱导政权转移。这些结果表明,基准性能本身是不足以表征模型的能力,并强调数据分布在塑造学习动态的重要性。
摘要:Large language models often achieve strong benchmark gains without corresponding improvements in broader capability. We hypothesize that this discrepancy arises from differences in training regimes induced by data distribution. To investigate this, we design controlled data interventions that isolate distributional effects under fixed training settings. We find that benchmark-aligned data improves narrow evaluation metrics while limiting broader representational development, whereas coverage-expanding data leads to more distributed parameter adaptation and better generalization. We further introduce parameter-space diagnostics based on spectral and rank analyses, which reveal distinct structural signatures of these regimes. Similar patterns are observed across diverse open-source model families, including multimodal models as a key case study, suggesting that these effects extend beyond controlled settings. A case study on prompt repetition shows that not all data artifacts induce regime shifts. These results indicate that benchmark performance alone is insufficient to characterize model capability, and highlight the importance of data distribution in shaping learning dynamics.
【34】LLM-Generated Fault Scenarios for Evaluating Perception-Driven Lane Following in Autonomous Edge Systems
标题:用于评估自主边缘系统中感知驱动车道跟随的LLM生成的故障场景
链接:https://arxiv.org/abs/2604.07362
作者:Faezeh Pasandideh,Achim Rettberg
摘要:在边缘设备上部署自主视觉系统面临着一个关键挑战:资源限制阻碍了全面安全测试的实时和可预测执行。现有的验证方法依赖于静态数据集或手动故障注入,无法捕获现实世界部署中遇到的各种环境危害。为了解决这个问题,我们引入了一个解耦的离线在线故障注入框架。该架构将验证过程分为两个不同的阶段:计算密集型离线阶段和轻量级在线阶段。在离线阶段,我们采用大语言模型(LLM)语义生成结构化的故障场景和潜在扩散模型(LDMs)合成高保真传感器退化。这些复杂的故障动态被提取到预先计算的查找表中,使边缘设备能够执行实时故障感知推理,而无需在本地运行繁重的AI模型。我们在460个故障场景中的ResNet 18车道跟踪模型上广泛验证了这个框架。结果表明,虽然该模型在干净数据上实现了约0.85的基线R^2,但我们生成的故障暴露出显著的鲁棒性下降,RMSE增加了高达99%,在雾条件下,定位精度在-0.10范围内下降到低至31.0%,这表明正常数据评估对于现实世界的边缘AI部署是不充分的。
摘要:Deploying autonomous vision systems on edge devices faces a critical challenge: resource constraints prevent real-time and predictable execution of comprehensive safety tests. Existing validation methods depend on static datasets or manual fault injection, failing to capture the diverse environmental hazards encountered in real-world deployment. To address this, we introduce a decoupled offline-online fault injection framework. This architecture separates the validation process into two distinct phases: a computationally intensive Offline Phase and a lightweight Online Phase. In the offline phase, we employ Large Language Models (LLMs) to semantically generate structured fault scenarios and Latent Diffusion Models (LDMs) to synthesize high-fidelity sensor degradations. These complex fault dynamics are distilled into a pre-computed lookup table, enabling the edge device to perform real-time fault-aware inference without running heavy AI models locally. We extensively validated this framework on a ResNet18 lane-following model across 460 fault scenarios. Results show that while the model achieves a baseline R^2 of approximately 0.85 on clean data, our generated faults expose significant robustness degradation, with RMSE increasing by up to 99% and within-0.10 localization accuracy dropping to as low as 31.0% under fog conditions, demonstrating the inadequacy of normal-data evaluation for real-world edge AI deployment.
【35】BLEG: LLM Functions as Powerful fMRI Graph-Enhancer for Brain Network Analysis
标题:BLEG:LLM作为脑网络分析的强大fMRI图形增强器
链接:https://arxiv.org/abs/2604.07361
作者:Rui Dong,Zitong Wang,Jiaxing Li,Weihuang Zheng,Youyong Kong
摘要:图神经网络(GNN)已被广泛用于基于预处理的功能磁共振成像(fMRI)数据的各种脑网络分析任务。然而,它们的性能受到限制,由于高特征稀疏性和固有的局限性领域知识内的单峰神经图。与此同时,大型语言模型(LLM)已经表现出强大的表示能力。将LLM与GNN相结合为大脑网络分析提供了一个有前途的方向。虽然LLM和MLLM已经出现在神经科学中,但LLM与基于图形的数据的集成仍然未被探索。在这项工作中,我们处理这些问题,结合LLM的强大的代表性和泛化能力。考虑到直接调整LLM的巨大成本,我们将LLM作为增强器来提高GNN在下游任务上的性能。我们的方法,即BLEG,可以分为三个阶段。我们首先提示LLM获得增强的文本的fMRI图数据,然后我们设计了一个LLM-LM指令调整方法,以获得增强的文本表示在相对较低的成本。GNN被一起训练用于粗化对齐。最后,我们在GNN之后为给定的下游任务微调适配器。LM和GNN logits之间的对齐损失旨在进一步增强GNN的表示。在不同数据集上的大量实验证实了BLEG的优越性。
摘要:Graph Neural Networks (GNNs) have been widely used in diverse brain network analysis tasks based on preprocessed functional magnetic resonance imaging (fMRI) data. However, their performances are constrained due to high feature sparsity and inherent limitations of domain knowledge within uni-modal neurographs. Meanwhile, large language models (LLMs) have demonstrated powerful representation capabilities. Combining LLMs with GNNs presents a promising direction for brain network analysis. While LLMs and MLLMs have emerged in neuroscience, integration of LLMs with graph-based data remains unexplored. In this work, we deal with these issues by incorporating LLM's powerful representation and generalization capabilities. Considering great cost for directly tuning LLMs, we instead function LLM as enhancer to boost GNN's performance on downstream tasks. Our method, namely BLEG, can be divided into three stages. We firstly prompt LLM to get augmented texts for fMRI graph data, then we design a LLM-LM instruction tuning method to get enhanced textual representations at a relatively lower cost. GNN is trained together for coarsened alignment. Finally we finetune an adapter after GNN for given downstream tasks. Alignment loss between LM and GNN logits is designed to further enhance GNN's representation. Extensive experiments on different datasets confirmed BLEG's superiority.
Graph相关(图学习|图神经网络|图优化等)(7篇)
【1】Adversarial Label Invariant Graph Data Augmentations for Out-of-Distribution Generalization
标题:用于分布外概括的对抗标签不变图数据增强
链接:https://arxiv.org/abs/2604.08404
作者:Simon Zhang,Ryan P. DeMilt,Kun Jin,Cathy H. Xia
备注:21 pages, 3 figures, accepted at ICML SCIS 2023
摘要:当表征学习遇到分布偏移时,会发生分布外泛化(OoD)。在实践中,当训练和测试数据来自不同的环境时,这种情况经常发生。协变量移位是一种仅发生在输入数据中的分布移位,而概念分布保持不变。我们提出了RIA - Regularization for Invariance with Adversarial training,这是一种在协变量移位下进行OoD推广的新方法。受$Q$-learning的启发,它对训练数据环境进行了对抗性探索。这些新的环境是由对抗性标签不变数据增强引起的,这些数据增强可以防止崩溃到分布中训练的学习者。它与许多现有的OoD推广方法协变量移位,可以制定为约束优化问题。我们开发了一个交替梯度下降上升算法来解决这个问题,并进行了广泛的实验OoD图分类的各种合成和自然分布的变化。我们证明,我们的方法可以实现高精度相比,OoD基线。
摘要
:Out-of-distribution (OoD) generalization occurs when representation learning encounters a distribution shift. This occurs frequently in practice when training and testing data come from different environments. Covariate shift is a type of distribution shift that occurs only in the input data, while the concept distribution stays invariant. We propose RIA - Regularization for Invariance with Adversarial training, a new method for OoD generalization under convariate shift. Motivated by an analogy to $Q$-learning, it performs an adversarial exploration for training data environments. These new environments are induced by adversarial label invariant data augmentations that prevent a collapse to an in-distribution trained learner. It works with many existing OoD generalization methods for covariate shift that can be formulated as constrained optimization problems. We develop an alternating gradient descent-ascent algorithm to solve the problem, and perform extensive experiments on OoD graph classification for various kinds of synthetic and natural distribution shifts. We demonstrate that our method can achieve high accuracy compared with OoD baselines.
【2】Equivariant Efficient Joint Discrete and Continuous MeanFlow for Molecular Graph Generation
标题:等变有效联合离散与连续MeanFlow生成分子图
链接:https://arxiv.org/abs/2604.08189
作者:Rongjian Xu,Teng Pang,Zhiqiang Dong,Guoqiang Wu
摘要:图结构数据共同包含离散拓扑和连续几何,这对生成建模提出了根本性的挑战,由于异构分布,不兼容的噪声动力学,以及需要等变归纳偏差。现有的流匹配方法生成的图形通常解耦结构的几何形状,缺乏同步的跨域动力学,并依赖于迭代采样,往往导致物理上不一致的分子构象和缓慢的采样。为了解决这些限制,我们提出了等变平均流(EQUIMF),这是一个统一的SE(3)-等变生成框架,通过同步平均流动态联合建模离散和连续组件。EQUIMF引入了一个统一的时间桥和平均速度更新,结构和几何之间的相互调节,使高效的几步生成,同时保持物理一致性。此外,我们开发了一种新的离散MeanFlow制定一个简单而有效的参数化,以支持高效的生成离散图结构。大量的实验表明,EQUIMF始终优于以前的扩散和流匹配方法的生成质量,物理有效性和采样效率。
摘要:Graph-structured data jointly contain discrete topology and continuous geometry, which poses fundamental challenges for generative modeling due to heterogeneous distributions, incompatible noise dynamics, and the need for equivariant inductive biases. Existing flow-matching approaches for graph generation typically decouple structure from geometry, lack synchronized cross-domain dynamics, and rely on iterative sampling, often resulting in physically inconsistent molecular conformations and slow sampling. To address these limitations, we propose Equivariant MeanFlow (EQUIMF), a unified SE(3)-equivariant generative framework that jointly models discrete and continuous components through synchronized MeanFlow dynamics. EQUIMF introduces a unified time bridge and average-velocity updates with mutual conditioning between structure and geometry, enabling efficient few-step generation while preserving physical consistency. Moreover, we develop a novel discrete MeanFlow formulation with a simple yet effective parameterization to support efficient generation over discrete graph structures. Extensive experiments demonstrate that EQUIMF consistently outperforms prior diffusion and flow-matching methods in generation quality, physical validity, and sampling efficiency.
【3】SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents
标题:SEARL:自进化代理的策略和工具图记忆的联合优化
链接:https://arxiv.org/abs/2604.07791
作者:Xinshun Feng,Xinhao Song,Lijun Li,Gongshen Liu,Jing Shao
备注:ACL 2026
摘要:具有可验证奖励的强化学习(RLVR)的最新进展在单轮推理任务中表现出了巨大的潜力。随着范式向自我进化的代理学习的转变,模型越来越多地被期望通过综合工具或积累外显经验来从轨迹中学习。然而,主流方法通常依赖于大规模LLM或多代理框架,这阻碍了它们在资源受限环境中的部署。基于结果的奖励的固有稀疏性也构成了一个重大挑战,因为代理通常只在完成任务后才会收到反馈。为了解决这些局限性,我们引入了一个基于工具记忆的自进化代理框架SEARL。与直接利用交互经验的方法不同,我们的方法构建了一个结构化的经验记忆,将计划与执行相结合。这提供了一种新颖的状态抽象,便于在类似的上下文中进行泛化,例如工具重用。因此,代理从历史数据中提取明确的知识,同时利用轨迹间的相关性来加密奖励信号。我们评估了我们关于知识推理和数学任务的框架,证明了它在实现更实用、更高效的学习方面的有效性。
摘要:Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have demonstrated significant potential in single-turn reasoning tasks. With the paradigm shift toward self-evolving agentic learning, models are increasingly expected to learn from trajectories by synthesizing tools or accumulating explicit experiences. However, prevailing methods typically rely on large-scale LLMs or multi-agent frameworks, which hinder their deployment in resource-constrained environments. The inherent sparsity of outcome-based rewards also poses a substantial challenge, as agents typically receive feedback only upon completion of tasks. To address these limitations, we introduce a Tool-Memory based self-evolving agentic framework SEARL. Unlike approaches that directly utilize interaction experiences, our method constructs a structured experience memory that integrates planning with execution. This provides a novel state abstraction that facilitates generalization across analogous contexts, such as tool reuse. Consequently, agents extract explicit knowledge from historical data while leveraging inter-trajectory correlations to densify reward signals. We evaluate our framework on knowledge reasoning and mathematics tasks, demonstrating its effectiveness in achieving more practical and efficient learning.
【4】Toward Generalizable Graph Learning for 3D Engineering AI: Explainable Workflows for CAE Mode Shape Classification and CFD Field Prediction
标题:迈向3D工程人工智能的可推广图形学习:CAE模式形状分类和计算流体力学场预测的可解释工作流程
链接:https://arxiv.org/abs/2604.07781
作者:Tong Duy Son,Kohta Sugiura,Marc Brughmans,Andrey Hense,Zhihao Liu,Amirthalakshmi Veeraraghavan,Ajinkya Bhave,Jay Masters,Paolo di Carlo,Theo Geluk
摘要:汽车工程开发越来越依赖于异构3D数据,包括有限元(FE)模型、白车身(BiW)表示、CAD几何形状和CFD网格。与此同时,工程团队面临着缩短开发周期、提高性能和加速创新的压力。尽管人工智能(AI)在该领域得到了越来越多的探索,但当前的许多方法仍然是特定于任务的,难以解释,并且难以在开发阶段重复使用。本文提出了一个实用的3D工程AI图学习框架,其中异构的工程资产被转换为物理感知的图表示,并由图神经网络(GNNs)处理。该框架旨在支持分类和预测任务。该框架在两个汽车应用中得到了验证:CAE振动模态分类和CFD空气动力场预测。对于CAE振动模式分类,区域感知的BiW图支持在标签稀缺性下跨车辆和FE变体的可解释模式分类。对于CFD空气动力学场预测,物理学信息的替代预测跨空气动力学车身形状变体的压力和壁面剪应力(WSS),同时保持对称性的下采样以较低的计算成本保持准确性。该框架还概述了数据生成指南,可以帮助工程师确定接下来收集哪些额外的模拟或标签是有价值的。这些结果展示了一个实用且可重复使用的工程AI工作流程,可提供更值得信赖的CAE和CFD决策支持。
摘要:Automotive engineering development increasingly relies on heterogeneous 3D data, including finite element (FE) models, body-in-white (BiW) representations, CAD geometry, and CFD meshes. At the same time, engineering teams face growing pressure to shorten development cycles, improve performance and accelerate innovation. Although artificial intelligence (AI) is increasingly explored in this domain, many current methods remain task-specific, difficult to interpret, and hard to reuse across development stages. This paper presents a practical graph learning framework for 3D engineering AI, in which heterogeneous engineering assets are converted into physics-aware graph representations and processed by Graph Neural Networks (GNNs). The framework is designed to support both classification and prediction tasks. The framework is validated on two automotive applications: CAE vibration mode shape classification and CFD aerodynamic field prediction. For CAE vibration mode classification, a region-aware BiW graph supports explainable mode classification across vehicle and FE variants under label scarcity. For CFD aerodynamic field prediction, a physics-informed surrogate predicts pressure and wall shear stress (WSS) across aerodynamic body shape variants, while symmetry preserving down sampling retains accuracy with lower computational cost. The framework also outlines data generation guidance that can help engineers identify which additional simulations or labels are valuable to collect next. These results demonstrate a practical and reusable engineering AI workflow for more trustworthy CAE and CFD decision support.
【5】Cluster Attention for Graph Machine Learning
标题:图形机器学习的集群注意力
链接:https://arxiv.org/abs/2604.07492
作者:Oleg Platonov,Liudmila Prokhorenkova
摘要:消息传递神经网络最近已经成为图形机器学习任务中最流行的方法;然而,它们的接收域受到消息传递层数量的限制。为了增加感受野,已经提出了具有全局注意力的图Transformers;然而,全局注意力没有考虑图拓扑,因此缺乏基于图结构的归纳偏差,这对于图机器学习任务通常非常重要。在这项工作中,我们提出了一种替代方法:集群注意力(CLATT)。我们用现成的图社区检测算法将图节点划分为簇,并让每个节点参与每个簇中的所有其他节点。CLATT提供了大的感受野,同时仍然具有很强的基于图形结构的归纳偏差。我们表明,用CLATT增强消息传递神经网络或图Transformers显着提高了它们在广泛的图数据集上的性能,包括最近推出的代表图机器学习现实应用的GraphLand基准测试数据集。
摘要:Message Passing Neural Networks have recently become the most popular approach to graph machine learning tasks; however, their receptive field is limited by the number of message passing layers. To increase the receptive field, Graph Transformers with global attention have been proposed; however, global attention does not take into account the graph topology and thus lacks graph-structure-based inductive biases, which are typically very important for graph machine learning tasks. In this work, we propose an alternative approach: cluster attention (CLATT). We divide graph nodes into clusters with off-the-shelf graph community detection algorithms and let each node attend to all other nodes in each cluster. CLATT provides large receptive fields while still having strong graph-structure-based inductive biases. We show that augmenting Message Passing Neural Networks or Graph Transformers with CLATT significantly improves their performance on a wide range of graph datasets including datasets from the recently introduced GraphLand benchmark representing real-world applications of graph machine learning.
【6】A Graph Foundation Model for Wireless Resource Allocation
标题:无线资源分配的图基础模型
链接:https://arxiv.org/abs/2604.07390
作者:Yucheng Sheng,Jiacheng Wang,Le Liang,Hao Ye,Shi Jin
摘要:现代无线网络的积极密集化需要明智的资源分配,以减轻严重的相互干扰。然而,经典的迭代算法仍然计算禁止的实时应用程序需要快速响应。虽然最近基于深度学习的方法显示出了希望,但它们通常作为特定任务的解决方案,缺乏适应不同目标和场景的灵活性,而无需昂贵的再培训。为了解决这些局限性,我们提出了一种基于预训练和微调范式的资源分配图形基础模型(GFM-RA),以提取统一的表示,从而能够快速适应不同的目标和场景。具体来说,我们引入了一个干扰感知的Transformer架构与偏见投影仪,注入干扰拓扑结构到全球的注意力机制。此外,我们还开发了一种混合自监督预训练策略,该策略将掩蔽边缘预测与无负师生对比学习相结合,使模型能够从大量未标记数据集中捕获可转移的结构表示。大量的实验表明,该框架实现了最先进的性能和规模有效地增加模型容量。至关重要的是,利用其统一的表示,基础模型表现出卓越的样本效率,使强大的Few-Shot适应多样化和无监督的下游目标在分布(OOD)的情况下。这些结果证明了预训练的基础模型对自适应无线资源分配的承诺,并为未来基于泛化学习的无线优化研究提供了坚实的基础。
摘要:The aggressive densification of modern wireless networks necessitates judicious resource allocation to mitigate severe mutual interference. However, classical iterative algorithms remain computationally prohibitive for real-time applications requiring rapid responsiveness. While recent deep learning-based methods show promise, they typically function as task-specific solvers lacking the flexibility to adapt to different objectives and scenarios without expensive retraining. To address these limitations, we propose a graph foundation model for resource allocation (GFM-RA) based on a pre-training and fine-tuning paradigm to extract unified representations, thereby enabling rapid adaptation to different objectives and scenarios. Specifically, we introduce an interference-aware Transformer architecture with a bias projector that injects interference topologies into global attention mechanisms. Furthermore, we develop a hybrid self-supervised pre-training strategy that synergizes masked edge prediction with negative-free Teacher-Student contrastive learning, enabling the model to capture transferable structural representations from massive unlabeled datasets. Extensive experiments demonstrate that the proposed framework achieves state-of-the-art performance and scales effectively with increased model capacity. Crucially, leveraging its unified representations, the foundation model exhibits exceptional sample efficiency, enabling robust few-shot adaptation to diverse and unsupervised downstream objectives in out-of-distribution (OOD) scenarios. These results demonstrate the promise of pre-trained foundation models for adaptable wireless resource allocation and provide a strong foundation for future research on generalizable learning-based wireless optimization.
【7】Intensity Dot Product Graphs
标题:强度点积图
链接:https://arxiv.org/abs/2604.07810
作者:Giulio Valentino Dalla Riva,Matteo Dalla Riva
摘要:潜在位置随机图模型通常将节点集视为固定的,一旦样本大小被选择,而基于图子和随机测度的构造允许更多的随机性,但代价是较弱的几何可解释性。我们介绍了强度点积图(IDPG),扩展随机点积图取代固定的潜在位置的集合与泊松点过程的欧氏潜在空间。这产生了一个模型,随机节点人口,RDPG风格的点积亲和力,和人口水平的强度,连续的潜在结构,有限的观察图。我们定义的热图和期望运营商的概率矩阵的连续类似物,证明了光谱的一致性结果连接邻接奇异值的运营商频谱,比较建设与graphon和digraphon表示,并显示如何经典RDPG出现在一个集中的限制。由于该模型是由不断变化的强度参数化,通过偏微分方程的时间扩展自然出现。
摘要:Latent-position random graph models usually treat the node set as fixed once the sample size is chosen, while graphon-based and random-measure constructions allow more randomness at the cost of weaker geometric interpretability. We introduce \emph{Intensity Dot Product Graphs} (IDPGs), which extend Random Dot Product Graphs by replacing a fixed collection of latent positions with a Poisson point process on a Euclidean latent space. This yields a model with random node populations, RDPG-style dot-product affinities, and a population-level intensity that links continuous latent structure to finite observed graphs. We define the heat map and the desire operator as continuous analogues of the probability matrix, prove a spectral consistency result connecting adjacency singular values to the operator spectrum, compare the construction with graphon and digraphon representations, and show how classical RDPGs arise in a concentrated limit. Because the model is parameterized by an evolving intensity, temporal extensions through partial differential equations arise naturally.
Transformer(3篇)
【1】Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
标题:由内而外:通过内心工作衡量视觉变形者的概括性
链接:https://arxiv.org/abs/2604.08192
作者:Yunxiang Peng,Mengmeng Ma,Ziyu Yao,Xi Peng
备注:CVPR 2026(Highlight)
摘要:可靠的泛化度量是评估机器学习模型的基础。特别是在标记目标数据稀缺的高风险应用中,迫切需要评估分布偏移下模型的泛化性能。我们关注两个实际场景:(1)在部署之前,如何为未标记的目标数据选择最佳模型?(2)部署完成后,如何监控模型在分发轮班下的性能?在这两种情况下,核心需求是可靠且无标签的代理度量。然而,现有的代理指标,如模型置信度或在线准确性,往往是不可靠的,因为它们只评估模型输出,而忽略了产生它们的内部机制。我们通过引入一个新的视角来解决这个限制:使用模型的内部工作原理,即,电路,作为泛化性能的预测指标。利用电路发现,我们提取内部表示之间的因果相互作用作为一个电路,从中我们得到两个度量定制的两个实际情况。(1)在部署之前,我们引入了Dependency Depth Bias,它衡量不同模型对目标数据的泛化能力。(2)在部署之后,我们提出了电路偏移分数,它预测了模型在不同分布偏移下的泛化能力。在各种任务中,这两个指标都表现出与泛化性能的显著改善的相关性,分别比现有代理的平均性能高出13.4%和34.1%。我们的代码可在https://github.com/deep-real/GenCircuit上获得。
摘要
:Reliable generalization metrics are fundamental to the evaluation of machine learning models. Especially in high-stakes applications where labeled target data are scarce, evaluation of models' generalization performance under distribution shift is a pressing need. We focus on two practical scenarios: (1) Before deployment, how to select the best model for unlabeled target data? (2) After deployment, how to monitor model performance under distribution shift? The central need in both cases is a reliable and label-free proxy metric. Yet existing proxy metrics, such as model confidence or accuracy-on-the-line, are often unreliable as they only assess model output while ignoring the internal mechanisms that produce them. We address this limitation by introducing a new perspective: using the inner workings of a model, i.e., circuits, as a predictive metric of generalization performance. Leveraging circuit discovery, we extract the causal interactions between internal representations as a circuit, from which we derive two metrics tailored to the two practical scenarios. (1) Before deployment, we introduce Dependency Depth Bias, which measures different models' generalization capability on target data. (2) After deployment, we propose Circuit Shift Score, which predicts a model's generalization under different distribution shifts. Across various tasks, both metrics demonstrate significantly improved correlation with generalization performance, outperforming existing proxies by an average of 13.4\% and 34.1\%, respectively. Our code is available at https://github.com/deep-real/GenCircuit.
【2】Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
标题:循环、思考和概括:回归深度变形机中的隐式推理
链接:https://arxiv.org/abs/2604.07822
作者:Harsh Kohli,Srinivasan Parthasarathy,Huan Sun,Yuekun Yao
备注:19 pages, 18 figures. Under review
摘要:我们研究隐式推理,即在单个向前传递中组合知识或规则的能力。虽然基于transformer的大型语言模型存储了大量的事实知识和规则,但它们通常无法将这些知识用于隐式多跳推理,这表明它们的参数知识缺乏组合概括。为了解决这个问题,我们研究了递归深度Transformers,它可以在相同的Transformer层上进行迭代计算。我们研究了隐式推理场景下的两个组合泛化挑战:系统泛化,即在训练期间组合从未用于组合的知识,以及深度外推,即从有限的推理深度(例如,最多5跳的训练)泛化到更深的组合(例如,10跳)。通过对从头开始训练的模型进行对照研究,我们表明,虽然vanilla Transformers在两个泛化挑战方面都很挣扎,但递归深度Transformers可以有效地进行这种泛化。对于系统的概括,我们发现,这种能力出现通过一个三阶段的grokking过程,从记忆过渡到分布的概括,最后到系统的概括,支持的机制分析。对于深度外推,我们证明了超出训练深度的泛化可以通过扩展推理时间递归来解锁,更多的迭代可以实现更深的推理。我们进一步研究了训练策略如何影响外推,为训练递归深度Transformers提供指导,并确定了一个关键的限制,过度思考,过度递归会降低预测并限制对非常深的成分的泛化。
摘要:We study implicit reasoning, i.e. the ability to combine knowledge or rules within a single forward pass. While transformer-based large language models store substantial factual knowledge and rules, they often fail to compose this knowledge for implicit multi-hop reasoning, suggesting a lack of compositional generalization over their parametric knowledge. To address this limitation, we study recurrent-depth transformers, which enables iterative computation over the same transformer layers. We investigate two compositional generalization challenges under the implicit reasoning scenario: systematic generalization, i.e. combining knowledge that is never used for compositions during training, and depth extrapolation, i.e. generalizing from limited reasoning depth (e.g. training on up to 5-hop) to deeper compositions (e.g. 10-hop). Through controlled studies with models trained from scratch, we show that while vanilla transformers struggle with both generalization challenges, recurrent-depth transformers can effectively make such generalization. For systematic generalization, we find that this ability emerges through a three-stage grokking process, transitioning from memorization to in-distribution generalization and finally to systematic generalization, supported by mechanistic analysis. For depth extrapolation, we show that generalization beyond training depth can be unlocked by scaling inference-time recurrence, with more iterations enabling deeper reasoning. We further study how training strategies affect extrapolation, providing guidance on training recurrent-depth transformers, and identify a key limitation, overthinking, where excessive recurrence degrades predictions and limits generalization to very deep compositions.
【3】Sensitivity-Positional Co-Localization in GQA Transformers
标题:GQATransformer中的灵敏度-位置协同定位
链接:https://arxiv.org/abs/2604.07766
作者:Manoj Chandrashekar Rao
备注:8 pages, 5 figures
摘要:我们调查一个基本的结构性问题,在分组查询注意力(GQA)的Transformers:最敏感的任务正确性的层与位置编码适应有最大的杠杆作用的层相吻合?我们称之为共定位假设,并在Llama 3.1 8B上进行测试,Llama 3.1 8B是一个32层的GQA模型,查询与键值的头部比例为4:1。我们引入了LSLORA,它将LoRA适应限制在通过新的正确性差分隐藏状态度量识别的层,以及GARFA(GQA感知RoPE频率适应),它将8个可学习的每KV头标量乘法器附加到每个目标层。与共定位假设相反,我们发现了强烈的反定位:任务敏感层集中在后期网络($\ell\in\{23\text{-}31\}$),而Rope影响层主导早期网络($\ell\in\{0\text{-}9\}$),从而产生斯皮尔曼$r_s =-0.735 $($p = 1.66\times10^{-6}$)。尽管存在这种反定位,但4向跨层消融显示,在六种不同的基准测试(MMLU,GPQA,HumanEval+,MATH,MGSM,ARC)中,将两种干预措施应用于敏感性识别层的性能优于所有替代配置4-16个百分点,在HumanEval+上接近Claude 3.5 Haiku(67.1% vs. 68.3%),总计算成本为100美元。
摘要:We investigate a fundamental structural question in Grouped Query Attention (GQA) transformers: do the layers most sensitive to task correctness coincide with the layers where positional encoding adaptation has the greatest leverage? We term this the co-localization hypothesis and test it on Llama 3.1 8B, a 32-layer GQA model with a 4:1 query-to-key-value head ratio. We introduce \LSLORA, which restricts LoRA adaptation to layers identified via a novel correctness-differential hidden-state metric, and GARFA (GQA-Aware RoPE Frequency Adaptation), which attaches 8 learnable per-KV-head scalar multipliers to each targeted layer. Contrary to the co-localization hypothesis, we discover strong anti-localization: task-sensitive layers concentrate in the late network ($\ell\in\{23\text{-}31\}$) while RoPE-influential layers dominate the early network ($\ell\in\{0\text{-}9\}$), yielding Spearman $r_s = -0.735$ ($p = 1.66\times10^{-6}$). Despite this anti-localization, a 4-way cross-layer ablation shows that applying both interventions to the sensitivity-identified layers outperforms all alternative configurations by 4-16 percentage points across six diverse benchmarks (MMLU, GPQA, HumanEval+, MATH, MGSM, ARC), approaching Claude 3.5 Haiku on HumanEval+ (67.1% vs. 68.3%) at \$100 total compute cost.
GAN|对抗|攻击|生成相关(9篇)
【1】What a Comfortable World: Ergonomic Principles Guided Apartment Layout Generation
标题:多么舒适的世界:人体工程学原则指导公寓布局生成
链接:https://arxiv.org/abs/2604.08411
作者:Piotr Nieciecki,Aleksander Plocharski,Przemyslaw Musialski
备注:4 pages, 2 figures, EUROGRAPHICS 2026 Short Paper
摘要:当前的数据驱动的平面图生成方法经常再现在现实世界的训练数据集中发现的人体工程学效率低下。为了解决这个问题,我们提出了一种新的方法,将建筑设计原则直接集成到基于transformer的生成过程中。我们制定了微分损失函数的基础上建立的建筑标准,从文献中优化房间的邻接和接近。通过在训练过程中使用这些人体工程学先验来指导模型,我们的方法可以生成具有显着改善的宜居性指标的布局。比较评估表明,我们的方法优于基线符合人体工程学,同时保持高的结构有效性。
摘要
:Current data-driven floor plan generation methods often reproduce the ergonomic inefficiencies found in real-world training datasets. To address this, we propose a novel approach that integrates architectural design principles directly into a transformer-based generative process. We formulate differentiable loss functions based on established architectural standards from literature to optimize room adjacency and proximity. By guiding the model with these ergonomic priors during training, our method produces layouts with significantly improved livability metrics. Comparative evaluations show that our approach outperforms baselines in ergonomic compliance while maintaining high structural validity.
【2】PrivFedTalk: Privacy-Aware Federated Diffusion with Identity-Stable Adapters for Personalized Talking-Head Generation
标题:PrivFedTalk:基于身份稳定适配器的隐私感知联邦扩散,用于个性化通话头生成
链接:https://arxiv.org/abs/2604.08037
作者:Soumya Mazumdar,Vineet Kumar Rakesh,Tapas Samanta
备注:GitHub: https://github.com/mazumdarsoumya/PrivFedTalk
摘要:通过基于扩散的生成模型,说话人生成技术发展迅速,但训练通常依赖于集中式面部视频和语音数据集,这引发了重大的隐私问题。这个问题对于个性化的通话头生成来说更为严重,其中特定于身份的数据非常敏感,通常无法在用户或设备之间共享。PrivFedTalk是一个隐私感知的联邦框架,用于个性化的谈话头生成,结合了条件潜在扩散和参数有效的身份适应。在客户端之间训练共享的扩散骨干,而每个客户端从本地私有视听数据中学习轻量级LoRA身份适配器,避免原始数据共享并降低通信成本。为了解决异构客户端分布问题,身份稳定联合聚合(ISFA)使用根据设备上身份一致性和时间稳定性估计计算的隐私安全标量可靠性信号来对客户端更新进行加权。引入时间去噪一致性(TDC)正则化以减少联邦去噪期间的帧间漂移、闪烁和身份漂移。为了限制更新端的隐私风险,安全聚合和客户端级别的差异隐私被应用于适配器更新。该实现支持低内存GPU执行和异构共享硬件上的多GPU客户端并行训练。在当前设置中,使用PrivFedTalk、FedAvg和FedProx在多个训练和聚合条件下进行的比较实验表明,在资源受限的情况下,联邦优化稳定,端到端训练和评估成功。研究结果支持在联邦环境中进行隐私感知的个性化谈话头训练的可行性,同时表明更强的组件、隐私效用和定性声明需要进一步标准化评估。
摘要:Talking-head generation has advanced rapidly with diffusion-based generative models, but training usually depends on centralized face-video and speech datasets, raising major privacy concerns. The problem is more acute for personalized talking-head generation, where identity-specific data are highly sensitive and often cannot be pooled across users or devices. PrivFedTalk is presented as a privacy-aware federated framework for personalized talking-head generation that combines conditional latent diffusion with parameter-efficient identity adaptation. A shared diffusion backbone is trained across clients, while each client learns lightweight LoRA identity adapters from local private audio-visual data, avoiding raw data sharing and reducing communication cost. To address heterogeneous client distributions, Identity-Stable Federated Aggregation (ISFA) weights client updates using privacy-safe scalar reliability signals computed from on-device identity consistency and temporal stability estimates. Temporal-Denoising Consistency (TDC) regularization is introduced to reduce inter-frame drift, flicker, and identity drift during federated denoising. To limit update-side privacy risk, secure aggregation and client-level differential privacy are applied to adapter updates. The implementation supports both low-memory GPU execution and multi-GPU client-parallel training on heterogeneous shared hardware. Comparative experiments on the present setup across multiple training and aggregation conditions with PrivFedTalk, FedAvg, and FedProx show stable federated optimization and successful end-to-end training and evaluation under constrained resources. The results support the feasibility of privacy-aware personalized talking-head training in federated environments, while suggesting that stronger component-wise, privacy-utility, and qualitative claims need further standardized evaluation.
【3】Preference Redirection via Attention Concentration: An Attack on Computer Use Agents
标题:通过注意力集中进行偏好重定向:对计算机使用代理的攻击
链接:https://arxiv.org/abs/2604.08005
作者:Dominik Seip,Matthias Hein
摘要:多模态基础模型的进步使计算机使用代理(CUA)能够自主与GUI环境进行交互。由于CUA不限于某些工具,它们允许自动化更复杂的代理任务,但同时也会带来新的安全漏洞。虽然以前的工作集中在语言模式,视觉模式的脆弱性得到了较少的关注。在本文中,我们介绍了PRAC,这是一种新的攻击,与之前直接针对VLM输出的工作不同,它通过将注意力转向隐形对抗补丁来操纵模型的内部偏好。我们表明,PRAC是能够操纵选择过程中的CUA在网上购物平台上对选定的目标产品。虽然我们需要白盒访问模型来创建攻击,但我们表明,我们的攻击可以推广到同一模型的微调版本,当多家公司基于开放权重模型构建特定的CUA时,会带来严重威胁。
摘要:Advancements in multimodal foundation models have enabled the development of Computer Use Agents (CUAs) capable of autonomously interacting with GUI environments. As CUAs are not restricted to certain tools, they allow to automate more complex agentic tasks but at the same time open up new security vulnerabilities. While prior work has concentrated on the language modality, the vulnerability of the vision modality has received less attention. In this paper, we introduce PRAC, a novel attack that, unlike prior work targeting the VLM output directly, manipulates the model's internal preferences by redirecting its attention toward a stealthy adversarial patch. We show that PRAC is able to manipulate the selection process of a CUA on an online shopping platform towards a chosen target product. While we require white-box access to the model for the creation of the attack, we show that our attack generalizes to fine-tuned versions of the same model, presenting a critical threat as multiple companies build specific CUAs based on open weights models.
【4】Automatic Generation of Executable BPMN Models from Medical Guidelines
标题:根据医疗指南自动生成可执行BCBN模型
链接:https://arxiv.org/abs/2604.07817
作者:Praveen Kumar Menaka Sekar,Ion Matei,Maksym Zhenirovskyy,Hon Yung Wong,Sayuri Kohmura,Shinji Hotta,Akihiro Inomata
摘要:我们提出了一个端到端的管道,将医疗保健政策文件转换为可执行的,数据感知的业务流程模型和符号(BPMN)模型,使用大型语言模型(LLM)进行基于模拟的政策评估。我们解决了自动化策略数字化的主要挑战,有四个贡献:基于数据的BPMN生成与语法自动校正,可执行增强,KPI仪器和基于熵的不确定性检测。我们评估了来自日本三个城市的糖尿病肾病预防指南,在三个LLM中每个后端生成100个模型,并对1,000名合成患者执行每个模型。在结构良好的策略上,该管道实现了100%的地面实况匹配,并与每个患者的决策达成了完美的一致。在所有条件下,每个患者的原始决策一致性超过92%,并且熵分数随着文档复杂性单调增加,这证实了检测器可靠地将明确的策略与需要有针对性的人类澄清的策略分开。
摘要:We present an end-to-end pipeline that converts healthcare policy documents into executable, data-aware Business Process Model and Notation (BPMN) models using large language models (LLMs) for simulation-based policy evaluation. We address the main challenges of automated policy digitization with four contributions: data-grounded BPMN generation with syntax auto-correction, executable augmentation, KPI instrumentation, and entropy-based uncertainty detection. We evaluate the pipeline on diabetic nephropathy prevention guidelines from three Japanese municipalities, generating 100 models per backend across three LLMs and executing each against 1,000 synthetic patients. On well-structured policies, the pipeline achieves a 100% ground-truth match with perfect per-patient decision agreement. Across all conditions, raw per-patient decision agreement exceeds 92%, and entropy scores increase monotonically with document complexity, confirming that the detector reliably separates unambiguous policies from those requiring targeted human clarification.
【5】Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
标题:共生教育部:释放生成与理解之间的协同效应
链接:https://arxiv.org/abs/2604.07753
作者:Xiangyue Liu,Zijian Zhang,Miles Yang,Zhao Zhong,Liefeng Bo,Ping Tan
摘要
:通过图像生成来增强大型多模态模型(LLM),由于严重的梯度冲突,通常会导致理解任务时的灾难性遗忘。虽然现有的范例,如混合Transformers(MoT)通过结构隔离缓解了这一冲突,但它们从根本上切断了跨模式的协同作用,并遭受能力碎片化。在这项工作中,我们提出了Symbiotic-MoE,一个统一的预训练框架,解决了任务干扰内的本地多模态混合专家(MoE)Transformers架构与零参数开销。我们首先确定,标准的MoE调整导致路由崩溃,生成梯度占主导地位的专家利用。为了解决这个问题,我们引入了模态感知专家分解,它将专家划分为特定于任务的组,同时利用共享专家作为多模态语义桥梁。至关重要的是,这种设计允许共享专家从生成任务中吸收细粒度的视觉语义,以丰富文本表示。为了优化这一点,我们提出了一种渐进式训练策略,其特点是差分学习率和早期梯度屏蔽。这种机制不仅可以保护预先训练好的知识不受早期波动的影响,而且最终可以将生成信号转化为建设性的反馈。大量的实验表明,Symbiotic MoE实现了快速的生成收敛,同时解锁了跨模态协同作用,增强了固有的理解,在MMLU和OCRBench上获得了显着的收益。
摘要:Empowering Large Multimodal Models (LMMs) with image generation often leads to catastrophic forgetting in understanding tasks due to severe gradient conflicts. While existing paradigms like Mixture-of-Transformers (MoT) mitigate this conflict through structural isolation, they fundamentally sever cross-modal synergy and suffer from capacity fragmentation. In this work, we present Symbiotic-MoE, a unified pre-training framework that resolves task interference within a native multimodal Mixture-of-Experts (MoE) Transformers architecture with zero-parameter overhead. We first identify that standard MoE tuning leads to routing collapse, where generative gradients dominate expert utilization. To address this, we introduce Modality-Aware Expert Disentanglement, which partitions experts into task-specific groups while utilizing shared experts as a multimodal semantic bridge. Crucially, this design allows shared experts to absorb fine-grained visual semantics from generative tasks to enrich textual representations. To optimize this, we propose a Progressive Training Strategy featuring differential learning rates and early-stage gradient shielding. This mechanism not only shields pre-trained knowledge from early volatility but eventually transforms generative signals into constructive feedback for understanding. Extensive experiments demonstrate that Symbiotic-MoE achieves rapid generative convergence while unlocking cross-modal synergy, boosting inherent understanding with remarkable gains on MMLU and OCRBench.
【6】Validated Synthetic Patient Generation for Small Longitudinal Cohorts: Coagulation Dynamics Across Pregnancy
标题:经过验证的小型纵向队列合成患者生成:整个怀孕期间的凝固动态
链接:https://arxiv.org/abs/2604.07557
作者:Jeffrey D. Varner,Maria Cristina Bravo,Carole McBride,Thomas Orfeo,Ira Bernstein
摘要:在孕产妇健康、罕见疾病和早期试验中常见的小的纵向临床队列限制了计算建模:患者太少,无法训练可靠的模型,但通过额外的招募来扩大成本太高,速度太慢。我们提出了多重加权随机注意力(SA),一个基于现代Hopfield网络理论的生成框架,解决了这一差距。SA将真实的患者简档作为记忆模式嵌入连续的能量景观中,并通过Langevin动力学生成新的合成患者,该Langevin动力学在存储的模式之间进行插值,同时保留原始队列的几何形状。每个模式的多重性权重使得能够在推理时有针对性地扩增罕见的临床亚组,而无需重新训练。我们将SA应用于来自23名妊娠患者的纵向凝血数据集,这些患者跨越3次访视(妊娠前基线、妊娠早期和妊娠晚期)的72项生化特征,包括罕见亚组,如多囊卵巢综合征和先兆子痫。通过SA生成的合成患者在统计学上、结构上和机制上与其真实对应物在多个独立验证测试中无法区分,包括凝血级联的常微分方程模型。下游效用测试进一步表明,完全在合成患者上校准的机械模型预测了真实患者的结果,以及在真实数据上校准的模型。这些结果表明,SA可以从非常小的纵向数据集产生临床有用的合成队列,从而在小队列环境中实现数据增强建模。
摘要:Small longitudinal clinical cohorts, common in maternal health, rare diseases, and early-phase trials, limit computational modeling: too few patients to train reliable models, yet too costly and slow to expand through additional enrollment. We present multiplicity-weighted Stochastic Attention (SA), a generative framework based on modern Hopfield network theory that addresses this gap. SA embeds real patient profiles as memory patterns in a continuous energy landscape and generates novel synthetic patients via Langevin dynamics that interpolate between stored patterns while preserving the geometry of the original cohort. Per-pattern multiplicity weights enable targeted amplification of rare clinical subgroups at inference time without retraining. We applied SA to a longitudinal coagulation dataset from 23 pregnant patients spanning 72 biochemical features across 3 visits (pre-pregnancy baseline, first trimester, and third trimester), including rare subgroups such as polycystic ovary syndrome and preeclampsia. Synthetic patients generated by SA were statistically, structurally, and mechanistically indistinguishable from their real counterparts across multiple independent validation tests, including an ordinary differential equation model of the coagulation cascade. A downstream utility test further showed that a mechanistic model calibrated entirely on synthetic patients predicted held-out real patient outcomes as well as one calibrated on real data. These results demonstrate that SA can produce clinically useful synthetic cohorts from very small longitudinal datasets, enabling data-augmented modeling in small-cohort settings.
【7】GAN-based Domain Adaptation for Image-aware Layout Generation in Advertising Poster Design
标题:广告海报设计中基于GAN的领域自适应图像感知布局生成
链接:https://arxiv.org/abs/2604.07409
作者:Chenchen Xu,Min Zhou,Tiezheng Ge,Weiwei Xu
备注:arXiv admin note: text overlap with arXiv:2303.14377
摘要:布局在平面设计和海报生成中起着至关重要的作用。最近,深度学习模型在布局生成中的应用受到了极大的关注。本文的重点是使用基于图像的GAN模型来生成广告海报平面布局,需要配对的产品图像和布局的数据集。为了解决这个问题,我们引入了内容感知的图形布局数据集(CGL-Dataset),由60,548对带有注释的修补海报和121,000张干净的产品图像组成。修复工件在修复的海报和干净的图像之间引入了域间隙。为了弥补这一差距,我们设计了两个基于GAN的模型。第一个模型,CGL-GAN,在修复区域上使用高斯模糊来生成布局。第二个模型结合了无监督域自适应,通过引入一个GAN与像素级搜索(PD),简称为PDA-GAN,以生成图像感知布局的基础上输入图像的视觉纹理。PD连接到浅层次特征图,并计算每个输入图像像素的GAN损失。此外,我们提出了三个新的内容感知指标来评估模型的能力,捕捉图形元素和图像内容之间的复杂关系。定量和定性评估表明,PDA-GAN实现了最先进的性能,并生成高质量的图像感知布局。
摘要:Layout plays a crucial role in graphic design and poster generation. Recently, the application of deep learning models for layout generation has gained significant attention. This paper focuses on using a GAN-based model conditioned on images to generate advertising poster graphic layouts, requiring a dataset of paired product images and layouts. To address this task, we introduce the Content-aware Graphic Layout Dataset (CGL-Dataset), consisting of 60,548 paired inpainted posters with annotations and 121,000 clean product images. The inpainting artifacts introduce a domain gap between the inpainted posters and clean images. To bridge this gap, we design two GAN-based models. The first model, CGL-GAN, uses Gaussian blur on the inpainted regions to generate layouts. The second model combines unsupervised domain adaptation by introducing a GAN with a pixel-level discriminator (PD), abbreviated as PDA-GAN, to generate image-aware layouts based on the visual texture of input images. The PD is connected to shallow-level feature maps and computes the GAN loss for each input-image pixel. Additionally, we propose three novel content-aware metrics to assess the model's ability to capture the intricate relationships between graphic elements and image content. Quantitative and qualitative evaluations demonstrate that PDA-GAN achieves state-of-the-art performance and generates high-quality image-aware layouts.
【8】Accelerating Training of Autoregressive Video Generation Models via Local Optimization with Representation Continuity
标题:通过具有表示连续性的局部优化加速自回归视频生成模型的训练
链接:https://arxiv.org/abs/2604.07402
作者:Yucheng Zhou,Jianbing Shen
备注:ACL 2026 Findings
摘要:自回归模型在图像生成方面表现出了卓越的性能和效率,但在视频生成中仍然受到高计算成本和延长的训练时间的限制。在这项研究中,我们通过实证分析探索加速自回归视频生成模型训练的方法。我们的研究结果表明,虽然在更少的视频帧上进行训练显著减少了训练时间,但它也加剧了错误积累,并在生成的视频中引入了不一致性。为了解决这些问题,我们提出了一个局部优化(Local Optimal)。方法,该方法优化本地化窗口内的标记,同时利用上下文信息减少错误传播。受Lipschitz连续性的启发,我们提出了一种表示连续性(ReCo)策略来提高生成视频的一致性。ReCo利用连续性损失来约束表示变化,提高模型的鲁棒性并减少误差累积。在类和文本到视频数据集上的广泛实验表明,我们的方法实现了优于基线的性能,同时在不牺牲质量的情况下将训练成本减半。
摘要:Autoregressive models have shown superior performance and efficiency in image generation, but remain constrained by high computational costs and prolonged training times in video generation. In this study, we explore methods to accelerate training for autoregressive video generation models through empirical analyses. Our results reveal that while training on fewer video frames significantly reduces training time, it also exacerbates error accumulation and introduces inconsistencies in the generated videos. To address these issues, we propose a Local Optimization (Local Opt.) method, which optimizes tokens within localized windows while leveraging contextual information to reduce error propagation. Inspired by Lipschitz continuity, we propose a Representation Continuity (ReCo) strategy to improve the consistency of generated videos. ReCo utilizes continuity loss to constrain representation changes, improving model robustness and reducing error accumulation. Extensive experiments on class- and text-to-video datasets demonstrate that our approach achieves superior performance to the baseline while halving the training cost without sacrificing quality.
【9】Differentially Private Language Generation and Identification in the Limit
标题:极限条件下的差分私有语言生成与识别
链接:https://arxiv.org/abs/2604.08504
作者:Anay Mehrotra,Grigoris Velegkas,Xifan Yu,Felix Zhou
摘要:我们在有限的语言生成的研究,最近介绍了Kleinberg和Mullainathan [KM 24],在差分隐私的约束下。我们考虑连续发布模型,其中生成器必须最终输出有效字符串流,同时保护整个输入序列的隐私。我们的第一个主要结果是,可数语言的集合,隐私是在没有质的成本:我们提供了一个$\varepalent $-差分私有算法,从任何可数的集合中产生的限制。这与许多学习环境形成鲜明对比,在这些环境中,隐私使学习变得不可能。然而,隐私确实带来了量化成本:存在大小为$k$的有限集合,统一的私有生成需要$Ω(k/\vareps)$样本,而只有一个样本就足够了。 然后,我们转向更困难的问题,语言识别的限制。在这里,我们展示了隐私造成了根本性的障碍。我们证明,没有$\vareproste $-DP算法可以识别一个集合包含两种语言的无限交集和有限集差,远强于经典的非私有特性的识别条件。接下来,我们转向随机设置,其中对样本串进行i.i.d.采样。从一个分布(而不是由对手生成)。在这里,我们证明了私人身份识别是可能的,当且仅当集合在对抗模型中是可识别的。总之,我们的研究结果建立了新的维度,沿着该维度生成和识别不同,并且对于识别,由隐私约束引起的对抗性和随机设置之间的分离。
摘要:We initiate the study of language generation in the limit, a model recently introduced by Kleinberg and Mullainathan [KM24], under the constraint of differential privacy. We consider the continual release model, where a generator must eventually output a stream of valid strings while protecting the privacy of the entire input sequence. Our first main result is that for countable collections of languages, privacy comes at no qualitative cost: we provide an $\varepsilon$-differentially-private algorithm that generates in the limit from any countable collection. This stands in contrast to many learning settings where privacy renders learnability impossible. However, privacy does impose a quantitative cost: there are finite collections of size $k$ for which uniform private generation requires $Ω(k/\varepsilon)$ samples, whereas just one sample suffices non-privately. We then turn to the harder problem of language identification in the limit. Here, we show that privacy creates fundamental barriers. We prove that no $\varepsilon$-DP algorithm can identify a collection containing two languages with an infinite intersection and a finite set difference, a condition far stronger than the classical non-private characterization of identification. Next, we turn to the stochastic setting where the sample strings are sampled i.i.d. from a distribution (instead of being generated by an adversary). Here, we show that private identification is possible if and only if the collection is identifiable in the adversarial model. Together, our results establish new dimensions along which generation and identification differ and, for identification, a separation between adversarial and stochastic settings induced by privacy constraints.
半/弱/无/有监督|不确定性|主动学习(3篇)
【1】TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis
标题:TTVS:通过测试时变分合成增强自我探索强化学习
链接:https://arxiv.org/abs/2604.08468
作者:Sikai Bai,Haoxi Li,Jie Zhang,Yongjiang Liu,Song Guo
摘要:尽管大型推理模型(LRM)由具有可验证奖励的强化学习(RLVR)驱动,但这种范式从根本上局限于专门或新颖的领域,在这些领域中,这种监督非常昂贵或不可用,这对测试时适应构成了关键挑战。虽然现有的测试时方法提供了一个潜在的解决方案,但它们受到从静态查询集学习的限制,存在过度拟合文本模式的风险。为了解决这一差距,我们引入了测试时变分合成(TTVS),一种新的框架,使LRM的自我发展,通过动态增强训练流从未标记的测试查询。TTVS包括两个协同模块:(1)在线变分合成,它将静态测试查询转换为不同的,语义等价的变化的动态流,强制模型学习潜在的问题逻辑,而不是表面模式;(2)测试时混合探索,它平衡准确性驱动的开发与跨合成变体的一致性驱动的探索。大量的实验表明,TTVS在八种模型架构中具有优异的性能。值得注意的是,仅使用未标记的测试时数据,TTVS不仅优于其他测试时自适应方法,而且优于在大量高质量标记数据上训练的最先进的基于监督RL的技术。
摘要:Despite significant advances in Large Reasoning Models (LRMs) driven by reinforcement learning with verifiable rewards (RLVR), this paradigm is fundamentally limited in specialized or novel domains where such supervision is prohibitively expensive or unavailable, posing a key challenge for test-time adaptation. While existing test-time methods offer a potential solution, they are constrained by learning from static query sets, risking overfitting to textual patterns. To address this gap, we introduce Test-Time Variational Synthesis (TTVS), a novel framework that enables LRMs to self-evolve by dynamically augmenting the training stream from unlabeled test queries. TTVS comprises two synergistic modules: (1) Online Variational Synthesis, which transforms static test queries into a dynamic stream of diverse, semantically-equivalent variations, enforcing the model to learn underlying problem logic rather than superficial patterns; (2) Test-time Hybrid Exploration, which balances accuracy-driven exploitation with consistency-driven exploration across synthetic variants. Extensive experiments show TTVS yields superior performance across eight model architectures. Notably, using only unlabeled test-time data, TTVS not only surpasses other test-time adaptation methods but also outperforms state-of-the-art supervised RL-based techniques trained on vast, high-quality labeled data.
【2】Component-Adaptive and Lesion-Level Supervision for Improved Small Structure Segmentation in Brain MRI
标题:改进脑部MRI中小结构分割的患者自适应和病变级监督
链接:https://arxiv.org/abs/2604.08015
作者:Minh Sao Khue Luu,Evgeniy N. Pavlovskiy,Bair N. Tuchinov
摘要:我们提出了一个统一的目标函数,称为CATMIL,增加了基础分割损失与两个辅助监督条款在不同的水平。第一项,自适应Tversky,基于连接的组件重新加权体素贡献,以平衡不同大小的病变的影响。第二个术语基于多实例学习,通过鼓励检测每个病变实例来引入病变级监督。这些术语与标准nnU-Net损失相结合,共同优化体素级分割准确性和病变级检测。我们使用一致的nnU-Net框架和5重交叉验证在MSLesSeg数据集上评估了所提出的目标。结果表明,CATMIL在分割精度、病变检测和错误控制方面实现了最平衡的性能。与标准损失相比,它提高了Dice得分(0.7834)并减少了边界误差。更重要的是,它大大提高了小病变的召回率,减少了假阴性,同时保持了比较方法中最低的假阳性量。这些研究结果表明,在一个统一的目标内集成组件级和病变级的监督提供了一种有效和实用的方法,用于改善高度不平衡的设置中的小病变分割。所有代码和预训练模型都可以在\href{https://github.com/luumsk/SmallLesionMRI}{this url}上找到。
摘要
:We propose a unified objective function, termed CATMIL, that augments the base segmentation loss with two auxiliary supervision terms operating at different levels. The first term, Component-Adaptive Tversky, reweights voxel contributions based on connected components to balance the influence of lesions of different sizes. The second term, based on Multiple Instance Learning, introduces lesion-level supervision by encouraging the detection of each lesion instance. These terms are combined with the standard nnU-Net loss to jointly optimize voxel-level segmentation accuracy and lesion-level detection. We evaluate the proposed objective on the MSLesSeg dataset using a consistent nnU-Net framework and 5-fold cross-validation. The results show that CATMIL achieves the most balanced performance across segmentation accuracy, lesion detection, and error control. It improves Dice score (0.7834) and reduces boundary error compared to standard losses. More importantly, it substantially increases small lesion recall and reduces false negatives, while maintaining the lowest false positive volume among compared methods. These findings demonstrate that integrating component-level and lesion-level supervision within a unified objective provides an effective and practical approach for improving small lesion segmentation in highly imbalanced settings. All code and pretrained models are available at \href{https://github.com/luumsk/SmallLesionMRI}{this url}.
【3】Non-variational supervised quantum kernel methods: a review
标题:非变分监督量子核方法:回顾
链接:https://arxiv.org/abs/2604.07896
作者:John Tanner,Chon-Fai Kam,Jingbo Wang
备注:38 pages, 11 figures, 1 table
摘要:量子核方法(QKM)已经成为监督量子机器学习的一个重要框架。与依赖于基于梯度的优化并且可能遭受贫瘠高原等问题的变分量子算法不同,非变分QKM采用固定的量子特征映射,通过凸优化和交叉验证经典地执行模型选择。量子特征嵌入与经典训练的这种分离确保了稳定的优化,同时利用量子电路在高维希尔伯特空间中编码数据。在这篇综述中,我们提供了一个彻底的分析,非变分监督QKM,涵盖其基础上的经典核理论,保真度和投影量子内核的建设,以及在实践中估计的方法。我们研究评估量子优势的框架,包括泛化边界和从经典模型中分离的必要条件,并分析关键挑战,如指数浓度,通过张量网络方法的反量化,以及核积分算子的谱特性。我们进一步讨论结构化的问题类,可能使优势,并综合比较和硬件研究的见解。总的来说,本次审查的目的是澄清制度,QKM可能提供真正的优势,并划定的概念,方法和技术的障碍,必须克服实际的量子增强学习。
摘要:Quantum kernel methods (QKMs) have emerged as a prominent framework for supervised quantum machine learning. Unlike variational quantum algorithms, which rely on gradient-based optimisation and may suffer from issues such as barren plateaus, non-variational QKMs employ fixed quantum feature maps, with model selection performed classically via convex optimisation and cross-validation. This separation of quantum feature embedding from classical training ensures stable optimisation while leveraging quantum circuits to encode data in high-dimensional Hilbert spaces. In this review, we provide a thorough analysis of non-variational supervised QKMs, covering their foundations in classical kernel theory, constructions of fidelity and projected quantum kernels, and methods for their estimation in practice. We examine frameworks for assessing quantum advantage, including generalisation bounds and necessary conditions for separation from classical models, and analyse key challenges such as exponential concentration, dequantisation via tensor-network methods, and the spectral properties of kernel integral operators. We further discuss structured problem classes that may enable advantage, and synthesise insights from comparative and hardware studies. Overall, this review aims to clarify the regimes in which QKMs may offer genuine advantages, and to delineate the conceptual, methodological, and technical obstacles that must be overcome for practical quantum-enhanced learning.
迁移|Zero/Few/One-Shot|自适应(7篇)
【1】Provably Adaptive Linear Approximation for the Shapley Value and Beyond
标题:Shapley值及其他值的可证明自适应线性逼近
链接:https://arxiv.org/abs/2604.08438
作者:Weida Li,Yaoliang Yu,Bryan Kian Hsiang Low
摘要:Shapley值及其更广泛的半值家族在各种归因问题中受到了广泛的关注。一个基本的和长期存在的挑战是他们的有效近似,因为精确的计算通常需要一个指数数量的效用查询的玩家$n$。为了满足大规模应用的挑战,我们探索了在$Θ(n)$空间约束下有效逼近半值的限制。基于向量集中不等式,我们建立了一个理论框架,使现有的无偏随机算法的查询复杂性更尖锐。在这个框架内,我们系统地开发了一个线性空间算法,需要$O(\frac{n}{ε^{2}}\log\frac{1}δ)$效用查询,以确保所有常用半值的$P(\|\hat{\boldsymbolφ}-\boldsymbolφ\|_{2}\geqε)\leq δ$。特别是,我们的框架自然地桥接了OFA、无偏kernelSHAP、SHAP-IQ和回归调整方法,并明确地描述了配对采样何时有益。此外,我们的算法允许显式最小化每个特定的效用函数的均方误差。因此,我们介绍了第一个自适应,线性时间,线性空间随机算法,Adalina,理论上实现了改进的均方误差。我们所有的理论发现都得到了实验验证。
摘要:The Shapley value, and its broader family of semi-values, has received much attention in various attribution problems. A fundamental and long-standing challenge is their efficient approximation, since exact computation generally requires an exponential number of utility queries in the number of players $n$. To meet the challenges of large-scale applications, we explore the limits of efficiently approximating semi-values under a $Θ(n)$ space constraint. Building upon a vector concentration inequality, we establish a theoretical framework that enables sharper query complexities for existing unbiased randomized algorithms. Within this framework, we systematically develop a linear-space algorithm that requires $O(\frac{n}{ε^{2}}\log\frac{1}δ)$ utility queries to ensure $P(\|\hat{\boldsymbolφ}-\boldsymbolφ\|_{2}\geqε)\leq δ$ for all commonly used semi-values. In particular, our framework naturally bridges OFA, unbiased kernelSHAP, SHAP-IQ and the regression-adjusted approach, and definitively characterizes when paired sampling is beneficial. Moreover, our algorithm allows explicit minimization of the mean square error for each specific utility function. Accordingly, we introduce the first adaptive, linear-time, linear-space randomized algorithm, Adalina, that theoretically achieves improved mean square error. All of our theoretical findings are experimentally validated.
【2】Zero-shot Multivariate Time Series Forecasting Using Tabular Prior Fitted Networks
标题:使用表格先验匹配网络的零采样多元时间序列预测
链接:https://arxiv.org/abs/2604.08400
作者:Mayuka Jayawardhana,Nihal Sharma,Kazem Meidani,Bayan Bruss,Tom Goldstein,Doron Bergman
摘要:表格基础模型,特别是像TabPFN这样的先验数据拟合网络,已经成为无数任务的主要竞争者,这些任务从数据插补到表格数据格式的标签预测,超越了基于树的模型的历史成功。这导致了他们的适用性预测时间序列数据,可以制定为一个表格问题的调查。虽然最近的工作,为此已显示出积极的成果,大多数作品都限制了他们的治疗多变量时间序列问题的几个独立的单变量时间序列预测子问题,从而忽略了任何通道间的相互作用。克服这一限制,我们介绍了一个普遍适用的框架,多变量时间序列预测使用表格基础模型。我们通过将多变量时间序列预测问题转换为一系列标量回归问题来实现这一点,然后可以通过具有回归功能的任何表格基础模型来解决zero-shot问题。我们目前的结果,我们的方法使用TabPFN-TS骨干和比较性能与当前最先进的表格方法。
摘要:Tabular foundation models, particularly Prior-data Fitted Networks like TabPFN have emerged as the leading contender in a myriad of tasks ranging from data imputation to label prediction on the tabular data format surpassing the historical successes of tree-based models. This has led to investigations on their applicability to forecasting time series data which can be formulated as a tabular problem. While recent work to this end has displayed positive results, most works have limited their treatment of multivariate time series problems to several independent univariate time series forecasting subproblems, thus ignoring any inter-channel interactions. Overcoming this limitation, we introduce a generally applicable framework for multivariate time series forecasting using tabular foundation models. We achieve this by recasting the multivariate time series forecasting problem as a series of scalar regression problems which can then be solved zero-shot by any tabular foundation model with regression capabilities. We present results of our method using the TabPFN-TS backbone and compare performance with the current state of the art tabular methods.
【3】ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification
标题:ADAPTive输入训练,用于时间序列分类的多对一预训练
链接:https://arxiv.org/abs/2604.08398
作者:Paul Quinlan,Qingguo Li,Xiaodan Zhu
摘要:最近关于时间序列模型的工作利用自我监督训练来学习有意义的特征和模式,以提高下游任务的性能,并推广到看不见的模式。虽然这些预训练方法在一对多场景中表现出了很大的希望,其中模型在一个数据集上进行预训练并在下游数据集上进行微调,但当在预训练期间添加更多数据集时,它们很难推广到新数据集。这是为时间序列数据构建基础模型的一个根本挑战,因为它限制了开发可以从大量不同数据集中学习的模型的能力。为了应对这一挑战,我们提出了一种新的时间序列数据预训练范例,称为ADAPT,它可以有效地调整时间序列域中数据的物理属性,即使预训练数据的输入大小和通道尺寸存在极大差异,也可以实现混合批预训练。我们在162个时间序列分类数据集上进行了训练,并为分类基准设置了新的最先进性能。我们成功地训练了一个模型的时间序列域上的广泛的数据集,同时,这是一个主要的构建块,在时间序列域中建立通才的基础模型。
摘要:Recent work on time-series models has leveraged self-supervised training to learn meaningful features and patterns in order to improve performance on downstream tasks and generalize to unseen modalities. While these pretraining methods have shown great promise in one-to-many scenarios, where a model is pre-trained on one dataset and fine-tuned on a downstream dataset, they have struggled to generalize to new datasets when more datasets are added during pre-training. This is a fundamental challenge in building foundation models for time-series data, as it limits the ability to develop models that can learn from a large variety of diverse datasets available. To address this challenge, we present a new pre-training paradigm for time-series data called ADAPT, which can efficiently align the physical properties of data in the time-series domain, enabling mixed-batch pre-training despite the extreme discrepancies in the input sizes and channel dimensions of pre-training data. We trained on 162 time-series classification datasets and set new state-of-the-art performance for classification benchmarks. We successfully train a model within the time-series domain on a wide range of datasets simultaneously, which is a major building block for building generalist foundation models in time-series domains.
【4】SOLAR: Communication-Efficient Model Adaptation via Subspace-Oriented Latent Adapter Reparametrization
标题:SORAL:通过面向子空间的潜在适配器重新参数化实现通信高效的模型适应
链接:https://arxiv.org/abs/2604.08368
作者:Seyed Mahmoud Sajjadi Mohammadabadi,Xiaolong Ma,Lei Yang,Feng Yan,Junshan Zhang
摘要:参数高效微调(PEFT)方法,如LoRA,通过注入低秩适配器实现基础模型的可扩展适配。然而,它们的通信和存储成本仍然是资源受限环境中的主要瓶颈。我们提出了SOLAR(面向子空间的潜在适配器重新参数化),这是一种训练后压缩框架,可以大大降低通信成本(即,要发送或存储的参数的数量)。SOLAR将每个PEFT更新表示为由基础模型的奇异向量与受控随机扰动形成的基向量的线性组合。通过利用基础模型和特定任务微调更新之间的子空间相似性(主方向的对齐),SOLAR从PEFT结构中提取适配器大小,并确保紧凑而富有表现力的表示。它与模型无关,并与现有的PEFT方法兼容,包括LoRA,AdaLoRA和其他适配器模块。我们从理论上建立了一个边界上的重建误差。使用LLaMA、GPT和ViT模型进行的语言和视觉任务实验表明,SOLAR在保持任务性能的同时,显著降低了模型表示大小,为分布式系统和边缘设备的部署提供了有效且通信高效的解决方案。
摘要:Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, enable scalable adaptation of foundation models by injecting low-rank adapters. However, their communication and storage costs remain a major bottleneck in resource-constrained settings. We propose SOLAR (Subspace-Oriented Latent Adapter Reparameterization), a post-training compression framework that substantially reduces the communication cost (i.e., the number of parameters to transmit or store) of PEFT adapters. SOLAR expresses each PEFT update as a linear combination of basis vectors formed from the foundation model's singular vectors with controlled random perturbations. By exploiting the subspace similarity (the alignment of principal directions) between the foundation model and task-specific fine-tuned updates, SOLAR decouples the adapter size from PEFT structure and ensures compact yet expressive representations. It is model-agnostic and compatible with existing PEFT methods, including LoRA, AdaLoRA, and other adapter modules. We theoretically establish a bound on the reconstruction error. Experiments on language and vision tasks using LLaMA, GPT, and ViT models demonstrate that SOLAR preserves task performance while significantly reducing model representation sizes, offering an effective and communication-efficient solution for deployment in distributed systems and edge devices.
【5】TADP-RME: A Trust-Adaptive Differential Privacy Framework for Enhancing Reliability of Data-Driven Systems
标题:TADP-RME:一个用于增强数据驱动系统可靠性的信任自适应差异隐私框架
链接:https://arxiv.org/abs/2604.08113
作者:Labani Halder,Payel Sadhukhan,Sarbani Palit
摘要:确保对抗环境中的可靠性需要将隐私视为数据驱动系统的基本组成部分。虽然差分隐私和密码协议提供了强有力的保证,现有的计划依赖于一个固定的隐私预算,导致一个刚性的效用隐私权衡失败下异构用户的信任。此外,只有噪声的差分隐私保持几何结构,推理攻击利用,造成隐私泄漏。 我们提出了TADP-RME(信任自适应差分隐私与反向流形嵌入),一个框架,提高了不同级别的用户信任的可靠性。它引入了一个在[0,1]范围内的逆信任分数来自适应地调节隐私预算,从而实现效用和隐私之间的平滑过渡。此外,反向流形嵌入应用非线性变换来破坏局部几何关系,同时通过后处理保留形式上的差分隐私保证。 理论和实证结果表明,改进了隐私-效用权衡,将攻击成功率降低了3.1%,而不会显着降低效用。该框架在对抗推理攻击方面始终优于现有方法,为对抗环境中的可靠学习提供了统一的方法。
摘要:Ensuring reliability in adversarial settings necessitates treating privacy as a foundational component of data-driven systems. While differential privacy and cryptographic protocols offer strong guarantees, existing schemes rely on a fixed privacy budget, leading to a rigid utility-privacy trade-off that fails under heterogeneous user trust. Moreover, noise-only differential privacy preserves geometric structure, which inference attacks exploit, causing privacy leakage. We propose TADP-RME (Trust-Adaptive Differential Privacy with Reverse Manifold Embedding), a framework that enhances reliability under varying levels of user trust. It introduces an inverse trust score in the range [0,1] to adaptively modulate the privacy budget, enabling smooth transitions between utility and privacy. Additionally, Reverse Manifold Embedding applies a nonlinear transformation to disrupt local geometric relationships while preserving formal differential privacy guarantees through post-processing. Theoretical and empirical results demonstrate improved privacy-utility trade-offs, reducing attack success rates by up to 3.1 percent without significant utility degradation. The framework consistently outperforms existing methods against inference attacks, providing a unified approach for reliable learning in adversarial environments.
【6】Efficient Dataset Selection for Continual Adaptation of Generative Recommenders
标题:高效的数据集选择,以连续适应生成性推荐
链接:https://arxiv.org/abs/2604.07739
作者:Cathy Jiao,Juan Elenter,Praveen Ravichandran,Bernd Huber,Joseph Cauteruccio,Todd Wasson,Timothy Heath,Chenyan Xiong,Mounia Lalmas,Paul Bennett
备注:ICLR 2026 CAO Workshop (Oral)
摘要:推荐系统必须不断适应不断变化的用户行为,但在大规模流媒体环境中生成的数据量使得频繁的全面重新训练变得不切实际。这项工作研究如何有针对性的数据选择可以减轻时间分布漂移造成的性能下降,同时保持可扩展性。我们评估了一系列的代表性选择和抽样策略,以策展小,但信息丰富的用户交互数据的子集。我们的研究结果表明,基于梯度的表示,加上分布匹配,提高下游模型的性能,实现训练效率的提高,同时保持对漂移的鲁棒性。这些发现突出了数据策展作为生产规模推荐系统中可扩展监控和自适应模型更新的实用机制。
摘要
:Recommendation systems must continuously adapt to evolving user behavior, yet the volume of data generated in large-scale streaming environments makes frequent full retraining impractical. This work investigates how targeted data selection can mitigate performance degradation caused by temporal distributional drift while maintaining scalability. We evaluate a range of representation choices and sampling strategies for curating small but informative subsets of user interaction data. Our results demonstrate that gradient-based representations, coupled with distribution-matching, improve downstream model performance, achieving training efficiency gains while preserving robustness to drift. These findings highlight data curation as a practical mechanism for scalable monitoring and adaptive model updates in production-scale recommendation systems.
【7】Position Paper: From Edge AI to Adaptive Edge AI
标题:从边缘AI到自适应边缘AI
链接:https://arxiv.org/abs/2604.07360
作者:Fabrizio Pittorino,Manuel Roveri
备注:8 pages, 2 tables
摘要:边缘AI通常被定义为在严格约束下的模型压缩和部署。我们提出了一个更强有力的操作论点:现实部署中的边缘AI必然是自适应的。在长期运行中,固定(非自适应)配置面临基本故障模式:随着数据和运行条件随时间演变和变化,它必须(i)违反时变预算(延迟/能量/热/连接/隐私)或(ii)失去预测可靠性(准确性和关键的校准),风险集中在瞬态状态和罕见的时间间隔,而不是平均性能。如果部署的系统不能在不断变化的条件和约束下重新配置其计算--以及在需要时重新配置其模型状态--它就会简化为静态嵌入式推理,无法提供持续的效用。本立场文件介绍了一个最小的代理系统环境(ASE)镜头,使适应性精确的边缘指定(i)什么变化,(ii)观察到什么,(iii)什么可以重新配置,(iv)随着时间的推移,哪些约束必须保持满足。在此框架的基础上,我们制定了未来十年的十大研究挑战,涵盖了不断发展的系统,动态架构和数据驱动和基于模型的组件之间的混合转换,故障/异常驱动的目标更新,系统1/系统2分解的理论保证。(随时智能),模块化,稀缺标签下的验证,以及量化漂移和干预下的生命周期效率和恢复/稳定性的评估协议。
摘要:Edge AI is often framed as model compression and deployment under tight constraints. We argue a stronger operational thesis: Edge AI in realistic deployments is necessarily adaptive. In long-horizon operation, a fixed (non-adaptive) configuration faces a fundamental failure mode: as data and operating conditions evolve and change in time, it must either (i) violate time-varying budgets (latency/energy/thermal/connectivity/privacy) or (ii) lose predictive reliability (accuracy and, critically, calibration), with risk concentrating in transient regimes and rare time intervals rather than in average performance. If a deployed system cannot reconfigure its computation - and, when required, its model state - under evolving conditions and constraints, it reduces to static embedded inference and cannot provide sustained utility. This position paper introduces a minimal Agent-System-Environment (ASE) lens that makes adaptivity precise at the edge by specifying (i) what changes, (ii) what is observed, (iii) what can be reconfigured, and (iv) which constraints must remain satisfied over time. Building on this framing, we formulate ten research challenges for the next decade, spanning theoretical guarantees for evolving systems, dynamic architectures and hybrid transitions between data-driven and model-based components, fault/anomaly-driven targeted updates, System-1/System-2 decompositions (anytime intelligence), modularity, validation under scarce labels, and evaluation protocols that quantify lifecycle efficiency and recovery/stability under drift and interventions.
强化学习(6篇)
【1】Value-Guidance MeanFlow for Offline Multi-Agent Reinforcement Learning
标题:离线多智能体强化学习的价值引导MeanFlow
链接:https://arxiv.org/abs/2604.08174
作者:Teng Pang,Zhiqiang Dong,Yan Zhang,Rongjian Xu,Guoqiang Wu,Yilong Yin
摘要:离线多智能体强化学习(MARL)旨在从预先收集的数据集中学习最佳联合策略,需要在最大化全局收益和减轻离线数据的分布偏移之间进行权衡。最近的研究使用扩散或流生成模型来捕获代理之间的复杂联合策略行为;然而,它们通常依赖于多步迭代采样,从而降低了训练和推理效率。虽然进一步的研究通过蒸馏等方法提高了采样效率,但它仍然对行为正则化系数敏感。为了解决上述问题,我们提出了价值指导多智能体平均流策略(VGM $^2$P),一个简单而有效的基于流的策略学习框架,使高效的动作生成与系数不敏感的条件行为克隆。具体来说,VGM $^2$P使用全局优势值来指导代理协作,将最优策略学习视为条件行为克隆。此外,为了提高多代理场景中的策略表达能力和推理效率,它利用无分类器指导MeanFlow进行策略训练和执行。在具有离散和连续动作空间的任务上的实验表明,即使仅通过条件行为克隆进行训练,VGM $^2 $P也能有效地实现与最先进方法相当的性能。
摘要:Offline multi-agent reinforcement learning (MARL) aims to learn the optimal joint policy from pre-collected datasets, requiring a trade-off between maximizing global returns and mitigating distribution shift from offline data. Recent studies use diffusion or flow generative models to capture complex joint policy behaviors among agents; however, they typically rely on multi-step iterative sampling, thereby reducing training and inference efficiency. Although further research improves sampling efficiency through methods like distillation, it remains sensitive to the behavior regularization coefficient. To address the above-mentioned issues, we propose Value Guidance Multi-agent MeanFlow Policy (VGM$^2$P), a simple yet effective flow-based policy learning framework that enables efficient action generation with coefficient-insensitive conditional behavior cloning. Specifically, VGM$^2$P uses global advantage values to guide agent collaboration, treating optimal policy learning as conditional behavior cloning. Additionally, to improve policy expressiveness and inference efficiency in multi-agent scenarios, it leverages classifier-free guidance MeanFlow for both policy training and execution. Experiments on tasks with both discrete and continuous action spaces demonstrate that, even when trained solely via conditional behavior cloning, VGM$^2$P efficiently achieves performance comparable to state-of-the-art methods.
【2】PriPG-RL: Privileged Planner-Guided Reinforcement Learning for Partially Observable Systems with Anytime-Feasible MPC
标题:PriPG-RL:具有随时可行MPC的部分可观察系统的特权规划者引导强化学习
链接:https://arxiv.org/abs/2604.08036
作者:Mohsen Amiri,Mohsen Amiri,Ali Beikmohammadi,Sindri Magnuśson,Mehdi Hosseinzadeh
备注:8 pages, 3 figures
摘要:本文解决了在部分可观测性下训练强化学习(RL)策略的问题,该策略通过利用在训练期间唯一可用的特权的任何时间可行的规划代理。我们将其形式化为一个部分可观测马尔可夫决策过程(POMDP),其中一个计划代理访问一个近似的动态模型和特权状态信息指导学习代理,只观察真实状态的有损投影。为了实现这个框架,我们引入了一个随时可行的模型预测控制(MPC)算法,作为规划代理。对于学习代理,我们提出了规划者到政策软演员批评(P2P-SAC),一种方法,提取规划者代理的特权知识,以减轻部分可观察性,从而提高样本效率和最终的政策性能。我们支持这个框架与严格的理论分析。最后,我们使用NVIDIA Isaac Lab在仿真中验证了我们的方法,并将其成功部署在现实世界的Unitree Go2四足机器人导航复杂,障碍物丰富的环境中。
摘要:This paper addresses the problem of training a reinforcement learning (RL) policy under partial observability by exploiting a privileged, anytime-feasible planner agent available exclusively during training. We formalize this as a Partially Observable Markov Decision Process (POMDP) in which a planner agent with access to an approximate dynamical model and privileged state information guides a learning agent that observes only a lossy projection of the true state. To realize this framework, we introduce an anytime-feasible Model Predictive Control (MPC) algorithm that serves as the planner agent. For the learning agent, we propose Planner-to-Policy Soft Actor-Critic (P2P-SAC), a method that distills the planner agent's privileged knowledge to mitigate partial observability and thereby improve both sample efficiency and final policy performance. We support this framework with rigorous theoretical analysis. Finally, we validate our approach in simulation using NVIDIA Isaac Lab and successfully deploy it on a real-world Unitree Go2 quadruped navigating complex, obstacle-rich environments.
【3】RL-ASL: A Dynamic Listening Optimization for TSCH Networks Using Reinforcement Learning
标题:WL-ATL:使用强化学习的Tsch网络动态收听优化
链接:https://arxiv.org/abs/2604.07533
作者:F. Fernando Jurado-Lasso,J. F. Jurado
备注:14 pages
摘要:时隙信道跳变(TSCH)是IEEE 802.15.4e标准中广泛采用的媒体访问控制(MAC)协议,旨在为工业物联网(IIoT)网络提供可靠且节能的通信。然而,现有技术的TSCH调度器依赖于静态时隙分配,导致在动态业务条件下的空闲侦听和不必要的功耗。本文介绍了RL-ASL,这是一种强化学习驱动的自适应监听框架,可以根据实时网络条件动态决定是否激活或跳过预定的监听时隙。通过将基于学习的时隙跳过与标准TSCH调度相结合,RL-ASL减少了空闲侦听,同时保持了同步和传输可靠性。在FIT IoT-LAB测试平台和Cooja网络模拟器上的实验结果表明,RL-ASL的功耗比基线调度协议低46%,同时保持了近乎完美的可靠性,并将平均延迟降低了96%。其基于链路的变体RL-ASL-LB进一步改善了高争用下的延迟性能,并具有类似的能效。重要的是,RL-ASL在受约束的微粒上执行推理,开销可以忽略不计,因为模型训练完全离线执行。总的来说,RL-ASL为下一代低功耗IIoT网络提供了一种实用、可扩展和节能的调度机制。
摘要:Time Slotted Channel Hopping (TSCH) is a widely adopted Media Access Control (MAC) protocol within the IEEE 802.15.4e standard, designed to provide reliable and energy-efficient communication in Industrial Internet of Things (IIoT) networks. However, state-of-the-art TSCH schedulers rely on static slot allocations, resulting in idle listening and unnecessary power consumption under dynamic traffic conditions. This paper introduces RL-ASL, a reinforcement learning-driven adaptive listening framework that dynamically decides whether to activate or skip a scheduled listening slot based on real-time network conditions. By integrating learning-based slot skipping with standard TSCH scheduling, RL-ASL reduces idle listening while preserving synchronization and delivery reliability. Experimental results on the FIT IoT-LAB testbed and Cooja network simulator show that RL-ASL achieves up to 46% lower power consumption than baseline scheduling protocols, while maintaining near-perfect reliability and reducing average latency by up to 96% compared to PRIL-M. Its link-based variant, RL-ASL-LB, further improves delay performance under high contention with similar energy efficiency. Importantly, RL-ASL performs inference on constrained motes with negligible overhead, as model training is fully performed offline. Overall, RL-ASL provides a practical, scalable, and energy-aware scheduling mechanism for next-generation low-power IIoT networks.
【4】GIRL: Generative Imagination Reinforcement Learning via Information-Theoretic Hallucination Control
标题:GIRL:通过信息理论幻觉控制的生成想象强化学习
链接:https://arxiv.org/abs/2604.07426
作者:Prakul Sunil Hiremath
备注:20 pages, 2 figures, 7 tables; reinforcement learning, world models
摘要:基于模型的强化学习(MBRL)通过优化想象中的推出策略来提高样本效率,但是当模型误差复合并且想象中的轨迹偏离训练流形时,长期规划会降低。我们介绍了GIRL(生成想象强化学习),一个潜在的世界模型框架,它通过两个关键组件来解决这种故障模式。首先,来自冻结地基模型(DINOv 2)的跨模态接地信号在语义一致的嵌入空间之前锚定潜在的过渡,惩罚不一致或不可信的预测。第二,不确定性自适应信任区域瓶颈将KL正则化器解释为约束优化问题的拉格朗日乘子,将想象力漂移限制在由预期信息增益和相对性能损失信号校准的学习区域内。 我们使用性能差异引理和积分概率定理重新推导出一个值差距界限,得到一个界限,该界限在折扣因子接近1时仍然具有信息性,并将目标与真实环境中的遗憾联系起来。在三个基准套件中进行的实验,包括DeepMind Control、Adroit Hand Manipulation和带有视觉干扰物的Meta-World,表明GIRL相对于DreamerV 3在任务中减少了38%到61%的潜在滚动漂移,提高了渐近回报,并且在长期任务中需要更少的环境交互。在标准评估指标下,GIRL在稀疏奖励和高接触设置上的表现也优于TD-MPC 2。相对于完整模型,提取先验变量减少了推理开销并提高了计算效率。
摘要:Model-based reinforcement learning (MBRL) improves sample efficiency by optimizing policies inside imagined rollouts, but long-horizon planning degrades when model errors compound and imagined trajectories drift off the training manifold. We introduce GIRL (Generative Imagination Reinforcement Learning), a latent world-model framework that addresses this failure mode with two key components. First, a cross-modal grounding signal derived from a frozen foundation model (DINOv2) anchors the latent transition prior to a semantically consistent embedding space, penalizing inconsistent or implausible predictions. Second, an uncertainty-adaptive trust-region bottleneck interprets the KL regularizer as the Lagrange multiplier of a constrained optimization problem, restricting imagination drift within a learned region calibrated by Expected Information Gain and a Relative Performance Loss signal. We re-derive a value-gap bound using the Performance Difference Lemma and Integral Probability Metrics, yielding a bound that remains informative as the discount factor approaches one and connects the objective to real-environment regret. Experiments across three benchmark suites, including DeepMind Control, Adroit Hand Manipulation, and Meta-World with visual distractors, show that GIRL reduces latent rollout drift by 38 to 61 percent across tasks relative to DreamerV3, improves asymptotic return, and requires fewer environment interactions on long-horizon tasks. GIRL also outperforms TD-MPC2 on sparse-reward and high-contact settings under standard evaluation metrics. A distilled-prior variant reduces inference overhead and improves computational efficiency relative to the full model.
【5】Reinforcement Learning with Reward Machines for Sleep Control in Mobile Networks
标题:利用奖励机进行强化学习用于移动网络中的睡眠控制
链接:https://arxiv.org/abs/2604.07411
作者:Kristina Levina,Nikolaos Pappas,Athanasios Karapantelakis,Aneta Vulgarakis Feljan,Jendrik Seipp
备注:Under review
摘要:移动网络的能源效率对于可持续的电信基础设施至关重要,特别是在网络密集化继续增加功耗的情况下。移动网络中的组件的睡眠机制可以减少能量使用,但是决定哪些组件在保持服务质量(QoS)的同时睡眠、何时睡眠以及睡眠多长时间仍然是一个困难的优化问题。在本文中,我们利用强化学习与奖励机(RM)做出睡眠控制决策,平衡即时节能和长期QoS影响,即最后期限约束流量的时间平均丢包率和恒定速率用户的时间平均最小吞吐量保证。一个挑战是,时间平均约束取决于随着时间的推移而累积的性能,而不是立即的性能。因此,有效的奖励是非马尔可夫的,最佳行动取决于操作历史,而不是瞬时系统状态。RM通过维护一个抽象状态来解释历史依赖性,该抽象状态显式地跟踪随时间的QoS约束违反。我们的框架提供了一个原则性的,可扩展的方法,能源管理下一代移动网络在不同的流量模式和QoS要求。
摘要:Energy efficiency in mobile networks is crucial for sustainable telecommunications infrastructure, particularly as network densification continues to increase power consumption. Sleep mechanisms for the components in mobile networks can reduce energy use, but deciding which components to put to sleep, when, and for how long while preserving quality of service (QoS) remains a difficult optimisation problem. In this paper, we utilise reinforcement learning with reward machines (RMs) to make sleep-control decisions that balance immediate energy savings and long-term QoS impact, i.e. time-averaged packet drop rates for deadline-constrained traffic and time-averaged minimum-throughput guarantees for constant-rate users. A challenge is that time-averaged constraints depend on cumulative performance over time rather than immediate performance. As a result, the effective reward is non-Markovian, and optimal actions depend on operational history rather than the instantaneous system state. RMs account for the history dependence by maintaining an abstract state that explicitly tracks the QoS constraint violations over time. Our framework provides a principled, scalable approach to energy management for next-generation mobile networks under diverse traffic patterns and QoS requirements.
【6】Investigation of Automated Design of Quantum Circuits for Imaginary Time Evolution Methods Using Deep Reinforcement Learning
标题:使用深度强化学习的虚时间进化量子电路自动设计研究
链接:https://arxiv.org/abs/2604.07951
作者:Ryo Suzuki,Shohei Watabe
备注:11 pages, 11 figures
摘要:有效的基态搜索是推进组合优化问题和量子化学的基础。虽然变分虚时间演化(VITE)方法提供了变分量子本征解算器(VQE)和量子近似优化算法(QAOA)的有用替代方案,但其在噪声中间尺度量子(NISQ)器件上的实现受到手动设计的模拟器的门数和深度的严重限制。在这里,我们提出了一个使用双深度Q网络(DDQN)的VITE电路设计自动化框架。我们的方法将电路的建设作为一个多目标优化问题,同时最大限度地减少能源的期望值和优化电路的复杂性。通过引入采用阈值,我们证明了显着的硬件开销减少。在Max-Cut问题中,我们的智能体自主发现的电路比标准的硬件有效的Ancestor平均少了大约37\%的门和43\%的深度。对于分子氢($H_2$),DDQN也达到了全CI极限,保持了明显较浅的电路。这些结果表明,深度强化学习有助于找到非直观的最佳电路结构,为高效的硬件感知量子算法设计提供了一条途径。
摘要:Efficient ground state search is fundamental to advancing combinatorial optimization problems and quantum chemistry. While the Variational Imaginary Time Evolution (VITE) method offers a useful alternative to Variational Quantum Eigensolver (VQE), and Quantum Approximate Optimization Algorithm (QAOA), its implementation on Noisy Intermediate-Scale Quantum (NISQ) devices is severely limited by the gate counts and depth of manually designed ansatz. Here, we present an automated framework for VITE circuit design using Double Deep-Q Networks (DDQN). Our approach treats circuit construction as a multi-objective optimization problem, simultaneously minimizing energy expectation values and optimizing circuit complexity. By introducing adoptive thresholds, we demonstrate significant hardware overhead reductions. In Max-Cut problems, our agent autonomously discovered circuits with approximately 37\% fewer gates and 43\% less depth than standard hardware-efficient ansatz on average. For molecular hydrogen ($H_2$), the DDQN also achieved the Full-CI limit, with maintaining a significantly shallower circuit. These results suggest that deep reinforcement learning can be helpful to find non-intuitive, optimal circuit structures, providing a pathway toward efficient, hardware-aware quantum algorithm design.
元学习(1篇)
【1】Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding
标题:上下文元学习实现免训练跨学科大脑解码
链接:https://arxiv.org/abs/2604.08537
作者:Mu Nan,Muquan Yu,Weijian Mai,Jacob S. Prince,Hossein Adeli,Rui Zhang,Jiahang Cao,Benjamin Becker,John A. Pyles,Margaret M. Henderson,Chunfeng Song,Nikolaus Kriegeskorte,Michael J. Tarr,Xiaoqing Hu,Andrew F. Luo
备注:Accepted to CVPR 2026, website: https://github.com/ezacngm/brainCodec
摘要:从大脑信号中进行视觉解码是计算机视觉和神经科学交叉的一个关键挑战,需要连接神经表示和视觉计算模型的方法。一个外地范围的目标是实现可推广的,跨学科的模式。实现这一目标的一个主要障碍是个体之间神经表征的巨大差异,到目前为止,这需要为每个受试者单独训练定制模型或进行微调。为了解决这一挑战,我们引入了一个元优化的方法,语义视觉解码功能磁共振成像,概括到新的主题,没有任何微调。通过简单地调节来自新个体的一小组图像脑激活示例,我们的模型快速推断出他们独特的神经编码模式,以促进强大而有效的视觉解码。我们的方法是明确优化的上下文中学习的新主题的编码模型,并执行解码的分层推理,反转编码器。首先,对于多个大脑区域,我们通过构建多个刺激和响应的上下文来估计每体素视觉响应编码器参数。其次,我们构建了一个由多个体素上的编码器参数和响应值组成的上下文来执行聚合函数反演。我们展示了强大的跨学科和跨扫描仪泛化在不同的视觉骨干,而无需重新训练或微调。此外,我们的方法既不需要解剖对齐,也不需要刺激重叠。这项工作是迈向非侵入性大脑解码的可推广基础模型的关键一步。
摘要:Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural representations and computational models of vision. A field-wide goal is to achieve generalizable, cross-subject models. A major obstacle towards this goal is the substantial variability in neural representations across individuals, which has so far required training bespoke models or fine-tuning separately for each subject. To address this challenge, we introduce a meta-optimized approach for semantic visual decoding from fMRI that generalizes to novel subjects without any fine-tuning. By simply conditioning on a small set of image-brain activation examples from the new individual, our model rapidly infers their unique neural encoding patterns to facilitate robust and efficient visual decoding. Our approach is explicitly optimized for in-context learning of the new subject's encoding model and performs decoding by hierarchical inference, inverting the encoder. First, for multiple brain regions, we estimate the per-voxel visual response encoder parameters by constructing a context over multiple stimuli and responses. Second, we construct a context consisting of encoder parameters and response values over multiple voxels to perform aggregated functional inversion. We demonstrate strong cross-subject and cross-scanner generalization across diverse visual backbones without retraining or fine-tuning. Moreover, our approach requires neither anatomical alignment nor stimulus overlap. This work is a critical step towards a generalizable foundation model for non-invasive brain decoding.
医学相关(3篇)
【1】Rhizome OS-1: Rhizome's Semi-Autonomous Operating System for Small Molecule Drug Discovery
标题:Rhizome OS-1:Rhizome用于小分子药物发现的半自主操作系统
链接:https://arxiv.org/abs/2604.07512
作者:Yiwen Wang,Gregory Sinenka,Xhuliano Brace
摘要:我们介绍了一个半自主的发现系统,其中多模态AI代理作为一个多学科的发现团队,充当计算化学家,药物化学家和专利代理人,编写和执行分析代码,可视化评估分子候选人,评估专利性,并根据经验筛选反馈调整生成策略,而r1,一个246 M参数的图形神经网络(GNN)在800 M分子上训练,直接在分子图上生成新的化学物质。代理商在肿瘤学(BCL 6,EZH 2)中执行了两项活动,制定了三个策略层的药物化学假设,并生成了每个靶标2,355 - 2,876个新分子的文库。在这两个靶点中,91.9%生成的Murcko支架在其各自靶点的ChEMBL中不存在,与最近的已知活性物质的Tanimoto距离为0.56-0.69,证实该引擎产生结构上不同的化学物质,而不是重述已知化合物。使用Boltz-2进行的结合亲和力预测根据ChEMBL实验数据进行校准,实现了-0.53至-0.64的Spearman相关性和0.88至0.93的ROC AUC值。这些结果表明,配备图形原生生成工具和物理信息评分的半自主代理系统为小分子发现的现代操作系统提供了基础。我们表明,根茎OS-1通过支持规模化、快速和自适应的逆向设计,为早期药物发现提供了一个新的范例。
摘要:We introduce a semi-autonomous discovery system in which multi-modal AI agents function as a multi-disciplinary discovery team, acting as computational chemists, medicinal chemists, and patent agents, writing and executing analysis code, visually evaluating molecular candidates, assessing patentability, and adapting generation strategy from empirical screening feedback, while r1, a 246M-parameter Graph Neural Network (GNN) trained on 800M molecules, generates novel chemical matter directly on molecular graphs. Agents executed two campaigns in oncology (BCL6, EZH2), formulating medicinal chemistry hypotheses across three strategy tiers and generating libraries of 2,355-2,876 novel molecules per target. Across both targets, 91.9% of generated Murcko scaffolds are absent from ChEMBL for their respective targets, with Tanimoto distances of 0.56-0.69 to the nearest known active, confirming that the engine produces structurally distinct chemical matter rather than recapitulating known compounds. Binding affinity predictions using Boltz-2 were calibrated against ChEMBL experimental data, achieving Spearman correlations of -0.53 to -0.64 and ROC AUC values of 0.88 to 0.93. These results demonstrate that semi-autonomous agent systems, equipped with graph-native generative tools and physics-informed scoring, provide a foundation for a modern operating system for small molecule discovery. We show that Rhizome OS-1 enables a new paradigm for early-stage drug discovery by supporting scaled, rapid, and adaptive inverse design.
【2】Differentially Private Modeling of Disease Transmission within Human Contact Networks
标题:人类接触网络内疾病传播的差异私人建模
链接:https://arxiv.org/abs/2604.07493
作者:Shlomi Hod,Debanuj Nayak,Jason R. Gantenberg,Iden Kalemaj,Thomas A. Trikalinos,Adam Smith
摘要:传染病的流行病学研究通常依赖于接触网络模型来捕捉控制疾病传播的复杂相互作用,正在进行的项目旨在大大增加收集此类数据的规模。然而,联系人网络可能包含敏感信息,如性关系或吸毒行为。保护个人隐私,同时保持数据的科学有用性至关重要。我们提出了一个隐私保护管道的疾病传播模拟研究的基础上,一个敏感的网络,集成差分隐私(DP)与统计网络模型,如随机块模型(SBMs)和指数随机图模型(ERGMs)。我们的管道包括三个步骤:(1)使用节点级DP(对应于保护个人贡献)计算网络摘要统计数据;(2)使用这些摘要拟合一个统计模型,如ERGM,这允许生成反映原始网络结构的合成网络;(3)使用基于代理的模型模拟疾病在合成网络上的传播。我们使用一个简单的易感-感染-易感(SIS)疾病模型在多种配置下评估我们的方法的有效性。我们比较了数值结果,如模拟疾病的发病率和患病率,以及定性的结论,如干预效应大小,在网络上产生的差异隐私限制和不。我们的实验基于ARTNet研究(一项关于艾滋病毒相关行为的调查)中的自我中心性网络数据。我们的研究结果表明,相对于其他误差来源(采样和模型误设定),为隐私添加的噪声很小。这表明,原则上,这些敏感数据的管理者可以在保护隐私的同时提供有价值的流行病学见解。
摘要:Epidemiologic studies of infectious diseases often rely on models of contact networks to capture the complex interactions that govern disease spread, and ongoing projects aim to vastly increase the scale at which such data can be collected. However, contact networks may include sensitive information, such as sexual relationships or drug use behavior. Protecting individual privacy while maintaining the scientific usefulness of the data is crucial. We propose a privacy-preserving pipeline for disease spread simulation studies based on a sensitive network that integrates differential privacy (DP) with statistical network models such as stochastic block models (SBMs) and exponential random graph models (ERGMs). Our pipeline comprises three steps: (1) compute network summary statistics using \emph{node-level} DP (which corresponds to protecting individuals' contributions); (2) fit a statistical model, like an ERGM, using these summaries, which allows generating synthetic networks reflecting the structure of the original network; and (3) simulate disease spread on the synthetic networks using an agent-based model. We evaluate the effectiveness of our approach using a simple Susceptible-Infected-Susceptible (SIS) disease model under multiple configurations. We compare both numerical results, such as simulated disease incidence and prevalence, as well as qualitative conclusions such as intervention effect size, on networks generated with and without differential privacy constraints. Our experiments are based on egocentric sexual network data from the ARTNet study (a survey about HIV-related behaviors). Our results show that the noise added for privacy is small relative to other sources of error (sampling and model misspecification). This suggests that, in principle, curators of such sensitive data can provide valuable epidemiologic insights while protecting privacy.
【3】HistDiT: A Structure-Aware Latent Conditional Diffusion Model for High-Fidelity Virtual Staining in Histopathology
标题:HistDiT:用于组织病理学高保真虚拟染色的结构感知潜在条件扩散模型
链接:https://arxiv.org/abs/2604.08305
作者:Aasim Bin Saleem,Amr Ahmed,Ardhendu Behera,Hafeezullah Amin,Iman Yi Liao,Mahmoud Khattab,Pan Jia Wern,Haslina Makmur
备注:Accepted to ICPR 2026
摘要:免疫组织化学(IHC)对于评估乳腺癌中的特定免疫生物标志物如人表皮生长因子受体2(HER 2)至关重要。然而,获得IHC染色剂的传统方案是资源密集型的,耗时的,并且易于结构损伤。虚拟染色已经成为一种可扩展的替代方案,但它在保留细粒度细胞结构的同时准确翻译生化表达方面面临着重大挑战。目前最先进的方法仍然依赖于生成对抗网络(GANs)或标准卷积U-Net扩散模型,这些模型经常与“结构和染色权衡”作斗争。生成的样本要么是结构相关但模糊的,要么是纹理逼真但具有影响其诊断用途的伪影。在本文中,我们介绍了HistDiT,一种新的潜在的条件扩散Transformer(DiT)架构,建立了一个新的基准,在虚拟组织学染色的视觉保真度。在这项工作中引入的新颖性是,a)双流条件反射策略,其通过VAE编码的潜伏期和通过UNI嵌入的语义表型指导明确地保持空间约束之间的平衡; b)多目标损失函数,其有助于具有清晰形态结构的更清晰的图像;以及c)使用结构相关性度量(SCM)来集中于岩心形态结构以精确评估样品质量。因此,我们的模型优于现有的基线,通过严格的定量和定性评估证明。
摘要:Immunohistochemistry (IHC) is essential for assessing specific immune biomarkers like Human Epidermal growth-factor Receptor 2 (HER2) in breast cancer. However, the traditional protocols of obtaining IHC stains are resource-intensive, time-consuming, and prone to structural damages. Virtual staining has emerged as a scalable alternative, but it faces significant challenges in preserving fine-grained cellular structures while accurately translating biochemical expressions. Current state-of-the-art methods still rely on Generative Adversarial Networks (GANs) or standard convolutional U-Net diffusion models that often struggle with "structure and staining trade-offs". The generated samples are either structurally relevant but blurry, or texturally realistic but have artifacts that compromise their diagnostic use. In this paper, we introduce HistDiT, a novel latent conditional Diffusion Transformer (DiT) architecture that establishes a new benchmark for visual fidelity in virtual histological staining. The novelty introduced in this work is, a) the Dual-Stream Conditioning strategy that explicitly maintains a balance between spatial constraints via VAE-encoded latents and semantic phenotype guidance via UNI embeddings; b) the multi-objective loss function that contributes to sharper images with clear morphological structure; and c) the use of the Structural Correlation Metric (SCM) to focus on the core morphological structure for precise assessment of sample quality. Consequently, our model outperforms existing baselines, as demonstrated through rigorous quantitative and qualitative evaluations.
蒸馏|知识提取(2篇)
【1】Are we still able to recognize pearls? Machine-driven peer review and the risk to creativity: An explainable RAG-XAI detection framework with markers extraction
标题:我们还能认出珍珠吗?机器驱动的同行评审和创造力的风险:具有标记提取的可解释RAG-XAI检测框架
链接:https://arxiv.org/abs/2604.07964
作者:Alin-Gabriel Văduva,Simona-Vasilica Oprea,Adela Bâra
摘要:将大型语言模型(LLM)集成到同行评审中,引发了作者身份和检测之外的担忧:整个编辑过程的潜在级联自动化。随着评论部分或完全由机器生成,编辑决策也可能委托给算法系统,从而导致完全自动化的评估管道。它们有可能重塑评估科学工作的标准。本文认为,机器驱动的评估可能会系统地有利于标准化,符合模式的研究,而惩罚非常规和范式转变的想法,需要上下文的人类判断。我们认为这种转变可能导致认知同质化,研究人员被隐含地激励优化他们的工作以获得算法批准,而不是真正的发现。为了解决这一风险,我们引入了一个可解释的框架(RAG-XAI),用于评估评审质量并使用标记LLM提取器检测自动模式,旨在保持科学的透明度,问责制和创造力。该框架实现了近乎完美的检测性能,XGBoost、Random Forest和LightGBM的准确率达到99.61%,AUC-ROC高于0.999,测试集上的F1分数为0.9925,同时保持极低的假阳性率(<0.23%)和假阴性率(~0.8%)。相比之下,逻辑回归基线表现较差(准确率为89.97%,F1评分为0.8314)。特征重要性和SHAP分析确定了个人信号和重复模式的缺失作为主导预测因素。此外,RAG组件实现了90.5%的top-1检索准确率,在嵌入空间中具有强大的同类聚类,进一步支持框架输出的可靠性。
摘要:The integration of large language models (LLMs) into peer review raises a concern beyond authorship and detection: the potential cascading automation of the entire editorial process. As reviews become partially or fully machine-generated, it becomes plausible that editorial decisions may also be delegated to algorithmic systems, leading to a fully automated evaluation pipeline. They risk reshaping the criteria by which scientific work is assessed. This paper argues that machine-driven assessment may systematically favor standardized, pattern-conforming research while penalizing unconventional and paradigm-shifting ideas that require contextual human judgment. We consider that this shift could lead to epistemic homogenization, where researchers are implicitly incentivized to optimize their work for algorithmic approval rather than genuine discovery. To address this risk, we introduce an explainable framework (RAG-XAI) for assessing review quality and detecting automated patterns using markers LLM extractor, aiming to preserve transparency, accountability and creativity in science. The proposed framework achieves near-perfect detection performance, with XGBoost, Random Forest and LightGBM reaching 99.61% accuracy, AUC-ROC above 0.999 and F1-scores of 0.9925 on the test set, while maintaining extremely low false positive rates (<0.23%) and false negative rates (~0.8%). In contrast, the logistic regression baseline performs substantially worse (89.97% accuracy, F1-score 0.8314). Feature importance and SHAP analyses identify absence of personal signals and repetition patterns as the dominant predictors. Additionally, the RAG component achieves 90.5% top-1 retrieval accuracy, with strong same-class clustering in the embedding space, further supporting the reliability of the framework's outputs.
【2】Structured Distillation of Web Agent Capabilities Enables Generalization
标题:Web代理功能的结构化提炼实现通用化
链接:https://arxiv.org/abs/2604.07776
作者:Xing Han Lù,Siva Reddy
摘要:Frontier LLM可以导航复杂的网站,但其成本和对第三方API的依赖使得本地部署不切实际。我们引入代理作为注释器,一个框架,通过类比人类注释角色,为Web代理构建合成轨迹生成,用模块化LLM组件取代任务设计器,注释器和管理器。使用Gemini 3 Pro作为教师,我们在六个网络环境中生成了3,000个轨迹,并在通过质量过滤的2,322个轨迹上使用纯监督学习对9 B参数学生进行微调。结果模型在WebArena上达到了41.5%,超过了在相同评估协议下的闭源模型,如Claude 3.5 Sonnet(36.0%)和GPT-4 o(31.5%),并且几乎是之前最好的开放权重结果(Go-Browse,21.7%)的两倍。功能转移到看不见的环境,在WorkArena L1(培训期间从未见过的企业平台)上获得18.2个百分点的收益,并在三个额外的基准测试中获得一致的改进。烧蚀确认每个管道组件都有意义地做出贡献,法官过滤,评估提示和推理跟踪每个都占可测量的收益。这些结果表明,结构化的轨迹合成从一个单一的前沿教师是足以产生竞争力,本地部署的网络代理。项目页面:https://agent-as-annotators.github.io
摘要:Frontier LLMs can navigate complex websites, but their cost and reliance on third-party APIs make local deployment impractical. We introduce Agent-as-Annotators, a framework that structures synthetic trajectory generation for web agents by analogy to human annotation roles, replacing the Task Designer, Annotator, and Supervisor with modular LLM components. Using Gemini 3 Pro as teacher, we generate 3,000 trajectories across six web environments and fine-tune a 9B-parameter student with pure supervised learning on the 2,322 that pass quality filtering. The resulting model achieves 41.5% on WebArena, surpassing closed-source models such as Claude 3.5 Sonnet (36.0%) and GPT-4o (31.5%) under the same evaluation protocol, and nearly doubling the previous best open-weight result (Go-Browse, 21.7%). Capabilities transfer to unseen environments, with an 18.2 percentage point gain on WorkArena L1 (an enterprise platform never seen during training) and consistent improvements across three additional benchmarks. Ablations confirm that each pipeline component contributes meaningfully, with Judge filtering, evaluation hints, and reasoning traces each accounting for measurable gains. These results demonstrate that structured trajectory synthesis from a single frontier teacher is sufficient to produce competitive, locally deployable web agents. Project page: https://agent-as-annotators.github.io
推荐(1篇)
【1】Ensembles at Any Cost? Accuracy-Energy Trade-offs in Recommender Systems
标题:不惜一切代价合奏?推荐系统中的准确性与能量权衡
链接:https://arxiv.org/abs/2604.07869
作者:Jannik Nitschke,Lukas Wegmeth,Joeran Beel
摘要:在推荐系统中经常使用包围方法,通过组合多个模型来提高准确性。最近的工作报告了相当大的性能提升,但大多数研究仍然主要针对准确性和鲁棒性而不是能源效率进行优化。本文测量集成技术相对于强单一模型的准确性能量权衡。我们在两个管道中运行93个受控实验:1。显式评级预测与惊喜(RMSE)和2。使用LensKit进行隐式反馈排名(NDCG@10)。我们评估了四个数据集,范围从10万到780万次交互(MovieLens 100 K,MovieLens 1 M,ModCloth,Anime)。我们比较了四种集成策略(平均,加权,堆叠或排名融合,顶级表演者)对基线和优化的单一模型。整个系统的能量通过EMERS使用智能插头测量并转换为CO2当量。在不同的设置中,合奏将准确度提高了0.3%到5.7%,同时将能量提高了19%到2,549%。在MovieLens 1 M上,与SVD++相比,顶级表演者集合以18.8%的能源开销将RMSE提高了0.96%。在MovieLens 100 K上,平均集合将NDCG@10提高了5.7%,并增加了103%的能量。在动漫中,Surprise Top Performers合奏将RMSE提高了1.2%,但消耗了2,005%的能量(0.21 vs. 0.01 Wh),将排放量从2.6增加到53.8 mg CO2当量,LensKit合奏由于内存限制而失败。总体而言,选择性集成比穷举平均更节能,
摘要:Ensemble methods are frequently used in recommender systems to improve accuracy by combining multiple models. Recent work reports sizable performance gains, but most studies still optimize primarily for accuracy and robustness rather than for energy efficiency. This paper measures accuracy energy trade offs of ensemble techniques relative to strong single models. We run 93 controlled experiments in two pipelines: 1. explicit rating prediction with Surprise (RMSE) and 2. implicit feedback ranking with LensKit (NDCG@10). We evaluate four datasets ranging from 100,000 to 7.8 million interactions (MovieLens 100K, MovieLens 1M, ModCloth, Anime). We compare four ensemble strategies (Average, Weighted, Stacking or Rank Fusion, Top Performers) against baselines and optimized single models. Whole system energy is measured with EMERS using a smart plug and converted to CO2 equivalents. Across settings, ensembles improve accuracy by 0.3% to 5.7% while increasing energy by 19% to 2,549%. On MovieLens 1M, a Top Performers ensemble improves RMSE by 0.96% at an 18.8% energy overhead over SVD++. On MovieLens 100K, an averaging ensemble improves NDCG@10 by 5.7% with 103% additional energy. On Anime, a Surprise Top Performers ensemble improves RMSE by 1.2% but consumes 2,005% more energy (0.21 vs. 0.01 Wh), increasing emissions from 2.6 to 53.8 mg CO2 equivalents, and LensKit ensembles fail due to memory limits. Overall, selective ensembles are more energy efficient than exhaustive averaging,
聚类(1篇)
【1】The Condition-Number Principle for Prototype Clustering
标题:原型集群的条件数原则
链接:https://arxiv.org/abs/2604.07744
作者:Romano Li,Jianfei Cao
摘要:我们开发了一个几何框架,将客观准确性与基于原型的聚类中的结构恢复联系起来。该分析是算法不可知的,适用于广泛的一类容许损失函数。我们定义了一个聚类条件数,比较内集群规模的最小损失增加所需的移动一个点跨越集群边界。当这个数量很小时,任何具有小的次优差距的解决方案也必须具有相对于基准划分的小的误分类错误。该框架还阐明了鲁棒性和对集群不平衡的敏感性之间的基本权衡,从而导致在不同目标下精确恢复的急剧相变。保证是确定性的和非渐近的,它们将算法精度的作用与实例的内在几何难度分开。我们进一步表明,错误集中在集群边界附近,足够深的集群核心恢复正是在加强本地利润。总之,这些结果提供了一个几何原理解释低客观值作为有意义的聚类结构的可靠证据。
摘要:We develop a geometric framework that links objective accuracy to structural recovery in prototype-based clustering. The analysis is algorithm-agnostic and applies to a broad class of admissible loss functions. We define a clustering condition number that compares within-cluster scale to the minimum loss increase required to move a point across a cluster boundary. When this quantity is small, any solution with a small suboptimality gap must also have a small misclassification error relative to a benchmark partition. The framework also clarifies a fundamental trade-off between robustness and sensitivity to cluster imbalance, leading to sharp phase transitions for exact recovery under different objectives. The guarantees are deterministic and non-asymptotic, and they separate the role of algorithmic accuracy from the intrinsic geometric difficulty of the instance. We further show that errors concentrate near cluster boundaries and that sufficiently deep cluster cores are recovered exactly under strengthened local margins. Together, these results provide a geometric principle for interpreting low objective values as reliable evidence of meaningful clustering structure.
自动驾驶|车辆|车道检测等(3篇)
【1】Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems
标题:端到端自动驾驶系统的可扩展感知数据选择
链接:https://arxiv.org/abs/2604.08366
作者
:Tolga Dimlioglu,Nadine Chang,Maying Shen,Rafid Mahmood,Jose M. Alvarez
备注:Accepted to CVPR 2026, 8 pages of main body and 10 pages of appendix
摘要:物理AI应用的大规模深度学习模型依赖于各种训练数据收集工作。这些模型以及相应的训练数据必须满足模型在现实环境中部署所需的不同评估标准。数据选择策略可以指导训练集的开发,但目前的框架没有考虑数据点如何影响不同指标的模糊性。在这项工作中,我们提出了通过缩放感知迭代收集(MOSAIC)进行混合优化,这是一种通用的数据选择框架,其操作方法是:(i)将数据集划分为域;(ii)将每个数据域的神经缩放律拟合到评估指标;以及(iii)通过迭代添加来自最大化指标变化的域的数据来优化数据混合。我们将MOSAIC应用于自动驾驶(AD),其中端到端(E2 E)规划器模型根据扩展预测驾驶员模型分数(EPDMS)(驾驶规则合规性指标的集合)进行评估。在这里,MOSAIC优于EPDMS上的一组不同的基线,数据减少了80%。
摘要:Large-scale deep learning models for physical AI applications depend on diverse training data collection efforts. These models and correspondingly, the training data, must address different evaluation criteria necessary for the models to be deployable in real-world environments. Data selection policies can guide the development of the training set, but current frameworks do not account for the ambiguity in how data points affect different metrics. In this work, we propose Mixture Optimization via Scaling-Aware Iterative Collection (MOSAIC), a general data selection framework that operates by: (i) partitioning the dataset into domains; (ii) fitting neural scaling laws from each data domain to the evaluation metrics; and (iii) optimizing a data mixture by iteratively adding data from domains that maximize the change in metrics. We apply MOSAIC to autonomous driving (AD), where an End-to-End (E2E) planner model is evaluated on the Extended Predictive Driver Model Score (EPDMS), an aggregate of driving rule compliance metrics. Here, MOSAIC outperforms a diverse set of baselines on EPDMS with up to 80\% less data.
【2】SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving
标题:SearchAD:用于自动驾驶的大规模稀有图像检索数据集
链接:https://arxiv.org/abs/2604.08008
作者:Felix Embacher,Jonas Uhrig,Marius Cordts,Markus Enzweiler
备注:To be published in CVPR 2026
摘要:从大规模数据集中检索罕见和安全关键的驾驶场景对于构建强大的自动驾驶(AD)系统至关重要。随着数据集规模的不断增长,关键挑战从收集更多数据转变为有效识别最相关的样本。我们介绍了SearchAD,一个大规模的罕见的图像检索数据集AD包含超过423k帧从11个已建立的数据集。SearchAD提供超过513k边界框的高质量手动注释,涵盖90个罕见类别。它专门针对定位极其罕见的类的大海捞针问题,有些类在整个数据集中出现不到50次。与专注于实例级检索的现有基准测试不同,SearchAD强调具有良好定义的数据分割的语义图像检索,支持文本到图像和图像到图像检索,Few-Shot学习以及多模态检索模型的微调。综合评价表明,基于文本的方法优于基于图像的,由于更强的内在语义基础。虽然直接将空间视觉特征与语言对齐的模型实现了最佳的zero-shot结果,并且我们的微调基线显着提高了性能,但绝对检索能力仍然不能令人满意。通过在公共基准服务器上进行测试,SearchAD建立了第一个用于AD检索驱动数据策展和长尾感知研究的大规模数据集:https://iis-esslingen.github.io/searchad/
摘要:Retrieving rare and safety-critical driving scenarios from large-scale datasets is essential for building robust autonomous driving (AD) systems. As dataset sizes continue to grow, the key challenge shifts from collecting more data to efficiently identifying the most relevant samples. We introduce SearchAD, a large-scale rare image retrieval dataset for AD containing over 423k frames drawn from 11 established datasets. SearchAD provides high-quality manual annotations of more than 513k bounding boxes covering 90 rare categories. It specifically targets the needle-in-a-haystack problem of locating extremely rare classes, with some appearing fewer than 50 times across the entire dataset. Unlike existing benchmarks, which focused on instance-level retrieval, SearchAD emphasizes semantic image retrieval with a well-defined data split, enabling text-to-image and image-to-image retrieval, few-shot learning, and fine-tuning of multi-modal retrieval models. Comprehensive evaluations show that text-based methods outperform image-based ones due to stronger inherent semantic grounding. While models directly aligning spatial visual features with language achieve the best zero-shot results, and our fine-tuning baseline significantly improves performance, absolute retrieval capabilities remain unsatisfactory. With a held-out test set on a public benchmark server, SearchAD establishes the first large-scale dataset for retrieval-driven data curation and long-tail perception research in AD: https://iis-esslingen.github.io/searchad/
【3】Cognitive-Causal Multi-Task Learning with Psychological State Conditioning for Assistive Driving Perception
标题:辅助驾驶感知心理状态条件反射的认知因果多任务学习
链接:https://arxiv.org/abs/2604.07651
作者:Keito Inoshita,Nobuhiro Hayashida,Akira Imanishi
摘要:高级驾驶辅助系统的多任务学习需要对驾驶员内部状态和外部交通环境之间的复杂相互作用进行建模。然而,现有的方法处理识别任务作为平坦的和独立的目标,未能利用认知因果结构的驾驶行为。在本文中,我们提出了CauPsi,一个认知科学为基础的因果多任务学习框架,明确建模交通上下文识别(TCR),车辆上下文识别(VCR),驾驶员情绪识别(DER)和驾驶员行为识别(DBR)之间的层次依赖关系。拟议的框架引入了两个关键机制。首先,因果任务链通过可学习的原型嵌入将上游任务预测传播到下游任务,以可区分的方式实现从环境感知到行为调节的认知级联。第二,跨任务心理条件反射(CTPC)从驾驶员面部表情和身体姿势估计心理状态信号,并将其作为条件反射输入注入到包括环境识别在内的所有任务中,从而模拟驾驶员内部状态对认知和决策过程的调节作用。在AIDE数据集上进行评估,Caupsi仅用5.05 M参数就实现了82.71%的平均准确度,总体上超过了之前的工作+1.0%,DER(+3.65%)和DBR(+7.53%)有显着改善。消融研究验证了每个组成部分的独立贡献,心理状态信号的分析证实,它获得系统的任务标签依赖的模式,在自我监督的方式没有明确的心理注释。
摘要:Multi-task learning for advanced driver assistance systems requires modeling the complex interplay between driver internal states and external traffic environments. However, existing methods treat recognition tasks as flat and independent objectives, failing to exploit the cognitive causal structure underlying driving behavior. In this paper, we propose CauPsi, a cognitive science-grounded causal multi-task learning framework that explicitly models the hierarchical dependencies among Traffic Context Recognition (TCR), Vehicle Context Recognition (VCR), Driver Emotion Recognition (DER), and Driver Behavior Recognition (DBR). The proposed framework introduces two key mechanisms. First, a Causal Task Chain propagates upstream task predictions to downstream tasks via learnable prototype embeddings, realizing the cognitive cascade from environmental perception to behavioral regulation in a differentiable manner. Second, Cross-Task Psychological Conditioning (CTPC) estimates a psychological state signal from driver facial expressions and body posture and injects it as a conditioning input to all tasks including environmental recognition, thereby modeling the modulatory effect of driver internal states on cognitive and decision-making processes. Evaluated on the AIDE dataset, CauPsi achieves a mean accuracy of 82.71% with only 5.05M parameters, surpassing prior work by +1.0% overall, with notable improvements on DER (+3.65%) and DBR (+7.53%). Ablation studies validate the independent contribution of each component, and analysis of the psychological state signal confirms that it acquires systematic task-label-dependent patterns in a self-supervised manner without explicit psychological annotations.
联邦学习|隐私保护|加密(2篇)
【1】Quantization Impact on the Accuracy and Communication Efficiency Trade-off in Federated Learning for Aerospace Predictive Maintenance
标题:量化对航空航天预测性维护联邦学习准确性和通信效率权衡的影响
链接:https://arxiv.org/abs/2604.08474
作者:Abdelkarim Loukili
摘要
:联邦学习(FL)可以在分布式航空航天机队中实现隐私保护的预测性维护,但梯度通信开销限制了在带宽有限的物联网节点上的部署。本文研究了对称均匀量化($b \in \{32,8,4,2\}$ bits)对定制设计的轻量级1-D卷积模型(AeroConv 1D,9\,697个参数)的准确性-效率权衡的影响,该模型在现实的非IID客户端分区下通过FL在NASA C-MAPSS基准上进行训练。使用严格的多种子评估($N=10$种子),我们表明,INT 4达到精度\n {统计上无法区分}从FP 32上的FD 001($p=0.341$)和FD 002($p=0.264$ MAE,$p=0.534$ NASA得分),同时提供梯度通信成本的$8\times$减少(每轮37.88~KiB $\到$4.73 ~KiB $\)。一个关键的方法发现是,天真的IID客户端划分人为地抑制了方差;正确的非IID评估揭示了极端量化的真实操作不稳定性,通过直接的经验IID与非IID比较。INT 2的经验特征是不合适的:虽然它通过极端量化诱导的过度正则化在FD 002上实现了较低的MAE,但这种明显的增益伴随着灾难性的NASA评分不稳定性(CV = 45.8% vs. FP 32为22.3%),证实了在非均相操作条件下的不可再现性。对Xilinx ZCU 102的FPGA资源预测分析证实,INT 4符合硬件限制(85.5%的DSP利用率),有可能在单个SoC上实现完整的FL流水线。完整的仿真代码库和FPGA估计脚本可在https://github.com/therealdeadbeef/aerospace-fl-quantization上公开获取。
摘要:Federated learning (FL) enables privacy-preserving predictive maintenance across distributed aerospace fleets, but gradient communication overhead constrains deployment on bandwidth-limited IoT nodes. This paper investigates the impact of symmetric uniform quantization ($b \in \{32,8,4,2\}$ bits) on the accuracy--efficiency trade-off of a custom-designed lightweight 1-D convolutional model (AeroConv1D, 9\,697 parameters) trained via FL on the NASA C-MAPSS benchmark under a realistic Non-IID client partition. Using a rigorous multi-seed evaluation ($N=10$ seeds), we show that INT4 achieves accuracy \emph{statistically indistinguishable} from FP32 on both FD001 ($p=0.341$) and FD002 ($p=0.264$ MAE, $p=0.534$ NASA score) while delivering an $8\times$ reduction in gradient communication cost (37.88~KiB $\to$ 4.73~KiB per round). A key methodological finding is that naïve IID client partitioning artificially suppresses variance; correct Non-IID evaluation reveals the true operational instability of extreme quantization, demonstrated via a direct empirical IID vs.\ Non-IID comparison. INT2 is empirically characterized as unsuitable: while it achieves lower MAE on FD002 through extreme quantization-induced over-regularization, this apparent gain is accompanied by catastrophic NASA score instability (CV\,=\,45.8\% vs.\ 22.3\% for FP32), confirming non-reproducibility under heterogeneous operating conditions. Analytical FPGA resource projections on the Xilinx ZCU102 confirm that INT4 fits within hardware constraints (85.5\% DSP utilization), potentially enabling a complete FL pipeline on a single SoC. The full simulation codebase and FPGA estimation scripts are publicly available at https://github.com/therealdeadbeef/aerospace-fl-quantization.
【2】Automating aggregation strategy selection in federated learning
标题:在联邦学习中自动选择聚合策略
链接:https://arxiv.org/abs/2604.08056
作者:Dian S. Y. Pang,Endrias Y. Ergetu,Eric Topham,Ahmed E. Fetit
摘要:联合学习可以在不集中数据的情况下进行协作模型训练,但其有效性随聚合策略的选择而变化。这种选择是不平凡的,因为性能在数据集、异构性级别和计算约束之间差异很大。我们提出了一个端到端的框架,可以自动化,简化和适应联邦学习的聚合策略选择。该框架以两种模式运行:单次试验模式,大型语言模型从用户提供或自动检测的数据特征中推断出合适的策略,以及多次试验模式,轻量级遗传搜索在有限的预算下有效地探索替代方案。在不同数据集上进行的大量实验表明,我们的方法在非IID条件下增强了鲁棒性和泛化能力,同时减少了人工干预的需要。总的来说,这项工作通过自动化其最关键的设计决策之一,即聚合策略的选择,朝着可访问和自适应的联邦学习方向发展。
摘要:Federated Learning enables collaborative model training without centralising data, but its effectiveness varies with the selection of the aggregation strategy. This choice is non-trivial, as performance varies widely across datasets, heterogeneity levels, and compute constraints. We present an end-to-end framework that automates, streamlines, and adapts aggregation strategy selection for federated learning. The framework operates in two modes: a single-trial mode, where large language models infer suitable strategies from user-provided or automatically detected data characteristics, and a multi-trial mode, where a lightweight genetic search efficiently explores alternatives under constrained budgets. Extensive experiments across diverse datasets show that our approach enhances robustness and generalisation under non-IID conditions while reducing the need for manual intervention. Overall, this work advances towards accessible and adaptive federated learning by automating one of its most critical design decisions, the choice of an aggregation strategy.
推理|分析|理解|解释(9篇)
【1】EgoEverything: A Benchmark for Human Behavior Inspired Long Context Egocentric Video Understanding in AR Environment
标题:EgoEverything:AR环境中受长上下文以自我为中心的视频理解启发的人类行为基准
链接:https://arxiv.org/abs/2604.08342
作者:Qiance Tang,Ziqi Wang,Jieyu Lin,Ziyun Li,Barbara De Salvo,Sai Qian Zhang
摘要:长上下文自我中心的视频理解最近引起了重要的研究关注,增强现实(AR)突出作为其最重要的应用领域之一。然而,由于需要在扩展的时间上下文和多样化的非结构化活动中进行推理,因此该任务仍然具有很大的挑战性。虽然存在几个基准,但大多数以自我为中心的数据集依赖于人类佩戴的相机,主要关注视觉内容,在形成视频相关查询时对底层用户行为的考虑有限。EgoEverything是一个基准测试,它在生成问题时,通过利用从凝视数据中提取的人类注意力信号来明确考虑人类行为。它包括超过5,000个多项选择题答案对,跨越100多个小时的视频。通过在问题生成过程中整合人类注意力信号,它可以更忠实地捕捉自然的人类行为,并为AR中的长上下文自我中心视频理解提供现实的评估设置。
摘要:Long context egocentric video understanding has recently attracted significant research attention, with augmented reality (AR) highlighted as one of its most important application domains. Nevertheless, the task remains highly challenging due to the need for reasoning over extended temporal contexts and diverse, unstructured activities. Although several benchmarks exist, most egocentric datasets rely on human worn cameras and focus mainly on visual content, with limited consideration of underlying user behavior when forming video-related queries. EgoEverything is a benchmark that explicitly considers human behavior by leveraging human attention signals, abstracted from gaze data, when generating questions. It comprises over 5,000 multiple choice question answer pairs, spanning more than 100 hours of video. By integrating human attention signals during question generation, it more faithfully captures natural human behavior and offers a realistic evaluation setting for long-context egocentric video understanding in AR.
【2】Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference
标题:Alloc-MoE:预算感知专家激活分配,以实现高效的混合专家推理
链接:https://arxiv.org/abs/2604.08133
作者:Baihui Liu,Kaiyuan Tian,Wei Wang,Zhaoning Zhang,Linbo Qiao,Dongsheng Li
备注:ACL 2026 main
摘要:混合专家模型(Mixture-of-Experts,MoE)由于其稀疏的激活机制,已经成为扩展大型语言模型的主流架构。然而,大量的专家激活在推理期间产生了关键的延迟瓶颈,特别是在资源受限的部署场景中。现有的方法,减少专家激活可能会导致严重的模型性能下降。在这项工作中,我们引入的概念,\n {激活预算}作为专家激活的数量的约束,并提出Alloc-MoE,一个统一的框架,优化预算分配协调在层和令牌级别,以尽量减少性能下降。在层级别上,我们引入了Alloc-L,它利用敏感性分析和动态编程来确定专家激活在层之间的最佳分配。在令牌级别,我们提出了Alloc-T,它根据路由分数动态地重新分配激活,优化预算分配而不增加延迟。跨多个MoE模型的大量实验表明,Alloc-MoE在受约束的激活预算下保持模型性能。特别是,Alloc-MoE在DeepSeek-V2-Lite上以原始预算的一半实现了1.15\times $预填充和1.34\times $解码加速。
摘要
:Mixture-of-Experts (MoE) has become a dominant architecture for scaling large language models due to their sparse activation mechanism. However, the substantial number of expert activations creates a critical latency bottleneck during inference, especially in resource-constrained deployment scenarios. Existing approaches that reduce expert activations potentially lead to severe model performance degradation. In this work, we introduce the concept of \emph{activation budget} as a constraint on the number of expert activations and propose Alloc-MoE, a unified framework that optimizes budget allocation coordinately at both the layer and token levels to minimize performance degradation. At the layer level, we introduce Alloc-L, which leverages sensitivity profiling and dynamic programming to determine the optimal allocation of expert activations across layers. At the token level, we propose Alloc-T, which dynamically redistributes activations based on routing scores, optimizing budget allocation without increasing latency. Extensive experiments across multiple MoE models demonstrate that Alloc-MoE maintains model performance under a constrained activation budget. Especially, Alloc-MoE achieves $1.15\times$ prefill and $1.34\times$ decode speedups on DeepSeek-V2-Lite at half of the original budget.
【3】Multimodal Latent Reasoning via Predictive Embeddings
标题:通过预测嵌入的多模式潜在推理
链接:https://arxiv.org/abs/2604.08065
作者:Ashutosh Adhikari,Mirella Lapata
摘要:工具增强的多模态推理使视觉语言模型(VLM)能够通过与外部工具(例如,裁剪、深度估计)。然而,这样的方法会产生大量的推理开销,需要专门的监督,并且容易出现错误的工具调用。我们提出了Pearl(潜在空间中推理的预测嵌入对齐),这是一个受JEPA启发的框架,它完全在潜在空间中从专家工具使用轨迹中学习,从而消除了在推理时显式调用工具的需要。与基于重构的潜在推理方法不同,基于重构的潜在推理方法自回归生成潜在令牌,并且受到训练推理不匹配和对多步工具使用的有限支持的影响,Pearl直接从多模态轨迹中学习预测嵌入,同时保留标准的视觉语言生成管道:它与模型无关,训练简单,并且自然支持多个工具调用的轨迹。多个感知基准的实验表明,Pearl匹配或优于标准的监督微调和基于重构的潜在推理方法。此外,我们提供的经验证据表明,基于重建的方法主要学习嵌入,而不是潜在空间中的图像编辑,激励预测嵌入学习作为一种更有原则的选择。
摘要:Tool-augmented multimodal reasoning enables visual language models (VLMs) to improve perception by interacting with external tools (e.g., cropping, depth estimation). However, such approaches incur substantial inference overhead, require specialized supervision, and are prone to erroneous tool calls. We propose Pearl (Predictive Embedding Alignment for Reasoning in Latent space), a JEPA-inspired framework that learns from expert tool-use trajectories entirely in the latent space, eliminating the need for explicit tool invocation at inference time. Unlike reconstruction-based latent reasoning methods, which autoregressively generate latent tokens and suffer from training-inference mismatch and limited support for multi-step tool use, Pearl directly learns predictive embeddings from multimodal trajectories while preserving the standard vision-language generation pipeline: it is model-agnostic, simple to train, and naturally supports trajectories with multiple tool calls. Experiments across multiple perception benchmarks show that Pearl matches or outperforms standard supervised fine-tuning and reconstruction-based latent reasoning approaches. Furthermore, we provide empirical evidence that reconstruction-based methods primarily learn embeddings rather than image edits in latent space, motivating predictive embedding learning as a more principled alternative.
【4】Sinkhorn doubly stochastic attention rank decay analysis
标题:Sinkhorn双随机注意秩衰减分析
链接:https://arxiv.org/abs/2604.07925
作者:Michela Lapenna,Rita Fioresi,Bahman Gharesifard
摘要:自我关注机制是Transformer架构成功的核心。然而,标准的行随机注意力已被证明遭受显着的跨层信号退化。特别是,它可以导致排名崩溃,导致越来越统一的令牌表示,以及熵崩溃,其特征是高度集中的注意力分布。最近的工作强调了双重随机注意力作为熵正则化的一种形式的好处,促进了更平衡的注意力分布,并提高了经验性能。在本文中,我们研究了跨网络深度的秩崩溃,并证明了用Sinkhorn算法归一化的双随机注意矩阵比标准Softmax行随机注意矩阵更有效地保持秩。如前所述,对于Softmax来说,跳过连接对于缓解秩崩溃至关重要。我们在情感分析和图像分类任务上实证验证了这种现象。此外,当使用Sinkhorn归一化时,我们推导出纯自我注意力秩衰减的理论界,并发现秩随着深度呈双指数衰减到1,这一现象已经在Softmax中表现出来。
摘要:The self-attention mechanism is central to the success of Transformer architectures. However, standard row-stochastic attention has been shown to suffer from significant signal degradation across layers. In particular, it can induce rank collapse, resulting in increasingly uniform token representations, as well as entropy collapse, characterized by highly concentrated attention distributions. Recent work has highlighted the benefits of doubly stochastic attention as a form of entropy regularization, promoting a more balanced attention distribution and leading to improved empirical performance. In this paper, we study rank collapse across network depth and show that doubly stochastic attention matrices normalized with Sinkhorn algorithm preserve rank more effectively than standard Softmax row-stochastic ones. As previously shown for Softmax, skip connections are crucial to mitigate rank collapse. We empirically validate this phenomenon on both sentiment analysis and image classification tasks. Moreover, we derive a theoretical bound for the pure self-attention rank decay when using Sinkhorn normalization and find that rank decays to one doubly exponentially with depth, a phenomenon that has already been shown for Softmax.
【5】QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training--Inference Mismatch
标题:QaRL:推出对齐量化感知RL,在训练中实现快速稳定的训练--推理不匹配
链接:https://arxiv.org/abs/2604.07853
作者:Hao Gu,Hao Wang,Jiacheng Liu,Lujun Li,Qiyuan Zhu,Bei Liu,Binxing Xu,Lei Wang,Xintong Yang,Sida Lin,Sirui Han,Yike Guo
摘要:大型语言模型(LLM)强化学习(RL)管道通常会受到推出生成的影响,从而使端到端训练变得缓慢。最近的工作通过量化来加速解码来缓解这一问题,这是RL循环中最昂贵的阶段。然而,这些设置通过放大训练-推理间隙来破坏优化的稳定性:滚展以低精度操作,而学习更新以全精度计算。为了应对这一挑战,我们提出了QaRL(卷展对齐量化感知RL),它将训练侧与量化卷展对齐,以最大限度地减少失配。我们进一步确定了量化推出的失败模式:长形式的响应往往会产生重复的,混乱的令牌(错误令牌)。为了缓解这些问题,我们引入了TBPO(信任带策略优化),这是一个针对负样本的双裁剪序列级目标,旨在将更新保持在信任区域内。在Qwen 3 - 30 B-A3 B MoE的数学问题上,QaRL的表现优于量化推出训练+5.5,同时提高了稳定性并保留了低位吞吐量优势。
摘要:Large language model (LLM) reinforcement learning (RL) pipelines are often bottlenecked by rollout generation, making end-to-end training slow. Recent work mitigates this by running rollouts with quantization to accelerate decoding, which is the most expensive stage of the RL loop. However, these setups destabilize optimization by amplifying the training-inference gap: rollouts are operated at low precision, while learning updates are computed at full precision. To address this challenge, we propose QaRL (Rollout Alignment Quantization-Aware RL), which aligns training-side forward with the quantized rollout to minimize mismatch. We further identify a failure mode in quantized rollouts: long-form responses tend to produce repetitive, garbled tokens (error tokens). To mitigate these problems, we introduce TBPO (Trust-Band Policy Optimization), a sequence-level objective with dual clipping for negative samples, aimed at keeping updates within the trust region. On Qwen3-30B-A3B MoE for math problems, QaRL outperforms quantized-rollout training by +5.5 while improving stability and preserving low-bit throughput benefits.
【6】Mathematical analysis of one-layer neural network with fixed biases, a new activation function and other observations
标题:具有固定偏差、新激活函数和其它观测值的单层神经网络的数学分析
链接:https://arxiv.org/abs/2604.07715
作者:Fabricio Macià,Shu Nakamura
摘要
:我们分析了一个简单的一隐藏层神经网络,该网络具有ReLU激活函数和固定偏差,具有一维输入和输出。我们研究了模型的连续和离散版本,并严格证明了学习过程的收敛性和L^2 $平方损失函数和梯度下降过程。我们还证明了这个学习过程的谱偏置属性。 本分析的几个结论进行了讨论,特别是,关于结构和性质,激活函数应具备的,以及某些运营商的频谱和学习过程之间的关系。在此基础上,我们还提出了一种替代的激活函数,全波整流指数函数(FReX),我们讨论了梯度下降与这种替代激活函数的收敛性。
摘要:We analyze a simple one-hidden-layer neural network with ReLU activation functions and fixed biases, with one-dimensional input and output. We study both continuous and discrete versions of the model, and we rigorously prove the convergence of the learning process with the $L^2$ squared loss function and the gradient descent procedure. We also prove the spectral bias property for this learning process. Several conclusions of this analysis are discussed; in particular, regarding the structure and properties that activation functions should possess, as well as the relationships between the spectrum of certain operators and the learning process. Based on this, we also propose an alternative activation function, the full-wave rectified exponential function (FReX), and we discuss the convergence of the gradient descent with this alternative activation function.
【7】Joint Task Offloading, Inference Optimization and UAV Trajectory Planning for Generative AI Empowered Intelligent Transportation Digital Twin
标题:生成人工智能赋能智能交通数字孪生的联合任务卸载、推理优化和无人机轨迹规划
链接:https://arxiv.org/abs/2604.07687
作者:Xiaohuan Li,Junchuan Fan,Bingqi Zhang,Rong Yu,Xumin Huang,Qian Chen
摘要:为了实现智能交通数字孪生(ITDT),无人机(UAV)被安排处理来自路边传感器的传感数据。此时,在无人机上部署扩散模型等生成式人工智能(GAI)技术,将原始传感数据转化为高质量和有价值的数据。因此,我们提出了GAI授权的ITDT。具有动态机动能力的无人机对一组扩散模型推理任务的动态处理同时影响DT更新的保真度和延迟。在本文中,我们研究了一个联合优化问题的无人机任务卸载,推理优化和无人机轨迹规划的系统效用最大化(SUM)问题,以解决可靠性延迟权衡的GAI授权的ITDT。为了解决网络动态下的问题,将SUM问题建模为异质代理马尔可夫决策过程,提出了基于顺序更新的异质代理双延迟深度确定性策略梯度(SU-HATD 3)算法,该算法能够快速学习到近似最优解.数值结果表明,与几种基线算法相比,该算法在提高系统效用和收敛速度方面具有很大的优势。
摘要:To implement the intelligent transportation digital twin (ITDT), unmanned aerial vehicles (UAVs) are scheduled to process the sensing data from the roadside sensors. At this time, generative artificial intelligence (GAI) technologies such as diffusion models are deployed on the UAVs to transform the raw sensing data into the high-quality and valuable. Therefore, we propose the GAI-empowered ITDT. The dynamic processing of a set of diffusion model inference (DMI) tasks on the UAVs with dynamic mobility simultaneously influences the DT updating fidelity and delay. In this paper, we investigate a joint optimization problem of DMI task offloading, inference optimization and UAV trajectory planning as the system utility maximization (SUM) problem to address the fidelity-delay tradeoff for the GAI-empowered ITDT. To seek a solution to the problem under the network dynamics, we model the SUM problem as the heterogeneous-agent Markov decision process, and propose the sequential update-based heterogeneous-agent twin delayed deep deterministic policy gradient (SU-HATD3) algorithm, which can quickly learn a near-optimal solution. Numerical results demonstrate that compared with several baseline algorithms, the proposed algorithm has great advantages in improving the system utility and convergence rate.
【8】Towards Counterfactual Explanation and Assertion Inference for CPS Debugging
标题:CPS收件箱的反事实解释和断言推理
链接:https://arxiv.org/abs/2604.07679
作者:Zaid Ghazal,Hadiza Yusuf,Khouloud Gaaloul
摘要:通过大规模仿真对信息物理系统(CPS)进行验证和确认,通常会出现难以解释的故障,特别是在特定事件或时间由连续和离散行为之间的相互作用触发时。现有的调试技术可以将异常定位到特定的模型组件,但它们对触发违规的输入信号值和时序条件,或者可以防止故障的最小精确时序变化几乎没有提供任何见解。在本文中,我们介绍DeCaF,这是一个用于CPS调试的反事实指导解释和基于断言的表征框架。给定一个失败的测试输入,DeCaF会对输入信号产生反事实的变化,将测试从失败转换为通过。这些更改被设计为最小、必要且足以精确恢复正确性。然后,它将断言推断为输入上的逻辑谓词,这些输入以工程师可以推理的可解释形式概括恢复条件,而无需访问内部模型细节。我们的方法结合了三个反事实发电机与两个因果模型,并推断成功的断言。在三个CPS案例研究中,DeCaF使用KD树最近邻结合M5模型树实现了最佳成功率,而遗传算法结合随机森林在成功和因果精度之间提供了最强的平衡。
摘要:Verification and validation of cyber-physical systems (CPS) via large-scale simulation often surface failures that are hard to interpret, especially when triggered by interactions between continuous and discrete behaviors at specific events or times. Existing debugging techniques can localize anomalies to specific model components, but they provide little insight into the input-signal values and timing conditions that trigger violations, or the minimal, precisely timed changes that could have prevented the failure. In this article, we introduce DeCaF, a counterfactual-guided explanation and assertion-based characterization framework for CPS debugging. Given a failing test input, DeCaF generates counterfactual changes to the input signals that transform the test from failing to passing. These changes are designed to be minimal, necessary, and sufficient to precisely restore correctness. Then, it infers assertions as logical predicates over inputs that generalize recovery conditions in an interpretable form engineers can reason about, without requiring access to internal model details. Our approach combines three counterfactual generators with two causal models, and infers success assertions. Across three CPS case studies, DeCaF achieves its best success rate with KD-Tree Nearest Neighbors combined with M5 model tree, while Genetic Algorithm combined with Random Forest provides the strongest balance between success and causal precision.
【9】QARIMA: A Quantum Approach To Classical Time Series Analysis
标题:QARIMA:经典时间序列分析的量子方法
链接:https://arxiv.org/abs/2604.08277
作者:Nishikanta Mohanty,Bikash K. Behera,Badshah Mukherjee,Pravat Dash
备注:17 Algorithms, 19 Figures , 26 Tables
摘要:我们提出了一个量子启发的ARIMA方法,集成了量子辅助滞后发现与\n {固定配置}变分量子电路(VQC)的参数估计和弱滞后细化。通过交换测试驱动的量子自相关(QACF)和量子偏自相关(QPACF)识别差分和候选滞后,延迟矩阵构造将量子投影与时域回归量对齐,然后是标准信息准则简约。给定筛选阶数$(p,d,q)$,我们保留固定的VQC模拟器、优化器和训练预算,防止超参数泄漏,并将电路部署在两个估计角色中:自回归系数的VQC-AR和移动平均系数的VQC-MA。在筛选和估计之间,轻量级VQC弱滞后细化重新加权或修剪筛选的AR滞后,而不改变$(p,d,q)$。在环境和工业数据集上,我们对自动经典ARIMA进行滚动原点评估,报告样本外均方误差(MSE),平均绝对百分比误差(MAPE)以及MSE和MAE的Diebold-Mariano检验。从经验上讲,七个量子贡献--(1)差分选择,(2)QACF,(3)QPACF,(4)具有延迟矩阵构造的交换测试原语,(5)VQC-AR,(6)VQC弱滞后细化,和(7)VQC-MA --共同减少了元优化开销,并明确了量子效应进入顺序发现,滞后细化和AR/MA参数估计的位置。
摘要
:We present a quantum-inspired ARIMA methodology that integrates quantum-assisted lag discovery with \emph{fixed-configuration} variational quantum circuits (VQCs) for parameter estimation and weak-lag refinement. Differencing and candidate lags are identified via swap-test-driven quantum autocorrelation (QACF) and quantum partial autocorrelation (QPACF), with a delayed-matrix construction that aligns quantum projections to time-domain regressors, followed by standard information-criterion parsimony. Given the screened orders $(p,d,q)$, we retain a fixed VQC ansatz, optimizer, and training budget, preventing hyperparameter leakage, and deploy the circuit in two estimation roles: VQC-AR for autoregressive coefficients and VQC-MA for moving-average coefficients. Between screening and estimation, a lightweight VQC weak-lag refinement re-weights or prunes screened AR lags without altering $(p,d,q)$. Across environmental and industrial datasets, we perform rolling-origin evaluations against automated classical ARIMA, reporting out-of-sample mean squared error (MSE), mean absolute percentage error (MAPE), and Diebold--Mariano tests on MSE and MAE. Empirically, the seven quantum contributions -- (1) differencing selection, (2) QACF, (3) QPACF, (4) swap-test primitives with delayed-matrix construction, (5) VQC-AR, (6) VQC weak-lag refinement, and (7) VQC-MA -- collectively reduce meta-optimization overhead and make explicit where quantum effects enter order discovery, lag refinement, and AR/MA parameter estimation.
检测相关(3篇)
【1】DeepForestSound: a multi-species automatic detector for passive acoustic monitoring in African tropical forests, a case study in Kibale National Park
标题:DeepForestSound:一种用于非洲热带森林被动声学监测的多物种自动探测器,基巴莱国家公园的案例研究
链接:https://arxiv.org/abs/2604.08087
作者:Gabriel Dubus,Théau d'Audiffret,Claire Auger,Raphaël Cornette,Sylvain Haupert,Innocent Kasekendi,Raymond Katumba,Hugo Magaldi,Lise Pernel,Harold Rugonge,Jérôme Sueur,John Justice Tibesigwa,Sabrina Krief
备注:8 pages
摘要:被动声监测(PAM)技术在生物多样性评价中有着广泛的应用。它在非洲热带森林的应用是有限的稀缺注释数据,降低了性能的通用生态声学模型代表性不足的类群。在这项研究中,我们介绍了DeepForestSound(DFS),一个多物种的自动检测模型,专为非洲热带森林中的PAM。DFS依赖于半监督管道,将未注释录音的聚类与手动验证相结合,然后使用低等级自适应对音频谱图Transformer(AST)进行监督微调,并将其与冻结的主干线性基线(DFS-Linear)进行比较。该框架支持从长期声学记录中检测多个分类组,包括鸟类,灵长类动物和大象。DFS在乌干达Kibale国家公园的Sebitoli地区收集的声学数据上进行了培训,并在两年后在同一森林的不同地点记录的独立数据集上进行了评估。因此,本评价评估了单一热带森林生态系统内跨时间和记录地点的一般化。在12个分类群中的8个分类群中,DFS优于现有的自动检测工具,特别是对于非鸟类分类群,灵长类动物的平均AP值为0.964,大象为0.961。结果进一步表明,基于LoRA的微调大大优于跨分类群的线性探测。总体而言,这些结果表明,以任务为导向,区域特定的培训大大提高了在声学复杂的热带环境中的检测性能,并强调DFS作为非洲雨林生物多样性监测和保护的实用工具的潜力。
摘要:Passive Acoustic Monitoring (PAM) is widely used for biodiversity assessment. Its application in African tropical forests is limited by scarce annotated data, reducing the performance of general-purpose ecoacoustic models on underrepresented taxa. In this study, we introduce DeepForestSound (DFS), a multi-species automatic detection model designed for PAM in African tropical forests. DFS relies on a semi-supervised pipeline combining clustering of unannotated recordings with manual validation, followed by supervised fine-tuning of an Audio Spectrogram Transformer (AST) using low-rank adaptation, which is compared to a frozen-backbone linear baseline (DFS-Linear). The framework supports the detection of multiple taxonomic groups, including birds, primates, and elephants, from long-term acoustic recordings. DFS was trained on acoustic data collected in the Sebitoli area, in Kibale National Park, Uganda, and evaluated on an independent dataset recorded two years later at different locations within the same forest. This evaluation therefore assesses generalization across time and recording sites within a single tropical forest ecosystem. Across 8 out of 12 taxons, DFS outperforms existing automatic detection tools, particularly for non-avian taxa, achieving average AP values of 0.964 for primates and 0.961 for elephants. Results further show that LoRA-based fine-tuning substantially outperforms linear probing across taxa. Overall, these results demonstrate that task-oriented, region-specific training substantially improves detection performance in acoustically complex tropical environments, and highlight the potential of DFS as a practical tool for biodiversity monitoring and conservation in African rainforests.
【2】Fraud Detection System for Banking Transactions
标题:银行交易欺诈检测系统
链接:https://arxiv.org/abs/2604.07952
作者:Ranya Batsyas,Ritesh Yaduwanshi
摘要:数字支付系统的扩展提高了在线金融交易的规模和复杂性,从而增加了欺诈活动的脆弱性。由于攻击策略的性质不断变化,以及真实交易和欺诈交易之间的显著差异,有效检测欺诈变得更加复杂。本研究介绍了一个基于机器学习的欺诈检测框架,利用PaySim合成金融交易数据集。根据CRISP-DM方法,该研究包括假设驱动的探索性分析,特征细化以及基线模型的比较评估,如逻辑回归和基于树的分类器,如随机森林,XGBoost和决策树。为了解决类别不平衡问题,采用了SMOTE,并通过GridSearchCV的超参数调整来增强模型性能。拟议的框架提供了一个强大且可扩展的解决方案,以增强金融科技交易系统的欺诈预防能力。关键词:欺诈检测,不平衡数据,HPO,SMOTE
摘要:The expansion of digital payment systems has heightened both the scale and intricacy of online financial transactions, thereby increasing vulnerability to fraudulent activities. Detecting fraud effectively is complicated by the changing nature of attack strategies and the significant disparity between genuine and fraudulent transactions. This research introduces a machine learning-based fraud detection framework utilizing the PaySim synthetic financial transaction dataset. Following the CRISP-DM methodology, the study includes hypothesis-driven exploratory analysis, feature refinement, and a comparative assessment of baseline models such as Logistic Regression and tree-based classifiers like Random Forest, XGBoost, and Decision Tree. To tackle class imbalance, SMOTE is employed, and model performance is enhanced through hyperparameter tuning with GridSearchCV. The proposed framework provides a robust and scalable solution to enhance fraud prevention capabilities in FinTech transaction systems. Keywords: fraud detection, imbalanced data, HPO, SMOTE
【3】Needle in a Haystack -- One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology
标题:干草堆中的针--计算细胞学中检测罕见恶性细胞的一类表示学习
链接:https://arxiv.org/abs/2604.07722
作者:Swarnadip Chatterjee,Vladimir Basic,Arrigo Capitanio,Orcun Goksel,Joakim Lindblad
备注:15 pages, 7 figures
摘要:在计算细胞学中,在整个切片图像上检测恶性肿瘤是困难的,因为恶性细胞在形态上是多样的,但在大量的正常细胞背景中却非常罕见。由于大的类别不平衡和有限的注释,这些极其罕见的恶性细胞的准确检测仍然具有挑战性。传统的弱监督方法,如多实例学习(MIL),往往无法在实例级别进行推广,特别是当恶性细胞的比例(见证率)非常低时。在这项研究中,我们探索使用一类表示学习技术检测恶性细胞在低见证率的情况下。这些方法只在幻灯片底片补丁上训练,不需要任何实例级监督。具体来说,我们评估两个OCC的方法,DSVDD和DROC,并将它们与FS-SIL,WS-SIL,和最近的ItS 2SIL方法进行比较。单类方法学习正态性的紧凑表示,并在测试时检测偏差。在公开可用的骨髓细胞形态学数据集(TCIA)和内部口腔癌细胞学数据集上的实验表明,DSVDD在实例级异常排名中实现了最先进的性能,特别是在超低成功率制度中($\leq 1\%$),在某些情况下,甚至优于完全监督学习,由于详尽的实例级注释的不可行性,这在整个载玻片细胞学中通常不是实际的选择。DROC在极端稀有的情况下也具有竞争力,这得益于分布增强的对比学习。这些发现突出了一类表示学习作为MIL在极端罕见情况下进行恶性细胞检测的强大且可解释的优越选择。
摘要
:In computational cytology, detecting malignancy on whole-slide images is difficult because malignant cells are morphologically diverse yet vanishingly rare amid a vast background of normal cells. Accurate detection of these extremely rare malignant cells remains challenging due to large class imbalance and limited annotations. Conventional weakly supervised approaches, such as multiple instance learning (MIL), often fail to generalize at the instance level, especially when the fraction of malignant cells (witness rate) is exceedingly low. In this study, we explore the use of one-class representation learning techniques for detecting malignant cells in low-witness-rate scenarios. These methods are trained exclusively on slide-negative patches, without requiring any instance-level supervision. Specifically, we evaluate two OCC approaches, DSVDD and DROC, and compare them with FS-SIL, WS-SIL, and the recent ItS2CLR method. The one-class methods learn compact representations of normality and detect deviations at test time. Experiments on a publicly available bone marrow cytomorphology dataset (TCIA) and an in-house oral cancer cytology dataset show that DSVDD achieves state-of-the-art performance in instance-level abnormality ranking, particularly in ultra-low witness-rate regimes ($\leq 1\%$) and, in some cases, even outperforming fully supervised learning, which is typically not a practical option in whole-slide cytology due to the infeasibility of exhaustive instance-level annotations. DROC is also competitive under extreme rarity, benefiting from distribution-augmented contrastive learning. These findings highlight one-class representation learning as a robust and interpretable superior choice to MIL for malignant cell detection under extreme rarity.
分类|识别(3篇)
【1】Pruning Extensions and Efficiency Trade-Offs for Sustainable Time Series Classification
标题:可持续时间序列分类的修剪扩展和效率权衡
链接:https://arxiv.org/abs/2604.07953
作者:Raphael Fischer,Angus Dempster,Sebastian Buschjäger,Matthias Jakobs,Urav Maniar,Geoffrey I. Webb
摘要:时间序列分类(TSC)支持重要的用例,但缺乏对模型,数据集和硬件之间性能权衡的统一理解。虽然该领域的资源意识有所提高,但尚未对TSC方法的能源效率进行严格评估。本文介绍了一个全面的评估框架,明确探讨了TSC的预测性能和资源消耗的平衡。为了提高效率,我们应用理论上有界的修剪策略,领先的混合分类器-水螅和量化-和目前的Hydrant,一种新的,prunable两者的组合。通过20个MONSTER数据集,13种方法和3种计算设置的4000多个实验配置,我们系统地分析了模型设计,超参数和硬件选择如何影响实际TSC性能。我们的研究结果表明,修剪可以显着降低高达80%的能耗,同时保持有竞争力的预测质量,通常模型的准确率低于5%。所提出的方法,实验结果和配套软件推进TSC走向可持续和可重复的实践。
摘要:Time series classification (TSC) enables important use cases, however lacks a unified understanding of performance trade-offs across models, datasets, and hardware. While resource awareness has grown in the field, TSC methods have not yet been rigorously evaluated for energy efficiency. This paper introduces a holistic evaluation framework that explicitly explores the balance of predictive performance and resource consumption in TSC. To boost efficiency, we apply a theoretically bounded pruning strategy to leading hybrid classifiers - Hydra and Quant - and present Hydrant, a novel, prunable combination of both. With over 4000 experimental configurations across 20 MONSTER datasets, 13 methods, and three compute setups, we systematically analyze how model design, hyperparameters, and hardware choices affect practical TSC performance. Our results showcase that pruning can significantly reduce energy consumption by up to 80% while maintaining competitive predictive quality, usually costing the model less than 5% of accuracy. The proposed methodology, experimental results, and accompanying software advance TSC toward sustainable and reproducible practice.
【2】A Novel Edge-Assisted Quantum-Classical Hybrid Framework for Crime Pattern Learning and Classification
标题:用于犯罪模式学习和分类的新型边缘辅助量子经典混合框架
链接:https://arxiv.org/abs/2604.07389
作者:Niloy Das,Apurba Adhikary,Sheikh Salman Hassan,Yu Qiao,Zhu Han,Tharmalingam Ratnarajah,Choong Seon Hong
摘要:犯罪模式分析对于执法和预测性警务至关重要,但快速城市化带来的犯罪活动激增产生了高维、不平衡的数据集,挑战了传统的分类方法。这项研究提出了一个量子经典犯罪分析的比较框架,评估四个计算范式:量子模型,经典的基线机器学习模型,和两个混合量子经典架构。使用16年的孟加拉国犯罪统计数据,我们系统地评估了严格的交叉验证方法下的分类性能和计算效率。实验结果表明,量子启发的方法,特别是QAOA,实现了高达84.6%的准确率,同时需要比经典基线更少的可训练参数,这表明内存受限的边缘部署具有实际优势。所提出的相关感知电路设计展示了将特定领域的特征关系纳入量子模型的潜力。此外,混合方法具有竞争力的培训效率,使他们成为资源受限环境的合适人选。该框架的低计算开销和紧凑的参数足迹建议在智能城市监控系统,分布式节点执行本地化的犯罪分析,以最小的通信成本的无线传感器网络部署的潜在优势。我们的研究结果为结构化犯罪数据的量子增强机器学习提供了初步的经验评估,并激发了更大数据集和现实量子硬件考虑的进一步调查。
摘要:Crime pattern analysis is critical for law enforcement and predictive policing, yet the surge in criminal activities from rapid urbanization creates high-dimensional, imbalanced datasets that challenge traditional classification methods. This study presents a quantum-classical comparison framework for crime analytics, evaluating four computational paradigms: quantum models, classical baseline machine learning models, and two hybrid quantum-classical architectures. Using 16-year Bangladesh crime statistics, we systematically assess classification performance and computational efficiency under rigorous cross-validation methods. Experimental results show that quantum-inspired approaches, particularly QAOA, achieve up to 84.6% accuracy, while requiring fewer trainable parameters than classical baselines, suggesting practical advantages for memory-constrained edge deployment. The proposed correlation-aware circuit design demonstrates the potential of incorporating domain-specific feature relationships into quantum models. Furthermore, hybrid approaches exhibit competitive training efficiency, making them suitable candidates for resource-constrained environments. The framework's low computational overhead and compact parameter footprint suggest potential advantages for wireless sensor network deployments in smart city surveillance systems, where distributed nodes perform localized crime analytics with minimal communication costs. Our findings provide a preliminary empirical assessment of quantum-enhanced machine learning for structured crime data and motivate further investigation with larger datasets and realistic quantum hardware considerations.
【3】Sparse $ε$ insensitive zone bounded asymmetric elastic net support vector machines for pattern classification
标题:用于模式分类的稀疏$不敏感区域有界非对称弹性网支持载体机
链接:https://arxiv.org/abs/2604.07748
作者:Haiyan Du,Hu Yang
摘要:现有的支持向量机(SVM)模型对噪声敏感,缺乏稀疏性,这限制了它们的性能。为了解决这些问题,我们结合弹性净损失与一个强大的损失框架,以构建一个稀疏的$\vareps $-不敏感的有界非对称弹性净损失,并将其与支持向量机,建立$\vareps $不敏感区有界非对称弹性净损失的支持向量机($\vareps $-BAEN-SVM)。$\varepsilon$-BAEN-SVM是稀疏和鲁棒的。稀疏性是证明了$\varepsilon$不敏感带内的样本不是支持向量。由于影响函数是有界的,因此在理论上保证了鲁棒性。针对非凸优化问题,设计了一种基于裁剪对偶坐标下降的半二次算法。它将问题转化为一系列加权子问题,通过$\vareps $参数提高计算效率。在模拟数据集和真实数据集上的实验表明,$\vareps $-BAEN-SVM的性能优于传统的和现有的鲁棒支持向量机。它在噪声环境中很好地平衡了稀疏性和鲁棒性。统计检验证实了它的优越性。在高斯核函数下,该方法具有较好的精度和对噪声的不敏感性,验证了其有效性和实用价值。
摘要:Existing support vector machines(SVM) models are sensitive to noise and lack sparsity, which limits their performance. To address these issues, we combine the elastic net loss with a robust loss framework to construct a sparse $\varepsilon$-insensitive bounded asymmetric elastic net loss, and integrate it with SVM to build $\varepsilon$ Insensitive Zone Bounded Asymmetric Elastic Net Loss-based SVM($\varepsilon$-BAEN-SVM). $\varepsilon$-BAEN-SVM is both sparse and robust. Sparsity is proven by showing that samples inside the $\varepsilon$-insensitive band are not support vectors. Robustness is theoretically guaranteed because the influence function is bounded. To solve the non-convex optimization problem, we design a half-quadratic algorithm based on clipping dual coordinate descent. It transforms the problem into a series of weighted subproblems, improving computational efficiency via the $\varepsilon$ parameter. Experiments on simulated and real datasets show that $\varepsilon$-BAEN-SVM outperforms traditional and existing robust SVMs. It balances sparsity and robustness well in noisy environments. Statistical tests confirm its superiority. Under the Gaussian kernel, it achieves better accuracy and noise insensitivity, validating its effectiveness and practical value.
表征(2篇)
【1】What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal
标题:是什么驱动着代表性转向?转向拒绝的机制案例研究
链接:https://arxiv.org/abs/2604.08524
作者:Stephen Cheng,Sarah Wiegreffe,Dinesh Manocha
备注:9 pages + appendix, 7 figures
摘要:将导向向量应用于大型语言模型(LLM)是一种高效的模型对齐技术,但我们缺乏对其工作原理的解释-特别是导向向量影响的内部机制以及这如何导致不同的模型输出。为了研究导向向量有效性的因果机制,我们对拒绝进行了全面的案例研究。我们提出了一个多令牌激活修补框架,并发现不同的转向方法利用功能可互换的电路时,应用在同一层。这些电路揭示了转向矢量主要通过OV电路与注意力机制相互作用,而在很大程度上忽略了QK电路-在两个模型系列中,转向过程中所有注意力得分都被冻结,仅下降了8.75%。转向OV电路的数学分解进一步揭示了语义上可解释的概念,即使在转向矢量本身没有的情况下。利用激活修补的结果,我们表明,导向矢量可以稀疏高达90-99%,同时保留大部分的性能,不同的转向方法同意一个子集的重要尺寸。
摘要:Applying steering vectors to large language models (LLMs) is an efficient and effective model alignment technique, but we lack an interpretable explanation for how it works-- specifically, what internal mechanisms steering vectors affect and how this results in different model outputs. To investigate the causal mechanisms underlying the effectiveness of steering vectors, we conduct a comprehensive case study on refusal. We propose a multi-token activation patching framework and discover that different steering methodologies leverage functionally interchangeable circuits when applied at the same layer. These circuits reveal that steering vectors primarily interact with the attention mechanism through the OV circuit while largely ignoring the QK circuit-- freezing all attention scores during steering drops performance by only 8.75% across two model families. A mathematical decomposition of the steered OV circuit further reveals semantically interpretable concepts, even in cases where the steering vector itself does not. Leveraging the activation patching results, we show that steering vectors can be sparsified by up to 90-99% while retaining most performance, and that different steering methodologies agree on a subset of important dimensions.
【2】An Illusion of Unlearning? Assessing Machine Unlearning Through Internal Representations
标题:忘记学习的幻觉?通过内部表示评估机器遗忘
链接:https://arxiv.org/abs/2604.08271
作者:Yichen Gao,Altay Unal,Akshay Rangamani,Zhihui Zhu
备注:9 pages main text, 21 pages total, 6 figures. Accepted at AISTATS 2026
摘要:While numerous machine unlearning (MU) methods have recently been developed with promising results in erasing the influence of forgotten data, classes, or concepts, they are also highly vulnerable-for example, simple fine-tuning can inadvertently reintroduce erased concepts. In this paper, we address this contradiction by examining the internal representations of unlearned models, in contrast to prior work that focuses primarily on output-level behavior. Our analysis shows that many state-of-the-art MU methods appear successful mainly due to a misalignment between last-layer features and the classifier, a phenomenon we call feature-classifier misalignment. In fact, hidden features remain highly discriminative, and simple linear probing can recover near-original accuracy. Assuming neural collapse in the original model, we further demonstrate that adjusting only the classifier can achieve negligible forget accuracy while preserving retain accuracy, and we corroborate this with experiments using classifier-only fine-tuning. Motivated by these findings, we propose MU methods based on a class-mean features (CMF) classifier, which explicitly enforces alignment between features and classifiers. Experiments on standard benchmarks show that CMF-based unlearning reduces forgotten information in representations while maintaining high retain accuracy, highlighting the need for faithful representation-level evaluation of MU.
3D|3D重建等相关(1篇)
【1】Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting
标题:用于辅助分辨率大气缩减和预测的生成性3D高斯飞溅
链接:https://arxiv.org/abs/2604.07928
作者:Tao Hana,Zhibin Wen,Zhenghao Chen,Fenghua Lin,Junyu Gao,Song Guo,Lei Bai
备注:20 pages, 13 figures
摘要:While AI-based numerical weather prediction (NWP) enables rapid forecasting, generating high-resolution outputs remains computationally demanding due to limited multi-scale adaptability and inefficient data representations. We propose the 3D Gaussian splatting-based scale-aware vision transformer (GSSA-ViT), a novel framework for arbitrary-resolution forecasting and flexible downscaling of high-dimensional atmospheric fields. Specifically, latitude-longitude grid points are treated as centers of 3D Gaussians. A generative 3D Gaussian prediction scheme is introduced to estimate key parameters, including covariance, attributes, and opacity, for unseen samples, improving generalization and mitigating overfitting. In addition, a scale-aware attention module is designed to capture cross-scale dependencies, enabling the model to effectively integrate information across varying downscaling ratios and support continuous resolution adaptation. To our knowledge, this is the first NWP approach that combines generative 3D Gaussian modeling with scale-aware attention for unified multi-scale prediction. Experiments on ERA5 show that the proposed method accurately forecasts 87 atmospheric variables at arbitrary resolutions, while evaluations on ERA5 and CMIP6 demonstrate its superior performance in downscaling tasks. The proposed framework provides an efficient and scalable solution for high-resolution, multi-scale atmospheric prediction and downscaling. Code is available at: https://github.com/binbin2xs/weather-GS.
编码器(1篇)
【1】Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency
标题:Kuramoto振荡相编码:神经启发的同步提高学习效率
链接:https://arxiv.org/abs/2604.07904
作者:Mingqing Xiao,Yansen Wang,Dongqi Han,Caihua Shan,Dongsheng Li
摘要:Spatiotemporal neural dynamics and oscillatory synchronization are widely implicated in biological information processing and have been hypothesized to support flexible coordination such as feature binding. By contrast, most deep learning architectures represent and propagate information through activation values, neglecting the joint dynamics of rate and phase. In this work, we introduce Kuramoto oscillatory Phase Encoding (KoPE) as an additional, evolving phase state to Vision Transformers, incorporating a neuro-inspired synchronization mechanism to advance learning efficiency. We show that KoPE can improve training, parameter, and data efficiency of vision models through synchronization-enhanced structure learning. Moreover, KoPE benefits tasks requiring structured understanding, including semantic and panoptic segmentation, representation alignment with language, and few-shot abstract visual reasoning (ARC-AGI). Theoretical analysis and empirical verification further suggest that KoPE can accelerate attention concentration for learning efficiency. These results indicate that synchronization can serve as a scalable, neuro-inspired mechanism for advancing state-of-the-art neural network models.
优化|敛散性(7篇)
【1】Optimal Decay Spectra for Linear Recurrences
标题:线性回归的最佳衰变谱
链接:https://arxiv.org/abs/2604.07658
作者:Yang Cao
摘要:Linear recurrent models offer linear-time sequence processing but often suffer from suboptimal long-range memory. We trace this to the decay spectrum: for $N$ channels, random initialization collapses the minimum spectral gap to $O(N^{-2})$, yielding sub-exponential error $\exp(-Ω(N/\log N))$; linear spacing avoids collapse but degrades to $\exp(-O(N/\sqrt{T}))$, practically algebraic over long contexts. We introduce Position-Adaptive Spectral Tapering (PoST), an architecture-agnostic framework combining two mechanisms: (1) Spectral Reparameterization, which structurally enforces geometrically spaced log-decay rates, proven minimax optimal at rate $O(\exp(-cN/\log T))$; and (2) Position-Adaptive Scaling, the provably unique mechanism that eliminates the scale mismatch of static spectra (where only $N\log t/\log T$ of $N$ channels are effective at position $t$) by stretching the spectrum to the actual dependency range, sharpening the rate to $O(\exp(-cN/\log t))$. This scaling natively induces fractional invariance: the impulse response becomes scale-free, with channels interpolating between relative and absolute temporal coordinates. PoST integrates into any diagonal linear recurrence without overhead. We instantiate it across Mamba-2, RWKV-7, Gated DeltaNet, Gated Linear Attention, and RetNet. Pre-training at 180M-440M scales shows consistent zero-shot language modeling improvements, significant long-context retrieval gains for Mamba-2 (MQAR and NIAH), and competitive or improved performance across other architectures. Code: https://github.com/SiLifen/PoST.
【2】Regret-Aware Policy Optimization: Environment-Level Memory for Replay Suppression under Delayed Harm
标题:后悔意识政策优化:延迟伤害下重播抑制的环境级记忆
链接:https://arxiv.org/abs/2604.07428
作者:Prakul Sunil Hiremath
备注:18 pages, 3 figures. Includes theoretical analysis and experiments on graph diffusion environments
摘要:Safety in reinforcement learning (RL) is typically enforced through objective shaping while keeping environment dynamics stationary with respect to observable state-action pairs. Under delayed harm, this can lead to replay: after a washout period, reintroducing the same stimulus under matched observable conditions reproduces a similar harmful cascade. We introduce the Replay Suppression Diagnostic (RSD), a controlled exposure-decay-replay protocol that isolates this failure mode under frozen-policy evaluation. We show that, under stationary observable transition kernels, replay cannot be structurally suppressed without inducing a persistent shift in replay-time action distributions. Motivated by platform-mediated systems, we propose Regret-Aware Policy Optimization (RAPO), which augments the environment with persistent harm-trace and scar fields and applies a bounded, mass-preserving transition reweighting to reduce reachability of historically harmful regions. On graph diffusion tasks (50-1000 nodes), RAPO suppresses replay, reducing re-amplification gain (RAG) from 0.98 to 0.33 on 250-node graphs while retaining 82\% of task return. Disabling transition deformation only during replay restores re-amplification (RAG 0.91), isolating environment-level deformation as the causal mechanism.
【3】Bayesian Optimization for Mixed-Variable Problems in the Natural Sciences
标题:自然科学中混合变量问题的Bayesian优化
链接:https://arxiv.org/abs/2604.07416
作者:Yuhao Zhang,Ti John,Matthias Stosiek,Patrick Rinke
摘要:Optimizing expensive black-box objectives over mixed search spaces is a common challenge across the natural sciences. Bayesian optimization (BO) offers sample-efficient strategies through probabilistic surrogate models and acquisition functions. However, its effectiveness diminishes in mixed or high-cardinality discrete spaces, where gradients are unavailable and optimizing the acquisition function becomes computationally demanding. In this work, we generalize the probabilistic reparameterization (PR) approach of Daulton et al. to handle non-equidistant discrete variables, enabling gradient-based optimization in fully mixed-variable settings with Gaussian process (GP) surrogates. With real-world scientific optimization tasks in mind, we conduct systematic benchmarks on synthetic and experimental objectives to obtain an optimized kernel formulations and demonstrate the robustness of our generalized PR method. We additionally show that, when combined with a modified BO workflow, our approach can efficiently optimize highly discontinuous and discretized objective landscapes. This work establishes a practical BO framework for addressing fully mixed optimization problems in the natural sciences, and is particularly well suited to autonomous laboratory settings where noise, discretization, and limited data are inherent.
【4】Conservation Law Breaking at the Edge of Stability: A Spectral Theory of Non-Convex Neural Network Optimization
标题:稳定性边缘的保守定律突破:非凸神经网络优化的谱理论
链接:https://arxiv.org/abs/2604.07405
作者:Daniel Nobrega Medeiros
备注:13 pages, 4 figures, 1 table, 23 experiments. Code available at https://github.com/danielxmed/TheLocalMinimumParadox
摘要
:Why does gradient descent reliably find good solutions in non-convex neural network optimization, despite the landscape being NP-hard in the worst case? We show that gradient flow on L-layer ReLU networks without bias preserves L-1 conservation laws C_l = ||W_{l+1}||_F^2 - ||W_l||_F^2, confining trajectories to lower-dimensional manifolds. Under discrete gradient descent, these laws break with total drift scaling as eta^alpha where alpha is approximately 1.1-1.6 depending on architecture, loss function, and width. We decompose this drift exactly as eta^2 * S(eta), where the gradient imbalance sum S(eta) admits a closed-form spectral crossover formula with mode coefficients c_k proportional to e_k(0)^2 * lambda_{x,k}^2, derived from first principles and validated for both linear (R=0.85) and ReLU (R>0.80) networks. For cross-entropy loss, softmax probability concentration drives exponential Hessian spectral compression with timescale tau = Theta(1/eta) independent of training set size, explaining why cross-entropy self-regularizes the drift exponent near alpha=1.0. We identify two dynamical regimes separated by a width-dependent transition: a perturbative sub-Edge-of-Stability regime where the spectral formula applies, and a non-perturbative regime with extensive mode coupling. All predictions are validated across 23 experiments.
【5】SCOT: Multi-Source Cross-City Transfer with Optimal-Transport Soft-Correspondence Objective
标题:SCOT:具有最佳运输软通信目标的多来源跨城市转移
链接:https://arxiv.org/abs/2604.07383
作者:Yuyao Wang,Min Yang,Meng Chen,Weiming Huang,Yongshun Gong
备注:29 pages, 22 figures, 19 tables
摘要:Cross-city transfer improves prediction in label-scarce cities by leveraging labeled data from other cities, but it becomes challenging when cities adopt incompatible partitions and no ground-truth region correspondences exist. Existing approaches either rely on heuristic region matching, which is often sensitive to anchor choices, or perform distribution-level alignment that leaves correspondences implicit and can be unstable under strong heterogeneity. We propose SCOT, a cross-city representation learning framework that learns explicit soft correspondences between unequal region sets via Sinkhorn-based entropic optimal transport. SCOT further sharpens transferable structure with an OT-weighted contrastive objective and stabilizes optimization through a cycle-style reconstruction regularizer. For multi-source transfer, SCOT aligns each source and the target to a shared prototype hub using balanced entropic transport guided by a target-induced prototype prior. Across real-world cities and tasks, SCOT consistently improves transfer accuracy and robustness, while the learned transport couplings and hub assignments provide interpretable diagnostics of alignment quality.
【6】Order-Optimal Sequential 1-Bit Mean Estimation in General Tail Regimes
标题:一般尾部机制中的序优顺序1位均值估计
链接:https://arxiv.org/abs/2604.07796
作者:Ivan Lau,Jonathan Scarlett
备注:arXiv admin note: substantial text overlap with arXiv:2509.21940
摘要:In this paper, we study the problem of mean estimation under strict 1-bit communication constraints. We propose a novel adaptive mean estimator based solely on randomized threshold queries, where each 1-bit outcome indicates whether a given sample exceeds a sequentially chosen threshold. Our estimator is $(ε, δ)$-PAC for any distribution with a bounded mean $μ\in [-λ, λ]$ and a bounded $k$-th central moment $\mathbb{E}[|X-μ|^k] \le σ^k$ for any fixed $k > 1$. Crucially, our sample complexity is order-optimal in all such tail regimes, i.e., for every such $k$ value. For $k \neq 2$, our estimator's sample complexity matches the unquantized minimax lower bounds plus an unavoidable $O(\log(λ/σ))$ localization cost. For the finite-variance case ($k=2$), our estimator's sample complexity has an extra multiplicative $O(\log(σ/ε))$ penalty, and we establish a novel information-theoretic lower bound showing that this penalty is a fundamental limit of 1-bit quantization. We also establish a significant adaptivity gap: for both threshold queries and more general interval queries, the sample complexity of any non-adaptive estimator must scale linearly with the search space parameter $λ/σ$, rendering it vastly less sample efficient than our adaptive approach. Finally, we present algorithmic variants that (i) handle an unknown sampling budget, (ii) adapt to an unknown scale parameter~$σ$ given (possibly loose) bounds, and (iii) require only two stages of adaptivity at the expense of more complicated general 1-bit queries.
【7】Generative optimal transport via forward-backward HJB matching
标题:通过前向-后向HJB匹配生成最佳运输
链接:https://arxiv.org/abs/2604.07762
作者:Haiqian Yang,Vishaal Krishnan,Sumit Sinha,L. Mahadevan
备注:16 pages, 4 figures
摘要:Controlling the evolution of a many-body stochastic system from a disordered reference state to a structured target ensemble, characterized empirically through samples, arises naturally in non-equilibrium statistical mechanics and stochastic control. The natural relaxation of such a system - driven by diffusion - runs from the structured target toward the disordered reference. The natural question is then: what is the minimum-work stochastic process that reverses this relaxation, given a pathwise cost functional combining spatial penalties and control effort? Computing this optimal process requires knowledge of trajectories that already sample the target ensemble - precisely the object one is trying to construct. We resolve this by establishing a time-reversal duality: the value function governing the hard backward dynamics satisfies an equivalent forward-in-time HJB equation, whose solution can be read off directly from the tractable forward relaxation trajectories. Via the Cole-Hopf transformation and its associated Feynman-Kac representation, this forward potential is computed as a path-space free energy averaged over these forward trajectories - the same relaxation paths that are easy to simulate - without any backward simulation or knowledge of the target beyond samples. The resulting framework provides a physically interpretable description of stochastic transport in terms of path-space free energy, risk-sensitive control, and spatial cost geometry. We illustrate the theory with numerical examples that visualize the learned value function and the induced controlled diffusions, demonstrating how spatial cost fields shape transport geometry analogously to Fermat's Principle in inhomogeneous media. Our results establish a unifying connection between stochastic optimal control, Schrödinger bridge theory, and non-equilibrium statistical mechanics.
预测|估计(8篇)
【1】A Machine Learning Framework for Turbofan Health Estimation via Inverse Problem Formulation
标题:通过反问题公式进行Turbofan健康状况估计的机器学习框架
链接:https://arxiv.org/abs/2604.08460
作者:Milad Leyli-Abadi,Lucas Thil,Sebastien Razakarivony,Guillaume Doquet,Jesse Read
备注:Submitted at ECML PKDD 2026
摘要
:Estimating the health state of turbofan engines is a challenging ill-posed inverse problem, hindered by sparse sensing and complex nonlinear thermodynamics. Research in this area remains fragmented, with comparisons limited by the use of unrealistic datasets and insufficient exploration of the exploitation of temporal information. This work investigates how to recover component-level health indicators from operational sensor data under realistic degradation and maintenance patterns. To support this study, we introduce a new dataset that incorporates industry-oriented complexities such as maintenance events and usage changes. Using this dataset, we establish an initial benchmark that compares steady-state and nonstationary data-driven models, and Bayesian filters, classic families of methods used to solve this problem. In addition to this benchmark, we introduce self-supervised learning (SSL) approaches that learn latent representations without access to true health labels, a scenario reflective of real-world operational constraints. By comparing the downstream estimation performance of these unsupervised representations against the direct prediction baselines, we establish a practical lower bound on the difficulty of solving this inverse problem. Our results reveal that traditional filters remain strong baselines, while SSL methods reveal the intrinsic complexity of health estimation and highlight the need for more advanced and interpretable inference strategies. For reproducibility, both the generated dataset and the implementation used in this work are made accessible.
【2】Robust Length Prediction: A Perspective from Heavy-Tailed Prompt-Conditioned Distributions
标题:鲁棒长度预测:重尾条件分布的观点
链接:https://arxiv.org/abs/2604.07931
作者:Jing Wang,Yu-Yang Qian,Ke Xue,Chao Qian,Peng Zhao,Zhi-Hua Zhou
摘要:Output-length prediction is important for efficient LLM serving, as it directly affects batching, memory reservation, and scheduling. For prompt-only length prediction, most existing methods use a one-shot sampled length as the label, implicitly treating each prompt as if it had one true target length. We show that this is unreliable: even under a fixed model and decoding setup, the same prompt induces a \emph{prompt-conditioned output length distribution}, not a deterministic scalar, and this distribution is consistent with \emph{heavy-tailed} behavior. Motivated by this, we cast length prediction as robust estimation from heavy-tailed prompt-conditioned length distributions. We propose prompt-conditioned length distribution (ProD) methods, which construct training targets from multiple independent generations of the same prompt. Two variants are developed to reuse the served LLM's hidden states: \mbox{ProD-M}, which uses a median-based target for robust point prediction, and ProD-D, which uses a distributional target that preserves prompt-conditioned uncertainty. We provide theoretical justifications by analyzing the estimation error under a surrogate model. Experiments across diverse scenarios show consistent gains in prediction quality.
【3】Information-Theoretic Requirements for Gradient-Based Task Affinity Estimation in Multi-Task Learning
标题:多任务学习中基于对象的任务亲和力估计的信息论要求
链接:https://arxiv.org/abs/2604.07848
作者:Jasper Zhang,Bryan Cheng
备注:8 pages, 4 figures. Accepted at workshop on AI for Accelerated Materials Design, Foundation Models for Science: Real-World Impact and Science-First Design, and Generative and Experimental Perspectives for Biomolecular Design at ICLR 2026
摘要:Multi-task learning shows strikingly inconsistent results -- sometimes joint training helps substantially, sometimes it actively harms performance -- yet the field lacks a principled framework for predicting these outcomes. We identify a fundamental but unstated assumption underlying gradient-based task analysis: tasks must share training instances for gradient conflicts to reveal genuine relationships. When tasks are measured on the same inputs, gradient alignment reflects shared mechanistic structure; when measured on disjoint inputs, any apparent signal conflates task relationships with distributional shift. We discover this sample overlap requirement exhibits a sharp phase transition: below 30% overlap, gradient-task correlations are statistically indistinguishable from noise; above 40%, they reliably recover known biological structure. Comprehensive validation across multiple datasets achieves strong correlations and recovers biological pathway organization. Standard benchmarks systematically violate this requirement -- MoleculeNet operates at <5% overlap, TDC at 8-14% -- far below the threshold where gradient analysis becomes meaningful. This provides the first principled explanation for seven years of inconsistent MTL results.
【4】Auto-Configured Networks for Multi-Scale Multi-Output Time-Series Forecasting
标题:用于多尺度多输出时间序列预测的自动配置网络
链接:https://arxiv.org/abs/2604.07610
作者:Yumeng Zha,Shengxiang Yang,Xianpeng Wang
摘要:Industrial forecasting often involves multi-source asynchronous signals and multi-output targets, while deployment requires explicit trade-offs between prediction error and model complexity. Current practices typically fix alignment strategies or network designs, making it difficult to systematically co-design preprocessing, architecture, and hyperparameters in budget-limited training-based evaluations. To address this issue, we propose an auto-configuration framework that outputs a deployable Pareto set of forecasting models balancing error and complexity. At the model level, a Multi-Scale Bi-Branch Convolutional Neural Network (MS--BCNN) is developed, where short- and long-kernel branches capture local fluctuations and long-term trends, respectively, for multi-output regression. At the search level, we unify alignment operators, architectural choices, and training hyperparameters into a hierarchical-conditional mixed configuration space, and apply Player-based Hybrid Multi-Objective Evolutionary Algorithm (PHMOEA) to approximate the error--complexity Pareto frontier within a limited computational budget. Experiments on hierarchical synthetic benchmarks and a real-world sintering dataset demonstrate that our framework outperforms competitive baselines under the same budget and offers flexible deployment choices.
【5】DSPR: Dual-Stream Physics-Residual Networks for Trustworthy Industrial Time Series Forecasting
标题:DSPL:用于值得信赖的工业时间序列预测的双流物理-剩余网络
链接:https://arxiv.org/abs/2604.07393
作者:Yeran Zhang,Pengwei Yang,Guoqing Wang,Tianyu Li
备注:12 pages, 7 figures, submitted to KDD 2026
摘要
:Accurate forecasting of industrial time series requires balancing predictive accuracy with physical plausibility under non-stationary operating conditions. Existing data-driven models often achieve strong statistical performance but struggle to respect regime-dependent interaction structures and transport delays inherent in real-world systems. To address this challenge, we propose DSPR (Dual-Stream Physics-Residual Networks), a forecasting framework that explicitly decouples stable temporal patterns from regime-dependent residual dynamics. The first stream models the statistical temporal evolution of individual variables. The second stream focuses on residual dynamics through two key mechanisms: an Adaptive Window module that estimates flow-dependent transport delays, and a Physics-Guided Dynamic Graph that incorporates physical priors to learn time-varying interaction structures while suppressing spurious correlations. Experiments on four industrial benchmarks spanning heterogeneous regimes demonstrate that DSPR consistently improves forecasting accuracy and robustness under regime shifts while maintaining strong physical plausibility. It achieves state-of-the-art predictive performance, with Mean Conservation Accuracy exceeding 99% and Total Variation Ratio reaching up to 97.2%. Beyond forecasting, the learned interaction structures and adaptive lags provide interpretable insights that are consistent with known domain mechanisms, such as flow-dependent transport delays and wind-to-power scaling behaviors. These results suggest that architectural decoupling with physics-consistent inductive biases offers an effective path toward trustworthy industrial time-series forecasting. Furthermore, DSPR's demonstrated robust performance in long-term industrial deployment bridges the gap between advanced forecasting models and trustworthy autonomous control systems.
【6】Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets
标题:预测竞技场:在现实世界预测市场上对人工智能模型进行基准测试
链接:https://arxiv.org/abs/2604.07355
作者:Jaden Zhang,Gardenia Liu,Oliver Johansson,Hileamlak Yitayew,Kamryn Ohly,Grace Li
备注:18 pages, 10 figures, 3 tables. Evaluation period: January 12 - March 9, 2026
摘要:We introduce Prediction Arena, a benchmark for evaluating AI models' predictive accuracy and decision-making by enabling them to trade autonomously on live prediction markets with real capital. Unlike synthetic benchmarks, Prediction Arena tests models in environments where trades execute on actual exchanges (Kalshi and Polymarket), providing objective ground truth that cannot be gamed or overfitted. Each model operates as an independent agent starting with $10,000, making autonomous decisions every 15-45 minutes. Over a 57-day longitudinal evaluation (January 12 to March 9, 2026), we track two cohorts: six frontier models in live trading (Cohort 1, full period) and four next-generation models in paper trading (Cohort 2, 3-day preliminary). For Cohort 1, final Kalshi returns range from -16.0% to -30.8%. Our analysis identifies a clear performance hierarchy: initial prediction accuracy and the ability to capitalize on correct predictions are the main drivers, while research volume shows no correlation with outcomes. A striking cross-platform contrast emerges from parallel Polymarket live trading: Cohort 1 models averaged only -1.1% on Polymarket vs. -22.6% on Kalshi, with grok-4-20-checkpoint achieving a 71.4% settlement win rate - the highest across any platform or cohort. gemini-3.1-pro-preview (Cohort 2), which executed zero trades on Kalshi, achieved +6.02% on Polymarket in 3 days - the best return of any model across either cohort - demonstrating that platform design has a profound effect on which models succeed. Beyond performance, we analyze computational efficiency (token usage, cycle time), settlement accuracy, exit patterns, and market preferences, providing a comprehensive view of how frontier models behave under real financial pressure.
【7】Variational Approximated Restricted Maximum Likelihood Estimation for Spatial Data
标题:空间数据的变分逼近限制性最大似然估计
链接:https://arxiv.org/abs/2604.07635
作者:Debjoy Thakur
摘要:This research considers a scalable inference for spatial data modeled through Gaussian intrinsic conditional autoregressive (ICAR) structures. The classical estimation method, restricted maximum likelihood (REML), requires repeated inversion and factorization of large, sparse precision matrices, which makes this computation costly. To sort this problem out, we propose a variational restricted maximum likelihood (VREML) framework that approximates the intractable marginal likelihood using a Gaussian variational distribution. By constructing an evidence lower bound (ELBO) on the restricted likelihood, we derive a computationally efficient coordinate-ascent algorithm for jointly estimating the spatial random effects and variance components. In this article, we theoretically establish the monotone convergence of ELBO and mathematically exhibit that the variational family is exact under Gaussian ICAR settings, which is an indication of nullifying approximation error at the posterior level. We empirically establish the supremacy of our VREML over MLE and INLA.
【8】Predicting Activity Cliffs for Autonomous Medicinal Chemistry
标题:预测自主药物化学的活动悬崖
链接:https://arxiv.org/abs/2604.07560
作者:Michael Cuccarese
备注:8 pages, 4 figures github: https://github.com/mcuccarese/Activity-cliff-prediction webapp: https://activity-cliffs-5gnirhr3k3ybhwhz7de7ua.streamlit.app/
摘要:Activity cliff prediction - identifying positions where small structural changes cause large potency shifts - has been a persistent challenge in computational medicinal chemistry. This work focuses on a parsimonious definition: which small modifications, at which positions, confer the highest probability of an outcome change. Position-level sensitivity is calculated using 25 million matched molecular pairs from 50 ChEMBL targets across six protein families, revealing that two questions have fundamentally different answers. "Which positions vary most?" is answered by scaffold size alone (NDCG@3 = 0.966), requiring no machine learning. "Which are true activity cliffs?" - where small modifications cause disproportionately large effects, as captured by SALI normalization - requires an 11-feature model with 3D pharmacophore context (NDCG@3 = 0.910 vs. 0.839 random), generalizing across all six protein families, novel scaffolds (0.913), and temporal splits (0.878). The model identifies the cliff-prone position first 53% of the time (vs. 27% random - 2x lift), reducing positions a chemist must explore from 3.1 to 2.1 - a 31% reduction in first-round experiments. Predicting which modification to make is not tractable from structure alone (Spearman 0.268, collapsing to -0.31 on novel scaffolds). The system is released as open-source code and an interactive webapp.
其他神经网络|深度学习|模型|建模(18篇)
【1】Persistence-Augmented Neural Networks
标题:持久性增强神经网络
链接:https://arxiv.org/abs/2604.08469
作者:Elena Xinyi Wang,Arnur Nigmetov,Dmitriy Morozov
摘要:Topological Data Analysis (TDA) provides tools to describe the shape of data, but integrating topological features into deep learning pipelines remains challenging, especially when preserving local geometric structure rather than summarizing it globally. We propose a persistence-based data augmentation framework that encodes local gradient flow regions and their hierarchical evolution using the Morse-Smale complex. This representation, compatible with both convolutional and graph neural networks, retains spatially localized topological information across multiple scales. Importantly, the augmentation procedure itself is efficient, with computational complexity $O(n \log n)$, making it practical for large datasets. We evaluate our method on histopathology image classification and 3D porous material regression, where it consistently outperforms baselines and global TDA descriptors such as persistence images and landscapes. We also show that pruning the base level of the hierarchy reduces memory usage while maintaining competitive performance. These results highlight the potential of local, structured topological augmentation for scalable and interpretable learning across data modalities.
【2】Leveraging Complementary Embeddings for Replay Selection in Continual Learning with Small Buffers
标题:利用互补嵌入进行小缓冲区连续学习中的重放选择
链接:https://arxiv.org/abs/2604.08336
作者:Danit Yanowsky,Daphna Weinshall
摘要:Catastrophic forgetting remains a key challenge in Continual Learning (CL). In replay-based CL with severe memory constraints, performance critically depends on the sample selection strategy for the replay buffer. Most existing approaches construct memory buffers using embeddings learned under supervised objectives. However, class-agnostic, self-supervised representations often encode rich, class-relevant semantics that are overlooked. We propose a new method, Multiple Embedding Replay Selection, MERS, which replaces the buffer selection module with a graph-based approach that integrates both supervised and self-supervised embeddings. Empirical results show consistent improvements over SOTA selection strategies across a range of continual learning algorithms, with particularly strong gains in low-memory regimes. On CIFAR-100 and TinyImageNet, MERS outperforms single-embedding baselines without adding model parameters or increasing replay volume, making it a practical, drop-in enhancement for replay-based continual learning.
【3】Introducing Echo Networks for Computational Neuroevolution
标题:引入Echo网络用于计算神经进化
链接:https://arxiv.org/abs/2604.08204
作者:Christian Kroos,Fabian Küch
备注:Accepted for AMLDS 2026 (International Conference on Advanced Machine Learning and Data Science)
摘要:For applications on the extreme edge, minimal networks of only a few dozen artificial neurons for event detection and classification in discrete time signals would be highly desirable. Feed-forward networks, RNNs, and CNNs evolved through evolutionary algorithms can all be successful in this respect but pose the problem of allowing little systematicity in mutation and recombination if the standard direct genetic encoding of the weights is used (as for instance in the classic NEAT algorithm). We therefore introduce Echo Networks, a type of recurrent network that consists of the connection matrix only, with the source neurons of the synapses represented as rows, destination neurons as columns and weights as entries. There are no layers, and connections between neurons can be bidirectional but are technically all recurrent. Input and output can be arbitrarily assigned to any of the neurons and only use an additional (optional) function in their computational path, e.g., a sigmoid to obtain a binary classification output. We evaluated Echo Networks successfully on the classification of electrocardiography signals but see the most promising potential in their genome representation as a single matrix, allowing matrix computations and factorisations as mutation and recombination operators.
【4】The ecosystem of machine learning competitions: Platforms, participants, and their impact on AI development
标题:机器学习竞赛生态系统:平台、参与者及其对人工智能发展的影响
链接:https://arxiv.org/abs/2604.08001
作者:Ioannis Nasios
摘要:Machine learning competitions (MLCs) play a pivotal role in advancing artificial intelligence (AI) by fostering innovation, skill development, and practical problem-solving. This study provides a comprehensive analysis of major competition platforms such as Kaggle and Zindi, examining their workflows, evaluation methodologies, and reward structures. It further assesses competition quality, participant expertise, and global reach, with particular attention to demographic trends among top-performing competitors. By exploring the motivations of competition hosts, this paper underscores the significant role of MLCs in shaping AI development, promoting collaboration, and driving impactful technological progress. Furthermore, by combining literature synthesis with platform-level data analysis and practitioner insights a comprehensive understanding of the MLC ecosystem is provided. Moreover, the paper demonstrates that MLCs function at the intersection of academic research and industrial application, fostering the exchange of knowledge, data, and practical methodologies across domains. Their strong ties to open-source communities further promote collaboration, reproducibility, and continuous innovation within the broader ML ecosystem. By shaping research priorities, informing industry standards, and enabling large-scale crowdsourced problem-solving, these competitions play a key role in the ongoing evolution of AI. The study provides insights relevant to researchers, practitioners, and competition organizers, and includes an examination of the future trajectory and sustained influence of MLCs on AI development.
【5】Benchmarking Deep Learning for Future Liver Remnant Segmentation in Colorectal Liver Metastasis
标题:对深度学习进行基准测试,用于未来结直肠肝转移中的残余肝脏分割
链接:https://arxiv.org/abs/2604.07999
作者:Anthony T. Wu,Arghavan Rezvani,Kela Liu,Roozbeh Houshyar,Pooya Khosravi,Whitney Li,Xiaohui Xie
备注:Accepted at the 2026 International Symposium on Biomedical Imaging (ISBI) Oral 4-page paper presentation
摘要
:Accurate segmentation of the future liver remnant (FLR) is critical for surgical planning in colorectal liver metastases (CRLM) to prevent fatal post-hepatectomy liver failure. However, this segmentation task is technically challenging due to complex resection boundaries, convoluted hepatic vasculature and diffuse metastatic lesions. A primary bottleneck in developing automated AI tools has been the lack of high-fidelity, validated data. We address this gap by manually refining all 197 volumes from the public CRLM-CT-Seg dataset, creating the first open-source, validated benchmark for this task. We then establish the first segmentation baselines, comparing cascaded (Liver->CRLM->FLR) and end-to-end (E2E) strategies using nnU-Net, SwinUNETR, and STU-Net. We find a cascaded nnU-Net achieves the best final FLR segmentation Dice (0.767), while the pretrained STU-Net provides superior CRLM segmentation (0.620 Dice) and is significantly more robust to cascaded errors. This work provides the first validated benchmark and a reproducible framework to accelerate research in AI-assisted surgical planning.
【6】Visual Perceptual to Conceptual First-Order Rule Learning Networks
标题:视觉感知到概念一阶规则学习网络
链接:https://arxiv.org/abs/2604.07897
作者:Kun Gao,Davide Soldà,Thomas Eiter,Katsumi Inoue
摘要:Learning rules plays a crucial role in deep learning, particularly in explainable artificial intelligence and enhancing the reasoning capabilities of large language models. While existing rule learning methods are primarily designed for symbolic data, learning rules from image data without supporting image labels and automatically inventing predicates remains a challenge. In this paper, we tackle these inductive rule learning problems from images with a framework called γILP, which provides a fully differentiable pipeline from image constant substitution to rule structure induction. Extensive experiments demonstrate that γILP achieves strong performance not only on classical symbolic relational datasets but also on relational image data and pure image datasets, such as Kandinsky patterns.
【7】Towards Rapid Constitutive Model Discovery from Multi-Modal Data: Physics Augmented Finite Element Model Updating (paFEMU)
标题:从多模式数据中快速发现本构模型:物理增广有限元模型更新(paFEMUU)
链接:https://arxiv.org/abs/2604.07746
作者:Jingye Tan,Govinda Anantha Padmanabha,Steven J. Yang,Nikolaos Bouklas
摘要:Recent progress in AI-enabled constitutive modeling has concentrated on moving from a purely data-driven paradigm to the enforcement of physical constraints and mechanistic principles, a concept referred to as physics augmentation. Classical phenomenological approaches rely on selecting a pre-defined model and calibrating its parameters, while machine learning methods often focus on discovery of the model itself. Sparse regression approaches lie in between, where large libraries of pre-defined models are probed during calibration. Sparsification in the aforementioned paradigm, but also in the context of neural network architecture, has been shown to enable interpretability, uncertainty quantification, but also heterogeneous software integration due to the low-dimensional nature of the resulting models. Most works in AI-enabled constitutive modeling have also focused on data from a single source, but in reality, materials modeling workflows can contain data from many different sources (multi-modal data), and also from testing other materials within the same materials class (multi-fidelity data). In this work, we introduce physics augmented finite element model updating (paFEMU), as a transfer learning approach that combines AI-enabled constitutive modeling, sparsification for interpretable model discovery, and finite element-based adjoint optimization utilizing multi-modal data. This is achieved by combining simple mechanical testing data, potentially from a distinct material, with digital image correlation-type full-field data acquisition to ultimately enable rapid constitutive modeling discovery. The simplicity of the sparse representation enables easy integration of neural constitutive models in existing finite element workflows, and also enables low-dimensional updating during transfer learning.
【8】CausalVAE as a Plug-in for World Models: Towards Reliable Counterfactual Dynamics
标题:Cairo VAE作为世界模型的插件:迈向可靠的反事实动力学
链接:https://arxiv.org/abs/2604.07712
作者:Ziyi Ding,Xianxin Lai,Weiyu Chen,Xiao-Ping Zhang,Jiayu Chen
摘要:In this work, CausalVAE is introduced as a plug-in structural module for latent world models and is attached to diverse encoder-transition backbones. Across the reported benchmarks, competitive factual prediction is preserved and intervention-aware counterfactual retrieval is improved after the plug-in is added, suggesting stronger robustness under distribution shift and interventions. The largest gains are observed on the Physics benchmark: when averaged over 8 paired baselines, CF-H@1 is improved by +102.5%. In a representative GNN-NLL setting on Physics, CF-H@1 is increased from 11.0 to 41.0 (+272.7%). Through causal analysis, learned structural dependencies are shown to recover meaningful first-order physical interaction trends, supporting the interpretability of the learned latent causal structure.
【9】An Imperfect Verifier is Good Enough: Learning with Noisy Rewards
标题:一个不完美的验证者就足够好了:在吵闹中学习奖励
链接:https://arxiv.org/abs/2604.07666
作者:Andreas Plesner,Francisco Guzmán,Anish Athalye
摘要:Reinforcement Learning with Verifiable Rewards (RLVR) has become a prominent method for post-training Large Language Models (LLMs). However, verifiers are rarely error-free; even deterministic checks can be inaccurate, and the growing dependence on model-based judges exacerbates the issue. The extent to which RLVR is robust to such noise and the verifier accuracy required for effective training remain unresolved questions. We investigate these questions in the domains of code generation and scientific reasoning by introducing noise into RL training. Noise rates up to 15% yield peak validation accuracy within 2 percentage points of the clean baseline. These findings are consistent across controlled and model-based noise types, three model families (Qwen3, GLM4, Llama 3.1), and model sizes from 4B to 9B. Overall, the results indicate that imperfect verification does not constitute a fundamental barrier to RLVR. Furthermore, our findings suggest that practitioners should prioritize moderate accuracy with high precision over perfect verification.
【10】Implicit Regularization and Generalization in Overparameterized Neural Networks
标题:过度参数化神经网络中的隐式正规化和推广
链接:https://arxiv.org/abs/2604.07603
作者:Zeran Johannsen
备注:12 pages, 5 figures
摘要:Classical statistical learning theory predicts that overparameterized models should exhibit severe overfitting, yet modern deep neural networks with far more parameters than training samples consistently generalize well. This contradiction has become a central theoretical question in machine learning. This study investigates the role of optimization dynamics and implicit regularization in enabling generalization in overparameterized neural networks through controlled experiments. We examine stochastic gradient descent (SGD) across batch sizes, the geometry of flat versus sharp minima via Hessian eigenvalue estimation and weight perturbation analysis, the Neural Tangent Kernel (NTK) regime through wide-network experiments, double descent across model scales, and the Lottery Ticket Hypothesis through iterative magnitude pruning. All experiments use PyTorch on CIFAR-10 and MNIST with multiple random seeds. Our findings demonstrate that generalization is strongly influenced by the interaction between network architecture, optimization algorithms, and loss landscape geometry. Smaller batch sizes consistently produced lower test error and flatter minima, with an 11.8x difference in top Hessian eigenvalue between small-batch and large-batch solutions corresponding to 1.61 percentage points higher test accuracy. Sparse subnetworks retaining only 10% of parameters achieved within 1.15 percentage points of full model performance when retrained from their original initialization. These results highlight the need for revised learning-theoretic frameworks capable of explaining generalization in high-dimensional model regimes.
【11】Learning Markov Processes as Sum-of-Square Forms for Analytical Belief Propagation
标题:学习Markov过程作为分析信念传播的平方和形式
链接:https://arxiv.org/abs/2604.07525
作者:Peter Amorese,Morteza Lahijanian
备注:Twenty-Ninth Annual Conference on Artificial Intelligence and Statistics (AISTATS 2026)
摘要:Harnessing the predictive capability of Markov process models requires propagating probability density functions (beliefs) through the model. For many existing models however, belief propagation is analytically infeasible, requiring approximation or sampling to generate predictions. This paper proposes a functional modeling framework leveraging sparse Sum-of-Squares (SoS) forms for valid (conditional) density estimation. We study the theoretical restrictions of modeling conditional densities using the SoS form, and propose a novel functional form for addressing such limitations. The proposed architecture enables generalized simultaneous learning of basis functions and coefficients, while preserving analytical belief propagation. In addition, we propose a training method that allows for exact adherence to the normalization and non-negativity constraints. Our results show that the proposed method achieves accuracy comparable to state-of-the-art approaches while requiring significantly less memory in low-dimensional spaces, and it further scales to 12D systems when existing methods fail beyond 2D.
【12】ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training
标题:ConsistRM:通过一致性感知的自我训练改进生成性奖励模型
链接:https://arxiv.org/abs/2604.07484
作者:Yu Liang,Liangxin Liu,Longzheng Wang,Yan Wang,Yueyang Zhang,Long Xia,Zhiyuan Sun,Daiting Shi
备注:Preprint
摘要:Generative reward models (GRMs) have emerged as a promising approach for aligning Large Language Models (LLMs) with human preferences by offering greater representational capacity and flexibility than traditional scalar reward models. However, GRMs face two major challenges: reliance on costly human-annotated data restricts scalability, and self-training approaches often suffer from instability and vulnerability to reward hacking. To address these issues, we propose ConsistRM, a self-training framework that enables effective and stable GRM training without human annotations. ConsistRM incorporates the Consistency-Aware Answer Reward, which produces reliable pseudo-labels with temporal consistency, thereby providing more stable model optimization. Moreover, the Consistency-Aware Critique Reward is introduced to assess semantic consistency across multiple critiques and allocates fine-grained and differentiated rewards. Experiments on five benchmark datasets across four base models demonstrate that ConsistRM outperforms vanilla Reinforcement Fine-Tuning (RFT) by an average of 1.5%. Further analysis shows that ConsistRM enhances output consistency and mitigates position bias caused by input order, highlighting the effectiveness of consistency-aware rewards in improving GRMs.
【13】Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge
标题:批判性补丁感知稀疏预算与脱钩训练,以实现边缘持续学习
链接:https://arxiv.org/abs/2604.07399
作者:Wonseon Lim,Jaesung Lee,Dae-Won Kim
备注:Accepted to CVPR 2026. 10 pages, 8 figures
摘要:Continual learning (CL) on edge devices requires not only high accuracy but also training-time efficiency to support on-device adaptation under strict memory and computational constraints. While prompt-based continual learning (PCL) is parameter-efficient and achieves competitive accuracy, prior work has focused mainly on accuracy or inference-time performance, often overlooking the memory and computational costs of on-device training. In this paper, we propose CPS-Prompt, a critical patch-aware sparse prompting framework that explicitly targets training-time memory usage and computational cost by integrating critical patch sampling (CPS) for task-aware token reduction and decoupled prompt and classifier training (DPCT) to reduce backpropagation overhead. Experiments on three public benchmarks and real edge hardware show that CPS-Prompt improves peak memory, training time, and energy efficiency by about 1.6x over the balanced CODA-Prompt baseline, while maintaining accuracy within 2% of the state-of-the-art C-Prompt on average and remaining competitive with CODA-Prompt in accuracy. The code is available at https://github.com/laymond1/cps-prompt.
【14】Event-Centric World Modeling with Memory-Augmented Retrieval for Embodied Decision-Making
标题
:以事件为中心的世界建模和内存增强检索以实现优先决策
链接:https://arxiv.org/abs/2604.07392
作者:Fan Zhaowen
备注:This is the initial version (v1) released to establish priority for the proposed framework. Subsequent versions will include expanded experimental validation and exhaustive hardware benchmarking
摘要:Autonomous agents operating in dynamic and safety-critical environments require decision-making frameworks that are both computationally efficient and physically grounded. However, many existing approaches rely on end-to-end learning, which often lacks interpretability and explicit mechanisms for ensuring consistency with physical constraints. In this work, we propose an event-centric world modeling framework with memory-augmented retrieval for embodied decision-making. The framework represents the environment as a structured set of semantic events, which are encoded into a permutation-invariant latent representation. Decision-making is performed via retrieval over a knowledge bank of prior experiences, where each entry associates an event representation with a corresponding maneuver. The final action is computed as a weighted combination of retrieved solutions, providing a transparent link between decision and stored experiences. The proposed design enables structured abstraction of dynamic environments and supports interpretable decision-making through case-based reasoning. In addition, incorporating physics-informed knowledge into the retrieval process encourages the selection of maneuvers that are consistent with observed system dynamics. Experimental evaluation in UAV flight scenarios demonstrates that the framework operates within real-time control constraints while maintaining interpretable and consistent behavior.
【15】The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression
标题:谱边缘的去噪:从梯度学习到权值衰减压缩
链接:https://arxiv.org/abs/2604.07380
作者:Yongzhong Xu
备注:15 pages, 12 figures
摘要:We decompose the spectral edge -- the dominant direction of the Gram matrix of parameter updates -- into its gradient and weight-decay components during grokking in two sequence tasks (Dyck-1 and SCAN). We find a sharp two-phase lifecycle: before grokking the edge is gradient-driven and functionally active; at grokking, gradient and weight decay align, and the edge becomes a compression axis that is perturbation-flat yet ablation-critical (>4000x more impactful than random directions). Three universality classes emerge (functional, mixed, compression), predicted by the gap flow equation. Nonlinear probes show information is re-encoded, not lost (MLP $R^2=0.99$ where linear $R^2=0.86$), and removing weight decay post-grok reverses compression while preserving the algorithm.
【16】Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression
标题:修剪-量化-蒸馏:高效神经网络压缩的有序管道
链接:https://arxiv.org/abs/2604.04988
作者:Longsheng Zhou,Yu Shen
备注:7 pages, submitted to IJCNN
摘要:Modern deployment often requires trading accuracy for efficiency under tight CPU and memory constraints, yet common compression proxies such as parameter count or FLOPs do not reliably predict wall-clock inference time. In particular, unstructured sparsity can reduce model storage while failing to accelerate (and sometimes slightly slowing down) standard CPU execution due to irregular memory access and sparse kernel overhead. Motivated by this gap between compression and acceleration, we study a practical, ordered pipeline that targets measured latency by combining three widely used techniques: unstructured pruning, INT8 quantization-aware training (QAT), and knowledge distillation (KD). Empirically, INT8 QAT provides the dominant runtime benefit, while pruning mainly acts as a capacity-reduction pre-conditioner that improves the robustness of subsequent low-precision optimization; KD, applied last, recovers accuracy within the already constrained sparse INT8 regime without changing the deployment form. We evaluate on CIFAR-10/100 using three backbones (ResNet-18, WRN-28-10, and VGG-16-BN). Across all settings, the ordered pipeline achieves a stronger accuracy-size-latency frontier than any single technique alone, reaching 0.99-1.42 ms CPU latency with competitive accuracy and compact checkpoints. Controlled ordering ablations with a fixed 20/40/40 epoch allocation further confirm that stage order is consequential, with the proposed ordering generally performing best among the tested permutations. Overall, our results provide a simple guideline for edge deployment: evaluate compression choices in the joint accuracy-size-latency space using measured runtime, rather than proxy metrics alone.
【17】Lecture notes on Machine Learning applications for global fits
标题:关于全球适应的机器学习应用程序的讲座笔记
链接:https://arxiv.org/abs/2604.07520
作者:Jorge Alda
备注:Lecture notes for the 4th COMCHA School on Computing Challenges in Zaragoza (Spain), 8-15 April 2026. 24 pages, 10 figures, 14 code snippets, 1 appendix. Submission to SciPost Physics Lecture Notes
摘要:These lecture notes provide a comprehensive framework for performing global statistical fits in high-energy physics using modern Machine Learning (ML) surrogates. We begin by reviewing the statistical foundations of model building, including the likelihood function, Wilks' theorem, and profile likelihoods. Recognizing that the computational cost of evaluating model predictions often renders traditional minimization prohibitive, we introduce Boosted Decision Trees to approximate the log-likelihood function. The notes detail a robust ML workflow including efficient generation of training data with active learning and Gaussian processes, hyperparameter optimization, model compilation for speed-up, and interpretability through SHAP values to decode the influence of model parameters and interactions between parameters. We further discuss posterior distribution sampling using Markov Chain Monte Carlo (MCMC). These techniques are finally applied to the $B^\pm \to K^\pm ν\barν$ anomaly at Belle II, demonstrating how a two-stage ML model can efficiently explore the parameter space of Axion-Like Particles (ALPs) while satisfying stringent experimental constraints on decay lengths and flavor-violating couplings.
【18】Score Shocks: The Burgers Equation Structure of Diffusion Generative Models
标题:得分冲击:扩散生成模型的伯格斯方程结构
链接:https://arxiv.org/abs/2604.07404
作者:Krisanu Sarkar
备注:41 pages, 7 figures. Introduces a Burgers equation formulation of diffusion model score dynamics and a local binary-boundary theorem for speciation
摘要:We analyze the score field of a diffusion generative model through a Burgers-type evolution law. For VE diffusion, the heat-evolved data density implies that the score obeys viscous Burgers in one dimension and the corresponding irrotational vector Burgers system in $\R^d$, giving a PDE view of \emph{speciation transitions} as the sharpening of inter-mode interfaces. For any binary decomposition of the noised density into two positive heat solutions, the score separates into a smooth background and a universal $\tanh$ interfacial term determined by the component log-ratio; near a regular binary mode boundary this yields a normal criterion for speciation. In symmetric binary Gaussian mixtures, the criterion recovers the critical diffusion time detected by the midpoint derivative of the score and agrees with the spectral criterion of Biroli, Bonnaire, de~Bortoli, and Mézard (2024). After subtracting the background drift, the inter-mode layer has a local Burgers $\tanh$ profile, which becomes global in the symmetric Gaussian case with width $σ_τ^2/a$. We also quantify exponential amplification of score errors across this layer, show that Burgers dynamics preserves irrotationality, and use a change of variables to reduce the VP-SDE to the VE case, yielding a closed-form VP speciation time. Gaussian-mixture formulas are verified to machine precision, and the local theorem is checked numerically on a quartic double-well.
其他(47篇)
【1】PIArena: A Platform for Prompt Injection Evaluation
标题:PIARena:即时注射评估平台
链接:https://arxiv.org/abs/2604.08499
作者:Runpeng Geng,Chenlong Yin,Yanting Wang,Ying Chen,Jinyuan Jia
备注:To appear in ACL 2026. The code is available at https://github.com/sleeepeer/PIArena
摘要:Prompt injection attacks pose serious security risks across a wide range of real-world applications. While receiving increasing attention, the community faces a critical gap: the lack of a unified platform for prompt injection evaluation. This makes it challenging to reliably compare defenses, understand their true robustness under diverse attacks, or assess how well they generalize across tasks and benchmarks. For instance, many defenses initially reported as effective were later found to exhibit limited robustness on diverse datasets and attacks. To bridge this gap, we introduce PIArena, a unified and extensible platform for prompt injection evaluation that enables users to easily integrate state-of-the-art attacks and defenses and evaluate them across a variety of existing and new benchmarks. We also design a dynamic strategy-based attack that adaptively optimizes injected prompts based on defense feedback. Through comprehensive evaluation using PIArena, we uncover critical limitations of state-of-the-art defenses: limited generalizability across tasks, vulnerability to adaptive attacks, and fundamental challenges when an injected task aligns with the target task. The code and datasets are available at https://github.com/sleeepeer/PIArena.
【2】The Impact of Dimensionality on the Stability of Node Embeddings
标题:虚拟性对节点嵌入稳定性的影响
链接:https://arxiv.org/abs/2604.08492
作者:Tobias Schumacher,Simon Reichelt,Markus Strohmaier
摘要:Previous work has established that neural network-based node embeddings return different outcomes when trained with identical parameters on the same dataset, just from using different training seeds. Yet, it has not been thoroughly analyzed how key hyperparameters such as embedding dimension could impact this instability. In this work, we investigate how varying the dimensionality of node embeddings influences both their stability and downstream performance. We systematically evaluate five widely used methods -- ASNE, DGI, GraphSAGE, node2vec, and VERSE -- across multiple datasets and embedding dimensions. We assess stability from both a representational perspective and a functional perspective, alongside performance evaluation. Our results show that embedding stability varies significantly with dimensionality, but we observe different patterns across the methods we consider: while some approaches, such as node2vec and ASNE, tend to become more stable with higher dimensionality, other methods do not exhibit the same trend. Moreover, we find that maximum stability does not necessarily align with optimal task performance. These findings highlight the importance of carefully selecting embedding dimension, and provide new insights into the trade-offs between stability, performance, and computational effectiveness in graph representation learning.
【3】Less Approximates More: Harmonizing Performance and Confidence Faithfulness via Hybrid Post-Training for High-Stakes Tasks
标题:更少的接近更多:通过高风险任务的混合后训练协调绩效和信心忠诚度
链接:https://arxiv.org/abs/2604.08454
作者:Haokai Ma,Lee Yan Zhen,Gang Yang,Yunshan Ma,Ee-Chien Chang,Tat-Seng Chua
摘要:Large language models are increasingly deployed in high-stakes tasks, where confident yet incorrect inferences may cause severe real-world harm, bringing the previously overlooked issue of confidence faithfulness back to the forefront. A promising solution is to jointly optimize unsupervised Reinforcement Learning from Internal Feedback (RLIF) with reasoning-trace-guided Reasoning Distillation (RD), which may face three persistent challenges: scarcity of high-quality training corpora, factually unwarranted overconfidence and indiscriminate fusion that amplifies erroneous updates. Inspired by the human confidence accumulation from uncertainty to certainty, we propose Progressive Reasoning Gain (PRG) to measure whether reasoning steps progressively strengthen support for the final answer. Furthermore, we introduce HyTuning, a hybrid post-training framework that adaptively reweights RD and RLIF via a PRG-style metric, using scarce supervised reasoning traces as a stable anchor while exploiting abundant unlabeled queries for scalability. Experiments on several domain-specific and general benchmarks demonstrate that HyTuning improves accuracy while achieving confidence faithfulness under limited supervision, supporting a practical "Less Approximates More" effect.
【4】KV Cache Offloading for Context-Intensive Tasks
标题:用于上下文密集型任务的KV缓存卸载
链接
:https://arxiv.org/abs/2604.08426
作者:Andrey Bocharnikov,Ivan Ermakov,Denis Kuznedelev,Vyacheslav Zhdanovskiy,Yegor Yershov
备注:Preprint, Work in progress
摘要:With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input prompt. We create and release the Text2JSON benchmark, a highly context-intensive task that requires extracting structured knowledge from raw text. We evaluate modern KV offloading on Text2JSON and other context-intensive tasks and find significant performance degradation on both Llama 3 and Qwen 3 models. Our analysis identifies two key reasons for poor accuracy: low-rank projection of keys and unreliable landmarks, and proposes a simpler alternative strategy that significantly improves accuracy across multiple LLM families and benchmarks. These findings highlight the need for a comprehensive and rigorous evaluation of long-context compression techniques.
【5】On-board Telemetry Monitoring in Autonomous Satellites: Challenges and Opportunities
标题:自主卫星星载遥测技术:挑战与机遇
链接:https://arxiv.org/abs/2604.08424
作者:Lorenzo Capelli,Leandro de Souza Rosa,Maurizio De Tommasi,Livia Manovi,Andriy Enttsel,Mauro Mangia,Riccardo Rovatti,Ilaria Pinci,Carlo Ciancarelli,Eleonora Mariotti,Gianluca Furano
摘要:The increasing autonomy of spacecraft demands fault-detection systems that are both reliable and explainable. This work addresses eXplainable Artificial Intelligence for onboard Fault Detection, Isolation and Recovery within the Attitude and Orbit Control Subsystem by introducing a framework that enhances interpretability in neural anomaly detectors. We propose a method to derive low-dimensional, semantically annotated encodings from intermediate neural activations, called peepholes. Applied to a convolutional autoencoder, the framework produces interpretable indicators that enable the identification and localization of anomalies in reaction-wheel telemetry. Peepholes analysis further reveals bias detection and supports fault localization. The proposed framework enables the semantic characterization of detected anomalies while requiring only a marginal increase in computational resources, thus supporting its feasibility for on-board deployment.
【6】Synthetic Data for any Differentiable Target
标题:针对任何差异目标的合成数据
链接:https://arxiv.org/abs/2604.08423
作者:Tristan Thrush,Sung Min Park,Herman Brunborg,Luke Bailey,Marcel Roed,Neil Band,Christopher Potts,Tatsunori Hashimoto
摘要:What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model's LM head weights to (1) embed a QR code, (2) embed the pattern $\texttt{67}$, and (3) have lower $\ell^2$ norm. We additionally show that we can cause the generator to (4) rephrase inputs in a new language and (5) produce a specific UUID, even though neither of these objectives is conveyed in the generator's input prompts. These findings suggest that DPG is a powerful and flexible technique for shaping model properties using only synthetic training examples.
【7】Bias-Constrained Diffusion Schedules for PDE Emulations: Reconstruction Error Minimization and Efficient Unrolled Training
标题:用于PCE仿真的偏差约束扩散计划:重建误差最小化和高效展开训练
链接:https://arxiv.org/abs/2604.08357
作者:Constantin Le Cleï,Nils Thürey,Xiaoxiang Zhu
摘要:Conditional Diffusion Models are powerful surrogates for emulating complex spatiotemporal dynamics, yet they often fail to match the accuracy of deterministic neural emulators for high-precision tasks. In this work, we address two critical limitations of autoregressive PDE diffusion models: their sub-optimal single-step accuracy and the prohibitive computational cost of unrolled training. First, we characterize the relationship between the noise schedule, the reconstruction error reduction rate and the diffusion exposure bias, demonstrating that standard schedules lead to suboptimal reconstruction error. Leveraging this insight, we propose an \textit{Adaptive Noise Schedule} framework that minimizes inference reconstruction error by dynamically constraining the model's exposure bias. We further show that this optimized schedule enables a fast \textit{Proxy Unrolled Training} method to stabilize long-term rollouts without the cost of full Markov Chain sampling. Both proposed methods enable significant improvements in short-term accuracy and long-term stability over diffusion and deterministic baselines on diverse benchmarks, including forced Navier-Stokes, Kuramoto-Sivashinsky and Transonic Flow.
【8】DMax: Aggressive Parallel Decoding for dLLMs
标题:DMax:针对DLLM的积极并行解码
链接:https://arxiv.org/abs/2604.08302
作者:Zigeng Chen,Gongfan Fang,Xinyin Ma,Ruonan Yu,Xinchao Wang
备注:Working in progress. Code is available at: https://github.com/czg1225/DMax
摘要
:We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition, DMax reformulates decoding as a progressive self-refinement from mask embeddings to token embeddings. At the core of our approach is On-Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further propose Soft Parallel Decoding. We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self-revising in embedding space. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax. Compared with the original LLaDA-2.0-mini, our method improves TPF on GSM8K from 2.04 to 5.47 while preserving accuracy. On MBPP, it increases TPF from 2.71 to 5.86 while maintaining comparable performance. On two H200 GPUs, our model achieves an average of 1,338 TPS at batch size 1. Code is available at: https://github.com/czg1225/DMax
【9】Approximation of the Basset force in the Maxey-Riley-Gatignol equations via universal differential equations
标题:用泛方程逼近Maxey-Riley-Gatignol方程中Basset力
链接:https://arxiv.org/abs/2604.08194
作者:Finn Sommer,Vamika Rathi,Sebastian Goetschel,Daniel Ruprecht
备注:24 pages, 15 figures
摘要:The Maxey-Riley-Gatignol equations (MaRGE) model the motion of spherical inertial particles in a fluid. They contain the Basset force, an integral term which models history effects due to the formation of wakes and boundary layer effects. This causes the force that acts on a particle to depend on its past trajectory and complicates the numerical solution of MaRGE. Therefore, the Basset force is often neglected, despite substantial evidence that it has both quantitative and qualitative impact on the movement patterns of modelled particles. Using the concept of universal differential equations, we propose an approximation of the history term via neural networks which approximates MaRGE by a system of ordinary differential equations that can be solved with standard numerical solvers like Runge-Kutta methods.
【10】Long-Term Embeddings for Balanced Personalization
标题:平衡个性化的长期嵌入
链接:https://arxiv.org/abs/2604.08181
作者:Andrii Dzhoha,Egor Malykh
摘要:Modern transformer-based sequential recommenders excel at capturing short-term intent but often suffer from recency bias, overlooking stable long-term preferences. While extending sequence lengths is an intuitive fix, it is computationally inefficient, and recent interactions tend to dominate the model's attention. We propose Long-Term Embeddings (LTE) as a high-inertia contextual anchor to bridge this gap. We address a critical production challenge: the point-in-time consistency problem caused by infrastructure constraints, as feature stores typically host only a single "live" version of features. This leads to an offline-online mismatch during model deployments and rollbacks, as models are forced to process evolved representations they never saw during training. To resolve this, we introduce an LTE framework that constrains embeddings to a fixed semantic basis of content-based item representations, ensuring cross-version compatibility. Furthermore, we investigate integration strategies for causal language modeling, considering the data leakage issue that occurs when the LTE and the transformer's short-term sequence share a temporal horizon. We evaluate two representations: a heuristic average and an asymmetric autoencoder with a fixed decoder grounded in the semantic basis to enable behavioral fine-tuning while maintaining stability. Online A/B tests on Zalando demonstrate that integrating LTE as a contextual prefix token using a lagged window yields significant uplifts in both user engagement and financial metrics.
【11】Shift- and stretch-invariant non-negative matrix factorization with an application to brain tissue delineation in emission tomography data
标题:位移和拉伸不变的非负矩阵分解及其在发射断层扫描数据中脑组织描绘中的应用
链接:https://arxiv.org/abs/2604.08161
作者:Anders S. Olsen,Miriam L. Navarro,Claus Svarer,Jesper L. Hinrich,Morten Mørup,Gitte M. Knudsen
备注:Accepted at ICASSP2026
摘要:Dynamic neuroimaging data, such as emission tomography measurements of radiotracer transport in blood or cerebrospinal fluid, often exhibit diffusion-like properties. These introduce distance-dependent temporal delays, scale-differences, and stretching effects that limit the effectiveness of conventional linear modeling and decomposition methods. To address this, we present the shift- and stretch-invariant non-negative matrix factorization framework. Our approach estimates both integer and non-integer temporal shifts as well as temporal stretching, all implemented in the frequency domain, where shifts correspond to phase modifications, and where stretching is handled via zero-padding or truncation. The model is implemented in PyTorch (https://github.com/anders-s-olsen/shiftstretchNMF). We demonstrate on synthetic data and brain emission tomography data that the model is able to account for stretching to provide more detailed characterization of brain tissue structure.
【12】A Direct Approach for Handling Contextual Bandits with Latent State Dynamics
标题:利用潜伏状态动态处理情境盗贼的直接方法
链接:https://arxiv.org/abs/2604.08149
作者:Zhen Li,Gilles Stoltz
摘要
:We revisit the finite-armed linear bandit model by Nelson et al. (2022), where contexts and rewards are governed by a finite hidden Markov chain. Nelson et al. (2022) approach this model by a reduction to linear contextual bandits; but to do so, they actually introduce a simplification in which rewards are linear functions of the posterior probabilities over the hidden states given the observed contexts, rather than functions of the hidden states themselves. Their analysis (but not their algorithm) also does not take into account the estimation of the HMM parameters, and only tackles expected, not high-probability, bounds, which suffer in addition from unnecessary complex dependencies on the model (like reward gaps). We instead study the more natural model incorporating direct dependencies in the hidden states (on top of dependencies on the observed contexts, as is natural for contextual bandits) and also obtain stronger, high-probability, regret bounds for a fully adaptive strategy that estimates HMM parameters online. These bounds do not depend on the reward functions and only depend on the model through the estimation of the HMM parameters.
【13】Bias Redistribution in Visual Machine Unlearning: Does Forgetting One Group Harm Another?
标题:视觉机器遗忘中的偏见重新分布:忘记一个群体会伤害另一个群体吗?
链接:https://arxiv.org/abs/2604.08111
作者:Yunusa Haruna,Adamu Lawan,Ibrahim Haruna Abdulhamid,Hamza Mohammed Dauda,Jiaquan Zhang,Chaoning Zhang,Shamsuddeen Hassan Muhammad
摘要:Machine unlearning enables models to selectively forget training data, driven by privacy regulations such as GDPR and CCPA. However, its fairness implications remain underexplored: when a model forgets a demographic group, does it neutralize that concept or redistribute it to correlated groups, potentially amplifying bias? We investigate this bias redistribution phenomenon on CelebA using CLIP models (ViT/B-32, ViT-L/14, ViT-B/16) under a zero-shot classification setting across intersectional groups defined by age and gender. We evaluate three unlearning methods, Prompt Erasure, Prompt Reweighting, and Refusal Vector using per-group accuracy shifts, demographic parity gaps, and a redistribution score. Our results show that unlearning does not eliminate bias but redistributes it primarily along gender rather than age boundaries. In particular, removing the dominant Young Female group consistently transfers performance to Old Female across all model scales, revealing a gender-dominant structure in CLIP's embedding space. While the Refusal Vector method reduces redistribution, it fails to achieve complete forgetting and significantly degrades retained performance. These findings highlight a fundamental limitation of current unlearning methods: without accounting for embedding geometry, they risk amplifying bias in retained groups.
【14】OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation
标题:DV-Stitcher:一个全球上下文感知框架,用于免训练开放词汇语义分割
链接:https://arxiv.org/abs/2604.08110
作者:Seungjae Moon,Seunghyun Oh,Youngmin Ro
摘要:Training-free open-vocabulary semantic segmentation(TF-OVSS) has recently attracted attention for its ability to perform dense prediction by leveraging the pretrained knowledge of large vision and vision-language models, without requiring additional training. However, due to the limited input resolution of these pretrained encoders, existing TF-OVSS methods commonly adopt a sliding-window strategy that processes cropped sub-images independently. While effective for managing high-resolution inputs, this approach prevents global attention over the full image, leading to fragmented feature representations and limited contextual reasoning. We propose OV-Stitcher, a training-free framework that addresses this limitation by stitching fragmented sub-image features directly within the final encoder block. By reconstructing attention representations from fragmented sub-image features, OV-Stitcher enables global attention within the final encoder block, producing coherent context aggregation and spatially consistent, semantically aligned segmentation maps. Extensive evaluations across eight benchmarks demonstrate that OV-Stitcher establishes a scalable and effective solution for open-vocabulary segmentation, achieving a notable improvement in mean Intersection over Union(mIoU) from 48.7 to 50.7 compared with prior training-free baselines.
【15】From Universal to Individualized Actionability: Revisiting Personalization in Algorithmic Recourse
标题:从普遍性到个性化的可选择性:重新审视数学诉讼中的个性化
链接:https://arxiv.org/abs/2604.08030
作者:Lena Marie Budde,Ayan Majumdar,Richard Uth,Markus Langer,Isabel Valera
备注:27 pages, 8 figures, 6 tables
摘要:Algorithmic recourse aims to provide actionable recommendations that enable individuals to change unfavorable model outcomes, and prior work has extensively studied properties such as efficiency, robustness, and fairness. However, the role of personalization in recourse remains largely implicit and underexplored. While existing approaches incorporate elements of personalization through user interactions, they typically lack an explicit definition of personalization and do not systematically analyze its downstream effects on other recourse desiderata. In this paper, we formalize personalization as individual actionability, characterized along two dimensions: hard constraints that specify which features are individually actionable, and soft, individualized constraints that capture preferences over action values and costs. We operationalize these dimensions within the causal algorithmic recourse framework, adopting a pre-hoc user-prompting approach in which individuals express preferences via rankings or scores prior to the generation of any recourse recommendation. Through extensive empirical evaluation, we investigate how personalization interacts with key recourse desiderata, including validity, cost, and plausibility. Our results highlight important trade-offs: individual actionability constraints, particularly hard ones, can substantially degrade the plausibility and validity of recourse recommendations across amortized and non-amortized approaches. Notably, we also find that incorporating individual actionability can reveal disparities in the cost and plausibility of recourse actions across socio-demographic groups. These findings underscore the need for principled definitions, careful operationalization, and rigorous evaluation of personalization in algorithmic recourse.
【16】DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
标题:DSCA:终身VLM编辑的动态子空间概念调整
链接:https://arxiv.org/abs/2604.07965
作者:Gyanendra Das,Sai Satyam Jena
备注:Accepted at CVPR 2026
摘要
:Model editing aims to update knowledge to add new concepts and change relevant information without retraining. Lifelong editing is a challenging task, prone to disrupting previously learned concepts, especially for Vision Language Models (VLMs), because sequential edits can lead to degraded reasoning and cross modal misalignment. Existing VLM knowledge editing methods based on gated adapters, activation edits, and parameter merging techniques address catastrophic forgetting seen in full fine tuning; however, they still operate in the shared representation space of the VLM, where concepts are entangled, so edits interfere with other non relevant concepts. We hypothesize that this instability persists because current methods algorithmically control edits via optimization rather than structurally separating knowledge. We introduce Dynamic Subspace Concept Alignment (DSCA) which by design mitigates this limitation by decomposing the representation space into a set of orthogonal semantic subspaces and proposing edits only in those transformed spaces. These subspaces are obtained through incremental clustering and PCA on joint vision language representations. This process structurally isolates concepts, enabling precise, non interfering edits by turning isolation from a soft training objective into an architectural property. The surgical edits are guided by a multi term loss function for maintaining task fidelity, edit locality, and cross modal alignment. With the base model frozen, our method achieves 98 percent single edit success, remains over 95 percent after 1000 sequential edits, lowers hallucination by 3 to 5 percent, and achieves the best backward transfer (BWT) scores on continual instruction tuning benchmarks. Extensive experiments demonstrate DSCA state of the art stability and knowledge retention capability in continual lifelong editing across various datasets and benchmarks.
【17】Is your algorithm unlearning or untraining?
标题:您的算法是取消学习还是取消训练?
链接:https://arxiv.org/abs/2604.07962
作者:Eleni Triantafillou,Ahmed Imtiaz Humayun,Monica Ribero,Alexander Matt Turner,Michael C. Mozer,Georgios Kaissis
摘要:As models are getting larger and are trained on increasing amounts of data, there has been an explosion of interest into how we can ``delete'' specific data points or behaviours from a trained model, after the fact. This goal has been referred to as ``machine unlearning''. In this note, we argue that the term ``unlearning'' has been overloaded, with different research efforts spanning two distinct problem formulations, but without that distinction having been observed or acknowledged in the literature. This causes various issues, including ambiguity around when an algorithm is expected to work, use of inappropriate metrics and baselines when comparing different algorithms to one another, difficulty in interpreting results, as well as missed opportunities for pursuing critical research directions. In this note, we address this issue by establishing a fundamental distinction between two notions that we identify as \unlearning and \untraining, illustrated in Figure 1. In short, \untraining aims to reverse the effect of having trained on a given forget set, i.e. to remove the influence that that specific forget set examples had on the model during training. On the other hand, the goal of \unlearning is not just to remove the influence of those given examples, but to use those examples for the purpose of more broadly removing the entire underlying distribution from which those examples were sampled (e.g. the concept or behaviour that those examples represent). We discuss technical definitions of these problems and map problem settings studied in the literature to each. We hope to initiate discussions on disambiguating technical definitions and identify a set of overlooked research questions, as we believe that this a key missing step for accelerating progress in the field of ``unlearning''.
【18】A Systematic Framework for Tabular Data Disentanglement
标题:表格数据解纠缠的系统框架
链接:https://arxiv.org/abs/2604.07940
作者:Ivan Tjuawinata,Andre Gunawan,Anh Quan Tran,Nitish Kumar,Payal Pote,Harsh Bansal,Chu-Hung Chi,Kwok-Yan Lam,Parventanis Murthy
摘要:Tabular data, widely used in various applications such as industrial control systems, finance, and supply chain, often contains complex interrelationships among its attributes. Data disentanglement seeks to transform such data into latent variables with reduced interdependencies, facilitating more effective and efficient processing. Despite the extensive studies on data disentanglement over image, text, or audio data, tabular data disentanglement may require further investigation due to the more intricate attribute interactions typically found in tabular data. Moreover, due to the highly complex interrelationships, direct translation from other data domains results in suboptimal data disentanglement. Existing tabular data disentanglement methods, such as factor analysis, CT-GAN, and VAE face limitations including scalability issues, mode collapse, and poor extrapolation. In this paper, we propose the use of a framework to provide a systematic view on tabular data disentanglement that modularizes the process into four core components: data extraction, data modeling, model analysis, and latent representation extrapolation. We believe this work provides a deeper understanding of tabular data disentanglement and existing methods, and lays the foundation for potential future research in developing robust, efficient, and scalable data disentanglement techniques. Finally, we demonstrate the framework's applicability through a case study on synthetic tabular data generation, showcasing its potential in the particular downstream task of data synthesis.
【19】PolicyLong: Towards On-Policy Context Extension
标题:Policy Long:走向政策背景扩展
链接:https://arxiv.org/abs/2604.07809
作者:Junlong Jia,Ziyang Chen,Xing Wu,Chaochen Gao,TingHao Yu,Feng Zhang,Songlin Hu
备注:Work in progress. Correspondence to ucaswu@tencent.com or wuxing@iie.ac.cn
摘要:Extending LLM context windows is hindered by scarce high-quality long-context data. Recent methods synthesize data with genuine long-range dependencies via information-theoretic verification, selecting contexts that reduce a base model's predictive entropy. However, their single-pass offline construction with a fixed model creates a fundamental off-policy gap: the static screening landscape misaligns with the model's evolving capabilities, causing the training distribution to drift. We propose PolicyLong, shifting data construction towards a dynamic on-policy paradigm. By iteratively re-executing data screening (entropy computation, retrieval, and verification) using the current model, PolicyLong ensures the training distribution tracks evolving capabilities, yielding an emergent self-curriculum. Crucially, both positive and hard negative contexts derive from the current model's entropy landscape, co-evolving what the model learns to exploit and resist. Experiments on RULER, HELMET, and LongBench-v2 (Qwen2.5-3B) show PolicyLong consistently outperforms EntropyLong and NExtLong, with gains growing at longer contexts (e.g., +2.54 at 128K on RULER), confirming the value of on-policy data evolution.
【20】Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
标题:跨模式情感转移用于说话的面部视频中的情感编辑
链接:https://arxiv.org/abs/2604.07786
作者
:Chanhyuk Choi,Taesoo Kim,Donggyu Lee,Siyeol Jung,Taehwan Kim
备注:Accepted to CVPR 2026. Project Page: https://chanhyeok-choi.github.io/C-MET/
摘要:Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/
【21】Mitigating Distribution Sharpening in Math RLVR via Distribution-Aligned Hint Synthesis and Backward Hint Annealing
标题:通过分布对齐提示合成和反向提示退变来缓解数学WLVR中的分布尖锐
链接:https://arxiv.org/abs/2604.07747
作者:Pei-Xi Xie,Che-Yu Lin,Cheng-Lin Yang
摘要:Reinforcement learning with verifiable rewards (RLVR) can improve low-$k$ reasoning accuracy while narrowing solution coverage on challenging math questions, and pass@1 gains do not necessarily translate into better large-$k$ performance. Existing hint-based approaches can make challenging questions trainable, but they leave two issues underexplored: teacher-student distribution mismatch and the need to reduce hint exposure to match no-hint evaluation. We address these issues through two components. Distribution-Aligned Hint Synthesis (DAHS) constructs verified teacher hints conditioned on student-style responses. Backward Hint Annealing (BHA) anneals hint exposure across difficulty buckets and uses per-question hint dropout to preserve no-hint updates throughout RL training. We evaluate the method in math RLVR under the DAPO training framework across AIME24, AIME25, and AIME26 using $\texttt{Qwen3-1.7B-Base}$ and $\texttt{Llama-3.2-1B-Instruct}$. On $\texttt{Qwen3-1.7B-Base}$, our method improves both pass@1 and pass@2048 relative to DAPO across the three AIME benchmarks. On $\texttt{Llama-3.2-1B-Instruct}$, the gains are concentrated in the large-$k$ regime. These results suggest that, in math RLVR, hint scaffolding is effective when it restores learnable updates on challenging questions early in training and is then gradually removed before no-hint evaluation.
【22】IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
标题:IatroBench:人工智能安全措施造成医源性伤害的预先注册证据
链接:https://arxiv.org/abs/2604.07709
作者:David Gringras
备注:30 pages, 3 figures, 11 tables. Pre-registered on OSF (DOI: 10.17605/OSF.IO/G6VMZ). Code and data: https://github.com/davidgringras/iatrobench
摘要:Ask a frontier model how to taper six milligrams of alprazolam (psychiatrist retired, ten days of pills left, abrupt cessation causes seizures) and it tells her to call the psychiatrist she just explained does not exist. Change one word ("I'm a psychiatrist; a patient presents with...") and the same model, same weights, same inference pass produces a textbook Ashton Manual taper with diazepam equivalence, anticonvulsant coverage, and monitoring thresholds. The knowledge was there; the model withheld it. IatroBench measures this gap. Sixty pre-registered clinical scenarios, six frontier models, 3,600 responses, scored on two axes (commission harm, CH 0-3; omission harm, OH 0-4) through a structured-evaluation pipeline validated against physician scoring (kappa_w = 0.571, within-1 agreement 96%). The central finding is identity-contingent withholding: match the same clinical question in physician vs. layperson framing and all five testable models provide better guidance to the physician (decoupling gap +0.38, p = 0.003; binary hit rates on safety-colliding actions drop 13.1 percentage points in layperson framing, p < 0.0001, while non-colliding actions show no change). The gap is widest for the model with the heaviest safety investment (Opus, +0.65). Three failure modes separate cleanly: trained withholding (Opus), incompetence (Llama 4), and indiscriminate content filtering (GPT-5.2, whose post-generation filter strips physician responses at 9x the layperson rate because they contain denser pharmacological tokens). The standard LLM judge assigns OH = 0 to 73% of responses a physician scores OH >= 1 (kappa = 0.045); the evaluation apparatus has the same blind spot as the training apparatus. Every scenario targets someone who has already exhausted the standard referrals.
【23】Tree-of-Evidence: Efficient "System 2" Search for Faithful Multimodal Grounding
链接:https://arxiv.org/abs/2604.07692
作者:Micky C. Nnamdi,Benoit L. Marteau,Yishan Zhong,J. Ben Tamo,May D. Wang
摘要
:Large Multimodal Models (LMMs) achieve state-of-the-art performance in high-stakes domains like healthcare, yet their reasoning remains opaque. Current interpretability methods, such as attention mechanisms or post-hoc saliency, often fail to faithfully represent the model's decision-making process, particularly when integrating heterogeneous modalities like time-series and text. We introduce Tree-of-Evidence (ToE), an inference-time search algorithm that frames interpretability as a discrete optimization problem. Rather than relying on soft attention weights, ToE employs lightweight Evidence Bottlenecks that score coarse groups or units of data (e.g., vital-sign windows, report sentences) and performs a beam search to identify the compact evidence set required to reproduce the model's prediction. We evaluate ToE across six tasks spanning three datasets and two domains: four clinical prediction tasks on MIMIC-IV, cross-center validation on eICU, and non-clinical fault detection on LEMMA-RCA. ToE produces auditable evidence traces while maintaining predictive performance, retaining over 0.98 of full-model AUROC with as few as five evidence units across all settings. Under sparse evidence budgets, ToE achieves higher decision agreement and lower probability fidelity error than other approaches. Qualitative analyses show that ToE adapts its search strategy: it often resolves straightforward cases using only vitals, while selectively incorporating text when physiological signals are ambiguous. ToE therefore provides a practical mechanism for auditing multimodal models by revealing which discrete evidence units support each prediction.
【24】Tensor-based computation of the Koopman generator via operator logarithm
标题:通过运算符对Koopman生成器进行基于张量的计算
链接:https://arxiv.org/abs/2604.07685
作者:Tatsuya Kishimoto,Jun Ohkubo
备注:9 pages, 5 figure
摘要:Identifying governing equations of nonlinear dynamical systems from data is challenging. While sparse identification of nonlinear dynamics (SINDy) and its extensions are widely used for system identification, operator-logarithm approaches use the logarithm to avoid time differentiation, enabling larger sampling intervals. However, they still suffer from the curse of dimensionality. Then, we propose a data-driven method to compute the Koopman generator in a low-rank tensor train (TT) format by taking logarithms of Koopman eigenvalues while preserving the TT format. Experiments on 4-dimensional Lotka-Volterra and 10-dimensional Lorenz-96 systems show accurate recovery of vector field coefficients and scalability to higher-dimensional systems.
【25】Sheaf-Laplacian Obstruction and Projection Hardness for Cross-Modal Compatibility on a Modality-Independent Site
标题:非模式位置上跨模式兼容性的Sheaf-Laplacian障碍和投影硬度
链接:https://arxiv.org/abs/2604.07632
作者:Tibor Sloboda
备注:21 pages, 4 figures, submitted to Annals of Mathematics and Artificial Intelligence of Springer Nature
摘要:We develop a unified framework for analyzing cross-modal compatibility in learned representations. The core object is a modality-independent neighborhood site on sample indices, equipped with a cellular sheaf of finite-dimensional real inner-product spaces. For a directed modality pair $(a\to b)$, we formalize two complementary incompatibility mechanisms: projection hardness, the minimal complexity within a nested Lipschitz-controlled projection family needed for a single global map to align whitened embeddings; and sheaf-Laplacian obstruction, the minimal spatial variation required by a locally fit field of projection parameters to achieve a target alignment error. The obstruction invariant is implemented via a projection-parameter sheaf whose 0-Laplacian energy exactly matches the smoothness penalty used in sheaf-regularized regression, making the theory directly operational. This separates two distinct failure modes: hardness failure, where no low-complexity global projection exists, and obstruction failure, where local projections exist but cannot be made globally consistent over the semantic neighborhood graph without large parameter variation. We link the sheaf spectral gap to stability of global alignment, derive bounds relating obstruction energy to excess global-map error under mild Lipschitz assumptions, and give explicit constructions showing that compatibility is generally non-transitive. We further define bridging via composed projection families and show, in a concrete ReLU setting, that an intermediate modality can strictly reduce effective hardness even when direct alignment remains infeasible.
【26】DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification
标题:DIVERSED:通过动态包围验证的宽松推测解码
链接:https://arxiv.org/abs/2604.07622
作者:Ziyi Wang,Siva Rajesh Kasa,Ankith M S,Santhosh Kumar Kasa,Jiaru Zou,Sumit Negi,Ruqi Zhang,Nan Jiang,Qifan Song
备注:35 pages, 9 figures, accepted at AISTATS 2026
摘要:Speculative decoding is an effective technique for accelerating large language model inference by drafting multiple tokens in parallel. In practice, its speedup is often bottlenecked by a rigid verification step that strictly enforces the accepted token distribution to exactly match the target model. This constraint leads to the rejection of many plausible tokens, lowering the acceptance rate and limiting overall time speedup. To overcome this limitation, we propose Dynamic Verification Relaxed Speculative Decoding (DIVERSED), a relaxed verification framework that improves time efficiency while preserving generation quality. DIVERSED learns an ensemble-based verifier that blends the draft and target model distributions with a task-dependent and context-dependent weight. We provide theoretical justification for our approach and demonstrate empirically that DIVERSED achieves substantially higher inference efficiency compared to standard speculative decoding methods. Code is available at: https://github.com/comeusr/diversed.
【27】SYN-DIGITS: A Synthetic Control Framework for Calibrated Digital Twin Simulation
标题:SYN-DIGITS:用于校准数字双胞胎模拟的综合控制框架
链接:https://arxiv.org/abs/2604.07513
作者:Grace Jiarui Fan,Chengpiao Huang,Tianyi Peng,Kaizheng Wang,Yuhang Wu
摘要:AI-based persona simulation -- often referred to as digital twin simulation -- is increasingly used for market research, recommender systems, and social sciences. Despite their flexibility, large language models (LLMs) often exhibit systematic bias and miscalibration relative to real human behavior, limiting their reliability. Inspired by synthetic control methods from causal inference, we propose SYN-DIGITS (SYNthetic Control Framework for Calibrated DIGItal Twin Simulation), a principled and lightweight calibration framework that learns latent structure from digital-twin responses and transfers it to align predictions with human ground truth. SYN-DIGITS operates as a post-processing layer on top of any LLM-based simulator and thus is model-agnostic. We develop a latent factor model that formalizes when and why calibration succeeds through latent space alignment conditions, and we systematically evaluate ten calibration methods across thirteen persona constructions, three LLMs, and two datasets. SYN-DIGITS supports both individual-level and distributional simulation for previously unseen questions and unobserved populations, with provable error guarantees. Experiments show that SYN-DIGITS achieves up to 50% relative improvements in individual-level correlation and 50--90% relative reductions in distributional discrepancy compared to uncalibrated baselines.
【28】Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yorùbá
标题:词汇语气难以量化:探索普通话和约克巴语中的离散言语单位
链接
:https://arxiv.org/abs/2604.07467
作者:Opeyemi Osakuade,Simon King
备注:Accepted at Speech Prosody 2026
摘要:Discrete speech units (DSUs) are derived by quantising representations from models trained using self-supervised learning (SSL). They are a popular representation for a wide variety of spoken language tasks, including those where prosody matters. DSUs are especially convenient for tasks where text and speech are jointly modelled, such as text-to-speech and multimodal dialogue systems. But we have found that DSUs encode suprasegmental information less reliably than segmental structure, which we demonstrate in this work using lexical tone, though this limitation likely extends to other suprasegmental features such as prosody. Our investigations using the tone languages Mandarin and Yorùbá show that the SSL latent representations themselves do encode tone, yet DSUs obtained using quantisation tend to prioritise phonetic structure, which makes lexical tone less reliably encoded. This remains true for a variety of quantisation methods, not only the most common, K-means. We conclude that current DSU quantisation strategies have limitations for suprasegmental features, which suggests a need for new, tone-aware (or prosody-aware) techniques in speech representation learning. We point towards a potential form of the solution by performing K-means clustering once to encode phonetic information, then again on the residual representation, which better encodes lexical tone.
【29】CMP: Robust Whole-Body Tracking for Loco-Manipulation via Competence Manifold Projection
标题:CM:通过能力综合体投影进行身体操纵的稳健全身跟踪
链接:https://arxiv.org/abs/2604.07457
作者:Ziyang Cheng,Haoyu Wei,Hang Yin,Xiuwei Xu,Bingyao Yu,Jie Zhou,Jiwen Lu
备注:14 pages, 8 figures. Under review. Project page and videos: https://shepherd1226.github.io/CMP
摘要:While decoupled control schemes for legged mobile manipulators have shown robustness, learning holistic whole-body control policies for tracking global end-effector poses remains fragile against Out-of-Distribution (OOD) inputs induced by sensor noise or infeasible user commands. To improve robustness against these perturbations without sacrificing task performance and continuity, we propose Competence Manifold Projection (CMP). Specifically, we utilize a Frame-Wise Safety Scheme that transforms the infinite-horizon safety constraint into a computationally efficient single-step manifold inclusion. To instantiate this competence manifold, we employ a Lower-Bounded Safety Estimator that distinguishes unmastered intentions from the training distribution. We then introduce an Isomorphic Latent Space (ILS) that aligns manifold geometry with safety probability, enabling efficient O(1) seamless defense against arbitrary OOD intents. Experiments demonstrate that CMP achieves up to a 10-fold survival rate improvement in typical OOD scenarios where baselines suffer catastrophic failure, incurring under 10% tracking degradation. Notably, the system exhibits emergent ``best-effort'' generalization behaviors to progressively accomplish OOD goals by adhering to the competence boundaries. Result videos are available at: https://shepherd1226.github.io/CMP.
【30】Munkres' General Topology Autoformalized in Isabelle/HOL
标题:Isabelle/HOL中自动形式化Munkres的一般拓学
链接:https://arxiv.org/abs/2604.07455
作者:Dustin Bryant,Jonathan Julián Huerta y Munive,Cezary Kaliszyk,Josef Urban
摘要:We describe an experiment in LLM-assisted autoformalization that produced over 85,000 lines of Isabelle/HOL code covering all 39 sections of Munkres' Topology (general topology, Chapters 2--8), from topological spaces through dimension theory. The LLM-based coding agents (initially ChatGPT 5.2 and then Claude Opus 4.6) used 24 active days for that. The formalization is complete: all 806 formal results are fully proved with zero sorry's. Proved results include the Tychonoff theorem, the Baire category theorem, the Nagata--Smirnov and Smirnov metrization theorems, the Stone--Čech compactification, Ascoli's theorem, the space-filling curve, and others. The methodology is based on a "sorry-first" declarative proof workflow combined with bulk use of sledgehammer - two of Isabelle major strengths. This leads to relatively fast autoformalization progress. We analyze the resulting formalization in detail, analyze the human--LLM interaction patterns from the session log, and briefly compare with related autoformalization efforts in Megalodon, HOL Light, and Naproche. The results indicate that LLM-assisted formalization of standard mathematical textbooks in Isabelle/HOL is quite feasible, cheap and fast, even if some human supervision is useful.
【31】OpenPRC: A Unified Open-Source Framework for Physics-to-Task Evaluation in Physical Reservoir Computing
标题:OpenPRC:一个统一的开源框架,用于物理水库计算中的物理到任务评估
链接:https://arxiv.org/abs/2604.07423
作者:Yogesh Phalak,Wen Sin Lor,Apoorva Khairnar,Benjamin Jantzen,Noel Naughton,Suyi Li
备注:23 pages, 7 figures
摘要
:Physical Reservoir Computing (PRC) leverages the intrinsic nonlinear dynamics of physical substrates, mechanical, optical, spintronic, and beyond, as fixed computational reservoirs, offering a compelling paradigm for energy-efficient and embodied machine learning. However, the practical workflow for developing and evaluating PRC systems remains fragmented: existing tools typically address only isolated parts of the pipeline, such as substrate-specific simulation, digital reservoir benchmarking, or readout training. What is missing is a unified framework that can represent both high-fidelity simulated trajectories and real experimental measurements through the same data interface, enabling reproducible evaluation, analysis, and physics-aware optimization across substrates and data sources. We present OpenPRC, an open-source Python framework that fills this gap through a schema-driven physics-to-task pipeline built around five modules: a GPU-accelerated hybrid RK4-PBD physics engine (demlat), a video-based experimental ingestion layer (openprc.vision), a modular learning layer (reservoir), information-theoretic analysis and benchmarking tools (analysis), and physics-aware optimization (optimize). A universal HDF5 schema enforces reproducibility and interoperability, allowing GPU-simulated and experimentally acquired trajectories to enter the same downstream workflow without modification. Demonstrated capabilities include simulations of Origami tessellations, video-based trajectory extraction from a physical reservoir, and a common interface for standardized PRC benchmarking, correlation diagnostics, and capacity analysis. The longer-term vision is to serve as a standardizing layer for the PRC community, compatible with external physics engines including PyBullet, PyElastica, and MERLIN.
【32】SPAMoE: Spectrum-Aware Hybrid Operator Framework for Full-Waveform Inversion
标题:SPAMoE:一种用于全波反演的频谱感知混合算子框架
链接:https://arxiv.org/abs/2604.07421
作者:Zhenyu Wang,Peiyuan Li,Yongxiang Shi,Ruoyu Wu,Chenfei Liao,Lei Zhang
摘要:Full-waveform inversion (FWI) is pivotal for reconstructing high-resolution subsurface velocity models but remains computationally intensive and ill-posed. While deep learning approaches promise efficiency, existing Convolutional Neural Networks (CNNs) and single-paradigm Neural Operators (NOs) struggle with one fundamental issue: frequency entanglement of multi-scale geological features. To address this challenge, we propose Spectral-Preserving Adaptive MoE (SPAMoE), a novel spectrum-aware framework for solving inverse problems with complex multi-scale structures. Our approach introduces a Spectral-Preserving DINO Encoder that enforces a lower bound on the high-to-low frequency energy ratio of the encoded representation, mitigating high-frequency collapse and stabilizing subsequent frequency-domain modeling. Furthermore, we design a novel Spectral Decomposition and Routing mechanism that dynamically assigns frequency bands to a Mixture-of-Experts (MoE) ensemble comprising FNO, MNO, and LNO. On the ten OpenFWI sub-datasets, experiments show that SPAMoE reduces the average MAE by 54.1% relative to the best officially reported OpenFWI baseline, thereby establishing a new architectural framework for learning-based full-waveform inversion.
【33】Dual-Rerank: Fusing Causality and Utility for Industrial Generative Reranking
标题:Dual-Rerank:融合因果性和效用的工业生成性重新排序
链接:https://arxiv.org/abs/2604.07420
作者:Chao Zhang,Shuai Lin,ChengLei Dai,Ye Qian,Fan Mingyang,Yi Zhang,Yi Wang,Jingwei Zhuo
摘要:Kuaishou serves over 400 million daily active users, processing hundreds of millions of search queries daily against a repository of tens of billions of short videos. As the final decision layer, the reranking stage determines user experience by optimizing whole-page utility. While traditional score-and-sort methods fail to capture combinatorial dependencies, Generative Reranking offers a superior paradigm by directly modeling the permutation probability. However, deploying Generative Reranking in such a high-stakes environment faces a fundamental dual dilemma: 1) the structural trade-off where Autoregressive (AR) models offer superior Sequential modeling but suffer from prohibitive latency, versus Non-Autoregressive (NAR) models that enable efficiency but lack dependency capturing; 2) the optimization gap where Supervised Learning faces challenges in directly optimizing whole-page utility, while Reinforcement Learning (RL) struggles with instability in high-throughput data streams. To resolve this, we propose Dual-Rerank, a unified framework designed for industrial reranking that bridges the structural gap via Sequential Knowledge Distillation and addresses the optimization gap using List-wise Decoupled Reranking Optimization (LDRO) for stable online RL. Extensive A/B testing on production traffic demonstrates that Dual-Rerank achieves State-of-the-Art performance, significantly improving User satisfaction and Watch Time while drastically reducing inference latency compared to AR baselines.
【34】FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios
标题:FORGE:制造场景的细粒度多模式评估
链接:https://arxiv.org/abs/2604.07413
作者:Xiangru Jian,Hao Xu,Wei Pang,Xinjian Zhao,Chengyu Tao,Qixin Zhang,Xikun Zhang,Chao Zhang,Guanzhi Deng,Alex Xue,Juan Du,Tianshu Yu,Garth Tarr,Linqi Song,Qiuzhuang Sun,Dacheng Tao
备注:Project Page:https://ai4manufacturing.github.io/forge-web
摘要:The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge-web.
【35】Physics-informed neural operators for the in situ characterization of locally reacting sound absorbers
标题:用于局部反应声吸收器的现场特征的物理信息神经运算符
链接:https://arxiv.org/abs/2604.07412
作者:Jonas M. Schmid,Johannes D. Schmid,Martin Eser,Steffen Marburg
摘要
:Accurate knowledge of acoustic surface admittance or impedance is essential for reliable wave-based simulations, yet its in situ estimation remains challenging due to noise, model inaccuracies, and restrictive assumptions of conventional methods. This work presents a physics-informed neural operator approach for estimating frequency-dependent surface admittance directly from near-field measurements of sound pressure and particle velocity. A deep operator network is employed to learn the mapping from measurement data, spatial coordinates, and frequency to acoustic field quantities, while simultaneously inferring a globally consistent surface admittance spectrum without requiring an explicit forward model. The governing acoustic relations, including the Helmholtz equation, the linearized momentum equation, and Robin boundary conditions, are embedded into the training process as physics-based regularization, enabling physically consistent and noise-robust predictions while avoiding frequency-wise inversion. The method is validated using synthetically generated data from a simulation model for two planar porous absorbers under semi free-field conditions across a broad frequency range. Results demonstrate accurate reconstruction of both real and imaginary admittance components and reliable prediction of acoustic field quantities. Parameter studies confirm improved robustness to noise and sparse sampling compared to purely data-driven approaches, highlighting the potential of physics-informed neural operators for in situ acoustic material characterization.
【36】Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training
标题:数据热身:有效扩散训练的复杂性意识课程
链接:https://arxiv.org/abs/2604.07397
作者:Jinhong Lin,Pan Wang,Zitong Zhan,Lin Zhang,Pedro Morgado
备注:CVPRW in the proceedings of CVPR 2026
摘要:A key inefficiency in diffusion training occurs when a randomly initialized network, lacking visual priors, encounters gradients from the full complexity spectrum--most of which it lacks the capacity to resolve. We propose Data Warmup, a curriculum strategy that schedules training images from simple to complex without modifying the model or loss. Each image is scored offline by a semantic-aware complexity metric combining foreground dominance (how much of the image salient objects occupy) and foreground typicality (how closely the salient content matches learned visual prototypes). A temperature-controlled sampler then prioritizes low-complexity images early and anneals toward uniform sampling. On ImageNet 256x256 with SiT backbones (S/2 to XL/2), Data Warmup improves IS by up to 6.11 and FID by up to 3.41, reaching baseline quality tens of thousands of iterations earlier. Reversing the curriculum (exposing hard images first) degrades performance below the uniform baseline, confirming that the simple-to-complex ordering itself drives the gains. The method combines with orthogonal accelerators such as REPA and requires only ~10 minutes of one-time preprocessing with zero per-iteration overhead.
【37】Decisions and Deployment: The Five-Year SAHELI Project (2020-2025) on Restless Multi-Armed Bandits for Improving Maternal and Child Health
标题:决策和部署:SAHELI五年项目(2020-2025年),旨在改善孕产妇和儿童健康
链接:https://arxiv.org/abs/2604.07384
作者:Shresth Verma,Arpan Dasgupta,Neha Madhiwalla,Aparna Taneja,Milind Tambe
摘要:Maternal and child health is a critical concern around the world. In many global health programs disseminating preventive care and health information, limited healthcare worker resources prevent continuous, personalised engagement with vulnerable beneficiaries. In such scenarios, it becomes crucial to optimally schedule limited live-service resources to maximise long-term engagement. To address this fundamental challenge, the multi-year SAHELI project (2020-2025), in collaboration with partner NGO ARMMAN, leverages AI to allocate scarce resources in a maternal and child health program in India. The SAHELI system solves this sequential resource allocation problem using a Restless Multi-Armed Bandit (RMAB) framework. A key methodological innovation is the transition from a traditional Two-Stage "predict-then-optimize" approach to Decision-Focused Learning (DFL), which directly aligns the framework's learning method with the ultimate goal of maximizing beneficiary engagement. Empirical evaluation through large-scale randomized controlled trials demonstrates that the DFL policy reduced cumulative engagement drops by 31% relative to the current standard of care, significantly outperforming the Two-Stage model. Crucially, the studies also confirmed that this increased program engagement translates directly into statistically significant improvements in real-world health behaviors, notably the continued consumption of vital iron and calcium supplements by new mothers. Ultimately, the SAHELI project provides a scalable blueprint for applying sequential decision-making AI to optimize resource allocation in health programs.
【38】Flow Learners for PDEs: Toward a Physics-to-Physics Paradigm for Scientific Computing
标题:PDEs的流动学习者:迈向科学计算的物理到物理范式
链接:https://arxiv.org/abs/2604.07366
作者:Yilong Dai,Shengyu Chen,Xiaowei Jia,Runlong Yu
摘要:Partial differential equations (PDEs) govern nearly every physical process in science and engineering, yet solving them at scale remains prohibitively expensive. Generative AI has transformed language, vision, and protein science, but learned PDE solvers have not undergone a comparable shift. Existing paradigms each capture part of the problem. Physics-informed neural networks embed residual structure, yet they are often difficult to optimize in stiff, multiscale, or large-domain regimes. Neural operators amortize across instances, yet they commonly inherit a snapshot-prediction view of solving and can degrade over long rollouts. Diffusion-based solvers model uncertainty, yet they are often built on a solver template that still centers on state regression. We argue that the core issue is the abstraction used to train learned solvers. Many models are asked to predict states, while many scientific settings require modeling how uncertainty moves through constrained dynamics. The relevant object is transport over physically admissible futures. This motivates \emph{flow learners}: models that parameterize transport vector fields and generate trajectories through integration, echoing the continuous dynamics that define PDE evolution. This physics-to-physics alignment supports continuous-time prediction, native uncertainty quantification, and new opportunities for physics-aware solver design. We explain why transport-based learning offers a stronger organizing principle for learned PDE solving and outline the research agenda that follows from this shift.
【39】ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories
标题:ReCodeAgent:一种用于大规模知识库不可知翻译和验证的多代理工作流程
链接:https://arxiv.org/abs/2604.07341
作者:Ali Reza Ibrahimzada,Brandon Paulsen,Daniel Kroening,Reyhaneh Jabbarvand
摘要
:Most repository-level code translation and validation techniques have been evaluated on a single source-target programming language (PL) pair, owing to the complex engineering effort required to adapt new PL pairs. Programming agents can enable PL-agnosticism in repository-level code translation and validation: they can synthesize code across many PLs and autonomously use existing tools specific to each PL's analysis. However, state-of-the-art has yet to offer a fully autonomous agentic approach for repository-level code translation and validation of large-scale programs. This paper proposes ReCodeAgent, an autonomous multi-agent approach for language-agnostic repository-level code translation and validation. Users only need to provide the project in the source PL and specify the target PL for ReCodeAgent to automatically translate and validate the entire repository. ReCodeAgent is the first technique to achieve high translation success rates across many PLs. We compare the effectiveness of ReCodeAgent with four alternative neuro-symbolic and agentic approaches to translate 118 real-world projects, with 1,975 LoC and 43 translation units for each project, on average. The projects cover 6 PLs (C, Go, Java, JavaScript, Python, and Rust) and 4 PL pairs (C-Rust, Go-Rust, Java-Python, Python-JavaScript). Our results demonstrate that ReCodeAgent consistently outperforms prior techniques on translation correctness, improving test pass rate by 60.8% on ground-truth tests, with an average cost of $15.3. We also perform process-centric analysis of ReCodeAgent trajectories to confirm its procedural efficiency. Finally, we investigate how the design choices (a multi-agent vs. single-agent architecture) influence ReCodeAgent performance: on average, the test pass rate drops by 40.4%, and trajectories become 28% longer and persistently inefficient.
【40】Scalable Neural Decoders for Practical Fault-Tolerant Quantum Computation
标题:用于实用容错量子计算的可扩展神经解码器
链接:https://arxiv.org/abs/2604.08358
作者:Andi Gu,J. Pablo Bonilla Ataides,Mikhail D. Lukin,Susanne F. Yelin
备注:18 pages, 9 figures
摘要:Quantum error correction (QEC) is essential for scalable quantum computing. However, it requires classical decoders that are fast and accurate enough to keep pace with quantum hardware. While quantum low-density parity-check codes have recently emerged as a promising route to efficient fault tolerance, current decoding algorithms do not allow one to realize the full potential of these codes in practical settings. Here, we introduce a convolutional neural network decoder that exploits the geometric structure of QEC codes, and use it to probe a novel "waterfall" regime of error suppression, demonstrating that the logical error rates required for large-scale fault-tolerant algorithms are attainable with modest code sizes at current physical error rates, and with latencies within the real-time budgets of several leading hardware platforms. For example, for the $[144, 12, 12]$ Gross code, the decoder achieves logical error rates up to $\sim 17$x below existing decoders - reaching logical error rates $\sim 10^{-10}$ at physical error $p=0.1\%$ - with 3-5 orders of magnitude higher throughput. This decoder also produces well-calibrated confidence estimates that can significantly reduce the time overhead of repeat-until-success protocols. Taken together, these results suggest that the space-time costs associated with fault-tolerant quantum computation may be significantly lower than previously anticipated.
【41】On the Unique Recovery of Transport Maps and Vector Fields from Finite Measure-Valued Data
标题:关于从有限测量值数据中唯一恢复传输图和向场
链接:https://arxiv.org/abs/2604.07671
作者:Jonah Botvinick-Greenhouse,Yunan Yang
摘要:We establish guarantees for the unique recovery of vector fields and transport maps from finite measure-valued data, yielding new insights into generative models, data-driven dynamical systems, and PDE inverse problems. In particular, we provide general conditions under which a diffeomorphism can be uniquely identified from its pushforward action on finitely many densities, i.e., when the data $\{(ρ_j,f_\#ρ_j)\}_{j=1}^m$ uniquely determines $f$. As a corollary, we introduce a new metric which compares diffeomorphisms by measuring the discrepancy between finitely many pushforward densities in the space of probability measures. We also prove analogous results in an infinitesimal setting, where derivatives of the densities along a smooth vector field are observed, i.e., when $\{(ρ_j,\text{div} (ρ_j v))\}_{j=1}^m$ uniquely determines $v$. Our analysis makes use of the Whitney and Takens embedding theorems, which provide estimates on the required number of densities $m$, depending only on the intrinsic dimension of the problem. We additionally interpret our results through the lens of Perron--Frobenius and Koopman operators and demonstrate how our techniques lead to new guarantees for the well-posedness of certain PDE inverse problems related to continuity, advection, Fokker--Planck, and advection-diffusion-reaction equations. Finally, we present illustrative numerical experiments demonstrating the unique identification of transport maps from finitely many pushforward densities, and of vector fields from finitely many weighted divergence observations.
【42】Parameter-free non-ergodic extragradient algorithms for solving monotone variational inequalities
标题:求解单调变分不等式的无参数非历经超梯度算法
链接:https://arxiv.org/abs/2604.07662
作者:Lingqing Shen,Fatma Kılınç-Karzan
摘要:Monotone variational inequalities (VIs) provide a unifying framework for convex minimization, equilibrium computation, and convex-concave saddle-point problems. Extragradient-type methods are among the most effective first-order algorithms for such problems, but their performance hinges critically on stepsize selection. While most existing theory focuses on ergodic averages of the iterates, practical performance is often driven by the significantly stronger behavior of the last iterate. Moreover, available last-iterate guarantees typically rely on fixed stepsizes chosen using problem-specific global smoothness information, which is often difficult to estimate accurately and may not even be applicable. In this paper, we develop parameter-free extragradient methods with non-asymptotic last-iterate guarantees for constrained monotone VIs. For globally Lipschitz operators, our algorithm achieves an $o(1/\sqrt{T})$ last-iterate rate. We then extend the framework to locally Lipschitz operators via backtracking line search and obtain the same rate while preserving parameter-freeness, thereby making parameter-free last-iterate methods applicable to important problem classes for which global smoothness is unrealistic. Our numerical experiments on bilinear matrix games, LASSO, minimax group fairness, and state-of-the-art maximum entropy sampling relaxations demonstrate wide applicability of our results as well as strong last-iterate performance and significant improvements over existing methods.
【43】Exponential quantum advantage in processing massive classical data
标题:处理海量经典数据的指数量子优势
链接:https://arxiv.org/abs/2604.07639
作者:Haimeng Zhao,Alexander Zlokapa,Hartmut Neven,Ryan Babbush,John Preskill,Jarrod R. McClean,Hsin-Yuan Huang
备注
:144 pages, including 9 pages of main text and 10 figures. Code available at https://github.com/haimengzhao/quantum-oracle-sketching
摘要:Broadly applicable quantum advantage, particularly in classical data processing and machine learning, has been a fundamental open problem. In this work, we prove that a small quantum computer of polylogarithmic size can perform large-scale classification and dimension reduction on massive classical data by processing samples on the fly, whereas any classical machine achieving the same prediction performance requires exponentially larger size. Furthermore, classical machines that are exponentially larger yet below the required size need superpolynomially more samples and time. We validate these quantum advantages in real-world applications, including single-cell RNA sequencing and movie review sentiment analysis, demonstrating four to six orders of magnitude reduction in size with fewer than 60 logical qubits. These quantum advantages are enabled by quantum oracle sketching, an algorithm for accessing the classical world in quantum superposition using only random classical data samples. Combined with classical shadows, our algorithm circumvents the data loading and readout bottleneck to construct succinct classical models from massive classical data, a task provably impossible for any classical machine that is not exponentially larger than the quantum machine. These quantum advantages persist even when classical machines are granted unlimited time or if BPP=BQP, and rely only on the correctness of quantum mechanics. Together, our results establish machine learning on classical data as a broad and natural domain of quantum advantage and a fundamental test of quantum mechanics at the complexity frontier.
【44】From Ground Truth to Measurement: A Statistical Framework for Human Labeling
标题:从基本真相到测量:人类标签的统计框架
链接:https://arxiv.org/abs/2604.07591
作者:Robert Chew,Stephanie Eckman,Christoph Kern,Frauke Kreuter
摘要:Supervised machine learning assumes that labeled data provide accurate measurements of the concepts models are meant to learn. Yet in practice, human labeling introduces systematic variation arising from ambiguous items, divergent interpretations, and simple mistakes. Machine learning research commonly treats all disagreement as noise, which obscures these distinctions and limits our understanding of what models actually learn. This paper reframes annotation as a measurement process and introduces a statistical framework for decomposing labeling outcomes into interpretable sources of variation: instance difficulty, annotator bias, situational noise, and relational alignment. The framework extends classical measurement-error models to accommodate both shared and individualized notions of truth, reflecting traditional and human label variation interpretations of error, and provides a diagnostic for assessing which regime better characterizes a given task. Applying the proposed model to a multi-annotator natural language inference dataset, we find empirical evidence for all four theorized components and demonstrate the effectiveness of our approach. We conclude with implications for data-centric machine learning and outline how this approach can guide the development of a more systematic science of labeling.
【45】Geometric Entropy and Retrieval Phase Transitions in Continuous Thermal Dense Associative Memory
标题:连续热稠密联想记忆中的几何熵和恢复相变
链接:https://arxiv.org/abs/2604.07401
作者:Tatiana Petrova,Evgeny Polyachenko,Radu State
摘要:We study the thermodynamic memory capacity of modern Hopfield networks (Dense Associative Memory models) with continuous states under geometric constraints, extending classical analyses of pairwise associative memory. We derive thermodynamic phase boundaries for Dense Associative Memory networks with exponential capacity $p = e^{αN}$, comparing Gaussian (LSE) and Epanechnikov (LSR) kernels. For continuous neurons on an $N$-sphere, the geometric entropy depends solely on the spherical geometry, not the kernel. In the sharp-kernel regime, the maximum theoretical capacity $α= 0.5$ is achieved at zero temperature; below this threshold, a critical line separates retrieval from a spin-glass phase. The two kernels differ qualitatively in their phase boundary structure: for LSE, the retrieval region extends to arbitrarily high temperatures as $α\to 0$, but interference from spurious patterns is always present. For LSR, the finite support introduces a threshold $α_{\text{th}}$ below which no spurious patterns contribute to the noise floor, producing a qualitatively different retrieval regime in this sub-threshold region. These results advance the theory of high-capacity associative memory and clarify fundamental limits of retrieval robustness in modern attention-like memory architectures.
【46】Quasicrystal Architected Nanomechanical Resonators via Data-Driven Design
标题:通过数据驱动设计的准晶体构建纳米机械共振器
链接:https://arxiv.org/abs/2604.07379
作者:Kawen Li,Hangjin Cho,Richard Norte,Dongil Shin
摘要:From butterfly wings to remnants of nuclear detonation, aperiodic order repeatedly emerges in nature, often exhibiting reduced sensitivity to boundaries and symmetry constraints. Inspired by this principle, a paradigm shift is introduced in nanomechanical resonator design from periodic to aperiodic structures, focusing on a special class: quasicrystals (QCs). Although soft clamping enabled by phononic stopbands has become a central strategy for achieving high-$Q_m$ nanomechanical resonators, its practical realization has been largely confined to periodic phononic crystals, where band structure engineering is well established. The potential of aperiodic architectures, however, has remained largely unexplored, owing to their intrinsic complexity and the lack of systematic approaches to identifying and exploiting stopband behavior. Here we demonstrate that soft clamping can be realized in quasicrystal architectures and that high-$Q_m$ nanomechanical resonators can be systematically achieved through a data-driven design framework. As a representative demonstration, the 12-fold QC-based resonator exhibits a quality factor $Q_m \sim 10^7$ and an effective mass of sub-nanograms at MHz frequencies, corresponding to an exceptional force sensitivity of $26.4$~aN/$\sqrt{\text{Hz}}$ compared to previous 2D phononic crystals. These results establish QCs as a robust platform for next-generation nanomechanical resonators and open a new design regime beyond periodic order.
【47】NS-RGS: Newton-Schulz based Riemannian gradient method for orthogonal group synchronization
标题:NS-RGS:基于Newton-Schulz的Riemann梯度方法用于垂直群同步
链接:https://arxiv.org/abs/2604.07372
作者:Haiyang Peng,Deren Han,Xin Chen,Meng Huang
摘要:Group synchronization is a fundamental task involving the recovery of group elements from pairwise measurements. For orthogonal group synchronization, the most common approach reformulates the problem as a constrained nonconvex optimization and solves it using projection-based methods, such as the generalized power method. However, these methods rely on exact SVD or QR decompositions in each iteration, which are computationally expensive and become a bottleneck for large-scale problems. In this paper, we propose a Newton-Schulz-based Riemannian Gradient Scheme (NS-RGS) for orthogonal group synchronization that significantly reduces computational cost by replacing the SVD or QR step with the Newton-Schulz iteration. This approach leverages efficient matrix multiplications and aligns perfectly with modern GPU/TPU architectures. By employing a refined leave-one-out analysis, we overcome the challenge arising from statistical dependencies, and establish that NS-RGS with spectral initialization achieves linear convergence to the target solution up to near-optimal statistical noise levels. Experiments on synthetic data and real-world global alignment tasks demonstrate that NS-RGS attains accuracy comparable to state-of-the-art methods such as the generalized power method, while achieving nearly a 2$\times$ speedup.
机器翻译由腾讯交互翻译提供,仅供参考
点击“阅读原文”获取带摘要的学术速递