点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
cs.LG 方向,今日共计302篇
大模型相关(40篇)
【1】Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference
标题:为提示付费,而不是答案付费:LLM引领成本效益推理
链接:https://arxiv.org/abs/2601.22132
作者:Ziming Dong,Hardik Sharma,Evan O'Toole,Jaya Prakash Champati,Kui Wu
摘要:Large Language Models (LLMs) deliver state-of-the-art performance on complex reasoning tasks, but their inference costs limit deployment at scale. Small Language Models (SLMs) offer dramatic cost savings yet lag substantially in accuracy. Existing approaches - routing and cascading - treat the LLM as an all-or-nothing resource: either the query bypasses the LLM entirely, or the LLM generates a complete response at full cost. We introduce LLM Shepherding, a framework that requests only a short prefix (a hint) from the LLM and provides it to SLM. This simple mechanism is surprisingly effective for math and coding tasks: even hints comprising 10-30% of the full LLM response improve SLM accuracy significantly. Shepherding generalizes both routing and cascading, and it achieves lower cost under oracle decision-making. We develop a two-stage predictor that jointly determines whether a hint is needed and how many tokens to request. On the widely-used mathematical reasoning (GSM8K, CNK12) and code generation (HumanEval, MBPP) benchmarks, Shepherding reduces costs by 42-94% relative to LLM-only inference. Compared to state-of-the-art routing and cascading baselines, shepherding delivers up to 2.8x cost reduction while matching accuracy. To our knowledge, this is the first work to exploit token-level budget control for SLM-LLM collaboration.
【2】A Separable Architecture for Continuous Token Representation in Language Models
标题:语言模型中连续令牌表示的可分离架构
链接:https://arxiv.org/abs/2601.22040
作者:Reza T. Batley,Sourav Saha
摘要:Transformer scaling law analyses typically treat parameters as interchangeable; an abstraction that accurately predicts loss-compute relationships. Yet, in sub-billion-parameter small language models (SLMs), embedding matrices dominate the parameter budget. This work argues that this allocation is as suboptimal as it is counterintuitive. Leviathan is an architecture with a continuous embedding generator to replace the discrete lookup tables of canonical models. Evaluating on the Pile dataset under isoparametric settings, Leviathan consistently outperforms a standard, LLaMA-style architecture. By means of an empirical power-law fit, Leviathan exhibits a markedly superior effective parameter capacity. Across the regime studied, Leviathan behaves as a dense model with $1.47$ to $2.11 \times$ more parameters.
【3】Per-parameter Task Arithmetic for Unlearning in Large Language Models
标题:大型语言模型中取消学习的每参数任务算法
链接:https://arxiv.org/abs/2601.22030
作者:Chengyi Cai,Zesheng Ye,Jiangchao Yao,Jianzhong Qi,Bo Han,Xiaolu Zhang,Feng Liu,Jun Zhou
摘要:In large language model (LLM) unlearning, private information is required to be removed. Task arithmetic unlearns by subtracting a specific task vector (TV)--defined as the parameter difference between a privacy-information-tuned model and the original model. While efficient, it can cause over-forgetting by disrupting parameters essential for retaining other information. Motivated by the observation that each parameter exhibits different importance for forgetting versus retention, we propose a per-parameter task arithmetic (PerTA) mechanism to rescale the TV, allowing per-parameter adjustment. These weights quantify the relative importance of each parameter for forgetting versus retention, estimated via gradients (i.e., PerTA-grad) or the diagonal Fisher information approximation (i.e., PerTA-fisher). Moreover, we discuss the effectiveness of PerTA, extend it to a more general form, and provide further analysis. Extensive experiments demonstrate that PerTA consistently improves upon standard TV, and in many cases surpasses widely used training-based unlearning methods in both forgetting effectiveness and overall model utility. By retaining the efficiency of task arithmetic while mitigating over-forgetting, PerTA offers a principled and practical framework for LLM unlearning.
【4】From Logits to Latents: Contrastive Representation Shaping for LLM Unlearning
标题:从逻辑到潜在:LLM遗忘的对比表示塑造
链接:https://arxiv.org/abs/2601.22028
作者:Haoran Tang,Rajiv Khanna
摘要:Most LLM unlearning methods aim to approximate retrain-from-scratch behaviors with minimal distribution shift, often via alignment-style objectives defined in the prediction space. While effective at reducing forgotten content generation, such approaches may act as suppression: forgotten concepts can persist in representations and remain entangled with retained knowledge. We introduce CLReg, a contrastive representation regularizer that identifies forget features while pushing them away from retain features, explicitly reducing forget-retain interference with minimal shifts on retain features. We provide first theoretical insights that relate representation shaping to entanglement reduction. Across unlearning benchmarks and LLMs of different sizes, CLReg decreases forget-retain representation entanglement that facilitates mainstream unlearning methods without positing extra privacy risks, inspiring future work that reshapes the representation space to remove forget concepts.
【5】Visual-Guided Key-Token Regularization for Multimodal Large Language Model Unlearning
标题:基于视觉引导的多模态大型语言模型去学习的键-令牌正则化
链接:https://arxiv.org/abs/2601.22020
作者:Chengyi Cai,Zesheng Ye,Peike Li,Bo Han,Jianzhong Qi,Feng Liu
摘要:Unlearning in Multimodal Large Language Models (MLLMs) prevents the model from revealing private information when queried about target images. Existing MLLM unlearning methods largely adopt approaches developed for LLMs. They treat all answer tokens uniformly, disregarding their varying importance in the unlearning process. Moreover, these methods focus exclusively on the language modality, disregarding visual cues that indicate key tokens in answers. In this paper, after formulating the problem of unlearning in multimodal question answering for MLLMs, we propose Visual-Guided Key-Token Regularization (ViKeR). We leverage irrelevant visual inputs to predict ideal post-unlearning token-level distributions and use these distributions to regularize the unlearning process, thereby prioritizing key tokens. Further, we define key tokens in unlearning via information entropy and discuss ViKeR's effectiveness through token-level gradient reweighting, which amplifies updates on key tokens. Experiments on MLLMU and CLEAR benchmarks demonstrate that our method effectively performs unlearning while mitigating forgetting and maintaining response coherence.
【6】Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units
标题:机械数据归因:追踪可解释LLM单元的训练起源
链接:https://arxiv.org/abs/2601.21996
作者:Jianhui Chen,Yuzhang Luo,Liangming Pan
摘要:While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.
【7】Not All Code Is Equal: A Data-Centric Study of Code Complexity and LLM Reasoning
标题:并非所有代码都是平等的:以数据为中心的代码复杂性和LLM推理研究
链接:https://arxiv.org/abs/2601.21894
作者:Lukas Twist,Shu Yang,Hanqi Yan,Jingzhi Gong,Di Wang,Helen Yannakoudakis,Jie M. Zhang
备注:16 pages, 5 figures, 3 tables
摘要:Large Language Models (LLMs) increasingly exhibit strong reasoning abilities, often attributed to their capacity to generate chain-of-thought-style intermediate reasoning. Recent work suggests that exposure to code can further enhance these skills, but existing studies largely treat code as a generic training signal, leaving open the question of which properties of code actually contribute to improved reasoning. To address this gap, we study the structural complexity of code, which captures control flow and compositional structure that may shape how models internalise multi-step reasoning during fine-tuning. We examine two complementary settings: solution-driven complexity, where complexity varies across multiple solutions to the same problem, and problem-driven complexity, where complexity reflects variation in the underlying tasks. Using cyclomatic complexity and logical lines of code to construct controlled fine-tuning datasets, we evaluate a range of open-weight LLMs on diverse reasoning benchmarks. Our findings show that although code can improve reasoning, structural properties strongly determine its usefulness. In 83% of experiments, restricting fine-tuning data to a specific structural complexity range outperforms training on structurally diverse code, pointing to a data-centric path for improving reasoning beyond scaling.
【8】DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training
标题:DASH:高吞吐量可再现LLM训练的确定性注意力调度
链接:https://arxiv.org/abs/2601.21824
作者:Xinwei Qiang,Hongmin Chen,Shixuan Sun,Jingwen Leng,Xin Liu,Minyi Guo
摘要:Determinism is indispensable for reproducibility in large language model (LLM) training, yet it often exacts a steep performance cost. In widely used attention implementations such as FlashAttention-3, the deterministic backward pass can incur up to a 37.9% throughput reduction relative to its non-deterministic counterpart, primarily because gradient accumulation operations must be serialized to guarantee numerical consistency. This performance loss stems from suboptimal scheduling of compute and gradient-reduction phases, leading to significant hardware underutilization. To address this challenge, we formulate the backward pass of deterministic attention as a scheduling problem on a Directed Acyclic Graph (DAG) and derive schedules that minimize the critical path length. Building on this formulation, we present DASH (Deterministic Attention Scheduling for High-Throughput), which encapsulates two complementary scheduling strategies: (i) Descending Q-Tile Iteration, a reversed query-block traversal that shrinks pipeline stalls in causal attention, and (ii) Shift Scheduling, a theoretically optimal schedule within our DAG model that reduces pipeline stalls for both full and causal masks. Our empirical evaluations on NVIDIA H800 GPUs demonstrate that DASH narrows the performance gap of deterministic attention. The proposed strategies improve the throughput of the attention backward pass by up to 1.28$\times$ compared to the baseline, significantly advancing the efficiency of reproducible LLM training. Our code is open-sourced at https://github.com/SJTU-Liquid/deterministic-FA3.
【9】Nonparametric LLM Evaluation from Preference Data
标题:来自偏好数据的非参数LLM评估
链接:https://arxiv.org/abs/2601.21816
作者:Dennis Frauen,Athiya Deviyani,Mihaela van der Schaar,Stefan Feuerriegel
摘要:Evaluating the performance of large language models (LLMs) from human preference data is crucial for obtaining LLM leaderboards. However, many existing approaches either rely on restrictive parametric assumptions or lack valid uncertainty quantification when flexible machine learning methods are used. In this paper, we propose a nonparametric statistical framework, DMLEval, for comparing and ranking LLMs from preference data using debiased machine learning (DML). For this, we introduce generalized average ranking scores (GARS), which generalize commonly used ranking models, including the Bradley-Terry model or PageRank/ Rank centrality, with complex human responses such as ties. DMLEval comes with the following advantages: (i) It produces statistically efficient estimates of GARS ranking scores. (ii) It naturally allows the incorporation of black-box machine learning methods for estimation. (iii) It can be combined with pre-trained LLM evaluators (e.g., using LLM-as-a-judge). (iv) It suggests optimal policies for collecting preference data under budget constraints. We demonstrate these advantages both theoretically and empirically using both synthetic and real-world preference datasets. In summary, our framework provides practitioners with powerful, state-of-the-art methods for comparing or ranking LLMs.
【10】Knowledge Vector Weakening: Efficient Training-free Unlearning for Large Vision-Language Models
标题:知识载体削弱:大型视觉语言模型的高效免训练取消学习
链接:https://arxiv.org/abs/2601.21794
作者:Yejin Kim,Dongjun Hwang,Sungmin Cha,Junsuk Choe
摘要:Large Vision-Language Models (LVLMs) are widely adopted for their strong multimodal capabilities, yet they raise serious concerns such as privacy leakage and harmful content generation. Machine unlearning has emerged as a promising solution for removing the influence of specific data from trained models. However, existing approaches largely rely on gradient-based optimization, incurring substantial computational costs for large-scale LVLMs. To address this limitation, we propose Knowledge Vector Weakening (KVW), a training-free unlearning method that directly intervenes in the full model without gradient computation. KVW identifies knowledge vectors that are activated during the model's output generation on the forget set and progressively weakens their contributions, thereby preventing the model from exploiting undesirable knowledge. Experiments on the MLLMU and CLEAR benchmarks demonstrate that KVW achieves a stable forget-retain trade-off while significantly improving computational efficiency over gradient-based and LoRA-based unlearning methods.
【11】Procedural Pretraining: Warming Up Language Models with Abstract Data
标题:程序预训练:用抽象数据预热语言模型
链接:https://arxiv.org/abs/2601.21725
作者:Liangze Jiang,Zachary Shinnick,Anton van den Hengel,Hemanth Saratchandran,Damien Teney
摘要:Pretraining directly on web-scale corpora is the de facto paradigm for building language models. We study an alternative setting where the model is initially exposed to abstract structured data, as a means to ease the subsequent acquisition of rich semantic knowledge, much like humans learn simple logic and mathematics before higher reasoning. We specifically focus on procedural data, generated by formal languages and other simple algorithms, as such abstract data. We first diagnose the algorithmic skills that different forms of procedural data can improve, often significantly. For example, on context recall (Needle-in-a-haystack), the accuracy jumps from 10 to 98% when pretraining on Dyck sequences (balanced brackets). Second, we study how these gains are reflected in pretraining larger models (up to 1.3B). We find that front-loading as little as 0.1% procedural data significantly outperforms standard pretraining on natural language, code, and informal mathematics (C4, CodeParrot, and DeepMind-Math datasets). Notably, this procedural pretraining enables the models to reach the same loss value with only 55, 67, 86% of the original data. Third, we explore the mechanisms behind and find that procedural pretraining instils non-trivial structure in both attention and MLP layers. The former is particularly important for structured domains (e.g. code), and the latter for language. Finally, we lay a path for combining multiple forms of procedural data. Our results show that procedural pretraining is a simple, lightweight means to improving performance and accelerating language model pretraining, ultimately suggesting the promise of disentangling knowledge acquisition from reasoning in LLMs.
【12】Curriculum Learning for LLM Pretraining: An Analysis of Learning Dynamics
标题:LLM预训练的课程学习:学习动力学分析
链接:https://arxiv.org/abs/2601.21698
作者:Mohamed Elgaar,Hadi Amiri
摘要:Curriculum learning changes the order of pre-training data, but it remains unclear whether it changes the learning trajectory or mainly reorders exposure over a fixed trajectory. We train Pythia models (14M-410M parameters) for 300B tokens under three linguistically motivated curricula-Age-of-Acquisition, word frequency, and Verb Variation (VV)-and compare each against Random ordering; at 1B parameters we compare Random and VV. Across orderings, training follows a shared sequence of latent phases, while curricula mainly change within-phase data exposure. In smaller models (up to 160M parameters), Random ordering exhibits higher gradient noise and stronger late-training output-head spectral saturation, alongside lower final accuracy; curricula reduce both effects at matched compute. At larger scales, saturation differences are smaller and curriculum gains shrink. We formalize the link between difficulty pacing and optimization stability in an idealized analysis based on gradient-variance control, and our results point to a practical takeaway: curricula help by stabilizing within-phase optimization rather than by creating new phases.
【13】FIT: Defying Catastrophic Forgetting in Continual LLM Unlearning
标题:FIT:在持续的法学硕士遗忘中反抗灾难性遗忘
链接:https://arxiv.org/abs/2601.21682
作者:Xiaoyu Xu,Minxin Du,Kun Fang,Zi Liang,Yaxin Xiao,Zhicong Huang,Cheng Hong,Qingqing Ye,Haibo Hu
备注:20 Pages
摘要:Large language models (LLMs) demonstrate impressive capabilities across diverse tasks but raise concerns about privacy, copyright, and harmful materials. Existing LLM unlearning methods rarely consider the continual and high-volume nature of real-world deletion requests, which can cause utility degradation and catastrophic forgetting as requests accumulate. To address this challenge, we introduce \fit, a framework for continual unlearning that handles large numbers of deletion requests while maintaining robustness against both catastrophic forgetting and post-unlearning recovery. \fit mitigates degradation through rigorous data \underline{F}iltering, \underline{I}mportance-aware updates, and \underline{T}argeted layer attribution, enabling stable performance across long sequences of unlearning operations and achieving a favorable balance between forgetting effectiveness and utility retention. To support realistic evaluation, we present \textbf{PCH}, a benchmark covering \textbf{P}ersonal information, \textbf{C}opyright, and \textbf{H}armful content in sequential deletion scenarios, along with two symmetric metrics, Forget Degree (F.D.) and Retain Utility (R.U.), which jointly assess forgetting quality and utility preservation. Extensive experiments on four open-source LLMs with hundreds of deletion requests show that \fit achieves the strongest trade-off between F.D. and R.U., surpasses existing methods on MMLU, CommonsenseQA, and GSM8K, and remains resistant against both relearning and quantization recovery attacks.
【14】LLM4Fluid: Large Language Models as Generalizable Neural Solvers for Fluid Dynamics
标题:LLM 4 Fluid:大型语言模型作为流体动力学的可推广神经求解器
链接:https://arxiv.org/abs/2601.21681
作者:Qisong Xiao,Xinhai Chen,Qinglin Wang,Xiaowei Guo,Binglin Wang,Weifeng Chen,Zhichao Wang,Yunfei Liu,Rui Xia,Hang Zou,Gencheng Liu,Shuai Li,Jie Liu
摘要:Deep learning has emerged as a promising paradigm for spatio-temporal modeling of fluid dynamics. However, existing approaches often suffer from limited generalization to unseen flow conditions and typically require retraining when applied to new scenarios. In this paper, we present LLM4Fluid, a spatio-temporal prediction framework that leverages Large Language Models (LLMs) as generalizable neural solvers for fluid dynamics. The framework first compresses high-dimensional flow fields into a compact latent space via reduced-order modeling enhanced with a physics-informed disentanglement mechanism, effectively mitigating spatial feature entanglement while preserving essential flow structures. A pretrained LLM then serves as a temporal processor, autoregressively predicting the dynamics of physical sequences with time series prompts. To bridge the modality gap between prompts and physical sequences, which can otherwise degrade prediction accuracy, we propose a dedicated modality alignment strategy that resolves representational mismatch and stabilizes long-term prediction. Extensive experiments across diverse flow scenarios demonstrate that LLM4Fluid functions as a robust and generalizable neural solver without retraining, achieving state-of-the-art accuracy while exhibiting powerful zero-shot and in-context learning capabilities. Code and datasets are publicly available at https://github.com/qisongxiao/LLM4Fluid.
【15】ILRR: Inference-Time Steering Method for Masked Diffusion Language Models
标题:ILRR:掩蔽扩散语言模型的推理时引导方法
链接:https://arxiv.org/abs/2601.21647
作者:Eden Avrahami,Eliya Nachmani
摘要:Discrete Diffusion Language Models (DLMs) offer a promising non-autoregressive alternative for text generation, yet effective mechanisms for inference-time control remain relatively underexplored. Existing approaches include sampling-level guidance procedures or trajectory optimization mechanisms. In this work, we introduce Iterative Latent Representation Refinement (ILRR), a learning-free framework for steering DLMs using a single reference sequence. ILRR guides generation by dynamically aligning the internal activations of the generated sequence with those of a given reference throughout the denoising process. This approach captures and transfers high-level semantic properties, with a tunable steering scale enabling flexible control over attributes such as sentiment. We further introduce Spatially Modulated Steering, an extension that enables steering long texts using shorter references by regulating guidance intensity across the sequence. Empirically, we demonstrate that ILRR achieves effective attribute steering on LLaDA and MDLM architectures with a minor computational overhead, requiring only one additional parallel forward pass per denoising step. Under the same compute budget, ILRR improves attribute accuracy over comparable baselines by 10$\%$ to 60$\%$ points, while maintaining high generation quality.
【16】LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models
标题:LMA:大型语言模型的前瞻混合精确推理
链接:https://arxiv.org/abs/2601.21623
作者:Stanislav Budzinskiy,Marian Gloser,Tolunay Yilmaz,Ying Hong Tham,Yuanyi Lin,Wenyi Fang,Fan Wu,Philipp Petersen
摘要:Mixed-precision computations are a hallmark of the current stage of AI, driving the progress in large language models towards efficient, locally deployable solutions. This article addresses the floating-point computation of compositionally-rich functions, concentrating on transformer inference. Based on the rounding error analysis of a composition $f(g(\mathrm{x}))$, we provide an adaptive strategy that selects a small subset of components of $g(\mathrm{x})$ to be computed more accurately while all other computations can be carried out with lower accuracy. We then explain how this strategy can be applied to different compositions within a transformer and illustrate its overall effect on transformer inference. We study the effectiveness of this algorithm numerically on GPT-2 models and demonstrate that already very low recomputation rates allow for improvements of up to two orders of magnitude in accuracy.
【17】Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening
标题:可扩展功率采样:通过分布细化为LLM解锁高效、免训练推理
链接:https://arxiv.org/abs/2601.21590
作者:Xiaotong Ji,Rasul Tutunov,Matthieu Zimmer,Haitham Bou Ammar
摘要:Reinforcement learning (RL) post-training is a dominant approach for improving the reasoning performance of large language models (LLMs), yet growing evidence suggests that its gains arise primarily from distribution sharpening rather than the acquisition of new capabilities. Recent work has shown that sampling from the power distribution of LLMs using Markov chain Monte Carlo (MCMC) can recover performance comparable to RL post-training without relying on external rewards; however, the high computational cost of MCMC makes such approaches impractical for widespread adoption. In this work, we propose a theoretically grounded alternative that eliminates the need for iterative MCMC. We derive a novel formulation showing that the global power distribution can be approximated by a token-level scaled low-temperature one, where the scaling factor captures future trajectory quality. Leveraging this insight, we introduce a training-free and verifier-free algorithm that sharpens the base model's generative distribution autoregressively. Empirically, we evaluate our method on math, QA, and code tasks across four LLMs, and show that our method matches or surpasses one-shot GRPO without relying on any external rewards, while reducing inference latency by over 10x compared to MCMC-based sampling.
【18】More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)
标题:更多好处:使用重置和丢弃(ReD)在固定预算下改进大型语言模型的推理
链接:https://arxiv.org/abs/2601.21522
作者:Sagi Meir,Tommer D. Keidar,Noam Levi,Shlomi Reuveni,Barak Hirshberg
摘要:The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question correctly at least once in k trials. At a fixed budget, a more suitable metric is coverage@cost, the average number of unique questions answered as a function of the total number of attempts. We connect the two metrics and show that the empirically-observed power-law behavior in pass@k leads to a sublinear growth of the coverage@cost (diminishing returns). To solve this problem, we propose Reset-and-Discard (ReD), a query method of LLMs that increases coverage@cost for any given budget, regardless of the pass@k form. Moreover, given a pass@k, we can quantitatively predict the savings in the total number of attempts using ReD. If pass@k is not available for the model, ReD can infer its power-law exponent. Experiments on three LLMs using HumanEval demonstrate that ReD substantially reduces the required attempts, tokens, and USD cost to reach a desired coverage, while also offering an efficient way to measure inference power-laws.
【19】MAR: Efficient Large Language Models via Module-aware Architecture Refinement
标题:VAR:通过模块感知架构细化的高效大型语言模型
链接:https://arxiv.org/abs/2601.21503
作者:Junhong Cai,Guiqin Wang,Kejie Zhao,Jianxiong Tang,Xiang Wang,Luziwei Leng,Ran Cheng,Yuxin Ma,Qinghai Guo
备注:Accepted by ICASSP 2026. 5 pages, 5 figures
摘要:Large Language Models (LLMs) excel across diverse domains but suffer from high energy costs due to quadratic attention and dense Feed-Forward Network (FFN) operations. To address these issues, we propose Module-aware Architecture Refinement (MAR), a two-stage framework that integrates State Space Models (SSMs) for linear-time sequence modeling and applies activation sparsification to reduce FFN costs. In addition, to mitigate low information density and temporal mismatch in integrating Spiking Neural Networks (SNNs) with SSMs, we design the Adaptive Ternary Multi-step Neuron (ATMN) and the Spike-aware Bidirectional Distillation Strategy (SBDS). Extensive experiments demonstrate that MAR effectively restores the performance of its dense counterpart under constrained resources while substantially reducing inference energy consumption. Furthermore, it outperforms efficient models of comparable or even larger scale, underscoring its potential for building efficient and practical LLMs.
【20】Task-Awareness Improves LLM Generations and Uncertainty
标题:任务意识改善LLM世代和不确定性
链接:https://arxiv.org/abs/2601.21500
作者:Tim Tomov,Dominik Fuchsgruber,Stephan Günnemann
摘要:In many applications of LLMs, natural language responses often have an underlying structure such as representing discrete labels, numerical values, or graphs. Yet, existing decoding and uncertainty estimation methods operate only in language space and largely disregard structural information. We address this by modeling LLM outputs directly in a task-dependent latent structure. By equipping this structure with a dissimilarity measure, we can compute Bayes-optimal responses. These are not selected from sampled generations but are newly synthesized by combining individual responses in the latent space. Across different tasks, Bayes-optimal responses consistently outperform standard decoding methods like beam search. Moreover, quantifying uncertainty via the induced Bayesian risk captures variations in terms of the latent structure and improves alignment with output quality and correctness. Our decision-theoretic framework is applicable to any problem that admits a latent response structure and enables reliable task-aware LLM predictions.
【21】Best Arm Identification with LLM Judges and Limited Human
标题:LLM评委和有限人力的最佳手臂识别
链接:https://arxiv.org/abs/2601.21471
作者:Ruicheng Ao,Hongyu Chen,Siyang Gao,Hanwei Li,David Simchi-Levi
备注:22 pages, 3 figures
摘要
:We study fixed-confidence best-arm identification (BAI) where a cheap but potentially biased proxy (e.g., LLM judge) is available for every sample, while an expensive ground-truth label can only be acquired selectively when using a human for auditing. Unlike classical multi-fidelity BAI, the proxy is biased (arm- and context-dependent) and ground truth is selectively observed. Consequently, standard multi-fidelity methods can mis-select the best arm, and uniform auditing, though accurate, wastes scarce resources and is inefficient. We prove that without bias correction and propensity adjustment, mis-selection probability may not vanish (even with unlimited proxy data). We then develop an estimator for the mean of each arm that combines proxy scores with inverse-propensity-weighted residuals and form anytime-valid confidence sequences for that estimator. Based on the estimator and confidence sequence, we propose an algorithm that adaptively selects and audits arms. The algorithm concentrates audits on unreliable contexts and close arms and we prove that a plug-in Neyman rule achieves near-oracle audit efficiency. Numerical experiments confirm the theoretical guarantees and demonstrate the superior empirical performance of the proposed algorithm.
【22】HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing
标题:HER:LLM角色扮演的类人推理和强化学习
链接:https://arxiv.org/abs/2601.21459
作者:Chengyu Du,Xintao Wang,Aili Chen,Weiyuan Li,Rui Xu,Junteng Liu,Zishan Huang,Rong Tian,Zijun Sun,Yuhao Li,Liheng Feng,Deming Ding,Pengyu Zhao,Yanghua Xiao
备注:41pages, 10 figures
摘要:LLM role-playing, i.e., using LLMs to simulate specific personas, has emerged as a key capability in various applications, such as companionship, content creation, and digital games. While current models effectively capture character tones and knowledge, simulating the inner thoughts behind their behaviors remains a challenge. Towards cognitive simulation in LLM role-play, previous efforts mainly suffer from two deficiencies: data with high-quality reasoning traces, and reliable reward signals aligned with human preferences. In this paper, we propose HER, a unified framework for cognitive-level persona simulation. HER introduces dual-layer thinking, which distinguishes characters' first-person thinking from LLMs' third-person thinking. To bridge these gaps, we curate reasoning-augmented role-playing data via reverse engineering and construct human-aligned principles and reward models. Leveraging these resources, we train \method models based on Qwen3-32B via supervised and reinforcement learning. Extensive experiments validate the effectiveness of our approach. Notably, our models significantly outperform the Qwen3-32B baseline, achieving a 30.26 improvement on the CoSER benchmark and a 14.97 gain on the Minimax Role-Play Bench. Our datasets, principles, and models will be released to facilitate future research.
【23】Accurate Network Traffic Matrix Prediction via LEAD: an LLM-Enhanced Adapter-Based Conditional Diffusion Model
标题:通过LEAD进行准确的网络流量矩阵预测:LLM增强的基于适配器的条件扩散模型
链接:https://arxiv.org/abs/2601.21437
作者:Yu Sun,Yaqiong Liu,Nan Cheng,Jiayuan Li,Zihan Jia,Xialin Du,Mugen Peng
摘要:Driven by the evolution toward 6G and AI-native edge intelligence, network operations increasingly require predictive and risk-aware adaptation under stringent computation and latency constraints. Network Traffic Matrix (TM), which characterizes flow volumes between nodes, is a fundamental signal for proactive traffic engineering. However, accurate TM forecasting remains challenging due to the stochastic, non-linear, and bursty nature of network dynamics. Existing discriminative models often suffer from over-smoothing and provide limited uncertainty awareness, leading to poor fidelity under extreme bursts. To address these limitations, we propose LEAD, a Large Language Model (LLM)-Enhanced Adapter-based conditional Diffusion model. First, LEAD adopts a "Traffic-to-Image" paradigm to transform traffic matrices into RGB images, enabling global dependency modeling via vision backbones. Then, we design a "Frozen LLM with Trainable Adapter" model, which efficiently captures temporal semantics with limited computational cost. Moreover, we propose a Dual-Conditioning Strategy to precisely guide a diffusion model to generate complex, dynamic network traffic matrices. Experiments on the Abilene and GEANT datasets demonstrate that LEAD outperforms all baselines. On the Abilene dataset, LEAD attains a remarkable 45.2% reduction in RMSE against the best baseline, with the error margin rising only marginally from 0.1098 at one-step to 0.1134 at 20-step predictions. Meanwhile, on the GEANT dataset, LEAD achieves a 0.0258 RMSE at 20-step prediction horizon which is 27.3% lower than the best baseline.
【24】Rethinking Federated Graph Foundation Models: A Graph-Language Alignment-based Approach
标题:重新思考联邦图基础模型:基于图语言对齐的方法
链接:https://arxiv.org/abs/2601.21369
作者:Yinlin Zhu,Di Wu,Xianzhi Zhang,Yuming Ai,Xunkai Li,Miao Hu,Guocong Quan
备注:Under Review. E-mail: zhuylin27@mail2.sysu.edu.cn
摘要:Recent studies of federated graph foundational models (FedGFMs) break the idealized and untenable assumption of having centralized data storage to train graph foundation models, and accommodate the reality of distributed, privacy-restricted data silos. Despite their simplicity and intuition, existing studies that project aligned generalizable knowledge onto a discrete token space via vector-quantized backbones suffer from irreversible knowledge loss during the quantization process. In this context, we argue that reconciling the semantic-structural orthogonality and integrity between pre-trained language models (PLMs) and graph neural networks (GNNs) is paramount for developing effective FedGFMs while simultaneously mitigating the severe data heterogeneity and communication constraints inherent in distributed, resource-limited environments. To address these issues, we propose FedGALA (Federated Graph And Language Alignment), a framework that resolves graph-based semantic-structural orthogonality and integrity in federated settings by employing unsupervised contrastive learning to align GNNs and frozen PLMs within a continuous embedding space, thereby capturing robust, transferable general knowledge. Subsequently, FedGALA leverages a communication-efficient prompt tuning mechanism to steer these pre-aligned encoders and frozen PLMs, facilitating effective adaptation to diverse downstream tasks while circumventing the prohibitive overhead of full-parameter fine-tuning. The comprehensive experiments validate that FedGALA outperforms all competitive baselines across multi-domain datasets on multiple tasks with up to 14.37% performance improvement.
【25】Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving
标题:分解LLM服务中理论上最佳的注意力/FFN比
链接:https://arxiv.org/abs/2601.21351
作者:Chendong Song,Meixuan Wang,Hang Zhou,Hong Liang,Yuan Lyu,Zixi Chen,Yuwei Fan,Zijie Zhou
备注:Submitted to ICML 2026
摘要:Attention-FFN disaggregation (AFD) is an emerging architecture for LLM decoding that separates state-heavy, KV-cache-dominated Attention computation from stateless, compute-intensive FFN computation, connected by per-step communication. While AFD enables independent scaling of memory and compute resources, its performance is highly sensitive to the Attention/FFN provisioning ratio: mis-sizing induces step-level blocking and costly device idle time. We develop a tractable analytical framework for sizing AFD bundles in an $r$A-$1$F topology, where the key difficulty is that Attention-side work is nonstationary-token context grows and requests are continuously replenished with random lengths-while FFN work is stable given the aggregated batch. Using a probabilistic workload model, we derive closed-form rules for the optimal A/F ratio that maximize average throughput per instance across the system. A trace-calibrated AFD simulator validates the theory: across workloads, the theoretical optimal A/F ratio matches the simulation-optimal within 10%, and consistently reduces idle time.
【26】DUET: Distilled LLM Unlearning from an Efficiently Contextualized Teacher
标题:DuET:从高效的情境化教师那里吸取教训的精制LLM
链接:https://arxiv.org/abs/2601.21283
作者:Yisheng Zhong,Zhengbang Yang,Zhuangdi Zhu
摘要:LLM unlearning is a technique to remove the impacts of undesirable knowledge from the model without retraining from scratch, which is indispensable towards trustworthy AI. Existing unlearning methods face significant limitations: conventional tuning-based unlearning is computationally heavy and prone to catastrophic forgetting. In contrast, in-contextualized unlearning is lightweight for precise unlearning but vulnerable to prompt removal or reverse engineering attacks. In response, we propose Distilled Unlearning from an Efficient Teacher (DUET), a novel distillation-based unlearning method that combines the merits of these two lines of work. It learns a student model to imitate the behavior of a prompt-steered teacher that effectively refuses undesirable knowledge generation while preserving general domain knowledge. Extensive evaluations on existing benchmarks with our enriched evaluation protocols demonstrate that DUET achieves higher performance in both forgetting and utility preservation, while being orders of magnitude more data-efficient than state-of-the-art unlearning methods.
【27】Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models
标题:缩放推理跳暴露弱点:揭开大型语言模型中的跳推广的神秘面纱并改进
链接:https://arxiv.org/abs/2601.21214
作者:Zhaoyi Li,Jiatong Li,Gangwei Jiang,Linqi Song,Defu Lian,Ying Wei
备注:52 pages, accepted by ICLR 2026 main conference
摘要:Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems. However, recent studies reveal a sharp performance drop in reasoning hop generalization scenarios, where the required number of reasoning steps exceeds training distributions while the underlying algorithm remains unchanged. The internal mechanisms driving this failure remain poorly understood. In this work, we conduct a systematic study on tasks from multiple domains, and find that errors concentrate at token positions of a few critical error types, rather than being uniformly distributed. Closer inspection reveals that these token-level erroneous predictions stem from internal competition mechanisms: certain attention heads, termed erroneous processing heads (ep heads), tip the balance by amplifying incorrect reasoning trajectories while suppressing correct ones. Notably, removing individual ep heads during inference can often restore the correct predictions. Motivated by these insights, we propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process. Extensive experiments across different tasks and LLMs show that it consistently improves reasoning hop generalization, highlighting both its effectiveness and potential.
【28】Scaling Embeddings Outperforms Scaling Experts in Language Models
标题:扩展嵌入优于语言模型中的扩展专家
链接:https://arxiv.org/abs/2601.21204
作者:Hong Liu,Jiaqi Zhang,Chao Wang,Xing Hu,Linkun Lyu,Jiaqi Sun,Xurui Yang,Bo Wang,Fengcun Li,Yulei Qian,Lingtong Si,Yerui Sun,Rumei Li,Peng Pei,Yuchen Xie,Xunliang Cai
摘要:While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy -- ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.
【29】FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models
标题:FRism:通过视觉语言模型的子空间级模型合并进行细粒度推理注入
链接:https://arxiv.org/abs/2601.21187
作者
:Chenyu Huang,Peng Ye,Xudong Tan,Jinhan Mu,Shenghe Zheng,Li Shen,Tao Chen
备注:23 pages, 8 figures
摘要:Efficiently enhancing the reasoning capabilities of Vision-Language Models (VLMs) by merging them with Large Reasoning Models (LRMs) has emerged as a promising direction. However, existing methods typically operate at a coarse-grained layer level, which often leads to a trade-off between injecting reasoning capabilities and preserving visual capabilities. To address this limitation, we propose {FRISM} (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging. Observing that reasoning capabilities are encoded in distinct subspaces, FRISM decomposes LRM task vectors via Singular Value Decomposition (SVD) and adaptively tunes the scaling coefficients of each subspace through learning to realize fine-grained reasoning injection. Furthermore, we introduce a label-free self-distillation learning strategy with a dual-objective optimization using common vision-language perception datasets. Extensive experiments demonstrate that FRISM effectively improves reasoning capabilities without compromising the model's original visual capabilities by consistently achieving state-of-the-art performance across diverse visual reasoning benchmarks.
【30】Human-LLM Collaborative Feature Engineering for Tabular Data
标题:面向表格数据的Human-LLM协作特征工程
链接:https://arxiv.org/abs/2601.21060
作者:Zhuoyan Li,Aditya Bansal,Jinzhao Li,Shishuang He,Zhuoran Lu,Mutian Zhang,Qin Liu,Yiwei Yang,Swati Jain,Ming Yin,Yunyao Li
备注:ICLR 2026
摘要:Large language models (LLMs) are increasingly used to automate feature engineering in tabular learning. Given task-specific information, LLMs can propose diverse feature transformation operations to enhance downstream model performance. However, current approaches typically assign the LLM as a black-box optimizer, responsible for both proposing and selecting operations based solely on its internal heuristics, which often lack calibrated estimations of operation utility and consequently lead to repeated exploration of low-yield operations without a principled strategy for prioritizing promising directions. In this paper, we propose a human-LLM collaborative feature engineering framework for tabular learning. We begin by decoupling the transformation operation proposal and selection processes, where LLMs are used solely to generate operation candidates, while the selection is guided by explicitly modeling the utility and uncertainty of each proposed operation. Since accurate utility estimation can be difficult especially in the early rounds of feature engineering, we design a mechanism within the framework that selectively elicits and incorporates human expert preference feedback, comparing which operations are more promising, into the selection process to help identify more effective operations. Our evaluations on both the synthetic study and the real user study demonstrate that the proposed framework improves feature engineering performance across a variety of tabular datasets and reduces users' cognitive load during the feature engineering process.
【31】Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B Technical Report
标题:Llama-3.1-FoundationAI-SecureLLM-Reasoning-8B技术报告
链接:https://arxiv.org/abs/2601.21051
作者:Zhuoran Yang,Ed Li,Jianliang He,Aman Priyanshu,Baturay Saglam,Paul Kassianik,Sajana Weerawardhena,Anu Vellore,Blaine Nelson,Neusha Javidnia,Arthur Goldblatt,Fraser Burch,Avi Zohary,Assaf Eisenman,Mahdi Sabbaghi,Supriti Vijay,Rahim Dharssi,Dhruv Kedia,Kojin Oshiba,Yaron Singer,Amin Karbasi
备注:31 pages, 5 figures, 7 tables
摘要:We present Foundation-Sec-8B-Reasoning, the first open-source native reasoning model for cybersecurity. Built upon our previously released Foundation-Sec-8B base model (derived from Llama-3.1-8B-Base), the model is trained through a two-stage process combining supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR). Our training leverages proprietary reasoning data spanning cybersecurity analysis, instruction-following, and mathematical reasoning. Evaluation across 10 cybersecurity benchmarks and 10 general-purpose benchmarks demonstrates performance competitive with significantly larger models on cybersecurity tasks while maintaining strong general capabilities. The model shows effective generalization on multi-hop reasoning tasks and strong safety performance when deployed with appropriate system prompts and guardrails. This work demonstrates that domain-specialized reasoning models can achieve strong performance on specialized tasks while maintaining broad general capabilities. We release the model publicly at https://huggingface.co/fdtn-ai/Foundation-Sec-8B-Reasoning.
【32】Noisy but Valid: Robust Statistical Evaluation of LLMs with Imperfect Judges
标题:吵闹但有效:法官不完美的LLM的稳健统计评估
链接:https://arxiv.org/abs/2601.20913
作者:Chen Feng,Minghe Shen,Ananth Balashankar,Carsten Gerner-Beuerle,Miguel R. D. Rodrigues
备注:Accepted to ICLR2026
摘要:Reliable certification of Large Language Models (LLMs)-verifying that failure rates are below a safety threshold-is critical yet challenging. While "LLM-as-a-Judge" offers scalability, judge imperfections, noise, and bias can invalidate statistical guarantees. We introduce a "Noisy but Valid" hypothesis testing framework to address this. By leveraging a small human-labelled calibration set to estimate the judge's True Positive and False Positive Rates (TPR/FPR), we derive a variance-corrected critical threshold applied to a large judge-labelled dataset. Crucially, our framework theoretically guarantees finite-sample Type-I error control (validity) despite calibration uncertainty. This distinguishes our work from Prediction-Powered Inference (PPI), positioning our method as a diagnostic tool that explicitly models judge behavior rather than a black-box estimator. Our contributions include: (1) Theoretical Guarantees: We derive the exact conditions under which noisy testing yields higher statistical power than direct evaluation; (2) Empirical Validation: Experiments on Jigsaw Comment, Hate Speech and SafeRLHF confirm our theory; (3) The Oracle Gap: We reveal a significant performance gap between practical methods and the theoretical "Oracle" (perfectly known judge parameters), quantifying the cost of estimation. Specifically, we provide the first systematic treatment of the imperfect-judge setting, yielding interpretable diagnostics of judge reliability and clarifying how evaluation power depends on judge quality, dataset size, and certification levels. Together, these results sharpen understanding of statistical evaluation with LLM judges, and highlight trade-offs among competing inferential tools.
【33】TwinWeaver: An LLM-Based Foundation Model Framework for Pan-Cancer Digital Twins
标题:TwinWeaver:基于LLM的泛癌症数字双胞胎基础模型框架
链接:https://arxiv.org/abs/2601.20906
作者:Nikita Makarov,Maria Bordukova,Lena Voith von Voithenberg,Estrella Pivel-Villanueva,Sabrina Mielke,Jonathan Wickes,Hanchen Wang,Mingyu Derek Ma,Keunwoo Choi,Kyunghyun Cho,Stephen Ra,Raul Rodriguez-Esteban,Fabian Schmich,Michael Menden
摘要:Precision oncology requires forecasting clinical events and trajectories, yet modeling sparse, multi-modal clinical time series remains a critical challenge. We introduce TwinWeaver, an open-source framework that serializes longitudinal patient histories into text, enabling unified event prediction as well as forecasting with large language models, and use it to build Genie Digital Twin (GDT) on 93,054 patients across 20 cancer types. In benchmarks, GDT significantly reduces forecasting error, achieving a median Mean Absolute Scaled Error (MASE) of 0.87 compared to 0.97 for the strongest time-series baseline (p<0.001). Furthermore, GDT improves risk stratification, achieving an average concordance index (C-index) of 0.703 across survival, progression, and therapy switching tasks, surpassing the best baseline of 0.662. GDT also generalizes to out-of-distribution clinical trials, matching trained baselines at zero-shot and surpassing them with fine-tuning, achieving a median MASE of 0.75-0.88 and outperforming the strongest baseline in event prediction with an average C-index of 0.672 versus 0.648. Finally, TwinWeaver enables an interpretable clinical reasoning extension, providing a scalable and transparent foundation for longitudinal clinical modeling.
【34】Text-only adaptation in LLM-based ASR through text denoising
标题:通过文本去噪在基于LLM的ASB中进行纯文本自适应
链接:https://arxiv.org/abs/2601.20900
作者:Sergio Burdisso,Esaú Villatoro-Tello,Andrés Carofilis,Shashi Kumar,Kadri Hacioglu,Srikanth Madikeri,Pradeep Rangappa,Manjunath K E,Petr Motlicek,Shankar Venkatesan,Andreas Stolcke
备注:Paper accepted at ICASSP 2026
摘要:Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech and text modalities learned by the projector, degrading performance. We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task. Our approach thus trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.
【35】IDE-Bench: Evaluating Large Language Models as IDE Agents on Real-World Software Engineering Tasks
标题:IDE-Bench:评估大型语言模型作为现实软件工程任务中的IDE代理
链接:https://arxiv.org/abs/2601.20886
作者:Spencer Mateega,Jeff Yang,Tiana Costello,Shaurya Jadhav,Nicole Tian,Agustin Garcinuño
摘要:IDE-Bench is a comprehensive framework for evaluating AI IDE agents on real-world software engineering tasks through an IDE-native tool interface. We present a Dockerized test harness that goes beyond raw terminal execution, granting models a structured tool ecosystem that represents AI-native IDEs like Cursor and Windsurf. By providing high-level abstractions for codebase search, structured file editing, and tools for testing full-stack applications, IDE-Bench evaluates an agent's ability to act as a true engineering collaborator. For evaluation and to prevent training data contamination, we created 80 tasks across eight never-published repositories spanning C/C++, Java, and MERN stacks, representing modern tech stack production scenarios, including feature implementation, bug fixing, refactoring, and performance optimization that mirror daily developer workflows in private codebases. Our benchmark is the first to systematically correlate agent-reported intent with successful project-level modifications in a multi-language, full-stack environment on completely uncontaminated code.
【36】Rethinking LLM-Driven Heuristic Design: Generating Efficient and Specialized Solvers via Dynamics-Aware Optimization
标题:重新思考LLM驱动的启发式设计:通过动态感知优化生成高效且专业的求解器
链接:https://arxiv.org/abs/2601.20868
作者:Rongzheng Wang,Yihong Huang,Muquan Li,Jiakai Li,Di Liang,Bob Simons,Pei Ke,Shuang Liang,Ke Qin
摘要:Large Language Models (LLMs) have advanced the field of Combinatorial Optimization through automated heuristic generation. Instead of relying on manual design, this LLM-Driven Heuristic Design (LHD) process leverages LLMs to iteratively generate and refine solvers to achieve high performance. However, existing LHD frameworks face two critical limitations: (1) Endpoint-only evaluation, which ranks solvers solely by final quality, ignoring the convergence process and runtime efficiency; (2) High adaptation costs, where distribution shifts necessitate re-adaptation to generate specialized solvers for new instance groups. To address these issues, we propose Dynamics-Aware Solver Heuristics (DASH), a framework that co-optimizes solver search mechanisms and runtime schedules guided by a convergence-aware metric, thereby identifying efficient and high-performance solvers. Furthermore, to mitigate expensive re-adaptation, DASH incorporates Profiled Library Retrieval (PLR). PLR efficiently archives specialized solvers concurrently with the evolutionary process, enabling cost-effective warm-starts for heterogeneous distributions. Experiments on four combinatorial optimization problems demonstrate that DASH improves runtime efficiency by over 3 times, while surpassing the solution quality of state-of-the-art baselines across diverse problem scales. Furthermore, by enabling profile-based warm starts, DASH maintains superior accuracy under different distributions while cutting LLM adaptation costs by over 90%.
【37】Fake News Detection After LLM Laundering: Measurement and Explanation
标题:LLM洗钱后的假新闻检测:测量和解释
链接:https://arxiv.org/abs/2501.18649
作者:Rupak Kumar Das,Jonathan Dodge
摘要:With their advanced capabilities, Large Language Models (LLMs) can generate highly convincing and contextually relevant fake news, which can contribute to disseminating misinformation. Though there is much research on fake news detection for human-written text, the field of detecting LLM-generated fake news is still under-explored. This research measures the efficacy of detectors in identifying LLM-paraphrased fake news, in particular, determining whether adding a paraphrase step in the detection pipeline helps or impedes detection. This study contributes: (1) Detectors struggle to detect LLM-paraphrased fake news more than human-written text, (2) We find which models excel at which tasks (evading detection, paraphrasing to evade detection, and paraphrasing for semantic similarity). (3) Via LIME explanations, we discovered a possible reason for detection failures: sentiment shift. (4) We discover a worrisome trend for paraphrase quality measurement: samples that exhibit sentiment shift despite a high BERTSCORE. (5) We provide a pair of datasets augmenting existing datasets with paraphrase outputs and scores. The dataset is available on GitHub
【38】A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth
标题:用于在没有基本真相的情况下评估大型语言模型的判断感知排名框架
链接:https://arxiv.org/abs/2601.21817
作者:Mingyuan Xu,Xinzi Tan,Jiawei Wu,Doudou Zhou
摘要:Evaluating large language models (LLMs) on open-ended tasks without ground-truth labels is increasingly done via the LLM-as-a-judge paradigm. A critical but under-modeled issue is that judge LLMs differ substantially in reliability; treating all judges equally can yield biased leaderboards and misleading uncertainty estimates. More data can make evaluation more confidently wrong under misspecified aggregation. We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters, jointly estimating latent model quality and judge reliability from pairwise comparisons without reference labels. We establish identifiability up to natural normalizations and prove consistency and asymptotic normality of the maximum likelihood estimator, enabling confidence intervals for score differences and rank comparisons. Across multiple public benchmarks and a newly collected dataset, our method improves agreement with human preferences, achieves higher data efficiency than unweighted baselines, and produces calibrated uncertainty quantification for LLM rankings.
【39】Statsformer: Validated Ensemble Learning with LLM-Derived Semantic Priors
标题:Statsformer:使用LLM衍生的语义先验进行验证的集合学习
链接:https://arxiv.org/abs/2601.21410
作者:Erica Zhang,Naomi Sagan,Danny Tse,Fangzhao Zhang,Mert Pilanci,Jose Blanchet
摘要:We introduce Statsformer, a principled framework for integrating large language model (LLM)-derived knowledge into supervised statistical learning. Existing approaches are limited in adaptability and scope: they either inject LLM guidance as an unvalidated heuristic, which is sensitive to LLM hallucination, or embed semantic information within a single fixed learner. Statsformer overcomes both limitations through a guardrailed ensemble architecture. We embed LLM-derived feature priors within an ensemble of linear and nonlinear learners, adaptively calibrating their influence via cross-validation. This design yields a flexible system with an oracle-style guarantee that it performs no worse than any convex combination of its in-library base learners, up to statistical error. Empirically, informative priors yield consistent performance improvements, while uninformative or misspecified LLM guidance is automatically downweighted, mitigating the impact of hallucinations across a diverse range of prediction tasks.
【40】Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection
标题:通过可学习投影降低基于LLM的语音识别中的提示敏感性
链接:https://arxiv.org/abs/2601.20898
作者:Sergio Burdisso,Esaú Villatoro-Tello,Shashi Kumar,Srikanth Madikeri,Andrés Carofilis,Pradeep Rangappa,Manjunath K E,Kadri Hacioglu,Petr Motlicek,Andreas Stolcke
备注:Paper accepted at ICASSP 2026
摘要:LLM-based automatic speech recognition (ASR), a well-established approach, connects speech foundation models to large language models (LLMs) through a speech-to-LLM projector, yielding promising results. A common design choice in these architectures is the use of a fixed, manually defined prompt during both training and inference. This setup not only enables applicability across a range of practical scenarios, but also helps maximize model performance. However, the impact of prompt design remains underexplored. This paper presents a comprehensive analysis of commonly used prompts across diverse datasets, showing that prompt choice significantly affects ASR performance and introduces instability, with no single prompt performing best across all cases. Inspired by the speech-to-LLM projector, we propose a prompt projector module, a simple, model-agnostic extension that learns to project prompt embeddings to more effective regions of the LLM input space, without modifying the underlying LLM-based ASR model. Experiments on four datasets show that the addition of a prompt projector consistently improves performance, reduces variability, and outperforms the best manually selected prompts.
Graph相关(图学习|图神经网络|图优化等)(13篇)
【1】Prior-Informed Flow Matching for Graph Reconstruction
标题:用于图重建的先验流匹配
链接:https://arxiv.org/abs/2601.22107
作者
:Harvey Chen,Nicolas Zilberstein,Santiago Segarra
摘要:We introduce Prior-Informed Flow Matching (PIFM), a conditional flow model for graph reconstruction. Reconstructing graphs from partial observations remains a key challenge; classical embedding methods often lack global consistency, while modern generative models struggle to incorporate structural priors. PIFM bridges this gap by integrating embedding-based priors with continuous-time flow matching. Grounded in a permutation equivariant version of the distortion-perception theory, our method first uses a prior, such as graphons or GraphSAGE/node2vec, to form an informed initial estimate of the adjacency matrix based on local information. It then applies rectified flow matching to refine this estimate, transporting it toward the true distribution of clean graphs and learning a global coupling. Experiments on different datasets demonstrate that PIFM consistently enhances classical embeddings, outperforming them and state-of-the-art generative baselines in reconstruction accuracy.
【2】Bridging Graph Structure and Knowledge-Guided Editing for Interpretable Temporal Knowledge Graph Reasoning
标题:可解释时态知识图推理的桥图结构和知识引导编辑
链接:https://arxiv.org/abs/2601.21978
作者:Shiqi Fan,Quanming Yao,Hongyi Nie,Wentao Ma,Zhen Wang,Wen Hua
摘要:Temporal knowledge graph reasoning (TKGR) aims to predict future events by inferring missing entities with dynamic knowledge structures. Existing LLM-based reasoning methods prioritize contextual over structural relations, struggling to extract relevant subgraphs from dynamic graphs. This limits structural information understanding, leading to unstructured, hallucination-prone inferences especially with temporal inconsistencies. To address this problem, we propose IGETR (Integration of Graph and Editing-enhanced Temporal Reasoning), a hybrid reasoning framework that combines the structured temporal modeling capabilities of Graph Neural Networks (GNNs) with the contextual understanding of LLMs. IGETR operates through a three-stage pipeline. The first stage aims to ground the reasoning process in the actual data by identifying structurally and temporally coherent candidate paths through a temporal GNN, ensuring that inference starts from reliable graph-based evidence. The second stage introduces LLM-guided path editing to address logical and semantic inconsistencies, leveraging external knowledge to refine and enhance the initial paths. The final stage focuses on integrating the refined reasoning paths to produce predictions that are both accurate and interpretable. Experiments on standard TKG benchmarks show that IGETR achieves state-of-the-art performance, outperforming strong baselines with relative improvements of up to 5.6% on Hits@1 and 8.1% on Hits@3 on the challenging ICEWS datasets. Additionally, we execute ablation studies and additional analyses confirm the effectiveness of each component.
【3】How Expressive Are Graph Neural Networks in the Presence of Node Identifiers?
标题:在存在节点标识符的情况下,图神经网络的表现力如何?
链接:https://arxiv.org/abs/2601.21882
作者:Arie Soeteman,Michael Benedikt,Martin Grohe,Balder ten Cate
备注:35 pages
摘要:Graph neural networks (GNNs) are a widely used class of machine learning models for graph-structured data, based on local aggregation over neighbors. GNNs have close connections to logic. In particular, their expressive power is linked to that of modal logics and bounded-variable logics with counting. In many practical scenarios, graphs processed by GNNs have node features that act as unique identifiers. In this work, we study how such identifiers affect the expressive power of GNNs. We initiate a study of the key-invariant expressive power of GNNs, inspired by the notion of order-invariant definability in finite model theory: which node queries that depend only on the underlying graph structure can GNNs express on graphs with unique node identifiers? We provide answers for various classes of GNNs with local max- or sum-aggregation.
【4】Heterogeneity-Aware Knowledge Sharing for Graph Federated Learning
标题:用于图联邦学习的具有异类意识的知识共享
链接:https://arxiv.org/abs/2601.21589
作者:Wentao Yu,Sheng Wan,Shuo Chen,Bo Han,Chen Gong
备注:33 pages
摘要:Graph Federated Learning (GFL) enables distributed graph representation learning while protecting the privacy of graph data. However, GFL suffers from heterogeneity arising from diverse node features and structural topologies across multiple clients. To address both types of heterogeneity, we propose a novel graph Federated learning method via Semantic and Structural Alignment (FedSSA), which shares the knowledge of both node features and structural topologies. For node feature heterogeneity, we propose a novel variational model to infer class-wise node distributions, so that we can cluster clients based on inferred distributions and construct cluster-level representative distributions. We then minimize the divergence between local and cluster-level distributions to facilitate semantic knowledge sharing. For structural heterogeneity, we employ spectral Graph Neural Networks (GNNs) and propose a spectral energy measure to characterize structural information, so that we can cluster clients based on spectral energy and build cluster-level spectral GNNs. We then align the spectral characteristics of local spectral GNNs with those of cluster-level spectral GNNs to enable structural knowledge sharing. Experiments on six homophilic and five heterophilic graph datasets under both non-overlapping and overlapping partitioning settings demonstrate that FedSSA consistently outperforms eleven state-of-the-art methods.
【5】Explicit Credit Assignment through Local Rewards and Dependence Graphs in Multi-Agent Reinforcement Learning
标题:多智能体强化学习中通过局部奖励和依赖图进行显式信用分配
链接:https://arxiv.org/abs/2601.21523
作者:Bang Giang Le,Viet Cuong Ta
摘要
:To promote cooperation in Multi-Agent Reinforcement Learning, the reward signals of all agents can be aggregated together, forming global rewards that are commonly known as the fully cooperative setting. However, global rewards are usually noisy because they contain the contributions of all agents, which have to be resolved in the credit assignment process. On the other hand, using local reward benefits from faster learning due to the separation of agents' contributions, but can be suboptimal as agents myopically optimize their own reward while disregarding the global optimality. In this work, we propose a method that combines the merits of both approaches. By using a graph of interaction between agents, our method discerns the individual agent contribution in a more fine-grained manner than a global reward, while alleviating the cooperation problem with agents' local reward. We also introduce a practical approach for approximating such a graph. Our experiments demonstrate the flexibility of the approach, enabling improvements over the traditional local and global reward settings.
【6】Mean-Field Control on Sparse Graphs: From Local Limits to GNNs via Neighborhood Distributions
标题:稀疏图的平均场控制:从局部极限到邻居分布的GNN
链接:https://arxiv.org/abs/2601.21477
作者:Tobias Schmidt,Kai Cui
备注:19 pages
摘要:Mean-field control (MFC) offers a scalable solution to the curse of dimensionality in multi-agent systems but traditionally hinges on the restrictive assumption of exchangeability via dense, all-to-all interactions. In this work, we bridge the gap to real-world network structures by proposing a rigorous framework for MFC on large sparse graphs. We redefine the system state as a probability measure over decorated rooted neighborhoods, effectively capturing local heterogeneity. Our central contribution is a theoretical foundation for scalable reinforcement learning in this setting. We prove horizon-dependent locality: for finite-horizon problems, an agent's optimal policy at time t depends strictly on its (T-t)-hop neighborhood. This result renders the infinite-dimensional control problem tractable and underpins a novel Dynamic Programming Principle (DPP) on the lifted space of neighborhood distributions. Furthermore, we formally and experimentally justify the use of Graph Neural Networks (GNNs) for actor-critic algorithms in this context. Our framework naturally recovers classical MFC as a degenerate case while enabling efficient, theoretically grounded control on complex sparse topologies.
【7】Synthetic Pattern Generation and Detection of Financial Activities using Graph Autoencoders
标题:使用图形自动编码器的合成模式生成和金融活动检测
链接:https://arxiv.org/abs/2601.21446
作者:Francesco Zola,Lucia Muñoz,Andrea Venturi,Amaia Gil
备注:Accept to The 7th International Workshop on Statistical Methods and Artificial Intelligence (IWSMAI'26)
摘要:Illicit financial activities such as money laundering often manifest through recurrent topological patterns in transaction networks. Detecting these patterns automatically remains challenging due to the scarcity of labeled real-world data and strict privacy constraints. To address this, we investigate whether Graph Autoencoders (GAEs) can effectively learn and distinguish topological patterns that mimic money laundering operations when trained on synthetic data. The analysis consists of two phases: (i) data generation, where synthetic samples are created for seven well-known illicit activity patterns using parametrized generators that preserve structural consistency while introducing realistic variability; and (ii) model training and validation, where separate GAEs are trained on each pattern without explicit labels, relying solely on reconstruction error as an indicator of learned structure. We compare three GAE implementations based on three distinct convolutional layers: Graph Convolutional (GAE-GCN), GraphSAGE (GAE-SAGE), and Graph Attention Network (GAE-GAT). Experimental results show that GAE-GCN achieves the most consistent reconstruction performance across patterns, while GAE-SAGE and GAE-GAT exhibit competitive results only in few specific patterns. These findings suggest that graph-based representation learning on synthetic data provides a viable path toward developing AI-driven tools for detecting illicit behaviors, overcoming the limitations of financial datasets.
【8】Graph-Free Root Cause Analysis
标题:无图根本原因分析
链接:https://arxiv.org/abs/2601.21359
作者:Luan Pham
摘要:Failures in complex systems demand rapid Root Cause Analysis (RCA) to prevent cascading damage. Existing RCA methods that operate without dependency graph typically assume that the root cause having the highest anomaly score. This assumption fails when faults propagate, as a small delay at the root cause can accumulate into a much larger anomaly downstream. In this paper, we propose PRISM, a simple and efficient framework for RCA when the dependency graph is absent. We formulate a class of component-based systems under which PRISM performs RCA with theoretical guarantees. On 735 failures across 9 real-world datasets, PRISM achieves 68% Top-1 accuracy, a 258% improvement over the best baseline, while requiring only 8ms per diagnosis.
【9】Transferable Graph Condensation from the Causal Perspective
标题:因果关系视角下的可迁移图凝聚
链接:https://arxiv.org/abs/2601.21309
作者:Huaming Du,Yijie Huang,Su Yao,Yiying Wang,Yueyang Zhou,Jingwen Yang,Jinshi Zhang,Han Ji,Yu Zhao,Guisong Liu,Hegui Zhang,Carl Yang,Gang Kou
摘要
:The increasing scale of graph datasets has significantly improved the performance of graph representation learning methods, but it has also introduced substantial training challenges. Graph dataset condensation techniques have emerged to compress large datasets into smaller yet information-rich datasets, while maintaining similar test performance. However, these methods strictly require downstream applications to match the original dataset and task, which often fails in cross-task and cross-domain scenarios. To address these challenges, we propose a novel causal-invariance-based and transferable graph dataset condensation method, named \textbf{TGCC}, providing effective and transferable condensed datasets. Specifically, to preserve domain-invariant knowledge, we first extract domain causal-invariant features from the spatial domain of the graph using causal interventions. Then, to fully capture the structural and feature information of the original graph, we perform enhanced condensation operations. Finally, through spectral-domain enhanced contrastive learning, we inject the causal-invariant features into the condensed graph, ensuring that the compressed graph retains the causal information of the original graph. Experimental results on five public datasets and our novel \textbf{FinReport} dataset demonstrate that TGCC achieves up to a 13.41\% improvement in cross-task and cross-domain complex scenarios compared to existing methods, and achieves state-of-the-art performance on 5 out of 6 datasets in the single dataset and task scenario.
【10】EGAM: Extended Graph Attention Model for Solving Routing Problems
标题:EGAM:解决路由问题的扩展图注意力模型
链接:https://arxiv.org/abs/2601.21281
作者:Licheng Wang,Yuzi Yan,Mingtao Huang,Yuan Shen
摘要:Neural combinatorial optimization (NCO) solvers, implemented with graph neural networks (GNNs), have introduced new approaches for solving routing problems. Trained with reinforcement learning (RL), the state-of-the-art graph attention model (GAM) achieves near-optimal solutions without requiring expert knowledge or labeled data. In this work, we generalize the existing graph attention mechanism and propose the extended graph attention model (EGAM). Our model utilizes multi-head dot-product attention to update both node and edge embeddings, addressing the limitations of the conventional GAM, which considers only node features. We employ an autoregressive encoder-decoder architecture and train it with policy gradient algorithms that incorporate a specially designed baseline. Experiments show that EGAM matches or outperforms existing methods across various routing problems. Notably, the proposed model demonstrates exceptional performance on highly constrained problems, highlighting its efficiency in handling complex graph structures.
【11】A Sheaf-Theoretic and Topological Perspective on Complex Network Modeling and Attention Mechanisms in Graph Neural Models
标题:复杂网络建模和图神经模型中注意力机制的层理论和布局视角
链接:https://arxiv.org/abs/2601.21207
作者:Chuan-Shen Hu
摘要:Combinatorial and topological structures, such as graphs, simplicial complexes, and cell complexes, form the foundation of geometric and topological deep learning (GDL and TDL) architectures. These models aggregate signals over such domains, integrate local features, and generate representations for diverse real-world applications. However, the distribution and diffusion behavior of GDL and TDL features during training remains an open and underexplored problem. Motivated by this gap, we introduce a cellular sheaf theoretic framework for modeling and analyzing the local consistency and harmonicity of node features and edge weights in graph-based architectures. By tracking local feature alignments and agreements through sheaf structures, the framework offers a topological perspective on feature diffusion and aggregation. Furthermore, a multiscale extension inspired by topological data analysis (TDA) is proposed to capture hierarchical feature interactions in graph models. This approach enables a joint characterization of GDL and TDL architectures based on their underlying geometric and topological structures and the learned signals defined on them, providing insights for future studies on conventional tasks such as node classification, substructure detection, and community detection.
【12】AC2L-GAD: Active Counterfactual Contrastive Learning for Graph Anomaly Detection
标题:AC 2L-GAD:用于图异常检测的主动反事实对比学习
链接:https://arxiv.org/abs/2601.21171
作者:Kamal Berahmand,Saman Forouzandeh,Mehrnoush Mohammadi,Parham Moradi,Mahdi Jalili
摘要:Graph anomaly detection aims to identify abnormal patterns in networks, but faces significant challenges from label scarcity and extreme class imbalance. While graph contrastive learning offers a promising unsupervised solution, existing methods suffer from two critical limitations: random augmentations break semantic consistency in positive pairs, while naive negative sampling produces trivial, uninformative contrasts. We propose AC2L-GAD, an Active Counterfactual Contrastive Learning framework that addresses both limitations through principled counterfactual reasoning. By combining information-theoretic active selection with counterfactual generation, our approach identifies structurally complex nodes and generates anomaly-preserving positive augmentations alongside normal negative counterparts that provide hard contrasts, while restricting expensive counterfactual generation to a strategically selected subset. This design reduces computational overhead by approximately 65% compared to full-graph counterfactual generation while maintaining detection quality. Experiments on nine benchmark datasets, including real-world financial transaction graphs from GADBench, show that AC2L-GAD achieves competitive or superior performance compared to state-of-the-art baselines, with notable gains in datasets where anomalies exhibit complex attribute-structure interactions.
【13】Out-of-Distribution Generalization in Graph Foundation Models
标题:图基础模型中的分布外推广
链接:https://arxiv.org/abs/2601.21067
作者:Haoyang Li,Haibo Chen,Xin Wang,Wenwu Zhu
摘要:Graphs are a fundamental data structure for representing relational information in domains such as social networks, molecular systems, and knowledge graphs. However, graph learning models often suffer from limited generalization when applied beyond their training distributions. In practice, distribution shifts may arise from changes in graph structure, domain semantics, available modalities, or task formulations. To address these challenges, graph foundation models (GFMs) have recently emerged, aiming to learn general-purpose representations through large-scale pretraining across diverse graphs and tasks. In this survey, we review recent progress on GFMs from the perspective of out-of-distribution (OOD) generalization. We first discuss the main challenges posed by distribution shifts in graph learning and outline a unified problem setting. We then organize existing approaches based on whether they are designed to operate under a fixed task specification or to support generalization across heterogeneous task formulations, and summarize the corresponding OOD handling strategies and pretraining objectives. Finally, we review common evaluation protocols and discuss open directions for future research. To the best of our knowledge, this paper is the first survey for OOD generalization in GFMs.
Transformer(7篇)
【1】EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers
标题:编辑Yourself:音频驱动的使用扩散Transformer生成和操纵会说话的头部视频
链接:https://arxiv.org/abs/2601.22127
作者:John Flynn,Wolfgang Paier,Dimitar Dinev,Sam Nhut Nguyen,Hayk Poghosyan,Manuel Toribio,Sandipan Banerjee,Guy Gafni
备注:Project page: https://edit-yourself.github.io/
摘要:Current generative video models excel at producing novel content from text and image prompts, but leave a critical gap in editing existing pre-recorded videos, where minor alterations to the spoken script require preserving motion, temporal coherence, speaker identity, and accurate lip synchronization. We introduce EditYourself, a DiT-based framework for audio-driven video-to-video (V2V) editing that enables transcript-based modification of talking head videos, including the seamless addition, removal, and retiming of visually spoken content. Building on a general-purpose video diffusion model, EditYourself augments its V2V capabilities with audio conditioning and region-aware, edit-focused training extensions. This enables precise lip synchronization and temporally coherent restructuring of existing performances via spatiotemporal inpainting, including the synthesis of realistic human motion in newly added segments, while maintaining visual fidelity and identity consistency over long durations. This work represents a foundational step toward generative video models as practical tools for professional video post-production.
【2】Rate-Distortion Optimization for Transformer Inference
标题:Transformer推理的率失真优化
链接:https://arxiv.org/abs/2601.22002
作者:Anderson de Andrade,Alon Harell,Ivan V. Bajić
摘要:Transformers achieve superior performance on many tasks, but impose heavy compute and memory requirements during inference. This inference can be made more efficient by partitioning the process across multiple devices, which, in turn, requires compressing its intermediate representations. In this work, we introduce a principled rate-distortion-based framework for lossy compression that learns compact encodings that explicitly trade off bitrate against accuracy. Experiments on language benchmarks show that the proposed codec achieves substantial savings with improved accuracy in some cases, outperforming more complex baseline methods. We characterize and analyze the rate-distortion performance of transformers, offering a unified lens for understanding performance in representation coding. This formulation extends information-theoretic concepts to define the gap between rate and entropy, and derive some of its bounds. We further develop probably approximately correct (PAC)-style bounds for estimating this gap. For different architectures and tasks, we empirically demonstrate that their rates are driven by these bounds, adding to the explainability of the formulation.
【3】Seg-MoE: Multi-Resolution Segment-wise Mixture-of-Experts for Time Series Forecasting Transformers
标题:Seg-MoE:用于时间序列预测Transformer的多分辨率分段专家混合
链接:https://arxiv.org/abs/2601.21641
作者:Evandro S. Ortigossa,Eran Segal
备注:Under review
摘要:Transformer-based models have recently made significant advances in accurate time-series forecasting, but even these architectures struggle to scale efficiently while capturing long-term temporal dynamics. Mixture-of-Experts (MoE) layers are a proven solution to scaling problems in natural language processing. However, existing MoE approaches for time-series forecasting rely on token-wise routing mechanisms, which may fail to exploit the natural locality and continuity of temporal data. In this work, we introduce Seg-MoE, a sparse MoE design that routes and processes contiguous time-step segments rather than making independent expert decisions. Token segments allow each expert to model intra-segment interactions directly, naturally aligning with inherent temporal patterns. We integrate Seg-MoE layers into a time-series Transformer and evaluate it on multiple multivariate long-term forecasting benchmarks. Seg-MoE consistently achieves state-of-the-art forecasting accuracy across almost all prediction horizons, outperforming both dense Transformers and prior token-wise MoE models. Comprehensive ablation studies confirm that segment-level routing is the key factor driving these gains. Our results show that aligning the MoE routing granularity with the inherent structure of time series provides a powerful, yet previously underexplored, inductive bias, opening new avenues for conditionally sparse architectures in sequential data modeling.
【4】A Unified SPD Token Transformer Framework for EEG Classification: Systematic Comparison of Geometric Embeddings
标题:用于脑电分类的统一SPD令牌Transformer框架:几何嵌入的系统比较
链接:https://arxiv.org/abs/2601.21521
作者:Chi-Sheng Chen,En-Jui Kuo,Guan-Ying Chen,Xinyu Zhang,Fan Zhang
摘要
:Spatial covariance matrices of EEG signals are Symmetric Positive Definite (SPD) and lie on a Riemannian manifold, yet the theoretical connection between embedding geometry and optimization dynamics remains unexplored. We provide a formal analysis linking embedding choice to gradient conditioning and numerical stability for SPD manifolds, establishing three theoretical results: (1) BWSPD's $\sqrtκ$ gradient conditioning (vs $κ$ for Log-Euclidean) via Daleckii-Kreĭn matrices provides better gradient conditioning on high-dimensional inputs ($d \geq 22$), with this advantage reducing on low-dimensional inputs ($d \leq 8$) where eigendecomposition overhead dominates; (2) Embedding-Space Batch Normalization (BN-Embed) approximates Riemannian normalization up to $O(\varepsilon^2)$ error, yielding $+26\%$ accuracy on 56-channel ERP data but negligible effect on 8-channel SSVEP data, matching the channel-count-dependent prediction; (3) bi-Lipschitz bounds prove BWSPD tokens preserve manifold distances with distortion governed solely by the condition ratio $κ$. We validate these predictions via a unified Transformer framework comparing BWSPD, Log-Euclidean, and Euclidean embeddings within identical architecture across 1,500+ runs on three EEG paradigms (motor imagery, ERP, SSVEP; 36 subjects). Our Log-Euclidean Transformer achieves state-of-the-art performance on all datasets, substantially outperforming classical Riemannian classifiers and recent SPD baselines, while BWSPD offers competitive accuracy with similar training time.
【5】Physics-Guided Tiny-Mamba Transformer for Reliability-Aware Early Fault Warning
标题:用于可靠性预警的物理引导Tiny-Mamba Transformer
链接:https://arxiv.org/abs/2601.21293
作者:Changyu Li,Dingcheng Huang,Kexuan Yao,Xiaoya Ni,Lijuan Shen,Fei Luo
备注:Submitted to IEEE Transactions on Reliability
摘要:Reliability-centered prognostics for rotating machinery requires early warning signals that remain accurate under nonstationary operating conditions, domain shifts across speed/load/sensors, and severe class imbalance, while keeping the false-alarm rate small and predictable. We propose the Physics-Guided Tiny-Mamba Transformer (PG-TMT), a compact tri-branch encoder tailored for online condition monitoring. A depthwise-separable convolutional stem captures micro-transients, a Tiny-Mamba state-space branch models near-linear long-range dynamics, and a lightweight local Transformer encodes cross-channel resonances. We derive an analytic temporal-to-spectral mapping that ties the model's attention spectrum to classical bearing fault-order bands, yielding a band-alignment score that quantifies physical plausibility and provides physics-grounded explanations. To ensure decision reliability, healthy-score exceedances are modeled with extreme-value theory (EVT), which yields an on-threshold achieving a target false-alarm intensity (events/hour); a dual-threshold hysteresis with a minimum hold time further suppresses chatter. Under a leakage-free streaming protocol with right-censoring of missed detections on CWRU, Paderborn, XJTU-SY, and an industrial pilot, PG-TMT attains higher precision-recall AUC (primary under imbalance), competitive or better ROC AUC, and shorter mean time-to-detect at matched false-alarm intensity, together with strong cross-domain transfer. By coupling physics-aligned representations with EVT-calibrated decision rules, PG-TMT delivers calibrated, interpretable, and deployment-ready early warnings for reliability-centric prognostics and health management.
【6】The Depth Delusion: Why Transformers Should Be Wider, Not Deeper
标题:深度缺失:为什么Transformer应该更宽,而不是更深
链接:https://arxiv.org/abs/2601.20994
作者:Md Muhtasim Munif Fahim,Md Rezaul Karim
摘要:Neural scaling laws describe how language model loss decreases with parameters and data, but treat architecture as interchangeable--a billion parameters could arise from a shallow-wide model (10 layers & 8,192 hidden dimension) or a deep-narrow one (80 layers & 2,048 hidden dimension). We propose architecture-conditioned scaling laws decomposing this dependence, finding that optimal depth scales as D* ~ C^0.12 while optimal width scales as W* ~ C^0.34, meaning width should grow 2.8x faster than depth. We discover a critical depth phenomenon: beyond D_crit ~ W^0.44 (sublinear in W), adding layers increases loss despite adding parameters--the Depth Delusion. Empirically, we validate these findings across 30 transformer architectures spanning 17M to 7B parameters, each trained on representative high-compute samples, achieving R^2 = 0.922. Our central finding: at 7B scale, a 64-layer model (6.38B params) underperforms a 32-layer model (6.86B params) by 0.12 nats, despite being significantly deeper. This demonstrates that optimal depth-width tradeoffs persist at the production scale.
【7】Clustering in Deep Stochastic Transformers
标题:深度随机Transformer中的聚集
链接:https://arxiv.org/abs/2601.21942
作者:Lev Fedorov,Michaël E. Sander,Romuald Elie,Pierre Marion,Mathieu Laurière
备注:24 pages
摘要:Transformers have revolutionized deep learning across various domains but understanding the precise token dynamics remains a theoretical challenge. Existing theories of deep Transformers with layer normalization typically predict that tokens cluster to a single point; however, these results rely on deterministic weight assumptions, which fail to capture the standard initialization scheme in Transformers. In this work, we show that accounting for the intrinsic stochasticity of random initialization alters this picture. More precisely, we analyze deep Transformers where noise arises from the random initialization of value matrices. Under diffusion scaling and token-wise RMS normalization, we prove that, as the number of Transformer layers goes to infinity, the discrete token dynamics converge to an interacting-particle system on the sphere where tokens are driven by a \emph{common} matrix-valued Brownian noise. In this limit, we show that initialization noise prevents the collapse to a single cluster predicted by deterministic models. For two tokens, we prove a phase transition governed by the interaction strength and the token dimension: unlike deterministic attention flows, antipodal configurations become attracting with positive probability. Numerical experiments confirm the predicted transition, reveal that antipodal formations persist for more than two tokens, and demonstrate that suppressing the intrinsic noise degrades accuracy.
GAN|对抗|攻击|生成相关(10篇)
【1】Latent Adversarial Regularization for Offline Preference Optimization
标题:线下偏好优化的潜在对抗正规化
链接:https://arxiv.org/abs/2601.22083
作者:Enyi Jiang,Yibo Jacky Zhang,Yinglun Xu,Andreas Haupt,Nancy Amato,Sanmi Koyejo
摘要
:Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity. To address this challenge, we leverage latent-space regularization for language model preference optimization. We introduce GANPO, which achieves latent-space regularization by penalizing divergence between the internal representations of a policy model and a reference model. Given that latent representations are not associated with explicit probability densities, we adopt an adversarial approach inspired by GANs to minimize latent-space divergence. We integrate GANPO as a regularizer into existing offline preference optimization objectives. Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. Further, by comparing GANPO-induced inferential biases with those from token-level regularization, we find that GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.
【2】Exploring Diverse Generation Paths via Inference-time Stiefel Activation Steering
标题:通过推理时Stiefel激活引导探索多元化的生成路径
链接:https://arxiv.org/abs/2601.22010
作者:Dongxuan Zhu,Ly Tran Ho Khanh,Andy Yat-Ming Cheung,Man-Chung Yue,Viet Anh Nguyen
备注:34 pages, 2 figures. Accepted for publication at ICLR 2026
摘要:Language models often default to a narrow set of high-probability outputs, leaving their generation paths homogeneous and prone to mode collapse. Sampling-based strategies inject randomness but still struggle to guarantee diversity across multiple concurrent generation runs. We address this limitation by introducing STARS ($\textbf{St}$iefel-based $\textbf{A}$ctivation Steering for Diverse $\textbf{R}$ea$\textbf{S}$oning), a training-free, inference-time intervention method that transforms activation steering into an exploration engine. At each token, STARS collects the hidden activations of concurrent generation runs and optimizes multiple additive steering directions jointly on the Stiefel manifold. STARS maximizes the geometric volume of the steered activations, while the Stiefel manifold induces orthogonality of the steering interventions. This formulation explicitly promotes divergent activation vectors of concurrent generation runs, and implicitly promotes divergent generation trajectories. This manifold optimization formulation can be solved using a Riemannian gradient descent algorithm with convergence guarantees, but this algorithm is too time-consuming for real-time inference. To guarantee low latency, we further design a lightweight one-step update with an aggressive, closed-form stepsize. For test case generation and scientific discovery benchmarks, STARS consistently outperforms standard sampling methods, achieving greater diversity without sacrificing qualitative performance.
【3】From Tokens to Blocks: A Block-Diffusion Perspective on Molecular Generation
标题:从代币到区块:分子生成的区块扩散视角
链接:https://arxiv.org/abs/2601.21964
作者:Qianwei Yang,Dong Xu,Zhangfan Yang,Sisi Yuan,Zexuan Zhu,Jianqiang Li,Junkai Ji
备注:30 pages, 13 figures, 11 tables
摘要:Drug discovery can be viewed as a combinatorial search over an immense chemical space, motivating the development of deep generative models for de novo molecular design. Among these, GPT-based molecular language models (MLM) have shown strong molecular design performance by learning chemical syntax and semantics from large-scale data. However, existing MLMs face two fundamental limitations: they inadequately capture the graph-structured nature of molecules when formulated as next-token prediction problems, and they typically lack explicit mechanisms for target-aware generation. Here, we propose SoftMol, a unified framework that co-designs molecular representation, model architecture, and search strategy for target-aware molecular generation. SoftMol introduces soft fragments, a rule-free block representation of SMILES that enables diffusion-native modeling, and develops SoftBD, the first block-diffusion molecular language model that combines local bidirectional diffusion with autoregressive generation under molecular structural constraints. To favor generated molecules with high drug-likeness and synthetic accessibility, SoftBD is trained on a carefully curated dataset named ZINC-Curated. SoftMol further integrates a gated Monte Carlo tree search to assemble fragments in a target-aware manner. Experimental results show that, compared with current state-of-the-art models, SoftMol achieves 100% chemical validity, improves binding affinity by 9.7%, yields a 2-3x increase in molecular diversity, and delivers a 6.6x speedup in inference efficiency. Code is available at https://github.com/szu-aicourse/softmol
【4】Visual Disentangled Diffusion Autoencoders: Scalable Counterfactual Generation for Foundation Models
标题:视觉解开扩散自动编码器:基础模型的可扩展反事实生成
链接:https://arxiv.org/abs/2601.21851
作者:Sidney Bender,Marco Morik
摘要:Foundation models, despite their robust zero-shot capabilities, remain vulnerable to spurious correlations and 'Clever Hans' strategies. Existing mitigation methods often rely on unavailable group labels or computationally expensive gradient-based adversarial optimization. To address these limitations, we propose Visual Disentangled Diffusion Autoencoders (DiDAE), a novel framework integrating frozen foundation models with disentangled dictionary learning for efficient, gradient-free counterfactual generation directly for the foundation model. DiDAE first edits foundation model embeddings in interpretable disentangled directions of the disentangled dictionary and then decodes them via a diffusion autoencoder. This allows the generation of multiple diverse, disentangled counterfactuals for each factual, much faster than existing baselines, which generate single entangled counterfactuals. When paired with Counterfactual Knowledge Distillation, DiDAE-CFKD achieves state-of-the-art performance in mitigating shortcut learning, improving downstream performance on unbalanced datasets.
【5】Noise as a Probe: Membership Inference Attacks on Diffusion Models Leveraging Initial Noise
标题:噪音作为探针:对利用初始噪音的扩散模型的隶属度推理攻击
链接:https://arxiv.org/abs/2601.21628
作者:Puwei Lian,Yujun Cai,Songze Li,Bingkun Bao
摘要
:Diffusion models have achieved remarkable progress in image generation, but their increasing deployment raises serious concerns about privacy. In particular, fine-tuned models are highly vulnerable, as they are often fine-tuned on small and private datasets. Membership inference attacks (MIAs) are used to assess privacy risks by determining whether a specific sample was part of a model's training data. Existing MIAs against diffusion models either assume obtaining the intermediate results or require auxiliary datasets for training the shadow model. In this work, we utilized a critical yet overlooked vulnerability: the widely used noise schedules fail to fully eliminate semantic information in the images, resulting in residual semantic signals even at the maximum noise step. We empirically demonstrate that the fine-tuned diffusion model captures hidden correlations between the residual semantics in initial noise and the original images. Building on this insight, we propose a simple yet effective membership inference attack, which injects semantic information into the initial noise and infers membership by analyzing the model's generation result. Extensive experiments demonstrate that the semantic initial noise can strongly reveal membership information, highlighting the vulnerability of diffusion models to MIAs.
【6】Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation
标题:生成通过多表示生成增强统一多模式模型的理解
链接:https://arxiv.org/abs/2601.21406
作者:Zihan Su,Hongyang Wei,Kangrui Cen,Yong Wang,Guanhua Chen,Chun Yuan,Xiangxiang Chu
摘要:Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.
【7】Adversarial Vulnerability Transcends Computational Paradigms: Feature Engineering Provides No Defense Against Neural Adversarial Transfer
标题:对抗漏洞超越计算范式:特征工程无法防御神经对抗转移
链接:https://arxiv.org/abs/2601.21323
作者:Achraf Hsain,Ahmed Abdelkader,Emmanuel Baldwin Mbaya,Hamoud Aljamaan
摘要:Deep neural networks are vulnerable to adversarial examples--inputs with imperceptible perturbations causing misclassification. While adversarial transfer within neural networks is well-documented, whether classical ML pipelines using handcrafted features inherit this vulnerability when attacked via neural surrogates remains unexplored. Feature engineering creates information bottlenecks through gradient quantization and spatial binning, potentially filtering high-frequency adversarial signals. We evaluate this hypothesis through the first comprehensive study of adversarial transfer from DNNs to HOG-based classifiers. Using VGG16 as a surrogate, we generate FGSM and PGD adversarial examples and test transfer to four classical classifiers (KNN, Decision Tree, Linear SVM, Kernel SVM) and a shallow neural network across eight HOG configurations on CIFAR-10. Our results strongly refute the protective hypothesis: all classifiers suffer 16.6%-59.1% relative accuracy drops, comparable to neural-to-neural transfer. More surprisingly, we discover attack hierarchy reversal--contrary to patterns where iterative PGD dominates FGSM within neural networks, FGSM causes greater degradation than PGD in 100% of classical ML cases, suggesting iterative attacks overfit to surrogate-specific features that don't survive feature extraction. Block normalization provides partial but insufficient mitigation. These findings demonstrate that adversarial vulnerability is not an artifact of end-to-end differentiability but a fundamental property of image classification systems, with implications for security-critical deployments across computational paradigms.
【8】Quantifying Noise in Language Generation
标题:量化语言生成中的噪音
链接:https://arxiv.org/abs/2601.21237
作者:Aaron Li,Ian Zhang
摘要:Kleinberg and Mullainathan recently proposed a formal framework for studying the phenomenon of language generation, called language generation in the limit. In this model, an adversary gives an enumeration of example strings from an unknown target language, and the algorithm is tasked with correctly generating unseen strings from the target language within finite time. Refined notions of non-uniform and uniform generation were later introduced by Li, Raman, and Tewari (2025), and a noisy model was introduced by Raman and Raman (2025), which allows the adversary to insert extraneous strings. A natural question in the noisy model is to quantify the effect of noise, by studying the impact of each additional extraneous string. We show two complementary results in this setting. We first show that for both uniform and non-uniform generation, a single noisy string strictly reduces the set of collections that can be generated, thus answering an open question in Raman and Raman (2025). Then, we show for both uniform and non-uniform generation that generation with a single noisy string is equivalent to generation with any finite amount of noise, sharply contrasting with the strict hierarchy for noisy generation in the limit shown by Bai, Panigrahi, and Zhang (2026). Finally, we leverage our previous results to provide the first known characterization for non-uniform noise-dependent generatability.
【9】Diverse Approaches to Optimal Execution Schedule Generation
标题:最佳执行计划生成的多种方法
链接:https://arxiv.org/abs/2601.22113
作者:Robert de Witt,Mikko S. Pakkanen
备注:27 pages, 15 figures, 5 tables
摘要
:We present the first application of MAP-Elites, a quality-diversity algorithm, to trade execution. Rather than searching for a single optimal policy, MAP-Elites generates a diverse portfolio of regime-specialist strategies indexed by liquidity and volatility conditions. Individual specialists achieve 8-10% performance improvements within their behavioural niches, while other cells show degradation, suggesting opportunities for ensemble approaches that combine improved specialists with the baseline PPO policy. Results indicate that quality-diversity methods offer promise for regime-adaptive execution, though substantial computational resources per behavioural cell may be required for robust specialist development across all market conditions. To ensure experimental integrity, we develop a calibrated Gymnasium environment focused on order scheduling rather than tactical placement decisions. The simulator features a transient impact model with exponential decay and square-root volume scaling, fit to 400+ U.S. equities with R^2>0.02 out-of-sample. Within this environment, two Proximal Policy Optimization architectures - both MLP and CNN feature extractors - demonstrate substantial improvements over industry baselines, with the CNN variant achieving 2.13 bps arrival slippage versus 5.23 bps for VWAP on 4,900 out-of-sample orders ($21B notional). These results validate both the simulation realism and provide strong single-policy baselines for quality-diversity methods.
【10】Data-Driven Generation of Neutron Star Equations of State Using Variational Autoencoders
标题:使用变分自动编码器的数据驱动生成中子星状态方程
链接:https://arxiv.org/abs/2601.21231
作者:Alex Ross,Tianqi Zhao,Sanjay Reddy
备注:12 pages, 8 figures. In preparation for submission to Machine Learning: Science and Technology
摘要:We develop a machine learning model based on a structured variational autoencoder (VAE) framework to reconstruct and generate neutron star (NS) equations of state (EOS). The VAE consists of an encoder network that maps high-dimensional EOS data into a lower-dimensional latent space and a decoder network that reconstructs the full EOS from the latent representation. The latent space includes supervised NS observables derived from the training EOS data, as well as latent random variables corresponding to additional unspecified EOS features learned automatically. Sampling the latent space enables the generation of new, causal, and stable EOS models that satisfy astronomical constraints on the supervised NS observables, while allowing Bayesian inference of the EOS incorporating additional multimessenger data, including gravitational waves from LIGO/Virgo and mass and radius measurements of pulsars. Based on a VAE trained on a Skyrme EOS dataset, we find that a latent space with two supervised NS observables, the maximum mass $(M_{\max})$ and the canonical radius $(R_{1.4})$, together with one latent random variable controlling the EOS near the crust--core transition, can already reconstruct Skyrme EOSs with high fidelity, achieving mean absolute percentage errors of approximately $(0.15\%)$ for $(M_{\max})$ and $(R_{1.4})$ derived from the decoder-reconstructed EOS.
半/弱/无/有监督|不确定性|主动学习(12篇)
【1】Generalized Information Gathering Under Dynamics Uncertainty
标题:动态不确定性下的广义信息收集
链接:https://arxiv.org/abs/2601.21988
作者:Fernando Palafox,Jingqi Li,Jesse Milzman,David Fridovich-Keil
摘要:An agent operating in an unknown dynamical system must learn its dynamics from observations. Active information gathering accelerates this learning, but existing methods derive bespoke costs for specific modeling choices: dynamics models, belief update procedures, observation models, and planners. We present a unifying framework that decouples these choices from the information-gathering cost by explicitly exposing the causal dependencies between parameters, beliefs, and controls. Using this framework, we derive a general information-gathering cost based on Massey's directed information that assumes only Markov dynamics with additive noise and is otherwise agnostic to modeling choices. We prove that the mutual information cost used in existing literature is a special case of our cost. Then, we leverage our framework to establish an explicit connection between the mutual information cost and information gain in linearized Bayesian estimation, thereby providing theoretical justification for mutual information-based active learning approaches. Finally, we illustrate the practical utility of our framework through experiments spanning linear, nonlinear, and multi-agent systems.
【2】MoE-ACT: Improving Surgical Imitation Learning Policies through Supervised Mixture-of-Experts
标题:MoE-ACT:通过受监督的专家混合改进手术模仿学习政策
链接:https://arxiv.org/abs/2601.21971
作者:Lorenzo Mazza,Ariel Rodriguez,Rayan Younis,Martin Lelis,Ortrun Hellig,Chenpan Li,Sebastian Bodenstedt,Martin Wagner,Stefanie Speidel
摘要:Imitation learning has achieved remarkable success in robotic manipulation, yet its application to surgical robotics remains challenging due to data scarcity, constrained workspaces, and the need for an exceptional level of safety and predictability. We present a supervised Mixture-of-Experts (MoE) architecture designed for phase-structured surgical manipulation tasks, which can be added on top of any autonomous policy. Unlike prior surgical robot learning approaches that rely on multi-camera setups or thousands of demonstrations, we show that a lightweight action decoder policy like Action Chunking Transformer (ACT) can learn complex, long-horizon manipulation from less than 150 demonstrations using solely stereo endoscopic images, when equipped with our architecture. We evaluate our approach on the collaborative surgical task of bowel grasping and retraction, where a robot assistant interprets visual cues from a human surgeon, executes targeted grasping on deformable tissue, and performs sustained retraction. We benchmark our method against state-of-the-art Vision-Language-Action (VLA) models and the standard ACT baseline. Our results show that generalist VLAs fail to acquire the task entirely, even under standard in-distribution conditions. Furthermore, while standard ACT achieves moderate success in-distribution, adopting a supervised MoE architecture significantly boosts its performance, yielding higher success rates in-distribution and demonstrating superior robustness in out-of-distribution scenarios, including novel grasp locations, reduced illumination, and partial occlusions. Notably, it generalizes to unseen testing viewpoints and also transfers zero-shot to ex vivo porcine tissue without additional training, offering a promising pathway toward in vivo deployment. To support this, we present qualitative preliminary results of policy roll-outs during in vivo porcine surgery.
【3】Uncertainty-Aware Data-Based Method for Fast and Reliable Shape Optimization
标题
:基于不确定性的数据快速可靠的形状优化方法
链接:https://arxiv.org/abs/2601.21956
作者:Yunjia Yang,Runze Li,Yufei Zhang,Haixin Chen
摘要:Data-based optimization (DBO) offers a promising approach for efficiently optimizing shape for better aerodynamic performance by leveraging a pretrained surrogate model for offline evaluations during iterations. However, DBO heavily relies on the quality of the training database. Samples outside the training distribution encountered during optimization can lead to significant prediction errors, potentially misleading the optimization process. Therefore, incorporating uncertainty quantification into optimization is critical for detecting outliers and enhancing robustness. This study proposes an uncertainty-aware data-based optimization (UA-DBO) framework to monitor and minimize surrogate model uncertainty during DBO. A probabilistic encoder-decoder surrogate model is developed to predict uncertainties associated with its outputs, and these uncertainties are integrated into a model-confidence-aware objective function to penalize samples with large prediction errors during data-based optimization process. The UA-DBO framework is evaluated on two multipoint optimization problems aimed at improving airfoil drag divergence and buffet performance. Results demonstrate that UA-DBO consistently reduces prediction errors in optimized samples and achieves superior performance gains compared to original DBO. Moreover, compared to multipoint optimization based on full computational simulations, UA-DBO offers comparable optimization effectiveness while significantly accelerating optimization speed.
【4】Embracing Aleatoric Uncertainty in Medical Multimodal Learning with Missing Modalities
标题:在缺失模式的医学多模式学习中拥抱感性不确定性
链接:https://arxiv.org/abs/2601.21950
作者:Linxiao Gong,Yang Liu,Lianlong Sun,Yulai Bi,Jing Liu,Xiaoguang Zhu
摘要:Medical multimodal learning faces significant challenges with missing modalities prevalent in clinical practice. Existing approaches assume equal contribution of modality and random missing patterns, neglecting inherent uncertainty in medical data acquisition. In this regard, we propose the Aleatoric Uncertainty Modeling (AUM) that explicitly quantifies unimodal aleatoric uncertainty to address missing modalities. Specifically, AUM models each unimodal representation as a multivariate Gaussian distribution to capture aleatoric uncertainty and enable principled modality reliability quantification. To adaptively aggregate captured information, we develop a dynamic message-passing mechanism within a bipartite patient-modality graph using uncertainty-aware aggregation mechanism. Through this process, missing modalities are naturally accommodated, while more reliable information from available modalities is dynamically emphasized to guide representation generation. Our AUM framework achieves an improvement of 2.26% AUC-ROC on MIMIC-IV mortality prediction and 2.17% gain on eICU, outperforming existing state-of-the-art approaches.
【5】XFACTORS: Disentangled Information Bottleneck via Contrastive Supervision
标题:XFACTORS:通过对比监督解开信息瓶颈
链接:https://arxiv.org/abs/2601.21688
作者:Alexandre Myara,Nicolas Bourriez,Thomas Boyer,Thomas Lemercier,Ihab Bendidi,Auguste Genovesio
摘要:Disentangled representation learning aims to map independent factors of variation to independent representation components. On one hand, purely unsupervised approaches have proven successful on fully disentangled synthetic data, but fail to recover semantic factors from real data without strong inductive biases. On the other hand, supervised approaches are unstable and hard to scale to large attribute sets because they rely on adversarial objectives or auxiliary classifiers. We introduce \textsc{XFactors}, a weakly-supervised VAE framework that disentangles and provides explicit control over a chosen set of factors. Building on the Disentangled Information Bottleneck perspective, we decompose the representation into a residual subspace $\mathcal{S}$ and factor-specific subspaces $\mathcal{T}_1,\ldots,\mathcal{T}_K$ and a residual subspace $\mathcal{S}$. Each target factor is encoded in its assigned $\mathcal{T}_i$ through contrastive supervision: an InfoNCE loss pulls together latents sharing the same factor value and pushes apart mismatched pairs. In parallel, KL regularization imposes a Gaussian structure on both $\mathcal{S}$ and the aggregated factor subspaces, organizing the geometry without additional supervision for non-targeted factors and avoiding adversarial training and classifiers. Across multiple datasets, with constant hyperparameters, \textsc{XFactors} achieves state-of-the-art disentanglement scores and yields consistent qualitative factor alignment in the corresponding subspaces, enabling controlled factor swapping via latent replacement. We further demonstrate that our method scales correctly with increasing latent capacity and evaluate it on the real-world dataset CelebA. Our code is available at \href{https://github.com/ICML26-anon/XFactors}{github.com/ICML26-anon/XFactors}.
【6】Can Local Learning Match Self-Supervised Backpropagation?
标题:本地学习可以匹配自我监督的反向传播吗?
链接:https://arxiv.org/abs/2601.21683
作者:Wu S. Zihan,Ariane Delrocq,Wulfram Gerstner,Guillaume Bellec
摘要:While end-to-end self-supervised learning with backpropagation (global BP-SSL) has become central for training modern AI systems, theories of local self-supervised learning (local-SSL) have struggled to build functional representations in deep neural networks. To establish a link between global and local rules, we first develop a theory for deep linear networks: we identify conditions for local-SSL algorithms (like Forward-forward or CLAPP) to implement exactly the same weight update as a global BP-SSL. Starting from the theoretical insights, we then develop novel variants of local-SSL algorithms to approximate global BP-SSL in deep non-linear convolutional neural networks. Variants that improve the similarity between gradient updates of local-SSL with those of global BP-SSL also show better performance on image datasets (CIFAR-10, STL-10, and Tiny ImageNet). The best local-SSL rule with the CLAPP loss function matches the performance of a comparable global BP-SSL with InfoNCE or CPC-like loss functions, and improves upon state-of-the-art for local SSL on these benchmarks.
【7】Epistemic Uncertainty Quantification for Pre-trained VLMs via Riemannian Flow Matching
标题:通过Riemann流匹配对预训练的VLM进行认识不确定性量化
链接:https://arxiv.org/abs/2601.21662
作者:Li Ju,Mayank Nautiyal,Andreas Hellander,Ekta Vats,Prashant Singh
摘要:Vision-Language Models (VLMs) are typically deterministic in nature and lack intrinsic mechanisms to quantify epistemic uncertainty, which reflects the model's lack of knowledge or ignorance of its own representations. We theoretically motivate negative log-density of an embedding as a proxy for the epistemic uncertainty, where low-density regions signify model ignorance. The proposed method REPVLM computes the probability density on the hyperspherical manifold of the VLM embeddings using Riemannian Flow Matching. We empirically demonstrate that REPVLM achieves near-perfect correlation between uncertainty and prediction error, significantly outperforming existing baselines. Beyond classification, we also demonstrate that the model also provides a scalable metric for out-of-distribution detection and automated data curation.
【8】Evaluating Prediction Uncertainty Estimates from BatchEnsemble
标题:评估BatchEnsemble的预测不确定性估计
链接:https://arxiv.org/abs/2601.21581
作者:Morten Blørstad,Herman Jangsett Mostein,Nello Blaser,Pekka Parviainen
备注:17 pages, 19 figures
摘要:Deep learning models struggle with uncertainty estimation. Many approaches are either computationally infeasible or underestimate uncertainty. We investigate \textit{BatchEnsemble} as a general and scalable method for uncertainty estimation across both tabular and time series tasks. To extend BatchEnsemble to sequential modeling, we introduce GRUBE, a novel BatchEnsemble GRU cell. We compare the BatchEnsemble to Monte Carlo dropout and deep ensemble models. Our results show that BatchEnsemble matches the uncertainty estimation performance of deep ensembles, and clearly outperforms Monte Carlo dropout. GRUBE achieves similar or better performance in both prediction and uncertainty estimation. These findings show that BatchEnsemble and GRUBE achieve similar performance with fewer parameters and reduced training and inference time compared to traditional ensembles.
【9】PPI-SVRG: Unifying Prediction-Powered Inference and Variance Reduction for Semi-Supervised Optimization
标题:PPI-SVRG:统一用于半监督优化的预测动力推理和方差约简
链接:https://arxiv.org/abs/2601.21470
作者:Ruicheng Ao,Hongyu Chen,Haoyang Liu,David Simchi-Levi,Will Wei Sun
备注:27 pages, 4 figures
摘要:We study semi-supervised stochastic optimization when labeled data is scarce but predictions from pre-trained models are available. PPI and SVRG both reduce variance through control variates -- PPI uses predictions, SVRG uses reference gradients. We show they are mathematically equivalent and develop PPI-SVRG, which combines both. Our convergence bound decomposes into the standard SVRG rate plus an error floor from prediction uncertainty. The rate depends only on loss geometry; predictions affect only the neighborhood size. When predictions are perfect, we recover SVRG exactly. When predictions degrade, convergence remains stable but reaches a larger neighborhood. Experiments confirm the theory: PPI-SVRG reduces MSE by 43--52\% under label scarcity on mean estimation benchmarks and improves test accuracy by 2.7--2.9 percentage points on MNIST with only 10\% labeled data.
【10】Learning to Optimize Job Shop Scheduling Under Structural Uncertainty
标题:结构不确定性下学习优化车间调度
链接:https://arxiv.org/abs/2601.21389
作者:Rui Zhang,Jianwei Niu,Xuefeng Liu,Shaojie Tang,Jing Yuan
摘要:The Job-Shop Scheduling Problem (JSSP), under various forms of manufacturing uncertainty, has recently attracted considerable research attention. Most existing studies focus on parameter uncertainty, such as variable processing times, and typically adopt the actor-critic framework. In this paper, we explore a different but prevalent form of uncertainty in JSSP: structural uncertainty. Structural uncertainty arises when a job may follow one of several routing paths, and the selection is determined not by policy, but by situational factors (e.g., the quality of intermediate products) that cannot be known in advance. Existing methods struggle to address this challenge due to incorrect credit assignment: a high-quality action may be unfairly penalized if it is followed by a time-consuming path. To address this problem, we propose a novel method named UP-AAC. In contrast to conventional actor-critic methods, UP-AAC employs an asymmetric architecture. While its actor receives a standard stochastic state, the critic is crucially provided with a deterministic state reconstructed in hindsight. This design allows the critic to learn a more accurate value function, which in turn provides a lower-variance policy gradient to the actor, leading to more stable learning. In addition, we design an attention-based Uncertainty Perception Model (UPM) to enhance the actor's scheduling decisions. Extensive experiments demonstrate that our method outperforms existing approaches in reducing makespan on benchmark instances.
【11】Distributionally Robust Classification for Multi-source Unsupervised Domain Adaptation
标题:多源无监督领域自适应的分布鲁棒分类
链接:https://arxiv.org/abs/2601.21315
作者:Seonghwi Kim,Sung Ho Jo,Wooseok Ha,Minwoo Chae
备注:Accepted at ICLR 2026. 10 pages (excluding references)
摘要
:Unsupervised domain adaptation (UDA) is a statistical learning problem when the distribution of training (source) data is different from that of test (target) data. In this setting, one has access to labeled data only from the source domain and unlabeled data from the target domain. The central objective is to leverage the source data and the unlabeled target data to build models that generalize to the target domain. Despite its potential, existing UDA approaches often struggle in practice, particularly in scenarios where the target domain offers only limited unlabeled data or spurious correlations dominate the source domain. To address these challenges, we propose a novel distributionally robust learning framework that models uncertainty in both the covariate distribution and the conditional label distribution. Our approach is motivated by the multi-source domain adaptation setting but is also directly applicable to the single-source scenario, making it versatile in practice. We develop an efficient learning algorithm that can be seamlessly integrated with existing UDA methods. Extensive experiments under various distribution shift scenarios show that our method consistently outperforms strong baselines, especially when target data are extremely scarce.
【12】Test-Time Adaptation for Unsupervised Combinatorial Optimization
标题:无监督组合优化的测试时间自适应
链接:https://arxiv.org/abs/2601.21048
作者:Yiqiao Liao,Farinaz Koushanfar,Parinaz Naghizadeh
摘要:Unsupervised neural combinatorial optimization (NCO) enables learning powerful solvers without access to ground-truth solutions. Existing approaches fall into two disjoint paradigms: models trained for generalization across instances, and instance-specific models optimized independently at test time. While the former are efficient during inference, they lack effective instance-wise adaptability; the latter are flexible but fail to exploit learned inductive structure and are prone to poor local optima. This motivates the central question of our work: how can we leverage the inductive bias learned through generalization while unlocking the flexibility required for effective instance-wise adaptation? We first identify a challenge in bridging these two paradigms: generalization-focused models often constitute poor warm starts for instance-wise optimization, potentially underperforming even randomly initialized models when fine-tuned at test time. To resolve this incompatibility, we propose TACO, a model-agnostic test-time adaptation framework that unifies and extends the two existing paradigms for unsupervised NCO. TACO applies strategic warm-starting to partially relax trained parameters while preserving inductive bias, enabling rapid and effective unsupervised adaptation. Crucially, compared to naively fine-tuning a trained generalizable model or optimizing an instance-specific model from scratch, TACO achieves better solution quality while incurring negligible additional computational cost. Experiments on canonical CO problems, Minimum Vertex Cover and Maximum Clique, demonstrate the effectiveness and robustness of TACO across static, distribution-shifted, and dynamic combinatorial optimization problems, establishing it as a practical bridge between generalizable and instance-specific unsupervised NCO.
迁移|Zero/Few/One-Shot|自适应(19篇)
【1】Routing the Lottery: Adaptive Subnetworks for Heterogeneous Data
标题:彩票路由:适用于异类数据的自适应子网络
链接:https://arxiv.org/abs/2601.22141
作者:Grzegorz Stefanski,Alberto Presta,Michal Byra
摘要:In pruning, the Lottery Ticket Hypothesis posits that large networks contain sparse subnetworks, or winning tickets, that can be trained in isolation to match the performance of their dense counterparts. However, most existing approaches assume a single universal winning ticket shared across all inputs, ignoring the inherent heterogeneity of real-world data. In this work, we propose Routing the Lottery (RTL), an adaptive pruning framework that discovers multiple specialized subnetworks, called adaptive tickets, each tailored to a class, semantic cluster, or environmental condition. Across diverse datasets and tasks, RTL consistently outperforms single- and multi-model baselines in balanced accuracy and recall, while using up to 10 times fewer parameters than independent models and exhibiting semantically aligned. Furthermore, we identify subnetwork collapse, a performance drop under aggressive pruning, and introduce a subnetwork similarity score that enables label-free diagnosis of oversparsification. Overall, our results recast pruning as a mechanism for aligning model structure with data heterogeneity, paving the way toward more modular and context-aware deep learning.
【2】PRISM: Distribution-free Adaptive Computation of Matrix Functions for Accelerating Neural Network Training
标题:PRism:矩阵函数的无分布自适应计算,加速神经网络训练
链接:https://arxiv.org/abs/2601.22137
作者:Shenghao Yang,Zhichao Wang,Oleg Balabanov,N. Benjamin Erichson,Michael W. Mahoney
摘要:Matrix functions such as square root, inverse roots, and orthogonalization play a central role in preconditioned gradient methods for neural network training. This has motivated the development of iterative algorithms that avoid explicit eigendecompositions and rely primarily on matrix multiplications, making them well suited for modern GPU accelerators. We present PRISM (Polynomial-fitting and Randomized Iterative Sketching for Matrix functions computation), a general framework for accelerating iterative algorithms for computing matrix functions. PRISM combines adaptive polynomial approximation with randomized sketching: at each iteration, it fits a polynomial surrogate to the current spectrum via a sketched least-squares problem, adapting to the instance at hand with minimal overhead. We apply PRISM to accelerate Newton-Schulz-like iterations for matrix square roots and orthogonalization, which are core primitives in machine learning. Unlike prior methods, PRISM requires no explicit spectral bounds or singular value estimates; and it adapts automatically to the evolving spectrum. Empirically, PRISM accelerates training when integrated into Shampoo and Muon optimizers.
【3】Entropy-Based Dimension-Free Convergence and Loss-Adaptive Schedules for Diffusion Models
标题:扩散模型基于熵的无冲突收敛和损失自适应调度
链接:https://arxiv.org/abs/2601.21943
作者:Ahmad Aghapour,Erhan Bayraktar,Ziqing Zhang
摘要
:Diffusion generative models synthesize samples by discretizing reverse-time dynamics driven by a learned score (or denoiser). Existing convergence analyses of diffusion models typically scale at least linearly with the ambient dimension, and sharper rates often depend on intrinsic-dimension assumptions or other geometric restrictions on the target distribution. We develop an alternative, information-theoretic approach to dimension-free convergence that avoids any geometric assumptions. Under mild assumptions on the target distribution, we bound KL divergence between the target and generated distributions by $O(H^2/K)$ (up to endpoint factors), where $H$ is the Shannon entropy and $K$ is the number of sampling steps. Moreover, using a reformulation of the KL divergence, we propose a Loss-Adaptive Schedule (LAS) for efficient discretization of reverse SDE which is lightweight and relies only on the training loss, requiring no post-training heavy computation. Empirically, LAS improves sampling quality over common heuristic schedules.
【4】Low-Rank Plus Sparse Matrix Transfer Learning under Growing Representations and Ambient Dimensions
标题:增长表示和环境维数下的低秩加稀疏矩阵迁移学习
链接:https://arxiv.org/abs/2601.21873
作者:Jinhang Chai,Xuyuan Liu,Elynn Chen,Yujun Yan
摘要:Learning systems often expand their ambient features or latent representations over time, embedding earlier representations into larger spaces with limited new latent structure. We study transfer learning for structured matrix estimation under simultaneous growth of the ambient dimension and the intrinsic representation, where a well-estimated source task is embedded as a subspace of a higher-dimensional target task. We propose a general transfer framework in which the target parameter decomposes into an embedded source component, low-dimensional low-rank innovations, and sparse edits, and develop an anchored alternating projection estimator that preserves transferred subspaces while estimating only low-dimensional innovations and sparse modifications. We establish deterministic error bounds that separate target noise, representation growth, and source estimation error, yielding strictly improved rates when rank and sparsity increments are small. We demonstrate the generality of the framework by applying it to two canonical problems. For Markov transition matrix estimation from a single trajectory, we derive end-to-end theoretical guarantees under dependent noise. For structured covariance estimation under enlarged dimensions, we provide complementary theoretical analysis in the appendix and empirically validate consistent transfer gains.
【5】Goal-Driven Adaptive Sampling Strategies for Machine Learning Models Predicting Fields
标题:机器学习模型预测领域的目标驱动自适应抽样策略
链接:https://arxiv.org/abs/2601.21832
作者:Jigar Parekh,Philipp Bekemeyer
摘要:Machine learning models are widely regarded as a way forward to tackle multi-query challenges that arise once expensive black-box simulations such as computational fluid dynamics are investigated. However, ensuring the desired level of accuracy for a certain task at minimal computational cost, e.g. as few black-box samples as possible, remains a challenges. Active learning strategies are used for scalar quantities to overcome this challenges and different so-called infill criteria exists and are commonly employed in several scenarios. Even though needed in various field an extension of active learning strategies towards field predictions is still lacking or limited to very specific scenarios and/or model types. In this paper we propose an active learning strategy for machine learning models that are capable if predicting field which is agnostic to the model architecture itself. For doing so, we combine a well-established Gaussian process model for a scalar reference value and simultaneously aim at reducing the epistemic model error and the difference between scalar and field predictions. Different specific forms of the above-mentioned approach are introduced and compared to each other as well as only scalar-valued based infill. Results are presented for the NASA common research model for an uncertainty propagation task showcasing high level of accuracy at significantly smaller cost compared to an approach without active learning.
【6】Signal-Adaptive Trust Regions for Gradient-Free Optimization of Recurrent Spiking Neural Networks
标题:循环尖峰神经网络无干扰优化的信号自适应信任域
链接:https://arxiv.org/abs/2601.21572
作者:Jinhao Li,Yuhao Sun,Zhiyuan Ma,Hao He,Xinche Zhang,Xing Chen,Jin Li,Sen Song
摘要:Recurrent spiking neural networks (RSNNs) are a promising substrate for energy-efficient control policies, but training them for high-dimensional, long-horizon reinforcement learning remains challenging. Population-based, gradient-free optimization circumvents backpropagation through non-differentiable spike dynamics by estimating gradients. However, with finite populations, high variance of these estimates can induce harmful and overly aggressive update steps. Inspired by trust-region methods in reinforcement learning that constrain policy updates in distribution space, we propose \textbf{Signal-Adaptive Trust Regions (SATR)}, a distributional update rule that constrains relative change by bounding KL divergence normalized by an estimated signal energy. SATR automatically expands the trust region under strong signals and contracts it when updates are noise-dominated. We instantiate SATR for Bernoulli connectivity distributions, which have shown strong empirical performance for RSNN optimization. Across a suite of high-dimensional continuous-control benchmarks, SATR improves stability under limited populations and reaches competitive returns against strong baselines including PPO-LSTM. In addition, to make SATR practical at scale, we introduce a bitset implementation for binary spiking and binary weights, substantially reducing wall-clock training time and enabling fast RSNN policy search.
【7】SAL: Selective Adaptive Learning for Backpropagation-Free Training with Sparsification
标题:SAL:采用稀疏化的无反向传播训练的选择性自适应学习
链接:https://arxiv.org/abs/2601.21561
作者:Fanping Liu,Hua Yang,Jiasi Zou
摘要
:Standard deep learning relies on Backpropagation (BP), which is constrained by biologically implausible weight symmetry and suffers from significant gradient interference within dense representations. To mitigate these bottlenecks, we propose Selective Adaptive Learning (SAL), a training method that combines selective parameter activation with adaptive area partitioning. Specifically, SAL decomposes the parameter space into mutually exclusive, sample-dependent regions. This decoupling mitigates gradient interference across divergent semantic patterns and addresses explicit weight symmetry requirements through our refined feedback alignment. Empirically, SAL demonstrates competitive convergence rates, leading to improved classification performance across 10 standard benchmarks. Additionally, SAL achieves numerical consistency and competitive accuracy even in deep regimes (up to 128 layers) and large-scale models (up to 1B parameters). Our approach is loosely inspired by biological learning mechanisms, offering a plausible alternative that contributes to the study of scalable neural network training.
【8】Cascaded Transfer: Learning Many Tasks under Budget Constraints
标题:级联转移:在预算限制下学习许多任务
链接:https://arxiv.org/abs/2601.21513
作者:Eloi Campagne,Yvenn Amara-Ouali,Yannig Goude,Mathilde Mougeot,Argyris Kalogeratos
摘要:Many-Task Learning refers to the setting where a large number of related tasks need to be learned, the exact relationships between tasks are not known. We introduce the Cascaded Transfer Learning, a novel many-task transfer learning paradigm where information (e.g. model parameters) cascades hierarchically through tasks that are learned by individual models of the same class, while respecting given budget constraints. The cascade is organized as a rooted tree that specifies the order in which tasks are learned and refined. We design a cascaded transfer mechanism deployed over a minimum spanning tree structure that connects the tasks according to a suitable distance measure, and allocates the available training budget along its branches. Experiments on synthetic and real many-task settings show that the resulting method enables more accurate and cost effective adaptation across large task collections compared to alternative approaches.
【9】Task-free Adaptive Meta Black-box Optimization
标题:无任务自适应Meta黑箱优化
链接:https://arxiv.org/abs/2601.21475
作者:Chao Wang,Licheng Jiao,Lingling Li,Jiaxuan Zhao,Guanchun Wang,Fang Liu,Shuyuan Yang
备注:This article was published as a conference paper at ICLR 2026
摘要:Handcrafted optimizers become prohibitively inefficient for complex black-box optimization (BBO) tasks. MetaBBO addresses this challenge by meta-learning to automatically configure optimizers for low-level BBO tasks, thereby eliminating heuristic dependencies. However, existing methods typically require extensive handcrafted training tasks to learn meta-strategies that generalize to target tasks, which poses a critical limitation for realistic applications with unknown task distributions. To overcome the issue, we propose the Adaptive meta Black-box Optimization Model (ABOM), which performs online parameter adaptation using solely optimization data from the target task, obviating the need for predefined task distributions. Unlike conventional metaBBO frameworks that decouple meta-training and optimization phases, ABOM introduces a closed-loop adaptive parameter learning mechanism, where parameterized evolutionary operators continuously self-update by leveraging generated populations during optimization. This paradigm shift enables zero-shot optimization: ABOM achieves competitive performance on synthetic BBO benchmarks and realistic unmanned aerial vehicle path planning problems without any handcrafted training tasks. Visualization studies reveal that parameterized evolutionary operators exhibit statistically significant search patterns, including natural selection and genetic recombination.
【10】SAGE: Sequence-level Adaptive Gradient Evolution for Generative Recommendation
标题:SAGE:生成式推荐的序列级自适应梯度进化
链接:https://arxiv.org/abs/2601.21452
作者:Yu Xie,Xing Kai Ren,Ying Qi,Hu Yao
备注:arXiv admin note: text overlap with arXiv:2506.19235
摘要:While works such as OneRec have validated the scaling laws of Large Language Models (LLMs) in recommender systems, they rely on a cumbersome separate vocabulary. This dependency prevents the model architecture from reusing native LLM vocabularies, resulting in high maintenance costs and poor scalability. In response, we aim to efficiently reuse open-source LLM architectures without constructing a separate tokenization vocabulary. Furthermore, we identify that the optimization strategy of OneRec Gradient Bounded Policy Optimization (GBPO),suffers from a "Symmetric Conservatism" problem: its static gradient boundaries structurally suppress the update momentum required for cold-start items and fail to prevent diversity collapse in high-noise environments.To address this issue, we propose SAGE (Sequence-level Adaptive Gradient Evolution), a unified optimization framework tailored for list-wise generative recommendation. SAGE introduces two key innovations:(1) Sequence-level Signal Decoupling: By combining a geometric mean importance ratio with decoupled multi-objective advantages, we eliminate token-level variance and resolve the "Reward Collapse" problem. (2) Asymmetric Adaptive Dynamics: We construct a dynamic gradient manifold that applies a "Boost Factor" to high-potential cold start items to achieve super-linear updates and employs an "Entropy Aware Penalty" to break information cocoons. Theoretical analysis and empirical results demonstrate that SAGE effectively unblocks cold-start traffic and sustains recommendation diversity, all while retaining the numerical stability of GBPO.
【11】ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation
标题:ConceptMoE:用于隐式计算分配的自适应令牌到概念压缩
链接:https://arxiv.org/abs/2601.21420
作者:Zihao Huang,Jundong Zhou,Xingwei Qu,Qiyang Min,Ge Zhang
摘要
:Large language models allocate uniform computation across all tokens, ignoring that some sequences are trivially predictable while others require deep reasoning. We introduce ConceptMoE, which dynamically merges semantically similar tokens into concept representations, performing implicit token-level compute allocation. A learnable chunk module identifies optimal boundaries by measuring inter-token similarity, compressing sequences by a target ratio $R$ before they enter the compute-intensive concept model. Crucially, the MoE architecture enables controlled evaluation: we reallocate saved computation to match baseline activated FLOPs (excluding attention map computation) and total parameters, isolating genuine architectural benefits. Under these conditions, ConceptMoE consistently outperforms standard MoE across language and vision-language tasks, achieving +0.9 points on language pretraining, +2.3 points on long context understanding, and +0.6 points on multimodal benchmarks. When converting pretrained MoE during continual training with layer looping, gains reach +5.5 points, demonstrating practical applicability. Beyond performance, ConceptMoE reduces attention computation by up to $R^2\times$ and KV cache by $R\times$. At $R=2$, empirical measurements show prefill speedups reaching 175\% and decoding speedups up to 117\% on long sequences. The minimal architectural modifications enable straightforward integration into existing MoE, demonstrating that adaptive concept-level processing fundamentally improves both effectiveness and efficiency of large language models.
【12】Few-Shot Learning for Dynamic Operations of Automated Electric Taxi Fleets under Evolving Charging Infrastructure: A Meta-Deep Reinforcement Learning Approach
标题:不断发展的充电基础设施下自动电动出租车车队动态运营的Few-Shot学习:元深度强化学习方法
链接:https://arxiv.org/abs/2601.21312
作者:Xiaozhuang Li,Xindi Tang,Fang He
摘要:With the rapid expansion of electric vehicles (EVs) and charging infrastructure, the effective management of Autonomous Electric Taxi (AET) fleets faces a critical challenge in environments with dynamic and uncertain charging availability. While most existing research assumes a static charging network, this simplification creates a significant gap between theoretical models and real-world operations. To bridge this gap, we propose GAT-PEARL, a novel meta-reinforcement learning framework that learns an adaptive operational policy. Our approach integrates a graph attention network (GAT) to effectively extract robust spatial representations under infrastructure layouts and model the complex spatiotemporal relationships of the urban environment, and employs probabilistic embeddings for actor-critic reinforcement learning (PEARL) to enable rapid, inference-based adaptation to changes in charging network layouts without retraining. Through extensive simulations on real-world data in Chengdu, China, we demonstrate that GAT-PEARL significantly outperforms conventional reinforcement learning baselines, showing superior generalization to unseen infrastructure layouts and achieving higher overall operational efficiency in dynamic settings.
【13】Rethinking Self-Training Based Cross-Subject Domain Adaptation for SSVEP Classification
标题:重新思考基于自我训练的跨学科领域适应SSVEP分类
链接:https://arxiv.org/abs/2601.21203
作者:Weiguang Wang,Yong Liu,Yingjie Gao,Guangyuan Xu
备注:Accepted to ICASSP 2026
摘要:Steady-state visually evoked potentials (SSVEP)-based brain-computer interfaces (BCIs) are widely used due to their high signal-to-noise ratio and user-friendliness. Accurate decoding of SSVEP signals is crucial for interpreting user intentions in BCI applications. However, signal variability across subjects and the costly user-specific annotation limit recognition performance. Therefore, we propose a novel cross-subject domain adaptation method built upon the self-training paradigm. Specifically, a Filter-Bank Euclidean Alignment (FBEA) strategy is designed to exploit frequency information from SSVEP filter banks. Then, we propose a Cross-Subject Self-Training (CSST) framework consisting of two stages: Pre-Training with Adversarial Learning (PTAL), which aligns the source and target distributions, and Dual-Ensemble Self-Training (DEST), which refines pseudo-label quality. Moreover, we introduce a Time-Frequency Augmented Contrastive Learning (TFA-CL) module to enhance feature discriminability across multiple augmented views. Extensive experiments on the Benchmark and BETA datasets demonstrate that our approach achieves state-of-the-art performance across varying signal lengths, highlighting its superiority.
【14】Order-Aware Test-Time Adaptation: Leveraging Temporal Dynamics for Robust Streaming Inference
标题:订单感知测试时间自适应:利用时间动态来实现稳健的流媒体推理
链接:https://arxiv.org/abs/2601.21012
作者:Young Kyung Kim,Oded Schlesinger,Qiangqiang Wu,J. Matías Di Martino,Guillermo Sapiro
备注:18 pages, 4 figures
摘要:Test-Time Adaptation (TTA) enables pre-trained models to adjust to distribution shift by learning from unlabeled test-time streams. However, existing methods typically treat these streams as independent samples, overlooking the supervisory signal inherent in temporal dynamics. To address this, we introduce Order-Aware Test-Time Adaptation (OATTA). We formulate test-time adaptation as a gradient-free recursive Bayesian estimation task, using a learned dynamic transition matrix as a temporal prior to refine the base model's predictions. To ensure safety in weakly structured streams, we introduce a likelihood-ratio gate (LLR) that reverts to the base predictor when temporal evidence is absent. OATTA is a lightweight, model-agnostic module that incurs negligible computational overhead. Extensive experiments across image classification, wearable and physiological signal analysis, and language sentiment analysis demonstrate its universality; OATTA consistently boosts established baselines, improving accuracy by up to 6.35%. Our findings establish that modeling temporal dynamics provides a critical, orthogonal signal beyond standard order-agnostic TTA approaches.
【15】Top-k on a Budget: Adaptive Ranking with Weak and Strong Oracles
标题:预算上的Top-k:用弱和强先知进行自适应排名
链接:https://arxiv.org/abs/2601.20989
作者:Lutz Oettershagen
摘要
:Identifying the top-$k$ items is fundamental but often prohibitive when exact valuations are expensive. We study a two-oracle setting with a fast, noisy weak oracle and a scarce, high-fidelity strong oracle (e.g., human expert verification or expensive simulation). We first analyze a simple screen-then-certify baseline (STC) and prove it makes at most $m(4\varepsilon_{\max})$ strong calls given jointly valid weak confidence intervals with maximum radius $\varepsilon_{\max}$, where $m(\cdot)$ denotes the near-tie mass around the top-$k$ threshold. We establish a conditional lower bound of $Ω(m(\varepsilon_{\max}))$ for any algorithm given the same weak uncertainty. Our main contribution is ACE, an adaptive certification algorithm that focuses strong queries on critical boundary items, achieving the same $O(m(4\varepsilon_{\max}))$ bound while reducing strong calls in practice. We then introduce ACE-W, a fully adaptive two-phase method that allocates weak budget adaptively before running ACE, further reducing strong costs.
【16】Pre-trained Encoders for Global Child Development: Transfer Learning Enables Deployment in Data-Scarce Settings
标题:经过预训练的全球儿童发展编码器:迁移学习支持在数据稀缺环境中部署
链接:https://arxiv.org/abs/2601.20987
作者:Md Muhtasim Munif Fahim,Md Rezaul Karim
摘要:A large number of children experience preventable developmental delays each year, yet the deployment of machine learning in new countries has been stymied by a data bottleneck: reliable models require thousands of samples, while new programs begin with fewer than 100. We introduce the first pre-trained encoder for global child development, trained on 357,709 children across 44 countries using UNICEF survey data. With only 50 training samples, the pre-trained encoder achieves an average AUC of 0.65 (95% CI: 0.56-0.72), outperforming cold-start gradient boosting at 0.61 by 8-12% across regions. At N=500, the encoder achieves an AUC of 0.73. Zero-shot deployment to unseen countries achieves AUCs up to 0.84. We apply a transfer learning bound to explain why pre-training diversity enables few-shot generalization. These results establish that pre-trained encoders can transform the feasibility of ML for SDG 4.2.1 monitoring in resource-constrained settings.
【17】VoxMorph: Scalable Zero-shot Voice Identity Morphing via Disentangled Embeddings
标题:VoxMorph:通过分解嵌入实现可扩展的Zero-Shot语音身份变形
链接:https://arxiv.org/abs/2601.20883
作者:Bharath Krishnamurthy,Ajita Rattani
备注:Accepted to IEEE ICASSP 2026 (51st International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2026). 5 pages, 1 figure, 3 tables. Project page: https://vcbsl.github.io/VoxMorph/
摘要:Morphing techniques generate artificial biometric samples that combine features from multiple individuals, allowing each contributor to be verified against a single enrolled template. While extensively studied in face recognition, this vulnerability remains largely unexplored in voice biometrics. Prior work on voice morphing is computationally expensive, non-scalable, and limited to acoustically similar identity pairs, constraining practical deployment. Moreover, existing sound-morphing methods target audio textures, music, or environmental sounds and are not transferable to voice identity manipulation. We propose VoxMorph, a zero-shot framework that produces high-fidelity voice morphs from as little as five seconds of audio per subject without model retraining. Our method disentangles vocal traits into prosody and timbre embeddings, enabling fine-grained interpolation of speaking style and identity. These embeddings are fused via Spherical Linear Interpolation (Slerp) and synthesized using an autoregressive language model coupled with a Conditional Flow Matching network. VoxMorph achieves state-of-the-art performance, delivering a 2.6x gain in audio quality, a 73% reduction in intelligibility errors, and a 67.8% morphing attack success rate on automated speaker verification systems under strict security thresholds. This work establishes a practical and scalable paradigm for voice morphing with significant implications for biometric security. The code and dataset are available on our project page: https://vcbsl.github.io/VoxMorph/
【18】Reinforcement Learning for Adaptive Composition of Quantum Circuit Optimisation Passes
标题:量子电路优化自适应组合的强化学习通过
链接:https://arxiv.org/abs/2601.21629
作者:Daniel Mills,Ifan Williams,Jacob Swain,Gabriel Matos,Enrico Rinaldi,Alexander Koziell-Pipe
备注:14 pages, 7 figures
摘要:Many quantum software development kits provide a suite of circuit optimisation passes. These passes have been highly optimised and tested in isolation. However, the order in which they are applied is left to the user, or else defined in general-purpose default pass sequences. While general-purpose sequences miss opportunities for optimisation which are particular to individual circuits, designing pass sequences bespoke to particular circuits requires exceptional knowledge about quantum circuit design and optimisation. Here we propose and demonstrate training a reinforcement learning agent to compose optimisation-pass sequences. In particular the agent's action space consists of passes for two-qubit gate count reduction used in default PyTKET pass sequences. For the circuits in our diverse test set, the (mean, median) fraction of two-qubit gates removed by the agent is $(57.7\%, \ 56.7 \%)$, compared to $(41.8 \%, \ 50.0 \%)$ for the next best default pass sequence.
【19】Thompson sampling: Precise arm-pull dynamics and adaptive inference
标题:汤普森抽样:精确的手臂拉力动力学和适应性推理
链接:https://arxiv.org/abs/2601.21131
作者:Qiyang Han
摘要
:Adaptive sampling schemes are well known to create complex dependence that may invalidate conventional inference methods. A recent line of work shows that this need not be the case for UCB-type algorithms in multi-armed bandits. A central emerging theme is a `stability' property with asymptotically deterministic arm-pull counts in these algorithms, making inference as easy as in the i.i.d. setting. In this paper, we study the precise arm-pull dynamics in another canonical class of Thompson-sampling type algorithms. We show that the phenomenology is qualitatively different: the arm-pull count is asymptotically deterministic if and only if the arm is suboptimal or is the unique optimal arm; otherwise it converges in distribution to the unique invariant law of an SDE. This dichotomy uncovers a unifying principle behind many existing (in)stability results: an arm is stable if and only if its interaction with statistical noise is asymptotically negligible. As an application, we show that normalized arm means obey the same dichotomy, with Gaussian limits for stable arms and a semi-universal, non-Gaussian limit for unstable arms. This not only enables the construction of confidence intervals for the unknown mean rewards despite non-normality, but also reveals the potential of developing tractable inference procedures beyond the stable regime. The proofs rely on two new approaches. For suboptimal arms, we develop an `inverse process' approach that characterizes the inverse of the arm-pull count process via a Stieltjes integral. For optimal arms, we adopt a reparametrization of the arm-pull and noise processes that reduces the singularity in the natural SDE to proving the uniqueness of the invariant law of another SDE. We prove the latter by a set of analytic tools, including the parabolic Hörmander condition and the Stroock-Varadhan support theorem.
强化学习(7篇)
【1】Constrained Meta Reinforcement Learning with Provable Test-Time Safety
标题:具有可证明测试时间安全性的约束Meta强化学习
链接:https://arxiv.org/abs/2601.21845
作者:Tingting Ni,Maryam Kamgarpour
摘要:Meta reinforcement learning (RL) allows agents to leverage experience across a distribution of tasks on which the agent can train at will, enabling faster learning of optimal policies on new test tasks. Despite its success in improving sample complexity on test tasks, many real-world applications, such as robotics and healthcare, impose safety constraints during testing. Constrained meta RL provides a promising framework for integrating safety into meta RL. An open question in constrained meta RL is how to ensure the safety of the policy on the real-world test task, while reducing the sample complexity and thus, enabling faster learning of optimal policies. To address this gap, we propose an algorithm that refines policies learned during training, with provable safety and sample complexity guarantees for learning a near optimal policy on the test tasks. We further derive a matching lower bound, showing that this sample complexity is tight.
【2】Expected Return Causes Outcome-Level Mode Collapse in Reinforcement Learning and How to Fix It with Inverse Probability Scaling
标题:预期回报导致强化学习中结果水平模式崩溃以及如何利用逆概率缩放修复它
链接:https://arxiv.org/abs/2601.21669
作者:Abhijeet Sinha,Sundari Elango,Dianbo Liu
摘要:Many reinforcement learning (RL) problems admit multiple terminal solutions of comparable quality, where the goal is not to identify a single optimum but to represent a diverse set of high-quality outcomes. Nevertheless, policies trained by standard expected return maximization routinely collapse onto a small subset of outcomes, a phenomenon commonly attributed to insufficient exploration or weak regularization. We show that this explanation is incomplete: outcome level mode collapse is a structural consequence of the expected-return objective itself. Under idealized learning dynamics, the log-probability ratio between any two outcomes evolves linearly in their reward difference, implying exponential ratio divergence and inevitable collapse independent of the exploration strategy, entropy regularization, or optimization algorithm. We identify the source of this pathology as the probability multiplier inside the expectation and propose a minimal correction: inverse probability scaling, which removes outcome-frequency amplification from the learning signal, fundamentally changes the learning dynamics, and provably yields reward-proportional terminal distributions, preventing collapse in multimodal settings. We instantiate this principle in Group Relative Policy Optimization (GRPO) as a drop-in modification, IPS-GRPO, requiring no auxiliary models or architectural changes. Across different reasoning and molecular generation tasks, IPS-GRPO consistently reduces outcome-level mode collapse while matching or exceeding baseline performance, suggesting that correcting the objective rather than adding exploration heuristics is key to reliable multimodal policy optimization.
【3】Mitigating Overthinking in Large Reasoning Models via Difficulty-aware Reinforcement Learning
标题:通过难度感知强化学习减轻大型推理模型中的过度思考
链接:https://arxiv.org/abs/2601.21418
作者:Qian Wan,Ziao Xu,Luona Wei,Xiaoxuan Shen,Jianwen Sun
摘要:Large Reasoning Models (LRMs) achieve explicit chain-of-thought expansion by imitating deep thinking behaviors of humans, demonstrating excellent performance in complex task scenarios. However, the deep-thinking mode often leads to unnecessarily lengthy reasoning and resource inefficiency when handling simple tasks. This overthinking phenomenon may arise from the generation preference triggered by the reward function during post-training. Existing research attempts to mitigate overthinking from the perspective of prompt design or model training, but generally underestimates the importance of task difficulty awareness, which makes it difficult for LRMs to effectively allocate reasoning resources. In this paper, we propose Difficulty-aware Policy Optimization (DiPO), a reinforcement learning-based LRM training framework. DiPO encourages LRM to spontaneously model task complexity, and integrates them into reinforcement learning framework to adjust the generation preferences introduced by post-training. A difficulty modeling method based on model self-reasoning is proposed, which significantly reduces the dependence on manual annotation and formalize task complexity. We further develop a difficulty-signal-enhanced reward function that incorporates a penalty for lengthy reasoning while considering reasoning performance and output format. Experimental results indicate that DiPO enables the model to spontaneously adjust inference overhead, significantly reducing redundant tokens without losing performance due to thought compression.
【4】Heterogeneous Vertiport Selection Optimization for On-Demand Air Taxi Services: A Deep Reinforcement Learning Approach
标题:按需空中出租车服务的异类垂直港选择优化:深度强化学习方法
链接:https://arxiv.org/abs/2601.21316
作者:Aoyu Pang,Maonan Wang,Zifan Sha,Wenwei Yue,Changle Li,Chung Shue Chen,Man-On Pun
摘要
:Urban Air Mobility (UAM) has emerged as a transformative solution to alleviate urban congestion by utilizing low-altitude airspace, thereby reducing pressure on ground transportation networks. To enable truly efficient and seamless door-to-door travel experiences, UAM requires close integration with existing ground transportation infrastructure. However, current research on optimal integrated routing strategies for passengers in air-ground mobility systems remains limited, with a lack of systematic exploration.To address this gap, we first propose a unified optimization model that integrates strategy selection for both air and ground transportation. This model captures the dynamic characteristics of multimodal transport networks and incorporates real-time traffic conditions alongside passenger decision-making behavior. Building on this model, we propose a Unified Air-Ground Mobility Coordination (UAGMC) framework, which leverages deep reinforcement learning (RL) and Vehicle-to-Everything (V2X) communication to optimize vertiport selection and dynamically plan air taxi routes. Experimental results demonstrate that UAGMC achieves a 34\% reduction in average travel time compared to conventional proportional allocation methods, enhancing overall travel efficiency and providing novel insights into the integration and optimization of multimodal transportation systems. This work lays a solid foundation for advancing intelligent urban mobility solutions through the coordination of air and ground transportation modes. The related code can be found at https://github.com/Traffic-Alpha/UAGMC.
【5】The Surprising Difficulty of Search in Model-Based Reinforcement Learning
标题:基于模型的强化学习中搜索的惊人困难
链接:https://arxiv.org/abs/2601.21306
作者:Wei-Di Chang,Mikael Henaff,Brandon Amos,Gregory Dudek,Scott Fujimoto
摘要:This paper investigates search in model-based reinforcement learning (RL). Conventional wisdom holds that long-term predictions and compounding errors are the primary obstacles for model-based RL. We challenge this view, showing that search is not a plug-and-play replacement for a learned policy. Surprisingly, we find that search can harm performance even when the model is highly accurate. Instead, we show that mitigating distribution shift matters more than improving model or value function accuracy. Building on this insight, we identify key techniques for enabling effective search, achieving state-of-the-art performance across multiple popular benchmark domains.
【6】Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification
标题:更少的噪音,更多的声音:通过指令净化进行推理的强化学习
链接:https://arxiv.org/abs/2601.21244
作者:Yiju Guo,Tianyi Hu,Zexu Sun,Yankai Lin
备注:Work in progress
摘要:Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6$\times$ speedup. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.
【7】Safety Generalization Under Distribution Shift in Safe Reinforcement Learning: A Diabetes Testbed
标题:安全强化学习中分布变化下的安全概括:糖尿病试验台
链接:https://arxiv.org/abs/2601.21094
作者:Minjae Kwon,Josephine Lamp,Lu Feng
摘要:Safe Reinforcement Learning (RL) algorithms are typically evaluated under fixed training conditions. We investigate whether training-time safety guarantees transfer to deployment under distribution shift, using diabetes management as a safety-critical testbed. We benchmark safe RL algorithms on a unified clinical simulator and reveal a safety generalization gap: policies satisfying constraints during training frequently violate safety requirements on unseen patients. We demonstrate that test-time shielding, which filters unsafe actions using learned dynamics models, effectively restores safety across algorithms and patient populations. Across eight safe RL algorithms, three diabetes types, and three age groups, shielding achieves Time-in-Range gains of 13--14\% for strong baselines such as PPO-Lag and CPO while reducing clinical risk index and glucose variability. Our simulator and benchmark provide a platform for studying safety under distribution shift in safety-critical control domains. Code is available at https://github.com/safe-autonomy-lab/GlucoSim and https://github.com/safe-autonomy-lab/GlucoAlg.
元学习(3篇)
【1】SMOG: Scalable Meta-Learning for Multi-Objective Bayesian Optimization
标题:SMOG:用于多目标Bayesian优化的可扩展元学习
链接:https://arxiv.org/abs/2601.22131
作者:Leonard Papenmeier,Petru Tighineanu
备注:19 pages, 15 figures
摘要
:Multi-objective optimization aims to solve problems with competing objectives, often with only black-box access to a problem and a limited budget of measurements. In many applications, historical data from related optimization tasks is available, creating an opportunity for meta-learning to accelerate the optimization. Bayesian optimization, as a promising technique for black-box optimization, has been extended to meta-learning and multi-objective optimization independently, but methods that simultaneously address both settings - meta-learned priors for multi-objective Bayesian optimization - remain largely unexplored. We propose SMOG, a scalable and modular meta-learning model based on a multi-output Gaussian process that explicitly learns correlations between objectives. SMOG builds a structured joint Gaussian process prior across meta- and target tasks and, after conditioning on metadata, yields a closed-form target-task prior augmented by a flexible residual multi-output kernel. This construction propagates metadata uncertainty into the target surrogate in a principled way. SMOG supports hierarchical, parallel training: meta-task Gaussian processes are fit once and then cached, achieving linear scaling with the number of meta-tasks. The resulting surrogate integrates seamlessly with standard multi-objective Bayesian optimization acquisition functions.
【2】Optimizing Agentic Workflows using Meta-tools
标题:使用元工具优化统计工作流
链接:https://arxiv.org/abs/2601.22037
作者:Sami Abuzakuk,Anne-Marie Kermarrec,Rishi Sharma,Rasmus Moorits Veski,Martijn de Vos
摘要:Agentic AI enables LLM to dynamically reason, plan, and interact with tools to solve complex tasks. However, agentic workflows often require many iterative reasoning steps and tool invocations, leading to significant operational expense, end-to-end latency and failures due to hallucinations. This work introduces Agent Workflow Optimization (AWO), a framework that identifies and optimizes redundant tool execution patterns to improve the efficiency and robustness of agentic workflows. AWO analyzes existing workflow traces to discover recurring sequences of tool calls and transforms them into meta-tools, which are deterministic, composite tools that bundle multiple agent actions into a single invocation. Meta-tools bypass unnecessary intermediate LLM reasoning steps and reduce operational cost while also shortening execution paths, leading to fewer failures. Experiments on two agentic AI benchmarks show that AWO reduces the number of LLM calls up to 11.9% while also increasing the task success rate by up to 4.2 percent points.
【3】READY: Reward Discovery for Meta-Black-Box Optimization
标题:Ready:元黑匣子优化的奖励发现
链接:https://arxiv.org/abs/2601.21847
作者:Zechuan Huang,Zhiguang Cao,Hongshu Guo,Yue-Jiao Gong,Zeyuan Ma
摘要:Meta-Black-Box Optimization (MetaBBO) is an emerging avenue within Optimization community, where algorithm design policy could be meta-learned by reinforcement learning to enhance optimization performance. So far, the reward functions in existing MetaBBO works are designed by human experts, introducing certain design bias and risks of reward hacking. In this paper, we use Large Language Model~(LLM) as an automated reward discovery tool for MetaBBO. Specifically, we consider both effectiveness and efficiency sides. On effectiveness side, we borrow the idea of evolution of heuristics, introducing tailored evolution paradigm in the iterative LLM-based program search process, which ensures continuous improvement. On efficiency side, we additionally introduce multi-task evolution architecture to support parallel reward discovery for diverse MetaBBO approaches. Such parallel process also benefits from knowledge sharing across tasks to accelerate convergence. Empirical results demonstrate that the reward functions discovered by our approach could be helpful for boosting existing MetaBBO works, underscoring the importance of reward design in MetaBBO. We provide READY's project at https://anonymous.4open.science/r/ICML_READY-747F.
符号|符号学习(1篇)
【1】TimeSliver : Symbolic-Linear Decomposition for Explainable Time Series Classification
标题:TimeSliver:可解释时间序列分类的符号线性分解
链接:https://arxiv.org/abs/2601.21289
作者:Akash Pandey,Payal Mohapatra,Wei Chen,Qi Zhu,Sinan Keten
备注:Accepted to ICLR 2026
摘要:Identifying the extent to which every temporal segment influences a model's predictions is essential for explaining model decisions and increasing transparency. While post-hoc explainable methods based on gradients and feature-based attributions have been popular, they suffer from reference state sensitivity and struggle to generalize across time-series datasets, as they treat time points independently and ignore sequential dependencies. Another perspective on explainable time-series classification is through interpretable components of the model, for instance, leveraging self-attention mechanisms to estimate temporal attribution; however, recent findings indicate that these attention weights often fail to provide faithful measures of temporal importance. In this work, we advance this perspective and present a novel explainability-driven deep learning framework, TimeSliver, which jointly utilizes raw time-series data and its symbolic abstraction to construct a representation that maintains the original temporal structure. Each element in this representation linearly encodes the contribution of each temporal segment to the final prediction, allowing us to assign a meaningful importance score to every time point. For time-series classification, TimeSliver outperforms other temporal attribution methods by 11% on 7 distinct synthetic and real-world multivariate time-series datasets. TimeSliver also achieves predictive performance within 2% of state-of-the-art baselines across 26 UEA benchmark datasets, positioning it as a strong and explainable framework for general time-series classification.
医学相关(3篇)
【1】HistoPrism: Unlocking Functional Pathway Analysis from Pan-Cancer Histology via Gene Expression Prediction
标题:HistoPrism:通过基因表达预测解锁泛癌组织学的功能途径分析
链接:https://arxiv.org/abs/2601.21560
作者:Susu Hu,Qinghe Zeng,Nithya Bhasker,Jakob Nicolas Kather,Stefanie Speidel
备注:Accepted at ICLR2026
摘要
:Predicting spatial gene expression from H&E histology offers a scalable and clinically accessible alternative to sequencing, but realizing clinical impact requires models that generalize across cancer types and capture biologically coherent signals. Prior work is often limited to per-cancer settings and variance-based evaluation, leaving functional relevance underexplored. We introduce HistoPrism, an efficient transformer-based architecture for pan-cancer prediction of gene expression from histology. To evaluate biological meaning, we introduce a pathway-level benchmark, shifting assessment from isolated gene-level variance to coherent functional pathways. HistoPrism not only surpasses prior state-of-the-art models on highly variable genes , but also more importantly, achieves substantial gains on pathway-level prediction, demonstrating its ability to recover biologically coherent transcriptomic patterns. With strong pan-cancer generalization and improved efficiency, HistoPrism establishes a new standard for clinically relevant transcriptomic modeling from routinely available histology.
【2】ECGFlowCMR: Pretraining with ECG-Generated Cine CMR Improves Cardiac Disease Classification and Phenotype Prediction
标题:ECGFlowMCR:使用心电图生成的电影MCR进行预训练改善心脏病分类和表型预测
链接:https://arxiv.org/abs/2601.20904
作者:Xiaocheng Fang,Zhengyao Ding,Jieyi Cai,Yujie Xiao,Bo Liu,Jiarui Jin,Haoyu Wang,Guangkun Nie,Shun Huang,Ting Chen,Hongyan Li,Shenda Hong
摘要:Cardiac Magnetic Resonance (CMR) imaging provides a comprehensive assessment of cardiac structure and function but remains constrained by high acquisition costs and reliance on expert annotations, limiting the availability of large-scale labeled datasets. In contrast, electrocardiograms (ECGs) are inexpensive, widely accessible, and offer a promising modality for conditioning the generative synthesis of cine CMR. To this end, we propose ECGFlowCMR, a novel ECG-to-CMR generative framework that integrates a Phase-Aware Masked Autoencoder (PA-MAE) and an Anatomy-Motion Disentangled Flow (AMDF) to address two fundamental challenges: (1) the cross-modal temporal mismatch between multi-beat ECG recordings and single-cycle CMR sequences, and (2) the anatomical observability gap due to the limited structural information inherent in ECGs. Extensive experiments on the UK Biobank and a proprietary clinical dataset demonstrate that ECGFlowCMR can generate realistic cine CMR sequences from ECG inputs, enabling scalable pretraining and improving performance on downstream cardiac disease classification and phenotype prediction tasks.
【3】Analyzing the Temporal Factors for Anxiety and Depression Symptoms with the Rashomon Perspective
标题:用罗生门视角分析焦虑和抑郁症状的时间因素
链接:https://arxiv.org/abs/2601.20874
作者:Mustafa Cavus,Przemysław Biecek,Julian Tejada,Fernando Marmolejo-Ramos,Andre Faro
备注:19 pages, 2 figures
摘要:This paper introduces a new modeling perspective in the public mental health domain to provide a robust interpretation of the relations between anxiety and depression, and the demographic and temporal factors. This perspective particularly leverages the Rashomon Effect, where multiple models exhibit similar predictive performance but rely on diverse internal structures. Instead of considering these multiple models, choosing a single best model risks masking alternative narratives embedded in the data. To address this, we employed this perspective in the interpretation of a large-scale psychological dataset, specifically focusing on the Patient Health Questionnaire-4. We use a random forest model combined with partial dependence profiles to rigorously assess the robustness and stability of predictive relationships across the resulting Rashomon set, which consists of multiple models that exhibit similar predictive performance. Our findings confirm that demographic variables \texttt{age}, \texttt{sex}, and \texttt{education} lead to consistent structural shifts in anxiety and depression risk. Crucially, we identify significant temporal effects: risk probability demonstrates clear diurnal and circaseptan fluctuations, peaking during early morning hours. This work demonstrates the necessity of moving beyond the best model to analyze the entire Rashomon set. Our results highlight that the observed variability, particularly due to circadian and circaseptan rhythms, must be meticulously considered for robust interpretation in psychological screening. We advocate for a multiplicity-aware approach to enhance the stability and generalizability of ML-based conclusions in mental health research.
蒸馏|知识提取(2篇)
【1】Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts
标题:正确的混合线性注意力:超长上下文的有效蒸馏和有效架构
链接:https://arxiv.org/abs/2601.22156
作者:Yingfa Chen,Zhen Leng Thai,Zihan Zhou,Zhu Zhang,Xingyu Shen,Shuo Wang,Chaojun Xiao,Xu Han,Zhiyuan Liu
备注:20 pages, 8 figures
摘要:Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through parameter transfer and knowledge distillation. However, these transfer methods require substantial amounts of training data (more than 10B tokens), and the resulting hybrid models also exhibit poor long-context performance, which is the scenario where hybrid models enjoy significant inference speedups over Transformer-based models. In this paper, we present HALO (Hybrid Attention via Layer Optimization), a pipeline for distilling Transformer models into RNN-attention hybrid models. We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme (named HyPE) and various architectural modifications. We convert the Qwen3 series into HypeNet using HALO, achieving performance comparable to the original Transformer models while enjoying superior long-context performance and efficiency. The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data
【2】Grounding and Enhancing Informativeness and Utility in Dataset Distillation
标题:数据集蒸馏中的基础和增强信息性和实用性
链接:https://arxiv.org/abs/2601.21296
作者:Shaobo Wang,Yantai Yang,Guo Chen,Peiru Li,Kaixin Li,Yufa Zhou,Zhaorun Chen,Linfeng Zhang
备注:Accepted by ICLR 2026, 20 pages, 9 figures, 11 tables
摘要
:Dataset Distillation (DD) seeks to create a compact dataset from a large, real-world dataset. While recent methods often rely on heuristic approaches to balance efficiency and quality, the fundamental relationship between original and synthetic data remains underexplored. This paper revisits knowledge distillation-based dataset distillation within a solid theoretical framework. We introduce the concepts of Informativeness and Utility, capturing crucial information within a sample and essential samples in the training set, respectively. Building on these principles, we define optimal dataset distillation mathematically. We then present InfoUtil, a framework that balances informativeness and utility in synthesizing the distilled dataset. InfoUtil incorporates two key components: (1) game-theoretic informativeness maximization using Shapley Value attribution to extract key information from samples, and (2) principled utility maximization by selecting globally influential samples based on Gradient Norm. These components ensure that the distilled dataset is both informative and utility-optimized. Experiments demonstrate that our method achieves a 6.1\% performance improvement over the previous state-of-the-art approach on ImageNet-1K dataset using ResNet-18.
推荐(1篇)
【1】Zenith: Scaling up Ranking Models for Billion-scale Livestreaming Recommendation
标题:Zenith:为亿级直播推荐升级排名模型
链接:https://arxiv.org/abs/2601.21285
作者:Ruifeng Zhang,Zexi Huang,Zikai Wang,Ke Sun,Bohang Zheng,Zhen Ouyang,Huimin Xie,Phil Shen,Junlin Zhang,Wentao Guo,Qinglei Wang
备注:9 pages
摘要:Accurately capturing feature interactions is essential in recommender systems, and recent trends show that scaling up model capacity could be a key driver for next-level predictive performance. While prior work has explored various model architectures to capture multi-granularity feature interactions, relatively little attention has been paid to efficient feature handling and scaling model capacity without incurring excessive inference latency. In this paper, we address this by presenting Zenith, a scalable and efficient ranking architecture that learns complex feature interactions with minimal runtime overhead. Zenith is designed to handle a few high-dimensional Prime Tokens with Token Fusion and Token Boost modules, which exhibits superior scaling laws compared to other state-of-the-art ranking methods, thanks to its improved token heterogeneity. Its real-world effectiveness is demonstrated by deploying the architecture to TikTok Live, a leading online livestreaming platform that attracts billions of users globally. Our A/B test shows that Zenith achieves +1.05%/-1.10% in online CTR AUC and Logloss, and realizes +9.93% gains in Quality Watch Session / User and +8.11% in Quality Watch Duration / User.
聚类(1篇)
【1】TabClustPFN: A Prior-Fitted Network for Tabular Data Clustering
标题:TabClustPFN:一个用于表格数据集群的优先匹配网络
链接:https://arxiv.org/abs/2601.21656
作者:Tianqi Zhao,Guanyang Wang,Yan Shuo Tan,Qiong Zhang
摘要:Clustering tabular data is a fundamental yet challenging problem due to heterogeneous feature types, diverse data-generating mechanisms, and the absence of transferable inductive biases across datasets. Prior-fitted networks (PFNs) have recently demonstrated strong generalization in supervised tabular learning by amortizing Bayesian inference under a broad synthetic prior. Extending this paradigm to clustering is nontrivial: clustering is unsupervised, admits a combinatorial and permutation-invariant output space, and requires inferring the number of clusters. We introduce TabClustPFN, a prior-fitted network for tabular data clustering that performs amortized Bayesian inference over both cluster assignments and cluster cardinality. Pretrained on synthetic datasets drawn from a flexible clustering prior, TabClustPFN clusters unseen datasets in a single forward pass, without dataset-specific retraining or hyperparameter tuning. The model naturally handles heterogeneous numerical and categorical features and adapts to a wide range of clustering structures. Experiments on synthetic data and curated real-world tabular benchmarks show that TabClustPFN outperforms classical, deep, and amortized clustering baselines, while exhibiting strong robustness in out-of-the-box exploratory settings. Code is available at https://github.com/Tianqi-Zhao/TabClustPFN.
超分辨率|去噪|去模糊|去雾(3篇)
【1】Memorization Control in Diffusion Models from Denoising-centric Perspective
标题:以去噪为中心的扩散模型中的小型化控制
链接:https://arxiv.org/abs/2601.21348
作者:Thuy Phuong Vu,Mai Viet Hoang Do,Minhhuy Le,Dinh-Cuong Hoang,Phan Xuan Tan
摘要:Controlling memorization in diffusion models is critical for applications that require generated data to closely match the training distribution. Existing approaches mainly focus on data centric or model centric modifications, treating the diffusion model as an isolated predictor. In this paper, we study memorization in diffusion models from a denoising centric perspective. We show that uniform timestep sampling leads to unequal learning contributions across denoising steps due to differences in signal to noise ratio, which biases training toward memorization. To address this, we propose a timestep sampling strategy that explicitly controls where learning occurs along the denoising trajectory. By adjusting the width of the confidence interval, our method provides direct control over the memorization generalization trade off. Experiments on image and 1D signal generation tasks demonstrate that shifting learning emphasis toward later denoising steps consistently reduces memorization and improves distributional alignment with training data, validating the generality and effectiveness of our approach.
【2】Conditional Denoising Model as a Physical Surrogate Model
标题:作为物理替代模型的条件去噪模型
链接:https://arxiv.org/abs/2601.21021
作者:José Afonso,Pedro Viegas,Rodrigo Ventura,Vasco Guerra
备注:15 pages, 2 figures, 2 tables
摘要
:Surrogate modeling for complex physical systems typically faces a trade-off between data-fitting accuracy and physical consistency. Physics-consistent approaches typically treat physical laws as soft constraints within the loss function, a strategy that frequently fails to guarantee strict adherence to the governing equations, or rely on post-processing corrections that do not intrinsically learn the underlying solution geometry. To address these limitations, we introduce the {Conditional Denoising Model (CDM)}, a generative model designed to learn the geometry of the physical manifold itself. By training the network to restore clean states from noisy ones, the model learns a vector field that points continuously towards the valid solution subspace. We introduce a time-independent formulation that transforms inference into a deterministic fixed-point iteration, effectively projecting noisy approximations onto the equilibrium manifold. Validated on a low-temperature plasma physics and chemistry benchmark, the CDM achieves higher parameter and data efficiency than physics-consistent baselines. Crucially, we demonstrate that the denoising objective acts as a powerful implicit regularizer: despite never seeing the governing equations during training, the model adheres to physical constraints more strictly than baselines trained with explicit physics losses.
【3】Denoising and Baseline Correction of Low-Scan FTIR Spectra: A Benchmark of Deep Learning Models Against Traditional Signal Processing
标题:低扫描红外光谱的降噪和基线修正:针对传统信号处理的深度学习模型基准
链接:https://arxiv.org/abs/2601.20905
作者:Azadeh Mokari,Shravan Raghunathan,Artem Shydliukh,Oleg Ryabchykov,Christoph Krafft,Thomas Bocklitz
摘要:High-quality Fourier Transform Infrared (FTIR) imaging usually needs extensive signal averaging to reduce noise and drift which severely limits clinical speed. Deep learning can accelerate imaging by reconstructing spectra from rapid, single-scan inputs. However, separating noise and baseline drift simultaneously without ground truth is an ill-posed inverse problem. Standard black-box architectures often rely on statistical approximations that introduce spectral hallucinations or fail to generalize to unstable atmospheric conditions. To solve these issues we propose a physics-informed cascade Unet that separates denoising and baseline correction tasks using a new, deterministic Physics Bridge. This architecture forces the network to separate random noise from chemical signals using an embedded SNIP layer to enforce spectroscopic constraints instead of learning statistical approximations. We benchmarked this approach against a standard single Unet and a traditional Savitzky-Golay/SNIP workflow. We used a dataset of human hypopharyngeal carcinoma cells (FaDu). The cascade model outperformed all other methods, achieving a 51.3% reduction in RMSE compared to raw single-scan inputs, surpassing both the single Unet (40.2%) and the traditional workflow (33.7%). Peak-aware metrics show that the cascade architecture eliminates spectral hallucinations found in standard deep learning. It also preserves peak intensity with much higher fidelity than traditional smoothing. These results show that the cascade Unet is a robust solution for diagnostic-grade FTIR imaging. It enables imaging speeds 32 times faster than current methods.
自动驾驶|车辆|车道检测等(2篇)
【1】NetMamba+: A Framework of Pre-trained Models for Efficient and Accurate Network Traffic Classification
标题:NetMamba+:预训练模型框架,用于高效、准确的网络流量分类
链接:https://arxiv.org/abs/2601.21792
作者:Tongze Wang,Xiaohui Xie,Wenduo Wang,Chuyi Wang,Jinzhou Liu,Boyan Huang,Yannan Hu,Youjian Zhao,Yong Cui
摘要:With the rapid growth of encrypted network traffic, effective traffic classification has become essential for network security and quality of service management. Current machine learning and deep learning approaches for traffic classification face three critical challenges: computational inefficiency of Transformer architectures, inadequate traffic representations with loss of crucial byte-level features while retaining detrimental biases, and poor handling of long-tail distributions in real-world data. We propose NetMamba+, a framework that addresses these challenges through three key innovations: (1) an efficient architecture considering Mamba and Flash Attention mechanisms, (2) a multimodal traffic representation scheme that preserves essential traffic information while eliminating biases, and (3) a label distribution-aware fine-tuning strategy. Evaluation experiments on massive datasets encompassing four main classification tasks showcase NetMamba+'s superior classification performance compared to state-of-the-art baselines, with improvements of up to 6.44\% in F1 score. Moreover, NetMamba+ demonstrates excellent efficiency, achieving 1.7x higher inference throughput than the best baseline while maintaining comparably low memory usage. Furthermore, NetMamba+ exhibits superior few-shot learning abilities, achieving better classification performance with fewer labeled data. Additionally, we implement an online traffic classification system that demonstrates robust real-world performance with a throughput of 261.87 Mb/s. As the first framework to adapt Mamba architecture for network traffic classification, NetMamba+ opens new possibilities for efficient and accurate traffic analysis in complex network environments.
【2】Sim-MSTNet: sim2real based Multi-task SpatioTemporal Network Traffic Forecasting
标题:Sim-MSTNet:基于sim 2 real的多任务SpatioTemporal网络流量预测
链接:https://arxiv.org/abs/2601.21384
作者:Hui Ma,Qingzhong Li,Jin Wang,Jie Wu,Shaoyu Dou,Li Feng,Xinjun Pei
备注:accepted in ICASSP 2026
摘要:Network traffic forecasting plays a crucial role in intelligent network operations, but existing techniques often perform poorly when faced with limited data. Additionally, multi-task learning methods struggle with task imbalance and negative transfer, especially when modeling various service types. To overcome these challenges, we propose Sim-MSTNet, a multi-task spatiotemporal network traffic forecasting model based on the sim2real approach. Our method leverages a simulator to generate synthetic data, effectively addressing the issue of poor generalization caused by data scarcity. By employing a domain randomization technique, we reduce the distributional gap between synthetic and real data through bi-level optimization of both sample weighting and model training. Moreover, Sim-MSTNet incorporates attention-based mechanisms to selectively share knowledge between tasks and applies dynamic loss weighting to balance task objectives. Extensive experiments on two open-source datasets show that Sim-MSTNet consistently outperforms state-of-the-art baselines, achieving enhanced accuracy and generalization.
点云|SLAM|雷达|激光|深度RGBD相关(1篇)
【1】Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves
标题:深度-递归注意力混合:给予潜在推理它所需要的注意力
链接:https://arxiv.org/abs/2601.21582
作者:Jonas Knupp,Jan Hendrik Metzen,Jeremias Bohn,Georg Groh,Kristian Kersting
摘要:Depth-recurrence facilitates latent reasoning by sharing parameters across depths. However, prior work lacks combined FLOP-, parameter-, and memory-matched baselines, underutilizes depth-recurrence due to partially fixed layer stacks, and ignores the bottleneck of constant hidden-sizes that restricts many-step latent reasoning. To address this, we introduce a modular framework of depth-recurrent attention mixtures (Dreamer), combining sequence attention, depth attention, and sparse expert attention. It alleviates the hidden-size bottleneck through attention along depth, decouples scaling dimensions, and allows depth-recurrent models to scale efficiently and effectively. Across language reasoning benchmarks, our models require 2 to 8x fewer training tokens for the same accuracy as FLOP-, parameter-, and memory-matched SOTA, and outperform ca. 2x larger SOTA models with the same training tokens. We further present insights into knowledge usage across depths, e.g., showing 2 to 11x larger expert selection diversity than SOTA MoEs.
推理|分析|理解|解释(14篇)
【1】Where Do the Joules Go? Diagnosing Inference Energy Consumption
标题:乔尔去哪里了?诊断推理能源消耗
链接:https://arxiv.org/abs/2601.22076
作者:Jae-Won Chung,Ruofan Wu,Jeff J. Ma,Mosharaf Chowdhury
备注:The ML.ENERGY Leaderboard v3.0 is open https://ml.energy/leaderboard
摘要:Energy is now a critical ML computing resource. While measuring energy consumption and observing trends is a valuable first step, accurately understanding and diagnosing why those differences occur is crucial for optimization. To that end, we begin by presenting a large-scale measurement study of inference time and energy across the generative AI landscape with 46 models, 7 tasks, and 1,858 different configurations on NVIDIA H100 and B200 GPUs. Our empirical findings span order-of-magnitude variations: LLM task type can lead to 25$\times$ energy differences, video generation sometimes consumes more than 100$\times$ the energy of images, and GPU utilization differences can result in 3--5$\times$ energy differences. Based on our observations, we present a framework for reasoning about the underlying mechanisms that govern time and energy consumption. The essence is that time and energy are determined by latent metrics like memory and utilization, which are in turn affected by various factors across the algorithm, software, and hardware layers. Our framework also extends directly to throughput per watt, a critical metric for power-constrained datacenters.
【2】Investigating Batch Inference in a Sequential Monte Carlo Framework for Neural Networks
标题:在神经网络序列蒙特卡罗框架中研究批推理
链接:https://arxiv.org/abs/2601.21983
作者:Andrew Millard,Joshua Murphy,Peter Green,Simon Maskell
摘要:Bayesian inference allows us to define a posterior distribution over the weights of a generic neural network (NN). Exact posteriors are usually intractable, in which case approximations can be employed. One such approximation - variational inference - is computationally efficient when using mini-batch stochastic gradient descent as subsets of the data are used for likelihood and gradient evaluations, though the approach relies on the selection of a variational distribution which sufficiently matches the form of the posterior. Particle-based methods such as Markov chain Monte Carlo and Sequential Monte Carlo (SMC) do not assume a parametric family for the posterior by typically require higher computational cost. These sampling methods typically use the full-batch of data for likelihood and gradient evaluations, which contributes to this computational expense. We explore several methods of gradually introducing more mini-batches of data (data annealing) into likelihood and gradient evaluations of an SMC sampler. We find that we can achieve up to $6\times$ faster training with minimal loss in accuracy on benchmark image classification problems using NNs.
【3】ECSEL: Explainable Classification via Signomial Equation Learning
标题:ECSEN:通过符号方程学习的可解释分类
链接:https://arxiv.org/abs/2601.21789
作者:Adia Lumadjeng,Ilker Birbil,Erman Acar
摘要:We introduce ECSEL, an explainable classification method that learns formal expressions in the form of signomial equations, motivated by the observation that many symbolic regression benchmarks admit compact signomial structure. ECSEL directly constructs a structural, closed-form expression that serves as both a classifier and an explanation. On standard symbolic regression benchmarks, our method recovers a larger fraction of target equations than competing state-of-the-art approaches while requiring substantially less computation. Leveraging this efficiency, ECSEL achieves classification accuracy competitive with established machine learning models without sacrificing interpretability. Further, we show that ECSEL satisfies some desirable properties regarding global feature behavior, decision-boundary analysis, and local feature attributions. Experiments on benchmark datasets and two real-world case studies i.e., e-commerce and fraud detection, demonstrate that the learned equations expose dataset biases, support counterfactual reasoning, and yield actionable insights.
【4】Understanding Model Merging: A Unified Generalization Framework for Heterogeneous Experts
标题:理解模型合并:针对异类专家的统一概括框架
链接:https://arxiv.org/abs/2601.21690
作者
:Qinglun Li,Anke Tang,Miao Zhang,Mengzhu Wang,Quanjun Yin,Li Shen
摘要:Model merging efficiently aggregates capabilities from multiple fine-tuned models into a single one, operating purely in parameter space without original data or expensive re-computation. Despite empirical successes, a unified theory for its effectiveness under heterogeneous finetuning hyperparameters (e.g., varying learning rates, batch sizes) remains missing. Moreover, the lack of hyperparameter transparency in open-source fine-tuned models makes it difficult to predict merged-model performance, leaving practitioners without guidance on how to fine-tune merge-friendly experts. To address those two challenges, we employ $L_2$-Stability theory under heterogeneous hyperparameter environments to analyze the generalization of the merged model $\boldsymbol{x}_{avg}$. This pioneering analysis yields two key contributions: (i) \textit{A unified theoretical framework} is provided to explain existing merging algorithms, revealing how they optimize specific terms in our bound, thus offering a strong theoretical foundation for empirical observations. (ii) \textit{Actionable recommendations} are proposed for practitioners to strategically fine-tune expert models, enabling the construction of merge-friendly models within the pretraining-to-finetuning pipeline. Extensive experiments on the ResNet/Vit family across 20/8 visual classification tasks, involving thousands of finetuning models, robustly confirm the impact of different hyperparameters on the generalization of $\boldsymbol{x}_{avg}$ predicted by our theoretical results.
【5】From Consistency to Complementarity: Aligned and Disentangled Multi-modal Learning for Time Series Understanding and Reasoning
标题:从一致性到互补性:用于时间序列理解和推理的对齐和解开的多模式学习
链接:https://arxiv.org/abs/2601.21436
作者:Hang Ni,Weijia Zhang,Fei Wang,Zezhi Shao,Hao Liu
摘要:Advances in multi-modal large language models (MLLMs) have inspired time series understanding and reasoning tasks, that enable natural language querying over time series, producing textual analyses of complex temporal dynamics. Recent attempts hybridize numerical time series with their visualized plots, facilitating precise value reasoning and visual structure comprehension for comprehensive time series understanding of MLLMs. However, effective cross-modal integration remains challenging due to fine-grained temporal misalignment across modalities and severe entanglement between shared and modality-specific semantics, which hinder localized interpretation and complementary reasoning. To address these issues, we propose MADI, a multi-modal LLM enhanced with fine-grained alignment and disentangled interaction, featuring (1) Patch-level Alignment, which enforces physically grounded fine-grained correspondence across heterogeneous modalities, (2) Discrete Disentangled Interaction, which separates modality-common semantics into compact discrete latents and adaptively synergizes the purified modality-unique information, and (3) Critical-token Highlighting, which emphasizes informative, query-relevant signals for robust reasoning. Experiments on synthetic and real-world benchmarks show that MADI consistently outperforms general-purpose LLMs and time-series-specialized MLLMs.
【6】DA-SPS: A Dual-stage Network based on Singular Spectrum Analysis, Patching-strategy and Spearman-correlation for Multivariate Time-series Prediction
标题:DA-MPS:基于奇异谱分析、修补策略和Spearman相关的多元时间序列预测双级网络
链接:https://arxiv.org/abs/2601.21381
作者:Tianhao Zhang,Shusen Ma,Yu Kang,Yun-Bo Zhao
备注:12 pages, 7 figures, 6 tables, submitted to IEEE Transactions on Emerging Topics in Computational Intelligence
摘要:Multivariate time-series forecasting, as a typical problem in the field of time series prediction, has a wide range of applications in weather forecasting, traffic flow prediction, and other scenarios. However, existing works do not effectively consider the impact of extraneous variables on the prediction of the target variable. On the other hand, they fail to fully extract complex sequence information based on various time patterns of the sequences. To address these drawbacks, we propose a DA-SPS model, which adopts different modules for feature extraction based on the information characteristics of different variables. DA-SPS mainly consists of two stages: the target variable processing stage (TVPS) and the extraneous variables processing stage (EVPS). In TVPS, the model first uses Singular Spectrum Analysis (SSA) to process the target variable sequence and then uses Long Short-Term Memory (LSTM) and P-Conv-LSTM which deploys a patching strategy to extract features from trend and seasonality components, respectively. In EVPS, the model filters extraneous variables that have a strong correlation with the target variate by using Spearman correlation analysis and further analyses them using the L-Attention module which consists of LSTM and attention mechanism. Finally, the results obtained by TVPS and EVPS are combined through weighted summation and linear mapping to produce the final prediction. The results on four public datasets demonstrate that the DA-SPS model outperforms existing state-of-the-art methods. Additionally, its performance in real-world scenarios is further validated using a private dataset collected by ourselves, which contains the test items' information on laptop motherboards.
【7】GeoRC: A Benchmark for Geolocation Reasoning Chains
标题:GeoRC:地理位置推理链的基准
链接:https://arxiv.org/abs/2601.21278
作者:Mohit Talreja,Joshua Diao,Jim Thannikary James,Radu Casapu,Tejas Santanam,Ethan Mendes,Alan Ritter,Wei Xu,James Hays
摘要
:Vision Language Models (VLMs) are good at recognizing the global location of a photograph -- their geolocation prediction accuracy rivals the best human experts. But many VLMs are startlingly bad at explaining which image evidence led to their prediction, even when their location prediction is correct. The reasoning chains produced by VLMs frequently hallucinate scene attributes to support their location prediction (e.g. phantom writing, imagined infrastructure, misidentified flora). In this paper, we introduce the first benchmark for geolocation reasoning chains. We focus on the global location prediction task in the popular GeoGuessr game which draws from Google Street View spanning more than 100 countries. We collaborate with expert GeoGuessr players, including the reigning world champion, to produce 800 ground truth reasoning chains for 500 query scenes. These expert reasoning chains address hundreds of different discriminative visual attributes such as license plate shape, architecture, and soil properties to name just a few. We evaluate LLM-as-a-judge and VLM-as-a-judge strategies for scoring VLM-generated reasoning chains against our expert reasoning chains and find that Qwen 3 LLM-as-a-judge correlates best with human scoring. Our benchmark reveals that while large, closed-source VLMs such as Gemini and GPT 5 rival human experts at prediction locations, they still lag behind human experts when it comes to producing auditable reasoning chains. Open weights VLMs such as Llama and Qwen catastrophically fail on our benchmark -- they perform only slightly better than a baseline in which an LLM hallucinates a reasoning chain with oracle knowledge of the photo location but no visual information at all. We believe the gap between human experts and VLMs on this task points to VLM limitations at extracting fine-grained visual attributes from high resolution images.
【8】Understanding Diffusion Models via Ratio-Based Function Approximation with SignReLU Networks
标题:通过SignReLU网络基于比率的函数逼近来理解扩散模型
链接:https://arxiv.org/abs/2601.21242
作者:Luwei Sun,Dongrui Shen,Jianfe Li,Yulong Zhao,Han Feng
备注:34 pages
摘要:Motivated by challenges in conditional generative modeling, where the target conditional density takes the form of a ratio f1 over f2, this paper develops a theoretical framework for approximating such ratio-type functionals. Here, f1 and f2 are kernel-based marginal densities that capture structured interactions, a setting central to diffusion-based generative models. We provide a concise proof for approximating these ratio-type functionals using deep neural networks with the SignReLU activation function, leveraging the activation's piecewise structure. Under standard regularity assumptions, we establish L^p(Omega) approximation bounds and convergence rates. Specializing to Denoising Diffusion Probabilistic Models (DDPMs), we construct a SignReLU-based neural estimator for the reverse process and derive bounds on the excess Kullback-Leibler (KL) risk between the generated and true data distributions. Our analysis decomposes this excess risk into approximation and estimation error components. These results provide generalization guarantees for finite-sample training of diffusion-based generative models.
【9】Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models
标题:谄媚者:推理模型中的本地化和量化用户同意
链接:https://arxiv.org/abs/2601.21183
作者:Jacek Duszenko
摘要:Reasoning models frequently agree with incorrect user suggestions -- a behavior known as sycophancy. However, it is unclear where in the reasoning trace this agreement originates and how strong the commitment is. To localize and quantify this behavior, we introduce \emph{sycophantic anchors} -- sentences that causally lock models into user agreement. Analyzing over 10,000 counterfactual rollouts on a distilled reasoning model, we show that anchors can be reliably detected and quantified mid-inference. Linear probes distinguish sycophantic anchors with 84.6\% balanced accuracy, while activation-based regressors predict the magnitude of the commitment ($R^2 = 0.74$). We further observe asymmetry where sycophantic anchors are significantly more distinguishable than correct reasoning anchors, and find that sycophancy builds gradually during reasoning, revealing a potential window for intervention. These results offer sentence-level mechanisms for localizing model misalignment mid-inference.
【10】Breaking the Reasoning Horizon in Entity Alignment Foundation Models
标题:突破实体对齐基础模型中的推理视野
链接:https://arxiv.org/abs/2601.21174
作者:Yuanning Cui,Zequn Sun,Wei Hu,Kexuan Xin,Zhangjie Fu
摘要:Entity alignment (EA) is critical for knowledge graph (KG) fusion. Existing EA models lack transferability and are incapable of aligning unseen KGs without retraining. While using graph foundation models (GFMs) offer a solution, we find that directly adapting GFMs to EA remains largely ineffective. This stems from a critical "reasoning horizon gap": unlike link prediction in GFMs, EA necessitates capturing long-range dependencies across sparse and heterogeneous KG structuresTo address this challenge, we propose a EA foundation model driven by a parallel encoding strategy. We utilize seed EA pairs as local anchors to guide the information flow, initializing and encoding two parallel streams simultaneously. This facilitates anchor-conditioned message passing and significantly shortens the inference trajectory by leveraging local structural proximity instead of global search. Additionally, we incorporate a merged relation graph to model global dependencies and a learnable interaction module for precise matching. Extensive experiments verify the effectiveness of our framework, highlighting its strong generalizability to unseen KGs.
【11】Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning
标题:框架思维:视觉上下文和测试时间缩放如何增强视频推理
链接:https://arxiv.org/abs/2601.21037
作者:Chengzu Li,Zanyi Wang,Jiaang Li,Yi Xu,Han Zhou,Huanyu Zhang,Ruichuan An,Dengyang Jiang,Zhaochong An,Ivan Vulić,Serge Belongie,Anna Korhonen
备注:8 pages, 3 figures, 3 tables (26 pages, 13 figures, 6 tables including references and appendices)
摘要:Vision-Language Models have excelled at textual reasoning, but they often struggle with fine-grained spatial understanding and continuous action planning, failing to simulate the dynamics required for complex visual reasoning. In this work, we formulate visual reasoning by means of video generation models, positing that generated frames can act as intermediate reasoning steps between initial states and solutions. We evaluate their capacity in two distinct regimes: Maze Navigation for sequential discrete planning with low visual change and Tangram Puzzle for continuous manipulation with high visual change. Our experiments reveal three critical insights: (1) Robust Zero-Shot Generalization: In both tasks, the model demonstrates strong performance on unseen data distributions without specific finetuning. (2) Visual Context: The model effectively uses visual context as explicit control, such as agent icons and tangram shapes, enabling it to maintain high visual consistency and adapt its planning capability robustly to unseen patterns. (3) Visual Test-Time Scaling: We observe a test-time scaling law in sequential planning; increasing the generated video length (visual inference budget) empowers better zero-shot generalization to spatially and temporally complex paths. These findings suggest that video generation is not merely a media tool, but a scalable, generalizable paradigm for visual reasoning.
【12】Distributional Active Inference
标题:分布主动推理
链接:https://arxiv.org/abs/2601.20985
作者:Abdullah Akgül,Gulcin Baykal,Manuel Haußmann,Mustafa Mert Çelikok,Melih Kandemir
摘要:Optimal control of complex environments with robotic systems faces two complementary and intertwined challenges: efficient organization of sensory state information and far-sighted action planning. Because the reinforcement learning framework addresses only the latter, it tends to deliver sample-inefficient solutions. Active inference is the state-of-the-art process theory that explains how biological brains handle this dual problem. However, its applications to artificial intelligence have thus far been limited to extensions of existing model-based approaches. We present a formal abstraction of reinforcement learning algorithms that spans model-based, distributional, and model-free approaches. This abstraction seamlessly integrates active inference into the distributional reinforcement learning framework, making its performance advantages accessible without transition dynamics modeling.
【13】Latent-IMH: Efficient Bayesian Inference for Inverse Problems with Approximate Operators
标题:Latent-IMH:具有逼近运算符的反问题的有效Bayesian推理
链接:https://arxiv.org/abs/2601.20888
作者:Youguang Chen,George Biros
摘要:We study sampling from posterior distributions in Bayesian linear inverse problems where $A$, the parameters to observables operator, is computationally expensive. In many applications, $A$ can be factored in a manner that facilitates the construction of a cost-effective approximation $\tilde{A}$. In this framework, we introduce Latent-IMH, a sampling method based on the Metropolis-Hastings independence (IMH) sampler. Latent-IMH first generates intermediate latent variables using the approximate $\tilde{A}$, and then refines them using the exact $A$. Its primary benefit is that it shifts the computational cost to an offline phase. We theoretically analyze the performance of Latent-IMH using KL divergence and mixing time bounds. Using numerical experiments on several model problems, we show that, under reasonable assumptions, it outperforms state-of-the-art methods such as the No-U-Turn sampler (NUTS) in computational efficiency. In some cases, Latent-IMH can be orders of magnitude faster than existing schemes.
【14】Distributed Causality in the SDG Network: Evidence from Panel VAR and Conditional Independence Analysis
标题:SDG网络中的分布因果关系:来自面板VAR和条件独立性分析的证据
链接:https://arxiv.org/abs/2601.20875
作者:Md Muhtasim Munif Fahim,Md Jahid Hasan Imran,Luknath Debnath,Tonmoy Shill,Md. Naim Molla,Ehsanul Bashar Pranto,Md Shafin Sanyan Saad,Md Rezaul Karim
备注:Comprehensive Manuscript with Code & Data
摘要:The achievement of the 2030 Sustainable Development Goals (SDGs) is dependent upon strategic resource distribution. We propose a causal discovery framework using Panel Vector Autoregression, along with both country-specific fixed effects and PCMCI+ conditional independence testing on 168 countries (2000-2025) to develop the first complete causal architecture of SDG dependencies. Utilizing 8 strategically chosen SDGs, we identify a distributed causal network (i.e., no single 'hub' SDG), with 10 statistically significant Granger-causal relationships identified as 11 unique direct effects. Education to Inequality is identified as the most statistically significant direct relationship (r = -0.599; p < 0.05), while effect magnitude significantly varies depending on income levels (e.g., high-income: r = -0.65; lower-middle-income: r = -0.06; non-significant). We also reject the idea that there exists a single 'keystone' SDG. Additionally, we offer a proposed tiered priority framework for the SDGs namely, identifying upstream drivers (Education, Growth), enabling goals (Institutions, Energy), and downstream outcomes (Poverty, Health). Therefore, we conclude that effective SDG acceleration can be accomplished through coordinated multi-dimensional intervention(s), and that single-goal sequential strategies are insufficient.
检测相关(4篇)
【1】Music Plagiarism Detection: Problem Formulation and a Segment-based Solution
标题:音乐抄袭检测:问题制定和基于片段的解决方案
链接:https://arxiv.org/abs/2601.21260
作者:Seonghyeon Go,Yumin Kim
摘要:Recently, the problem of music plagiarism has emerged as an even more pressing social issue. As music information retrieval research advances, there is a growing effort to address issues related to music plagiarism. However, many studies, including our previous work, have conducted research without clearly defining what the music plagiarism detection task actually involves. This lack of a clear definition has slowed research progress and made it hard to apply results to real-world scenarios. To fix this situation, we defined how Music Plagiarism Detection is different from other MIR tasks and explained what problems need to be solved. We introduce the Similar Music Pair dataset to support this newly defined task. In addition, we propose a method based on segment transcription as one way to solve the task. Our demo and dataset are available at https://github.com/Mippia/ICASSP2026-MPD.
【2】Conditional Generative Framework with Peak-Aware Attention for Robust Chemical Detection under Interferences
标题:具有峰值感知注意力的条件生成框架,用于干扰下的鲁棒化学检测
链接:https://arxiv.org/abs/2601.21246
作者:Namkyung Yoon,Sanghong Kim,Hwangnam Kim
备注:24 pages, 5 figures
摘要:Gas chromatography-mass spectrometry (GC-MS) is a widely used analytical method for chemical substance detection, but measurement reliability tends to deteriorate in the presence of interfering substances. In particular, interfering substances cause nonspecific peaks, residence time shifts, and increased background noise, resulting in reduced sensitivity and false alarms. To overcome these challenges, in this paper, we propose an artificial intelligence discrimination framework based on a peak-aware conditional generative model to improve the reliability of GC-MS measurements under interference conditions. The framework is learned with a novel peak-aware mechanism that highlights the characteristic peaks of GC-MS data, allowing it to generate important spectral features more faithfully. In addition, chemical and solvent information is encoded in a latent vector embedded with it, allowing a conditional generative adversarial neural network (CGAN) to generate a synthetic GC-MS signal consistent with the experimental conditions. This generates an experimental dataset that assumes indirect substance situations in chemical substance data, where acquisition is limited without conducting real experiments. These data are used for the learning of AI-based GC-MS discrimination models to help in accurate chemical substance discrimination. We conduct various quantitative and qualitative evaluations of the generated simulated data to verify the validity of the proposed framework. We also verify how the generative model improves the performance of the AI discrimination framework. Representatively, the proposed method is shown to consistently achieve cosine similarity and Pearson correlation coefficient values above 0.9 while preserving peak number diversity and reducing false alarms in the discrimination model.
【3】The Powers of Precision: Structure-Informed Detection in Complex Systems -- From Customer Churn to Seizure Onset
标题:精确度的力量:复杂系统中的结构知情检测--从客户流失到癫痫发作
链接:https://arxiv.org/abs/2601.21170
作者:Augusto Santos,Teresa Santos,Catarina Rodrigues,José M. F. Moura
摘要:Emergent phenomena -- onset of epileptic seizures, sudden customer churn, or pandemic outbreaks -- often arise from hidden causal interactions in complex systems. We propose a machine learning method for their early detection that addresses a core challenge: unveiling and harnessing a system's latent causal structure despite the data-generating process being unknown and partially observed. The method learns an optimal feature representation from a one-parameter family of estimators -- powers of the empirical covariance or precision matrix -- offering a principled way to tune in to the underlying structure driving the emergence of critical events. A supervised learning module then classifies the learned representation. We prove structural consistency of the family and demonstrate the empirical soundness of our approach on seizure detection and churn prediction, attaining competitive results in both. Beyond prediction, and toward explainability, we ascertain that the optimal covariance power exhibits evidence of good identifiability while capturing structural signatures, thus reconciling predictive performance with interpretable statistical structure.
【4】SMKC: Sketch Based Kernel Correlation Images for Variable Cardinality Time Series Anomaly Detection
标题:SMKC:基于草图的核相关图像用于可变基数时间序列异常检测
链接:https://arxiv.org/abs/2601.21050
作者:Haokun Zhou
摘要:Conventional anomaly detection in multivariate time series relies on the assumption that the set of observed variables remains static. In operational environments, however, monitoring systems frequently experience sensor churn. Signals may appear, disappear, or be renamed, creating data windows where the cardinality varies and may include values unseen during training. To address this challenge, we propose SMKC, a framework that decouples the dynamic input structure from the anomaly detector. We first employ permutation-invariant feature hashing to sketch raw inputs into a fixed size state sequence. We then construct a hybrid kernel image to capture global temporal structure through pairwise comparisons of the sequence and its derivatives. The model learns normal patterns using masked reconstruction and a teacher-student prediction objective. Our evaluation reveals that robust log-distance channels provide the primary discriminative signal, whereas cosine representations often fail to capture sufficient contrast. Notably, we find that a detector using random projections and nearest neighbors on the SMKC representation performs competitively with fully trained baselines without requiring gradient updates. This highlights the effectiveness of the representation itself and offers a practical cold-start solution for resource-constrained deployments.
分类|识别(4篇)
【1】Beyond Parameter Finetuning: Test-Time Representation Refinement for Node Classification
标题:超越参数微调:节点分类的测试时表示微调
链接:https://arxiv.org/abs/2601.21615
作者:Jiaxin Zhang,Yiqi Wang,Siwei Wang,Xihong Yang,Yu Shi,Xinwang Liu,En Zhu
摘要:Graph Neural Networks frequently exhibit significant performance degradation in the out-of-distribution test scenario. While test-time training (TTT) offers a promising solution, existing Parameter Finetuning (PaFT) paradigm suffer from catastrophic forgetting, hindering their real-world applicability. We propose TTReFT, a novel Test-Time Representation FineTuning framework that transitions the adaptation target from model parameters to latent representations. Specifically, TTReFT achieves this through three key innovations: (1) uncertainty-guided node selection for specific interventions, (2) low-rank representation interventions that preserve pre-trained knowledge, and (3) an intervention-aware masked autoencoder that dynamically adjust masking strategy to accommodate the node selection scheme. Theoretically, we establish guarantees for TTReFT in OOD settings. Empirically, extensive experiments across five benchmark datasets demonstrate that TTReFT achieves consistent and superior performance. Our work establishes representation finetuning as a new paradigm for graph TTT, offering both theoretical grounding and immediate practical utility for real-world deployment.
【2】Noninvasive Intracranial Pressure Estimation Using Subspace System Identification and Bespoke Machine Learning Algorithms: A Learning-to-Rank Approach
标题:使用子空间系统识别和定制机器学习算法的无创性颅压估计:一种学习排序方法
链接:https://arxiv.org/abs/2601.20916
作者:Anni Zhao,Ayca Ermis,Jeffrey Robert Vitt,Sergio Brasil,Wellingson Paiva,Magdalena Kasprowicz,Malgorzata Burzynska,Robert Hamilton,Runze Yan,Ofer Sadan,J. Claude Hemphill,Lieven Vandenberghe,Xiao Hu
备注:17 pages, 9 figures
摘要:Objective: Accurate noninvasive estimation of intracranial pressure (ICP) remains a major challenge in critical care. We developed a bespoke machine learning algorithm that integrates system identification and ranking-constrained optimization to estimate mean ICP from noninvasive signals. Methods: A machine learning framework was proposed to obtain accurate mean ICP values using arbitrary noninvasive signals. The subspace system identification algorithm is employed to identify cerebral hemodynamics models for ICP simulation using arterial blood pressure (ABP), cerebral blood velocity (CBv), and R-wave to R-wave interval (R-R interval) signals in a comprehensive database. A mapping function to describe the relationship between the features of noninvasive signals and the estimation errors is learned using innovative ranking constraints through convex optimization. Patients across multiple clinical settings were randomly split into testing and training datasets for performance evaluation of the mapping function. Results: The results indicate that about 31.88% of testing entries achieved estimation errors within 2 mmHg and 34.07% of testing entries between 2 mmHg to 6 mmHg from the nonlinear mapping with constraints. Conclusion: Our results demonstrate the feasibility of the proposed noninvasive ICP estimation approach. Significance: Further validation and technical refinement are required before clinical deployment, but this work lays the foundation for safe and broadly accessible ICP monitoring in patients with acute brain injury and related conditions.
【3】Provably Reliable Classifier Guidance through Cross-entropy Error Control
标题:通过交叉熵误差控制实现可证明可靠的分类器引导
链接:https://arxiv.org/abs/2601.21200
作者:Sharan Sahu,Arisina Banerjee,Yuchen Wu
备注:32 pages, 6 figures
摘要:Classifier-guided diffusion models generate conditional samples by augmenting the reverse-time score with the gradient of a learned classifier, yet it remains unclear whether standard classifier training procedures yield effective diffusion guidance. We address this gap by showing that, under mild smoothness assumptions on the classifiers, controlling the cross-entropy error at each diffusion step also controls the error of the resulting guidance vectors: classifiers achieving conditional KL divergence $\varepsilon^2$ from the ground-truth conditional label probabilities induce guidance vectors with mean squared error $\widetilde{O}(d \varepsilon )$. Our result yields an upper bound on the sampling error under classifier guidance and bears resemblance to a reverse log-Sobolev-type inequality. Moreover, we show that the classifier smoothness assumption is essential, by constructing simple counterexamples demonstrating that, without it, control of the guidance vector can fail for almost all distributions. To our knowledge, our work establishes the first quantitative link between classifier training and guidance alignment, yielding both a theoretical foundation for classifier guidance and principled guidelines for classifier selection.
【4】A Diffusive Classification Loss for Learning Energy-based Generative Models
标题:学习基于能量的生成模型的扩散性分类损失
链接:https://arxiv.org/abs/2601.21025
作者:Louis Grenioux,RuiKang OuYang,José Miguel Hernández-Lobato
摘要:Score-based generative models have recently achieved remarkable success. While they are usually parameterized by the score, an alternative way is to use a series of time-dependent energy-based models (EBMs), where the score is obtained from the negative input-gradient of the energy. Crucially, EBMs can be leveraged not only for generation, but also for tasks such as compositional sampling or building Boltzmann Generators via Monte Carlo methods. However, training EBMs remains challenging. Direct maximum likelihood is computationally prohibitive due to the need for nested sampling, while score matching, though efficient, suffers from mode blindness. To address these issues, we introduce the Diffusive Classification (DiffCLF) objective, a simple method that avoids blindness while remaining computationally efficient. DiffCLF reframes EBM learning as a supervised classification problem across noise levels, and can be seamlessly combined with standard score-based objectives. We validate the effectiveness of DiffCLF by comparing the estimated energies against ground truth in analytical Gaussian mixture cases, and by applying the trained models to tasks such as model composition and Boltzmann Generator sampling. Our results show that DiffCLF enables EBMs with higher fidelity and broader applicability than existing approaches.
表征(10篇)
【1】Cross-Fusion Distance: A Novel Metric for Measuring Fusion and Separability Between Data Groups in Representation Space
标题:交叉融合距离:一种用于测量表示空间中数据组之间融合和分离性的新型指标
链接:https://arxiv.org/abs/2601.22036
作者:Xiaolong Zhang,Jianwei Zhang,Xubo Song
备注:19 pages
摘要:Quantifying degrees of fusion and separability between data groups in representation space is a fundamental problem in representation learning, particularly under domain shift. A meaningful metric should capture fusion-altering factors like geometric displacement between representation groups, whose variations change the extent of fusion, while remaining invariant to fusion-preserving factors such as global scaling and sampling-induced layout changes, whose variations do not. Existing distributional distance metrics conflate these factors, leading to measures that are not informative of the true extent of fusion between data groups. We introduce Cross-Fusion Distance (CFD), a principled measure that isolates fusion-altering geometry while remaining robust to fusion-preserving variations, with linear computational complexity. We characterize the invariance and sensitivity properties of CFD theoretically and validate them in controlled synthetic experiments. For practical utility on real-world datasets with domain shift, CFD aligns more closely with downstream generalization degradation than commonly used alternatives. Overall, CFD provides a theoretically grounded and interpretable distance measure for representation learning.
【2】Investigation into using stochastic embedding representations for evaluating the trustworthiness of the Fréchet Inception Distance
标题:研究使用随机嵌入表示来评估弗雷谢特初始距离的可信度
链接:https://arxiv.org/abs/2601.21979
作者:Ciaran Bench,Vivek Desai,Carlijn Roozemond,Ruben van Engen,Spencer A. Thomas
摘要:Feature embeddings acquired from pretrained models are widely used in medical applications of deep learning to assess the characteristics of datasets; e.g. to determine the quality of synthetic, generated medical images. The Fréchet Inception Distance (FID) is one popular synthetic image quality metric that relies on the assumption that the characteristic features of the data can be detected and encoded by an InceptionV3 model pretrained on ImageNet1K (natural images). While it is widely known that this makes it less effective for applications involving medical images, the extent to which the metric fails to capture meaningful differences in image characteristics is not obviously known. Here, we use Monte Carlo dropout to compute the predictive variance in the FID as well as a supplemental estimate of the predictive variance in the feature embedding model's latent representations. We show that the magnitudes of the predictive variances considered exhibit varying degrees of correlation with the extent to which test inputs (ImageNet1K validation set augmented at various strengths, and other external datasets) are out-of-distribution relative to its training data, providing some insight into the effectiveness of their use as indicators of the trustworthiness of the FID.
【3】Robust Multimodal Representation Learning in Healthcare
标题:医疗保健中的鲁棒多模式表示学习
链接:https://arxiv.org/abs/2601.21941
作者:Xiaoguang Zhu,Linxiao Gong,Lianlong Sun,Yang Liu,Haoyu Wang,Jing Liu
摘要:Medical multimodal representation learning aims to integrate heterogeneous data into unified patient representations to support clinical outcome prediction. However, real-world medical datasets commonly contain systematic biases from multiple sources, which poses significant challenges for medical multimodal representation learning. Existing approaches typically focus on effective multimodal fusion, neglecting inherent biased features that affect the generalization ability. To address these challenges, we propose a Dual-Stream Feature Decorrelation Framework that identifies and handles the biases through structural causal analysis introduced by latent confounders. Our method employs a causal-biased decorrelation framework with dual-stream neural networks to disentangle causal features from spurious correlations, utilizing generalized cross-entropy loss and mutual information minimization for effective decorrelation. The framework is model-agnostic and can be integrated into existing medical multimodal learning methods. Comprehensive experiments on MIMIC-IV, eICU, and ADNI datasets demonstrate consistent performance improvements.
【4】Effective LoRA Adapter Routing using Task Representations
标题:使用任务表示的有效LoRA适配器路由
链接:https://arxiv.org/abs/2601.21795
作者:Akash Dhasade,Anne-Marie Kermarrec,Igor Pavlovic,Diana Petrescu,Rafael Pires,Mathis Randl,Martijn de Vos
摘要:Low-rank adaptation (LoRA) enables parameter efficient specialization of large language models (LLMs) through modular adapters, resulting in rapidly growing public adapter pools spanning diverse tasks. Effectively using these adapters requires routing: selecting and composing the appropriate adapters for a query. We introduce LORAUTER, a novel routing framework that selects and composes LoRA adapters using task representations rather than adapter characteristics. Unlike existing approaches that map queries directly to adapters, LORAUTER routes queries via task embeddings derived from small validation sets and does not require adapter training data. By operating at the task level, LORAUTER achieves efficient routing that scales with the number of tasks rather than the number of adapters. Experiments across multiple tasks show that LORAUTER consistently outperforms baseline routing approaches, matching Oracle performance (101.2%) when task-aligned adapters exist and achieving state-of-the-art results on unseen tasks (+5.2 points). We further demonstrate the robustness of LORAUTER to very large, noisy adapter pools by scaling it to over 1500 adapters.
【5】Gauge-invariant representation holonomy
标题:规格不变表示完整学
链接:https://arxiv.org/abs/2601.21653
作者:Vasileios Sevetlidis,George Pavlidis
备注:14th International Conference on Learning Representations (ICLR)
摘要:Deep networks learn internal representations whose geometry--how features bend, rotate, and evolve--affects both generalization and robustness. Existing similarity measures such as CKA or SVCCA capture pointwise overlap between activation sets, but miss how representations change along input paths. Two models may appear nearly identical under these metrics yet respond very differently to perturbations or adversarial stress. We introduce representation holonomy, a gauge-invariant statistic that measures this path dependence. Conceptually, holonomy quantifies the "twist" accumulated when features are parallel-transported around a small loop in input space: flat representations yield zero holonomy, while nonzero values reveal hidden curvature. Our estimator fixes gauge through global whitening, aligns neighborhoods using shared subspaces and rotation-only Procrustes, and embeds the result back to the full feature space. We prove invariance to orthogonal (and affine, post-whitening) transformations, establish a linear null for affine layers, and show that holonomy vanishes at small radii. Empirically, holonomy increases with loop radius, separates models that appear similar under CKA, and correlates with adversarial and corruption robustness. It also tracks training dynamics as features form and stabilize. Together, these results position representation holonomy as a practical and scalable diagnostic for probing the geometric structure of learned representations beyond pointwise similarity.
【6】CORDS: Continuous Representations of Discrete Structures
标题:CORDS:离散结构的连续表示
链接:https://arxiv.org/abs/2601.21583
作者:Tin Hadži Veljković,Erik Bekkers,Michael Tiemann,Jan-Willem van de Meent
备注:Preprint, accepted at ICLR 2026
摘要:Many learning problems require predicting sets of objects when the number of objects is not known beforehand. Examples include object detection, molecular modeling, and scientific inference tasks such as astrophysical source detection. Existing methods often rely on padded representations or must explicitly infer the set size, which often poses challenges. We present a novel strategy for addressing this challenge by casting prediction of variable-sized sets as a continuous inference problem. Our approach, CORDS (Continuous Representations of Discrete Structures), provides an invertible mapping that transforms a set of spatial objects into continuous fields: a density field that encodes object locations and count, and a feature field that carries their attributes over the same support. Because the mapping is invertible, models operate entirely in field space while remaining exactly decodable to discrete sets. We evaluate CORDS across molecular generation and regression, object detection, simulation-based inference, and a mathematical task involving recovery of local maxima, demonstrating robust handling of unknown set sizes with competitive accuracy.
【7】Representation Unlearning: Forgetting through Information Compression
标题:表示遗忘:通过信息压缩忘记
链接:https://arxiv.org/abs/2601.21564
作者:Antonio Almudévar,Alfonso Ortega
摘要:Machine unlearning seeks to remove the influence of specific training data from a model, a need driven by privacy regulations and robustness concerns. Existing approaches typically modify model parameters, but such updates can be unstable, computationally costly, and limited by local approximations. We introduce Representation Unlearning, a framework that performs unlearning directly in the model's representation space. Instead of modifying model parameters, we learn a transformation over representations that imposes an information bottleneck: maximizing mutual information with retained data while suppressing information about data to be forgotten. We derive variational surrogates that make this objective tractable and show how they can be instantiated in two practical regimes: when both retain and forget data are available, and in a zero-shot setting where only forget data can be accessed. Experiments across several benchmarks demonstrate that Representation Unlearning achieves more reliable forgetting, better utility retention, and greater computational efficiency than parameter-centric baselines.
【8】Factored Causal Representation Learning for Robust Reward Modeling in RLHF
标题:RL HF中鲁棒奖励建模的因子因果表示学习
链接:https://arxiv.org/abs/2601.21350
作者:Yupei Yang,Lin Yang,Wanxi Deng,Lin Qu,Fan Feng,Biwei Huang,Shikui Tu,Lei Xu
摘要:A reliable reward model is essential for aligning large language models with human preferences through reinforcement learning from human feedback. However, standard reward models are susceptible to spurious features that are not causally related to human labels. This can lead to reward hacking, where high predicted reward does not translate into better behavior. In this work, we address this problem from a causal perspective by proposing a factored representation learning framework that decomposes the model's contextual embedding into (1) causal factors that are sufficient for reward prediction and (2) non-causal factors that capture reward-irrelevant attributes such as length or sycophantic bias. The reward head is then constrained to depend only on the causal component. In addition, we introduce an adversarial head trained to predict reward from the non-causal factors, while applying gradient reversal to discourage them from encoding reward-relevant information. Experiments on both mathematical and dialogue tasks demonstrate that our method learns more robust reward models and consistently improves downstream RLHF performance over state-of-the-art baselines. Analyses on length and sycophantic bias further validate the effectiveness of our method in mitigating reward hacking behaviors.
【9】Hypersolid: Emergent Vision Representations via Short-Range Repulsion
标题:超实体:通过短程排斥的紧急视觉表达
链接:https://arxiv.org/abs/2601.21255
作者:Esteban Rodríguez-Betancourt,Edgar Casasola-Murillo
备注:17 pages, 16 figures
摘要:A recurring challenge in self-supervised learning is preventing representation collapse. Existing solutions typically rely on global regularization, such as maximizing distances, decorrelating dimensions or enforcing certain distributions. We instead reinterpret representation learning as a discrete packing problem, where preserving information simplifies to maintaining injectivity. We operationalize this in Hypersolid, a method using short-range hard-ball repulsion to prevent local collisions. This constraint results in a high-separation geometric regime that preserves augmentation diversity, excelling on fine-grained and low-resolution classification tasks.
【10】TRACE: Trajectory Recovery for Continuous Mechanism Evolution in Causal Representation Learning
标题:TRACE:因果表示学习中连续机制进化的轨迹恢复
链接:https://arxiv.org/abs/2601.21135
作者:Shicheng Fan,Kun Zhang,Lu Cheng
备注:23 pages, 11 figures
摘要
:Temporal causal representation learning methods assume that causal mechanisms switch instantaneously between discrete domains, yet real-world systems often exhibit continuous mechanism transitions. For example, a vehicle's dynamics evolve gradually through a turning maneuver, and human gait shifts smoothly from walking to running. We formalize this setting by modeling transitional mechanisms as convex combinations of finitely many atomic mechanisms, governed by time-varying mixing coefficients. Our theoretical contributions establish that both the latent causal variables and the continuous mixing trajectory are jointly identifiable. We further propose TRACE, a Mixture-of-Experts framework where each expert learns one atomic mechanism during training, enabling recovery of mechanism trajectories at test time. This formulation generalizes to intermediate mechanism states never observed during training. Experiments on synthetic and real-world data demonstrate that TRACE recovers mixing trajectories with up to 0.99 correlation, substantially outperforming discrete-switching baselines.
优化|敛散性(8篇)
【1】Boosting CVaR Policy Optimization with Quantile Gradients
标题:利用分位数要素促进CVaR政策优化
链接:https://arxiv.org/abs/2601.22100
作者:Yudong Luo,Erick Delage
摘要:Optimizing Conditional Value-at-risk (CVaR) using policy gradient (a.k.a CVaR-PG) faces significant challenges of sample inefficiency. This inefficiency stems from the fact that it focuses on tail-end performance and overlooks many sampled trajectories. We address this problem by augmenting CVaR with an expected quantile term. Quantile optimization admits a dynamic programming formulation that leverages all sampled data, thus improves sample efficiency. This does not alter the CVaR objective since CVaR corresponds to the expectation of quantile over the tail. Empirical results in domains with verifiable risk-averse behavior show that our algorithm within the Markovian policy class substantially improves upon CVaR-PG and consistently outperforms other existing methods.
【2】GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization
标题:GeoNorm:通过测地优化统一前规范和后规范
链接:https://arxiv.org/abs/2601.22095
作者:Chuanyang Zheng,Jiankai Sun,Yihang Gao,Chi Wang,Yuehao Wang,Jing Xiong,Liliang Ren,Bo Peng,Qingmei Wang,Xiaoran Shang,Mac Schwager,Anderson Schneider,Yuriy Nevmyvaka,Xiaodong Liu
备注:Tech Report
摘要:The placement of normalization layers, specifically Pre-Norm and Post-Norm, remains an open question in Transformer architecture design. In this work, we rethink these approaches through the lens of manifold optimization, interpreting the outputs of the Feed-Forward Network (FFN) and attention layers as update directions in optimization. Building on this perspective, we introduce GeoNorm, a novel method that replaces standard normalization with geodesic updates on the manifold. Furthermore, analogous to learning rate schedules, we propose a layer-wise update decay for the FFN and attention components. Comprehensive experiments demonstrate that GeoNorm consistently outperforms existing normalization methods in Transformer models. Crucially, GeoNorm can be seamlessly integrated into standard Transformer architectures, achieving performance improvements with negligible additional computational cost.
【3】When Gradient Optimization Is Not Enough: $\dagger$ Dispersive and Anchoring Geometric Regularizer for Multimodal Learning
链接:https://arxiv.org/abs/2601.21670
作者:Zixuan Xia,Hao Wang,Pengcheng Weng,Yanyu Qian,Yangxin Xu,William Dan,Fei Wang
摘要:Multimodal learning aims to integrate complementary information from heterogeneous modalities, yet strong optimization alone does not guaranty well-structured representations. Even under carefully balanced training schemes, multimodal models often exhibit geometric pathologies, including intra-modal representation collapse and sample-level cross-modal inconsistency, which degrade both unimodal robustness and multimodal fusion. We identify representation geometry as a missing control axis in multimodal learning and propose \regName, a lightweight geometry-aware regularization framework. \regName enforces two complementary constraints on intermediate embeddings: an intra-modal dispersive regularization that promotes representation diversity, and an inter-modal anchoring regularization that bounds sample-level cross-modal drift without rigid alignment. The proposed regularizer is plug-and-play, requires no architectural modifications, and is compatible with various training paradigms. Extensive experiments across multiple multimodal benchmarks demonstrate consistent improvements in both multimodal and unimodal performance, showing that explicitly regulating representation geometry effectively mitigates modality trade-offs.
【4】Intrinsic Reward Policy Optimization for Sparse-Reward Environments
标题:稀疏回报环境下的内在回报政策优化
链接:https://arxiv.org/abs/2601.21391
作者:Minjae Cho,Huy Trong Tran
摘要:Exploration is essential in reinforcement learning as an agent relies on trial and error to learn an optimal policy. However, when rewards are sparse, naive exploration strategies, like noise injection, are often insufficient. Intrinsic rewards can also provide principled guidance for exploration by, for example, combining them with extrinsic rewards to optimize a policy or using them to train subpolicies for hierarchical learning. However, the former approach suffers from unstable credit assignment, while the latter exhibits sample inefficiency and sub-optimality. We propose a policy optimization framework that leverages multiple intrinsic rewards to directly optimize a policy for an extrinsic reward without pretraining subpolicies. Our algorithm -- intrinsic reward policy optimization (IRPO) -- achieves this by using a surrogate policy gradient that provides a more informative learning signal than the true gradient in sparse-reward environments. We demonstrate that IRPO improves performance and sample efficiency relative to baselines in discrete and continuous environments, and formally analyze the optimization problem solved by IRPO. Our code is available at https://github.com/Mgineer117/IRPO.
【5】Optimal Transport-Induced Samples against Out-of-Distribution Overconfidence
标题:针对分布外过度自信的最佳传输诱导样本
链接:https://arxiv.org/abs/2601.21320
作者:Keke Tang,Ziyong Du,Xiaofei Wang,Weilong Peng,Peican Zhu,Zhihong Tian
备注:Accepted by ICLR 2026
摘要:Deep neural networks (DNNs) often produce overconfident predictions on out-of-distribution (OOD) inputs, undermining their reliability in open-world environments. Singularities in semi-discrete optimal transport (OT) mark regions of semantic ambiguity, where classifiers are particularly prone to unwarranted high-confidence predictions. Motivated by this observation, we propose a principled framework to mitigate OOD overconfidence by leveraging the geometry of OT-induced singular boundaries. Specifically, we formulate an OT problem between a continuous base distribution and the latent embeddings of training data, and identify the resulting singular boundaries. By sampling near these boundaries, we construct a class of OOD inputs, termed optimal transport-induced OOD samples (OTIS), which are geometrically grounded and inherently semantically ambiguous. During training, a confidence suppression loss is applied to OTIS to guide the model toward more calibrated predictions in structurally uncertain regions. Extensive experiments show that our method significantly alleviates OOD overconfidence and outperforms state-of-the-art methods.
【6】Magellan: Autonomous Discovery of Novel Compiler Optimization Heuristics with AlphaEvolve
标题:麦哲伦:利用AlphaEvolve自主发现新颖的简化器优化启发式方法
链接:https://arxiv.org/abs/2601.21096
作者:Hongzheng Chen,Alexander Novikov,Ngân Vũ,Hanna Alam,Zhiru Zhang,Aiden Grossman,Mircea Trofin,Amir Yazdanbakhsh
备注:Accepted to C4ML@CGO'26
摘要:Modern compilers rely on hand-crafted heuristics to guide optimization passes. These human-designed rules often struggle to adapt to the complexity of modern software and hardware and lead to high maintenance burden. To address this challenge, we present Magellan, an agentic framework that evolves the compiler pass itself by synthesizing executable C++ decision logic. Magellan couples an LLM coding agent with evolutionary search and autotuning in a closed loop of generation, evaluation on user-provided macro-benchmarks, and refinement, producing compact heuristics that integrate directly into existing compilers. Across several production optimization tasks, Magellan discovers policies that match or surpass expert baselines. In LLVM function inlining, Magellan synthesizes new heuristics that outperform decades of manual engineering for both binary-size reduction and end-to-end performance. In register allocation, it learns a concise priority rule for live-range processing that matches intricate human-designed policies on a large-scale workload. We also report preliminary results on XLA problems, demonstrating portability beyond LLVM with reduced engineering effort.
【7】Snowball: A Scalable All-to-All Ising Machine with Dual-Mode Markov Chain Monte Carlo Spin Selection and Asynchronous Spin Updates for Fast Combinatorial Optimization
标题:滚雪球:具有双模式马尔科夫链蒙特卡罗旋转选择和同步旋转更新的可扩展全对全伊辛机用于快速组合优化
链接:https://arxiv.org/abs/2601.21058
作者:Seungki Hong,Kyeongwon Jeong,Taekwang Jang
摘要:Ising machines have emerged as accelerators for combinatorial optimization. To enable practical deployment, this work aims to reduce time-to-solution by addressing three challenges: (1) hardware topology, (2) spin selection and update algorithms, and (3) scalable coupling-coefficient precision. Restricted topologies require minor embedding; naive parallel updates can oscillate or stall; and limited precision can preclude feasible mappings or degrade solution quality. This work presents Snowball, a digital, scalable, all-to-all coupled Ising machine that integrates dual-mode Markov chain Monte Carlo spin selection with asynchronous spin updates to promote convergence and reduce time-to-solution. The digital architecture supports wide, configurable coupling precision, unlike many analog realizations at high bit widths. A prototype on an AMD Alveo U250 accelerator card achieves an 8$\times$ reduction in time-to-solution relative to a state-of-the-art Ising machine on the same benchmark instance.
【8】Near-Optimal Private Tests for Simple and MLR Hypotheses
标题:简单假设和MLR假设的近优私人测试
链接:https://arxiv.org/abs/2601.21959
作者:Yu-Wei Chen,Raghu Pasupathy,Jordan Awan
摘要:We develop a near-optimal testing procedure under the framework of Gaussian differential privacy for simple as well as one- and two-sided tests under monotone likelihood ratio conditions. Our mechanism is based on a private mean estimator with data-driven clamping bounds, whose population risk matches the private minimax rate up to logarithmic factors. Using this estimator, we construct private test statistics that achieve the same asymptotic relative efficiency as the non-private, most powerful tests while maintaining conservative type I error control. In addition to our theoretical results, our numerical experiments show that our private tests outperform competing DP methods and offer comparable power to the non-private most powerful tests, even at moderately small sample sizes and privacy loss budgets.
预测|估计(13篇)
【1】Breaking the Regional Barrier: Inductive Semantic Topology Learning for Worldwide Air Quality Forecasting
标题:打破区域障碍:用于全球空气质量预测的归纳语义拓学
链接:https://arxiv.org/abs/2601.21899
作者:Zhiqing Cui,Siru Zhong,Ming Jin,Shirui Pan,Qingsong Wen,Yuxuan Liang
摘要
:Global air quality forecasting grapples with extreme spatial heterogeneity and the poor generalization of existing transductive models to unseen regions. To tackle this, we propose OmniAir, a semantic topology learning framework tailored for global station-level prediction. By encoding invariant physical environmental attributes into generalizable station identities and dynamically constructing adaptive sparse topologies, our approach effectively captures long-range non-Euclidean correlations and physical diffusion patterns across unevenly distributed global networks. We further curate WorldAir, a massive dataset covering over 7,800 stations worldwide. Extensive experiments show that OmniAir achieves state-of-the-art performance against 18 baselines, maintaining high efficiency and scalability with speeds nearly 10 times faster than existing models, while effectively bridging the monitoring gap in data-sparse regions.
【2】MoHETS: Long-term Time Series Forecasting with Mixture-of-Heterogeneous-Experts
标题:MoHETS:采用混合异源专家进行长期时间序列预测
链接:https://arxiv.org/abs/2601.21866
作者:Evandro S. Ortigossa,Guy Lutsker,Eran Segal
备注:Under review
摘要:Real-world multivariate time series can exhibit intricate multi-scale structures, including global trends, local periodicities, and non-stationary regimes, which makes long-horizon forecasting challenging. Although sparse Mixture-of-Experts (MoE) approaches improve scalability and specialization, they typically rely on homogeneous MLP experts that poorly capture the diverse temporal dynamics of time series data. We address these limitations with MoHETS, an encoder-only Transformer that integrates sparse Mixture-of-Heterogeneous-Experts (MoHE) layers. MoHE routes temporal patches to a small subset of expert networks, combining a shared depthwise-convolution expert for sequence-level continuity with routed Fourier-based experts for patch-level periodic structures. MoHETS further improves robustness to non-stationary dynamics by incorporating exogenous information via cross-attention over covariate patch embeddings. Finally, we replace parameter-heavy linear projection heads with a lightweight convolutional patch decoder, improving parameter efficiency, reducing training instability, and allowing a single model to generalize across arbitrary forecast horizons. We validate across seven multivariate benchmarks and multiple horizons, with MoHETS consistently achieving state-of-the-art performance, reducing the average MSE by $12\%$ compared to strong recent baselines, demonstrating effective heterogeneous specialization for long-term forecasting.
【3】When does predictive inverse dynamics outperform behavior cloning?
标题:预测性反向动力学何时优于行为克隆?
链接:https://arxiv.org/abs/2601.21718
作者:Lukas Schäfer,Pallavi Choudhury,Abdelhak Lemkhenter,Chris Lovett,Somjit Nath,Luis França,Matheus Ribeiro Furtado de Mendonça,Alex Lamb,Riashat Islam,Siddhartha Sen,John Langford,Katja Hofmann,Sergio Valcarcel Macua
备注:Preprint
摘要:Behavior cloning (BC) is a practical offline imitation learning method, but it often fails when expert demonstrations are limited. Recent works have introduced a class of architectures named predictive inverse dynamics models (PIDM) that combine a future state predictor with an inverse dynamics model (IDM). While PIDM often outperforms BC, the reasons behind its benefits remain unclear. In this paper, we provide a theoretical explanation: PIDM introduces a bias-variance tradeoff. While predicting the future state introduces bias, conditioning the IDM on the prediction can significantly reduce variance. We establish conditions on the state predictor bias for PIDM to achieve lower prediction error and higher sample efficiency than BC, with the gap widening when additional data sources are available. We validate the theoretical insights empirically in 2D navigation tasks, where BC requires up to five times (three times on average) more demonstrations than PIDM to reach comparable performance; and in a complex 3D environment in a modern video game with high-dimensional visual inputs and stochastic transitions, where BC requires over 66\% more samples than PIDM.
【4】Multi-Modal Time Series Prediction via Mixture of Modulated Experts
标题:通过混合调制专家进行多模式时间序列预测
链接:https://arxiv.org/abs/2601.21547
作者:Lige Zhang,Ali Maatouk,Jialin Chen,Leandros Tassiulas,Rex Ying
备注:26 pages, 12 figures
摘要:Real-world time series exhibit complex and evolving dynamics, making accurate forecasting extremely challenging. Recent multi-modal forecasting methods leverage textual information such as news reports to improve prediction, but most rely on token-level fusion that mixes temporal patches with language tokens in a shared embedding space. However, such fusion can be ill-suited when high-quality time-text pairs are scarce and when time series exhibit substantial variation in scale and characteristics, thus complicating cross-modal alignment. In parallel, Mixture-of-Experts (MoE) architectures have proven effective for both time series modeling and multi-modal learning, yet many existing MoE-based modality integration methods still depend on token-level fusion. To address this, we propose Expert Modulation, a new paradigm for multi-modal time series prediction that conditions both routing and expert computation on textual signals, enabling direct and efficient cross-modal control over expert behavior. Through comprehensive theoretical analysis and experiments, our proposed method demonstrates substantial improvements in multi-modal time series prediction. The current code is available at https://github.com/BruceZhangReve/MoME
【5】A block-coordinate descent framework for non-convex composite optimization. Application to sparse precision matrix estimation
标题:非凸复合优化的块坐标下降框架。稀疏精度矩阵估计
链接:https://arxiv.org/abs/2601.21467
作者:Guillaume Lauga
摘要
:Block-coordinate descent (BCD) is the method of choice to solve numerous large scale optimization problems, however their theoretical study for non-convex optimization, has received less attention. In this paper, we present a new block-coordinate descent (BCD) framework to tackle non-convex composite optimization problems, ensuring decrease of the objective function and convergence to a solution. This framework is general enough to include variable metric proximal gradient updates, proximal Newton updates, and alternated minimization updates. This generality allows to encompass three versions of the most used solvers in the sparse precision matrix estimation problem, deemed Graphical Lasso: graphical ISTA, Primal GLasso, and QUIC. We demonstrate the value of this new framework on non-convex sparse precision matrix estimation problems, providing convergence guarantees and up to a $100$-fold reduction in the number of iterations required to reach state-of-the-art estimation quality.
【6】Model-Free Neural State Estimation in Nonlinear Dynamical Systems: A Comparative Study of Neural Architectures and Classical Filters
标题:非线性动态系统中的无模型神经状态估计:神经结构和经典过滤器的比较研究
链接:https://arxiv.org/abs/2601.21266
作者:Zhuochen Liu,Hans Walker,Rahul Jain
备注:8 pages, 2 figures
摘要:Neural network models are increasingly used for state estimation in control and decision-making problems, yet it remains unclear to what extent they behave as principled filters in nonlinear dynamical systems. Unlike classical filters, which rely on explicit knowledge of system dynamics and noise models, neural estimators can be trained purely from data without access to the underlying system equations. In this work, we present a systematic empirical comparison between such model-free neural network models and classical filtering methods across multiple nonlinear scenarios. Our study evaluates Transformer-based models, state-space neural networks, and recurrent architectures alongside particle filters and nonlinear Kalman filters. The results show that neural models (in particular, state-space models (SSMs)) achieve state estimation performance that approaches strong nonlinear Kalman filters in nonlinear scenarios and outperform weaker classical baselines despite lacking access to system models, while also attaining substantially higher inference throughput.
【7】Flow Perturbation++: Multi-Step Unbiased Jacobian Estimation for High-Dimensional Boltzmann Sampling
标题:流量微扰++:多维Boltzmann抽样的多步无偏雅可比估计
链接:https://arxiv.org/abs/2601.21177
作者:Xin Peng,Ang Gao
摘要:The scalability of continuous normalizing flows (CNFs) for unbiased Boltzmann sampling remains limited in high-dimensional systems due to the cost of Jacobian-determinant evaluation, which requires $D$ backpropagation passes through the flow layers. Existing stochastic Jacobian estimators such as the Hutchinson trace estimator reduce computation but introduce bias, while the recently proposed Flow Perturbation method is unbiased yet suffers from high variance. We present \textbf{Flow Perturbation++}, a variance-reduced extension of Flow Perturbation that discretizes the probability-flow ODE and performs unbiased stepwise Jacobian estimation at each integration step. This multi-step construction retains the unbiasedness of Flow Perturbation while achieves substantially lower estimator variance. Integrated into a Sequential Monte Carlo framework, Flow Perturbation++ achieves significantly improved equilibrium sampling on a 1000D Gaussian Mixture Model and the all-atom Chignolin protein compared with Hutchinson-based and single-step Flow Perturbation baselines.
【8】Learning to Advect: A Neural Semi-Lagrangian Architecture for Weather Forecasting
标题:学习前进:天气预报的神经半拉格朗日架构
链接:https://arxiv.org/abs/2601.21151
作者:Carlos A. Pereira,Stéphane Gaudreault,Valentin Dallerit,Christopher Subich,Shoyon Panday,Siqi Wei,Sasa Zhang,Siddharth Rout,Eldad Haber,Raymond J. Spiteri,David Millard,Emilia Diaconescu
摘要:Recent machine-learning approaches to weather forecasting often employ a monolithic architecture, where distinct physical mechanisms (advection, transport), diffusion-like mixing, thermodynamic processes, and forcing are represented implicitly within a single large network. This representation is particularly problematic for advection, where long-range transport must be treated with expensive global interaction mechanisms or through deep, stacked convolutional layers. To mitigate this, we present PARADIS, a physics-inspired global weather prediction model that imposes inductive biases on network behavior through a functional decomposition into advection, diffusion, and reaction blocks acting on latent variables. We implement advection through a Neural Semi-Lagrangian operator that performs trajectory-based transport via differentiable interpolation on the sphere, enabling end-to-end learning of both the latent modes to be transported and their characteristic trajectories. Diffusion-like processes are modeled through depthwise-separable spatial mixing, while local source terms and vertical interactions are modeled via pointwise channel interactions, enabling operator-level physical structure. PARADIS provides state-of-the-art forecast skill at a fraction of the training cost. On ERA5-based benchmarks, the 1 degree PARADIS model, with a total training cost of less than a GPU month, meets or exceeds the performance of 0.25 degree traditional and machine-learning baselines, including the ECMWF HRES forecast and DeepMind's GraphCast.
【9】Faster Predictive Coding Networks via Better Initialization
标题:通过更好的收件箱更快的预测编码网络
链接:https://arxiv.org/abs/2601.20895
作者:Luca Pinchetti,Simon Frieder,Thomas Lukasiewicz,Tommaso Salvatori
摘要:Research aimed at scaling up neuroscience inspired learning algorithms for neural networks is accelerating. Recently, a key research area has been the study of energy-based learning algorithms such as predictive coding, due to their versatility and mathematical grounding. However, the applicability of such methods is held back by the large computational requirements caused by their iterative nature. In this work, we address this problem by showing that the choice of initialization of the neurons in a predictive coding network matters significantly and can notably reduce the required training times. Consequently, we propose a new initialization technique for predictive coding networks that aims to preserve the iterative progress made on previous training samples. Our approach suggests a promising path toward reconciling the disparities between predictive coding and backpropagation in terms of computational efficiency and final performance. In fact, our experiments demonstrate substantial improvements in convergence speed and final test loss in both supervised and unsupervised settings.
【10】VSE: Variational state estimation of complex model-free process
标题:VSE:复杂无模型过程的变分状态估计
链接:https://arxiv.org/abs/2601.21887
作者:Gustav Norén,Anubhab Ghosh,Fredrik Cumlin,Saikat Chatterjee
备注:The article is accepted at ICASSP 2026
摘要:We design a variational state estimation (VSE) method that provides a closed-form Gaussian posterior of an underlying complex dynamical process from (noisy) nonlinear measurements. The complex process is model-free. That is, we do not have a suitable physics-based model characterizing the temporal evolution of the process state. The closed-form Gaussian posterior is provided by a recurrent neural network (RNN). The use of RNN is computationally simple in the inference phase. For learning the RNN, an additional RNN is used in the learning phase. Both RNNs help each other learn better based on variational inference principles. The VSE is demonstrated for a tracking application - state estimation of a stochastic Lorenz system (a benchmark process) using a 2-D camera measurement model. The VSE is shown to be competitive against a particle filter that knows the Lorenz system model and a recently proposed data-driven state estimation method that does not know the Lorenz system model.
【11】A Decomposable Forward Process in Diffusion Models for Time-Series Forecasting
标题:用于时间序列预测的扩散模型中的可分解正向过程
链接:https://arxiv.org/abs/2601.21812
作者:Francisco Caldas,Sahil Kumar,Cláudia Soares
备注:submitted to ICML'26
摘要:We introduce a model-agnostic forward diffusion process for time-series forecasting that decomposes signals into spectral components, preserving structured temporal patterns such as seasonality more effectively than standard diffusion. Unlike prior work that modifies the network architecture or diffuses directly in the frequency domain, our proposed method alters only the diffusion process itself, making it compatible with existing diffusion backbones (e.g., DiffWave, TimeGrad, CSDI). By staging noise injection according to component energy, it maintains high signal-to-noise ratios for dominant frequencies throughout the diffusion trajectory, thereby improving the recoverability of long-term patterns. This strategy enables the model to maintain the signal structure for a longer period in the forward process, leading to improved forecast quality. Across standard forecasting benchmarks, we show that applying spectral decomposition strategies, such as the Fourier or Wavelet transform, consistently improves upon diffusion models using the baseline forward process, with negligible computational overhead. The code for this paper is available at https://anonymous.4open.science/r/D-FDP-4A29.
【12】Questioning the Coverage-Length Metric in Conformal Prediction: When Shorter Intervals Are Not Better
标题:质疑保形预测中的覆盖长度指标:当较短的间隔并不更好时
链接:https://arxiv.org/abs/2601.21455
作者:Yizhou Min,Yizhou Lu,Lanqi Li,Zhen Zhang,Jiaye Teng
摘要:Conformal prediction (CP) has become a cornerstone of distribution-free uncertainty quantification, conventionally evaluated by its coverage and interval length. This work critically examines the sufficiency of these standard metrics. We demonstrate that the interval length might be deceptively improved through a counter-intuitive approach termed Prejudicial Trick (PT), while the coverage remains valid. Specifically, for any given test sample, PT probabilistically returns an interval, which is either null or constructed using an adjusted confidence level, thereby preserving marginal coverage. While PT potentially yields a deceptively lower interval length, it introduces practical vulnerabilities: the same input can yield completely different prediction intervals across repeated runs of the algorithm. We formally derive the conditions under which PT achieves these misleading improvements and provides extensive empirical evidence across various regression and classification tasks. Furthermore, we introduce a new metric interval stability which helps detect whether a new CP method implicitly improves the length based on such PT-like techniques.
【13】A new strategy for finite-sample valid prediction of future insurance claims in the regression setting
标题:回归环境下未来保险索赔的有限样本有效预测的新策略
链接:https://arxiv.org/abs/2601.21153
作者:Liang Hong
摘要:The extant insurance literature demonstrates a paucity of finite-sample valid prediction intervals of future insurance claims in the regression setting. To address this challenge, this article proposes a new strategy that converts a predictive method in the unsupervised iid (independent identically distributed) setting to a predictive method in the regression setting. In particular, it enables an actuary to obtain infinitely many finite-sample valid prediction intervals in the regression setting.
其他神经网络|深度学习|模型|建模(55篇)
【1】Discovering Hidden Gems in Model Repositories
标题:在模型库中发现隐藏的宝石
链接:https://arxiv.org/abs/2601.22157
作者:Jonathan Kahana,Eliahu Horwitz,Yedid Hoshen
摘要
:Public repositories host millions of fine-tuned models, yet community usage remains disproportionately concentrated on a small number of foundation checkpoints. We investigate whether this concentration reflects efficient market selection or if superior models are systematically overlooked. Through an extensive evaluation of over 2,000 models, we show the prevalence of "hidden gems", unpopular fine-tunes that significantly outperform their popular counterparts. Notably, within the Llama-3.1-8B family, we find rarely downloaded checkpoints that improve math performance from 83.2% to 96.0% without increasing inference costs. However, discovering these models through exhaustive evaluation of every uploaded model is computationally infeasible. We therefore formulate model discovery as a Multi-Armed Bandit problem and accelerate the Sequential Halving search algorithm by using shared query sets and aggressive elimination schedules. Our method retrieves top models with as few as 50 queries per candidate, accelerating discovery by over 50x.
【2】Late Breaking Results: Conversion of Neural Networks into Logic Flows for Edge Computing
标题:最新成果:将神经网络转换为边缘计算的逻辑流
链接:https://arxiv.org/abs/2601.22151
作者:Daniel Stein,Shaoyi Huang,Rolf Drechsler,Bing Li,Grace Li Zhang
备注:accepted by DATE2026
摘要:Neural networks have been successfully applied in various resource-constrained edge devices, where usually central processing units (CPUs) instead of graphics processing units exist due to limited power availability. State-of-the-art research still focuses on efficiently executing enormous numbers of multiply-accumulate (MAC) operations. However, CPUs themselves are not good at executing such mathematical operations on a large scale, since they are more suited to execute control flow logic, i.e., computer algorithms. To enhance the computation efficiency of neural networks on CPUs, in this paper, we propose to convert them into logic flows for execution. Specifically, neural networks are first converted into equivalent decision trees, from which decision paths with constant leaves are then selected and compressed into logic flows. Such logic flows consist of if and else structures and a reduced number of MAC operations. Experimental results demonstrate that the latency can be reduced by up to 14.9 % on a simulated RISC-V CPU without any accuracy degradation. The code is open source at https://github.com/TUDa-HWAI/NN2Logic
【3】Learning Hamiltonian Flow Maps: Mean Flow Consistency for Large-Timestep Molecular Dynamics
标题:学习汉密尔顿流图:大时步分子动力学的平均流一致性
链接:https://arxiv.org/abs/2601.22123
作者:Winfried Ripken,Michael Plainer,Gregor Lied,Thorben Frank,Oliver T. Unke,Stefan Chmiela,Frank Noé,Klaus Robert Müller
摘要:Simulating the long-time evolution of Hamiltonian systems is limited by the small timesteps required for stable numerical integration. To overcome this constraint, we introduce a framework to learn Hamiltonian Flow Maps by predicting the mean phase-space evolution over a chosen time span $Δt$, enabling stable large-timestep updates far beyond the stability limits of classical integrators. To this end, we impose a Mean Flow consistency condition for time-averaged Hamiltonian dynamics. Unlike prior approaches, this allows training on independent phase-space samples without access to future states, avoiding expensive trajectory generation. Validated across diverse Hamiltonian systems, our method in particular improves upon molecular dynamics simulations using machine-learned force fields (MLFF). Our models maintain comparable training and inference cost, but support significantly larger integration timesteps while trained directly on widely-available trajectory-free MLFF datasets.
【4】Making Foundation Models Probabilistic via Singular Value Ensembles
标题:通过奇异值集合使基础模型变得可能
链接:https://arxiv.org/abs/2601.22068
作者:Mehmet Ozgur Turkoglu,Dominik J. Mühlematter,Alexander Becker,Konrad Schindler,Helge Aasen
摘要:Foundation models have become a dominant paradigm in machine learning, achieving remarkable performance across diverse tasks through large-scale pretraining. However, these models often yield overconfident, uncalibrated predictions. The standard approach to quantifying epistemic uncertainty, training an ensemble of independent models, incurs prohibitive computational costs that scale linearly with ensemble size, making it impractical for large foundation models. We propose Singular Value Ensemble (SVE), a parameter-efficient implicit ensemble method that builds on a simple, but powerful core assumption: namely, that the singular vectors of the weight matrices constitute meaningful subspaces of the model's knowledge. Pretrained foundation models encode rich, transferable information in their weight matrices. If the singular vectors are indeed meaningful (orthogonal) "knowledge directions". To obtain a model ensemble, we modulate only how strongly each direction contributes to the output. Rather than learning entirely new parameters, we freeze the singular vectors and only train per-member singular values that rescale the contribution of each direction in that shared knowledge basis. Ensemble diversity emerges naturally as stochastic initialization and random sampling of mini-batches during joint training cause different members to converge to different combinations of the same underlying knowledge. SVE achieves uncertainty quantification comparable to explicit deep ensembles while increasing the parameter count of the base model by less than 1%, making principled uncertainty estimation accessible in resource-constrained settings. We validate SVE on NLP and vision tasks with various different backbones and show that it improves calibration while maintaining predictive accuracy.
【5】Learning to Communicate Across Modalities: Perceptual Heterogeneity in Multi-Agent Systems
标题:学习跨模态交流:多智能体系统中的感知异质性
链接:https://arxiv.org/abs/2601.22041
作者:Naomi Pitzer,Daniela Mihai
备注:To be published in EvoLang XVI proceedings. 15 pages, 17 figures
摘要
:Emergent communication offers insight into how agents develop shared structured representations, yet most research assumes homogeneous modalities or aligned representational spaces, overlooking the perceptual heterogeneity of real-world settings. We study a heterogeneous multi-step binary communication game where agents differ in modality and lack perceptual grounding. Despite perceptual misalignment, multimodal systems converge to class-consistent messages grounded in perceptual input. Unimodal systems communicate more efficiently, using fewer bits and achieving lower classification entropy, while multimodal agents require greater information exchange and exhibit higher uncertainty. Bit perturbation experiments provide strong evidence that meaning is encoded in a distributional rather than compositional manner, as each bit's contribution depends on its surrounding pattern. Finally, interoperability analyses show that systems trained in different perceptual worlds fail to directly communicate, but limited fine-tuning enables successful cross-system communication. This work positions emergent communication as a framework for studying how agents adapt and transfer representations across heterogeneous modalities, opening new directions for both theory and experimentation.
【6】Putting a Face to Forgetting: Continual Learning meets Mechanistic Interpretability
标题:直面遗忘:持续学习满足机械解释性
链接:https://arxiv.org/abs/2601.22012
作者:Sergi Masip,Gido M. van de Ven,Javier Ferrando,Tinne Tuytelaars
摘要:Catastrophic forgetting in continual learning is often measured at the performance or last-layer representation level, overlooking the underlying mechanisms. We introduce a mechanistic framework that offers a geometric interpretation of catastrophic forgetting as the result of transformations to the encoding of individual features. These transformations can lead to forgetting by reducing the allocated capacity of features (worse representation) and disrupting their readout by downstream computations. Analysis of a tractable model formalizes this view, allowing us to identify best- and worst-case scenarios. Through experiments on this model, we empirically test our formal analysis and highlight the detrimental effect of depth. Finally, we demonstrate how our framework can be used in the analysis of practical models through the use of Crosscoders. We present a case study of a Vision Transformer trained on sequential CIFAR-10. Our work provides a new, feature-centric vocabulary for continual learning.
【7】Negatives-Dominant Contrastive Learning for Generalization in Imbalanced Domains
标题:非平衡领域中的负优势对比学习
链接:https://arxiv.org/abs/2601.21999
作者:Meng Cao,Jiexi Liu,Songcan Chen
摘要:Imbalanced Domain Generalization (IDG) focuses on mitigating both domain and label shifts, both of which fundamentally shape the model's decision boundaries, particularly under heterogeneous long-tailed distributions across domains. Despite its practical significance, it remains underexplored, primarily due to the technical complexity of handling their entanglement and the paucity of theoretical foundations. In this paper, we begin by theoretically establishing the generalization bound for IDG, highlighting the role of posterior discrepancy and decision margin. This bound motivates us to focus on directly steering decision boundaries, marking a clear departure from existing methods. Subsequently, we technically propose a novel Negative-Dominant Contrastive Learning (NDCL) for IDG to enhance discriminability while enforce posterior consistency across domains. Specifically, inter-class decision-boundary separation is enhanced by placing greater emphasis on negatives as the primary signal in our contrastive learning, naturally amplifying gradient signals for minority classes to avoid the decision boundary being biased toward majority classes. Meanwhile, intra-class compactness is encouraged through a re-weighted cross-entropy strategy, and posterior consistency across domains is enforced through a prediction-central alignment strategy. Finally, rigorous yet challenging experiments on benchmarks validate the effectiveness of our NDCL. The code is available at https://github.com/Alrash/NDCL.
【8】Elign: Equivariant Diffusion Model Alignment from Foundational Machine Learning Force Fields
标题:Elignn:基础机器学习力场的等变扩散模型对齐
链接:https://arxiv.org/abs/2601.21985
作者:Yunyang Li,Lin Huang,Luojia Xia,Wenhe Zhang,Mark Gerstein
摘要:Generative models for 3D molecular conformations must respect Euclidean symmetries and concentrate probability mass on thermodynamically favorable, mechanically stable structures. However, E(3)-equivariant diffusion models often reproduce biases from semi-empirical training data rather than capturing the equilibrium distribution of a high-fidelity Hamiltonian. While physics-based guidance can correct this, it faces two computational bottlenecks: expensive quantum-chemical evaluations (e.g., DFT) and the need to repeat such queries at every sampling step. We present Elign, a post-training framework that amortizes both costs. First, we replace expensive DFT evaluations with a faster, pretrained foundational machine-learning force field (MLFF) to provide physical signals. Second, we eliminate repeated run-time queries by shifting physical steering to the training phase. To achieve the second amortization, we formulate reverse diffusion as a reinforcement learning problem and introduce Force--Energy Disentangled Group Relative Policy Optimization (FED-GRPO) to fine-tune the denoising policy. FED-GRPO includes a potential-based energy reward and a force-based stability reward, which are optimized and group-normalized independently. Experiments show that Elign generates conformations with lower gold-standard DFT energies and forces, while improving stability. Crucially, inference remains as fast as unguided sampling, since no energy evaluations are required during generation.
【9】Dependence of Equilibrium Propagation Training Success on Network Architecture
标题:均衡传播训练成功对网络架构的依赖
链接:https://arxiv.org/abs/2601.21945
作者:Qingshan Wang,Clara C. Wanjura,Florian Marquardt
备注:9 pages, 5 figures
摘要
:The rapid rise of artificial intelligence has led to an unsustainable growth in energy consumption. This has motivated progress in neuromorphic computing and physics-based training of learning machines as alternatives to digital neural networks. Many theoretical studies focus on simple architectures like all-to-all or densely connected layered networks. However, these may be challenging to realize experimentally, e.g. due to connectivity constraints. In this work, we investigate the performance of the widespread physics-based training method of equilibrium propagation for more realistic architectural choices, specifically, locally connected lattices. We train an XY model and explore the influence of architecture on various benchmark tasks, tracking the evolution of spatially distributed responses and couplings during training. Our results show that sparse networks with only local connections can achieve performance comparable to dense networks. Our findings provide guidelines for further scaling up architectures based on equilibrium propagation in realistic settings.
【10】Clarity: The Flexibility-Interpretability Trade-Off in Sparsity-aware Concept Bottleneck Models
标题:清晰度:稀疏意识概念瓶颈模型中的可信性-可解释性权衡
链接:https://arxiv.org/abs/2601.21944
作者:Konstantinos P. Panousis,Diego Marcos
摘要:The widespread adoption of Vision-Language Models (VLMs) across fields has amplified concerns about model interpretability. Distressingly, these models are often treated as black-boxes, with limited or non-existent investigation of their decision making process. Despite numerous post- and ante-hoc interepretability methods, systematic and objective evaluation of the learned representations remains limited, particularly for sparsity-aware methods that are increasingly considered to "induce interpretability". In this work, we focus on Concept Bottleneck Models and investigate how different modeling decisions affect the emerging representations. We introduce the notion of clarity, a measure, capturing the interplay between the downstream performance and the sparsity and precision of the concept representation, while proposing an interpretability assessment framework using datasets with ground truth concept annotations. We consider both VLM- and attribute predictor-based CBMs, and three different sparsity-inducing strategies: per example $\ell_1, \ell_0$ and Bernoulli-based formulations. Our experiments reveal a critical trade-off between flexibility and interpretability, under which a given method can exhibit markedly different behaviors even at comparable performance levels. The code will be made publicly available upon publication.
【11】A Low-Complexity Plug-and-Play Deep Learning Model for Generalizable Massive MIMO Precoding
标题:用于可推广大规模MMO预编码的低复杂性即插即用深度学习模型
链接:https://arxiv.org/abs/2601.21897
作者:Ali Hasanzadeh Karkan,Ahmed Ibrahim,Jean-François Frigon,François Leduc-Primeau
摘要:Massive multiple-input multiple-output (mMIMO) downlink precoding offers high spectral efficiency but remains challenging to deploy in practice because near-optimal algorithms such as the weighted minimum mean squared error (WMMSE) are computationally expensive, and sensitive to SNR and channel-estimation quality, while existing deep learning (DL)-based solutions often lack robustness and require retraining for each deployment site. This paper proposes a plug-and-play precoder (PaPP), a DL framework with a backbone that can be trained for either fully digital (FDP) or hybrid beamforming (HBF) precoding and reused across sites, transmit-power levels, and with varying amounts of channel estimation error, avoiding the need to train a new model from scratch at each deployment. PaPP combines a high-capacity teacher and a compact student with a self-supervised loss that balances teacher imitation and normalized sum-rate, trained using meta-learning domain-generalization and transmit-power-aware input normalization. Numerical results on ray-tracing data from three unseen sites show that the PaPP FDP and HBF models both outperform conventional and deep learning baselines, after fine-tuning with a small set of local unlabeled samples. Across both architectures, PaPP achieves more than 21$\times$ reduction in modeled computation energy and maintains good performance under channel-estimation errors, making it a practical solution for energy-efficient mMIMO precoding.
【12】Managing Solution Stability in Decision-Focused Learning with Cost Regularization
标题:在以决策为中心的学习中管理解决方案稳定性,并通过成本规范化
链接:https://arxiv.org/abs/2601.21883
作者:Victor Spitzer,Francois Sanson
摘要:Decision-focused learning integrates predictive modeling and combinatorial optimization by training models to directly improve decision quality rather than prediction accuracy alone. Differentiating through combinatorial optimization problems represents a central challenge, and recent approaches tackle this difficulty by introducing perturbation-based approximations. In this work, we focus on estimating the objective function coefficients of a combinatorial optimization problem. Our study demonstrates that fluctuations in perturbation intensity occurring during the learning phase can lead to ineffective training, by establishing a theoretical link to the notion of solution stability in combinatorial optimization. We propose addressing this issue by introducing a regularization of the estimated cost vectors which improves the robustness and reliability of the learning process, as demonstrated by extensive numerical experiments.
【13】Quantum LEGO Learning: A Modular Design Principle for Hybrid Artificial Intelligence
标题:量子乐高学习:混合人工智能的模块化设计原则
链接:https://arxiv.org/abs/2601.21780
作者:Jun Qi,Chao-Han Huck Yang,Pin-Yu Chen,Min-Hsiu Hsieh,Hector Zenil,Jesper Tegner
备注:In submission
摘要
:Hybrid quantum-classical learning models increasingly integrate neural networks with variational quantum circuits (VQCs) to exploit complementary inductive biases. However, many existing approaches rely on tightly coupled architectures or task-specific encoders, limiting conceptual clarity, generality, and transferability across learning settings. In this work, we introduce Quantum LEGO Learning, a modular and architecture-agnostic learning framework that treats classical and quantum components as reusable, composable learning blocks with well-defined roles. Within this framework, a pre-trained classical neural network serves as a frozen feature block, while a VQC acts as a trainable adaptive module that operates on structured representations rather than raw inputs. This separation enables efficient learning under constrained quantum resources and provides a principled abstraction for analyzing hybrid models. We develop a block-wise generalization theory that decomposes learning error into approximation and estimation components, explicitly characterizing how the complexity and training status of each block influence overall performance. Our analysis generalizes prior tensor-network-specific results and identifies conditions under which quantum modules provide representational advantages over comparably sized classical heads. Empirically, we validate the framework through systematic block-swap experiments across frozen feature extractors and both quantum and classical adaptive heads. Experiments on quantum dot classification demonstrate stable optimization, reduced sensitivity to qubit count, and robustness to realistic noise.
【14】Temporal Sepsis Modeling: a Fully Interpretable Relational Way
标题:时态脓毒症建模:一种完全可解释的关系方法
链接:https://arxiv.org/abs/2601.21747
作者:Vincent Lemaire,Nédra Meloulli,Pierre Jaquet
摘要:Sepsis remains one of the most complex and heterogeneous syndromes in intensive care, characterized by diverse physiological trajectories and variable responses to treatment. While deep learning models perform well in the early prediction of sepsis, they often lack interpretability and ignore latent patient sub-phenotypes. In this work, we propose a machine learning framework by opening up a new avenue for addressing this issue: a relational approach. Temporal data from electronic medical records (EMRs) are viewed as multivariate patient logs and represented in a relational data schema. Then, a propositionalisation technique (based on classic aggregation/selection functions from the field of relational data) is applied to construct interpretable features to "flatten" the data. Finally, the flattened data is classified using a selective naive Bayesian classifier. Experimental validation demonstrates the relevance of the suggested approach as well as its extreme interpretability. The interpretation is fourfold: univariate, global, local, and counterfactual.
【15】Amortized Spectral Kernel Discovery via Prior-Data Fitted Network
标题:通过先验数据匹配网络的摊销频谱核发现
链接:https://arxiv.org/abs/2601.21731
作者:Kaustubh Sharma,Srijan Tiwari,Ojasva Nema,Parikshit Pareek
摘要:Prior-Data Fitted Networks (PFNs) enable efficient amortized inference but lack transparent access to their learned priors and kernels. This opacity hinders their use in downstream tasks, such as surrogate-based optimization, that require explicit covariance models. We introduce an interpretability-driven framework for amortized spectral discovery from pre-trained PFNs with decoupled attention. We perform a mechanistic analysis on a trained PFN that identifies attention latent output as the key intermediary, linking observed function data to spectral structure. Building on this insight, we propose decoder architectures that map PFN latents to explicit spectral density estimates and corresponding stationary kernels via Bochner's theorem. We study this pipeline in both single-realization and multi-realization regimes, contextualizing theoretical limits on spectral identifiability and proving consistency when multiple function samples are available. Empirically, the proposed decoders recover complex multi-peak spectral mixtures and produce explicit kernels that support Gaussian process regression with accuracy comparable to PFNs and optimization-based baselines, while requiring only a single forward pass. This yields orders-of-magnitude reductions in inference time compared to optimization-based baselines.
【16】SmartMeterFM: Unifying Smart Meter Data Generative Tasks Using Flow Matching Models
标题:SmartMeterFM:使用流量匹配模型统一智能仪表数据生成任务
链接:https://arxiv.org/abs/2601.21706
作者:Nan Lin,Yanbo Wang,Jacco Heres,Peter Palensky,Pedro P. Vergara
备注:10 pages, 6 figures, 6 tables
摘要:Smart meter data is the foundation for planning and operating the distribution network. Unfortunately, such data are not always available due to privacy regulations. Meanwhile, the collected data may be corrupted due to sensor or transmission failure, or it may not have sufficient resolution for downstream tasks. A wide range of generative tasks is formulated to address these issues, including synthetic data generation, missing data imputation, and super-resolution. Despite the success of machine learning models on these tasks, dedicated models need to be designed and trained for each task, leading to redundancy and inefficiency. In this paper, by recognizing the powerful modeling capability of flow matching models, we propose a new approach to unify diverse smart meter data generative tasks with a single model trained for conditional generation. The proposed flow matching models are trained to generate challenging, high-dimensional time series data, specifically monthly smart meter data at a 15 min resolution. By viewing different generative tasks as distinct forms of partial data observations and injecting them into the generation process, we unify tasks such as imputation and super-resolution with a single model, eliminating the need for re-training. The data generated by our model not only are consistent with the given observations but also remain realistic, showing better performance against interpolation and other machine learning based baselines dedicated to the tasks.
【17】Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold
标题:别这么斯蒂夫!学习Stiefel上的KV缓存低等级逼近
链接:https://arxiv.org/abs/2601.21686
作者:Luca Benfenati,Matteo Risso,Andrea Vannozzi,Ahmet Caner Yüzügüler,Lukas Cavigelli,Enrico Macii,Daniele Jahier Pagliari,Alessio Burrello
摘要
:Key--value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by projecting per-head matrixes to a lower rank, storing only the projections in the HBM. However, existing post-training approaches typically fit these projections using SVD-style proxy objectives, which may poorly reflect end-to-end reconstruction after softmax, value mixing, and subsequent decoder-layer transformations. For these reasons, we introduce StiefAttention, a post-training KV-cache compression method that learns \emph{orthonormal} projection bases by directly minimizing \emph{decoder-layer output reconstruction error}. StiefAttention additionally precomputes, for each layer, an error-rank profile over candidate ranks, enabling flexible layer-wise rank allocation under a user-specified error budget. Noteworthy, on Llama3-8B under the same conditions, StiefAttention outperforms EigenAttention by $11.9$ points on C4 perplexity and $5.4\%$ on 0-shot MMLU accuracy at iso-compression, yielding lower relative error and higher cosine similarity with respect to the original decoder-layer outputs.
【18】SWE-Spot: Building Small Repo-Experts with Repository-Centric Learning
标题:SWE-Spot:通过以仓库为中心的学习培养小型仓库专家
链接:https://arxiv.org/abs/2601.21649
作者:Jinjun Peng,Magnus Saebo,Tianjun Zhong,Yi-Jie Cheng,Junfeng Yang,Baishakhi Ray,Simin Chen,Yangruibo Ding
摘要:The deployment of coding agents in privacy-sensitive and resource-constrained environments drives the demand for capable open-weight Small Language Models (SLMs). However, they suffer from a fundamental capability gap: unlike frontier large models, they lack the inference-time strong generalization to work with complicated, unfamiliar codebases. We identify that the prevailing Task-Centric Learning (TCL) paradigm, which scales exposure across disparate repositories, fails to address this limitation. In response, we propose Repository-Centric Learning (RCL), a paradigm shift that prioritizes vertical repository depth over horizontal task breadth, suggesting SLMs must internalize the "physics" of a target software environment through parametric knowledge acquisition, rather than attempting to recover it via costly inference-time search. Following this new paradigm, we design a four-unit Repository-Centric Experience, transforming static codebases into interactive learning signals, to train SWE-Spot-4B, a family of highly compact models built as repo-specialized experts that breaks established scaling trends, outperforming open-weight models up to larger (e.g., CWM by Meta, Qwen3-Coder-30B) and surpassing/matching efficiency-focused commercial models (e.g., GPT-4.1-mini, GPT-5-nano) across multiple SWE tasks. Further analysis reveals that RCL yields higher training sample efficiency and lower inference costs, emphasizing that for building efficient intelligence, repository mastery is a distinct and necessary dimension that complements general coding capability.
【19】Identifiable Equivariant Networks are Layerwise Equivariant
标题:可识别等变网络是分层等变的
链接:https://arxiv.org/abs/2601.21645
作者:Vahid Shahverdi,Giovanni Luca Marchetti,Georg Bökman,Kathlén Kohn
摘要:We investigate the relation between end-to-end equivariance and layerwise equivariance in deep neural networks. We prove the following: For a network whose end-to-end function is equivariant with respect to group actions on the input and output spaces, there is a parameter choice yielding the same end-to-end function such that its layers are equivariant with respect to some group actions on the latent spaces. Our result assumes that the parameters of the model are identifiable in an appropriate sense. This identifiability property has been established in the literature for a large class of networks, to which our results apply immediately, while it is conjectural for others. The theory we develop is grounded in an abstract formalism, and is therefore architecture-agnostic. Overall, our results provide a mathematical explanation for the emergence of equivariant structures in the weights of neural networks during training -- a phenomenon that is consistently observed in practice.
【20】Training Memory in Deep Neural Networks: Mechanisms, Evidence, and Measurement Gaps
标题:深度神经网络中的训练记忆:机制、证据和测量差距
链接:https://arxiv.org/abs/2601.21624
作者:Vasileios Sevetlidis,George Pavlidis
摘要:Modern deep-learning training is not memoryless. Updates depend on optimizer moments and averaging, data-order policies (random reshuffling vs with-replacement, staged augmentations and replay), the nonconvex path, and auxiliary state (teacher EMA/SWA, contrastive queues, BatchNorm statistics). This survey organizes mechanisms by source, lifetime, and visibility. It introduces seed-paired, function-space causal estimands; portable perturbation primitives (carry/reset of momentum/Adam/EMA/BN, order-window swaps, queue/teacher tweaks); and a reporting checklist with audit artifacts (order hashes, buffer/BN checksums, RNG contracts). The conclusion is a protocol for portable, causal, uncertainty-aware measurement that attributes how much training history matters across models, data, and regimes.
【21】Learning the Mechanism of Catastrophic Forgetting: A Perspective from Gradient Similarity
标题:学习灾难性遗忘的机制:从梯度相似性的角度
链接:https://arxiv.org/abs/2601.21577
作者:Mutian Yang,Zisen Zhan,Yutong Chen,Haolin Li,Kaiwen Wang,Kaili Zheng,Yuguang Wang,Qi Wang,Jiandong Gao,Ji Wu
摘要:Catastrophic forgetting during knowledge injection severely undermines the continual learning capability of large language models (LLMs). Although existing methods attempt to mitigate this issue, they often lack a foundational theoretical explanation. We establish a gradient-based theoretical framework to explain catastrophic forgetting. We first prove that strongly negative gradient similarity is a fundamental cause of forgetting. We then use gradient similarity to identify two types of neurons: conflicting neurons that induce forgetting and account for 50%-75% of neurons, and collaborative neurons that mitigate forgetting and account for 25%-50%. Based on this analysis, we propose a knowledge injection method, Collaborative Neural Learning (CNL). By freezing conflicting neurons and updating only collaborative neurons, CNL theoretically eliminates catastrophic forgetting under an infinitesimal learning rate eta and an exactly known mastered set. Experiments on five LLMs, four datasets, and four optimizers show that CNL achieves zero forgetting in in-set settings and reduces forgetting by 59.1%-81.7% in out-of-set settings.
【22】Fast and Geometrically Grounded Lorentz Neural Networks
标题:快速且几何接地的洛伦兹神经网络
链接:https://arxiv.org/abs/2601.21529
作者:Robert van der Klis,Ricardo Chávez Torres,Max van Spengler,Yuhui Ding,Thomas Hofmann,Pascal Mettes
备注:19 pages, 4 figures
【23】Partial Feedback Online Learning
标题:部分反馈在线学习
链接:https://arxiv.org/abs/2601.21462
作者:Shihao Shao,Cong Fang,Zhouchen Lin,Dacheng Tao
备注:32 pages
【24】Lossy Common Information in a Learnable Gray-Wyner Network
标题:可学习Gray-Wyner网络中的丢失公共信息
链接:https://arxiv.org/abs/2601.21424
作者:Anderson de Andrade,Alon Harell,Ivan V. Bajić
【25】Revisiting Diffusion Model Predictions Through Dimensionality
标题:通过主观性重新审视扩散模型预测
链接:https://arxiv.org/abs/2601.21419
作者:Qing Jin,Chaoyang Wang
备注:19 pages, 5 figures
【26】Hebbian Learning with Global Direction
标题:Hebbian学习与全球方向
链接:https://arxiv.org/abs/2601.21367
作者:Wenjia Hua,Kejie Zhao,Luziwei Leng,Ran Cheng,Yuxin Ma,Qinghai Guo
备注:Accepted to ICASSP 2026
【27】Self-Improving Pretraining: using post-trained models to pretrain better models
标题:自我改进预训练:使用训练后的模型预训练更好的模型
链接:https://arxiv.org/abs/2601.21343
作者:Ellen Xiaoqing Tan,Shehzaad Dhuliawala,Jing Xu,Ping Yu,Sainbayar Sukhbaatar,Jason Weston,Olga Golovneva
【28】An introductory Generalization of the standard SVMs loss and its applications to Shallow and Deep Neural Networks
标题:标准SVMs损失的概括及其在浅层和深层神经网络中的应用
链接:https://arxiv.org/abs/2601.21331
【29】Achieving $\varepsilon^{-2}$ Dependence for Average-Reward Q-Learning with a New Contraction Principle
链接:https://arxiv.org/abs/2601.21301
作者:Zijun Chen,Zaiwei Chen,Nian Si,Shengbo Wang
【30】Missing-Data-Induced Phase Transitions in Spectral PLS for Multimodal Learning
标题:用于多模式学习的谱偏最大化中缺失数据引起的相转变
链接:https://arxiv.org/abs/2601.21294
作者:Anders Gjølbye,Ida Kargaard,Emma Kargaard,Lars Kai Hansen
备注:Preprint
【31】PILD: Physics-Informed Learning via Diffusion
标题:PILD:通过扩散进行身体知情学习
链接:https://arxiv.org/abs/2601.21284
作者
:Tianyi Zeng,Tianyi Wang,Jiaru Zhang,Zimo Zeng,Feiyang Zhang,Yiming Xu,Sikai Chen,Yajie Zou,Yangyang Wang,Junfeng Jiao,Christian Claudel,Xinbo Chen
【32】PHDME: Physics-Informed Diffusion Models without Explicit Governing Equations
标题:PHDME:无显式控制方程的物理信息扩散模型
链接:https://arxiv.org/abs/2601.21234
作者:Kaiyuan Tan,Kendra Givens,Peilun Li,Thomas Beckers
【33】Soft Quantization: Model Compression Via Weight Coupling
标题:软量化:通过权重耦合进行模型压缩
链接:https://arxiv.org/abs/2601.21219
作者:Daniel T. Bernstein,Luca Di Carlo,David Schwab
备注:7 pages, 6 figures
【34】A Federated Generalized Expectation-Maximization Algorithm for Mixture Models with an Unknown Number of Components
标题:未知分量混合模型的联邦广义期望最大化算法
链接:https://arxiv.org/abs/2601.21160
作者:Michael Ibrahim,Nagi Gebraeel,Weijun Xie
备注:49 Pages, Accepted at ICLR 2026
【35】Can Neural Networks Learn Small Algebraic Worlds? An Investigation Into the Group-theoretic Structures Learned By Narrow Models Trained To Predict Group Operations
标题:神经网络可以学习小代数世界吗?通过训练来预测群体行动的狭窄模型学习的群体理论结构的调查
链接:https://arxiv.org/abs/2601.21150
作者:Henry Kvinge,Andrew Aguilar,Nayda Farnsworth,Grace O'Brien,Robert Jasper,Sarah Scullen,Helen Jenne
备注:Presented at TAG-DS 2025
【36】Mobility-Embedded POIs: Learning What A Place Is and How It Is Used from Human Movement
标题:移动嵌入式兴趣点:从人类运动中了解什么是地方以及如何使用它
链接:https://arxiv.org/abs/2601.21149
作者:Maria Despoina Siampou,Shushman Choudhury,Shang-Ling Hsu,Neha Arora,Cyrus Shahabi
【37】Smooth Dynamic Cutoffs for Machine Learning Interatomic Potentials
标题:机器学习原子间势的平滑动态截止
链接:https://arxiv.org/abs/2601.21147
作者:Kevin Han,Haolin Cong,Bowen Deng,Amir Barati Farimani
【38】MapPFN: Learning Causal Perturbation Maps in Context
标题:MapPFN:在上下文中学习因果微扰图
链接:https://arxiv.org/abs/2601.21092
作者:Marvin Sextro,Weronika Kłos,Gabriel Dernbach
【39】LOCUS: Low-Dimensional Model Embeddings for Efficient Model Exploration, Comparison, and Selection
标题:LOCUS:用于高效模型探索、比较和选择的低维模型嵌入
链接:https://arxiv.org/abs/2601.21082
作者:Shivam Patel,William Cocke,Gauri Joshi
【40】Signal from Structure: Exploiting Submodular Upper Bounds in Generative Flow Networks
标题:来自结构的信号:利用生成流网络中的次模上界
链接:https://arxiv.org/abs/2601.21061
作者:Alexandre Larouche,Audrey Durand
【41】Predict-Project-Renoise: Sampling Diffusion Models under Hard Constraints
标题:预测-项目-翻新:硬约束下的抽样扩散模型
链接:https://arxiv.org/abs/2601.21033
作者:Omer Rochman-Sharabi,Gilles Louppe
备注:Code coming soon
【42】SIGMA-PPG: Statistical-prior Informed Generative Masking Architecture for PPG Foundation Model
标题:SIGMA-PPG:PPG基础模型的统计优先知情生成掩蔽架构
链接:https://arxiv.org/abs/2601.21031
作者:Zongheng Guo,Tao Chen,Yang Jiao,Yi Pan,Xiao Hu,Manuela Ferrario
备注:31 pages, 9 figures, 14 tables
【43】A Theory of Universal Agnostic Learning
标题:普遍不可知学习理论
链接:https://arxiv.org/abs/2601.20961
作者:Steve Hanneke,Shay Moran
【44】Is Parameter Isolation Better for Prompt-Based Continual Learning?
标题:参数隔离对于基于预算的连续学习来说更好吗?
链接:https://arxiv.org/abs/2601.20894
作者:Jiangyang Li,Chenhao Ding,Songlin Dong,Qiang Wang,Jianchao Zhao,Yuhang He,Yihong Gong
备注:17 pages, 5 figures
【45】A generative machine learning model for designing metal hydrides applied to hydrogen storage
标题:用于设计应用于氢储存的金属纳米管的生成式机器学习模型
链接:https://arxiv.org/abs/2601.20892
作者:Xiyuan Liu,Christian Hacker,Shengnian Wang,Yuhua Duan
【46】STAER: Temporal Aligned Rehearsal for Continual Spiking Neural Network
标题:STAER:连续尖峰神经网络的时间对齐排练
链接:https://arxiv.org/abs/2601.20870
作者:Matteo Gianferrari,Omayma Moussadek,Riccardo Salami,Cosimo Fiorini,Lorenzo Tartarini,Daniela Gandolfi,Simone Calderara
【47】Alpha Discovery via Grammar-Guided Learning and Search
标题:通过文法引导学习和搜索进行Alpha发现
链接:https://arxiv.org/abs/2601.22119
作者:Han Yang,Dong Hao,Zhuohan Wang,Qi Shi,Xingtong Li
备注:24 pages, 10 figures
【48】On Forgetting and Stability of Score-based Generative models
标题:基于分数的生成模型的遗忘和稳定性
链接:https://arxiv.org/abs/2601.21868
作者:Stanislas Strasman,Gabriel Cardoso,Sylvain Le Corff,Vincent Lemaire,Antonio Ocello
【49】Generative Modeling of Discrete Data Using Geometric Latent Subspaces
标题:使用几何潜子空间的离散数据生成建模
链接:https://arxiv.org/abs/2601.21831
作者:Daniel Gonzalez-Alvarado,Jonas Cassel,Stefania Petra,Christoph Schnörr
【50】A Flexible Empirical Bayes Approach to Generalized Linear Models, with Applications to Sparse Logistic Regression
标题:广义线性模型的灵活经验Bayes方法及其在稀疏逻辑回归中的应用
链接:https://arxiv.org/abs/2601.21217
作者:Dongyue Xie,Wanrong Zhu,Matthew Stephens
【51】High-dimensional learning dynamics of multi-pass Stochastic Gradient Descent in multi-index models
标题:多指标模型中多遍随机梯度下降的多维学习动态
链接:https://arxiv.org/abs/2601.21093
【52】An efficient, accurate, and interpretable machine learning method for computing probability of failure
标题:一种高效、准确且可解释的机器学习方法,用于计算故障概率
链接:https://arxiv.org/abs/2601.21089
作者:Jacob Zhu,Donald Estep
【53】Towards regularized learning from functional data with covariate shift
标题:从具有协变量变化的函数数据中进行正规化学习
链接:https://arxiv.org/abs/2601.21019
作者:Markus Holzleitner,Sergiy Pereverzyev,Sergei V. Pereverzyev,Vaibhav Silmana,S. Sivananthan
备注:38 pages
【54】Efficient Causal Structure Learning via Modular Subgraph Integration
标题:通过模块子图集成进行高效因果结构学习
链接:https://arxiv.org/abs/2601.21014
作者:Haixiang Sun,Pengchao Tian,Zihan Zhou,Jielei Zhang,Peiyi Li,Andrew L. Liu
【55】ATTNSOM: Learning Cross-Isoform Attention for Cytochrome P450 Site-of-Metabolism
标题:ATTNSOM:学习细胞色素P450代谢部位的跨同工型注意力
链接:https://arxiv.org/abs/2601.20891
作者:Hajung Kim,Eunha Lee,Sohyun Chung,Jueon Park,Seungheun Baek,Jaewoo Kang
备注:14 pages
其他(69篇)
【1】FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale
标题:FineDirections:将合成说明扩展到训练前量表
链接:https://arxiv.org/abs/2601.22146
作者:Ajay Patel,Colin Raffel,Chris Callison-Burch
【2】StepShield: When, Not Whether to Intervene on Rogue Agents
标题:StepShield:何时而不是是否干预流氓特工
链接:https://arxiv.org/abs/2601.22136
作者:Gloria Felicia,Michael Eniolade,Jinfeng He,Zitha Sasindran,Hemant Kumar,Milan Hussain Angati,Sandeep Bandarupalli
备注:16 pages, 2 figures, 14 tables
【3】SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents
标题:SWE-Replay:软件工程代理的高效测试时间扩展
链接:https://arxiv.org/abs/2601.22129
作者
:Yifeng Ding,Lingming Zhang
【4】Physics Informed Reconstruction of Four-Dimensional Atmospheric Wind Fields Using Multi-UAS Swarm Observations in a Synthetic Turbulent Environment
标题:在合成湍流环境中使用多UAS群观测重建二维大气风场的物理信息
链接:https://arxiv.org/abs/2601.22111
作者:Abdullah Tasim,Wei Sun
【5】Value-Based Pre-Training with Downstream Feedback
标题:基于价值观的预训练和下游反馈
链接:https://arxiv.org/abs/2601.22108
【6】ECO: Quantized Training without Full-Precision Master Weights
标题:ECO:没有全精度主重量的量化训练
链接:https://arxiv.org/abs/2601.22101
作者:Mahdi Nikdan,Amir Zandieh,Dan Alistarh,Vahab Mirrokni
【7】Holographic generative flows with AdS/CFT
标题:使用AdS/CFT的全息生成流
链接:https://arxiv.org/abs/2601.22033
作者:Ehsan Mirafzali,Sanjit Shashi,Sanya Murdeshwar,Edgar Shaghoulian,Daniele Venturi,Razvan Marinescu
备注:v1: 13 pages, 6 figures
【8】The Ensemble Inverse Problem: Applications and Methods
标题:整体反问题:应用和方法
链接:https://arxiv.org/abs/2601.22029
作者:Zhengyan Huan,Camila Pazos,Martin Klassen,Vincent Croft,Pierre-Hugues Beauchemin,Shuchin Aeron
备注:26 pages, 11 figures, in peer review
【9】TBDFiltering: Sample-Efficient Tree-Based Data Filtering
标题:TBD过滤:样本高效的基于树的数据过滤
链接:https://arxiv.org/abs/2601.22016
作者:Robert Istvan Busa-Fekete,Julian Zimmert,Anne Xiangyi Zheng,Claudio Gentile,Andras Gyorgy
【10】Geometry of Drifting MDPs with Path-Integral Stability Certificates
标题:具有路径积分稳定性证书的漂移MDP的几何形状
链接:https://arxiv.org/abs/2601.21991
作者:Zuyuan Zhang,Mahdi Imani,Tian Lan
【11】PowerGenie: Analytically-Guided Evolutionary Discovery of Superior Reconfigurable Power Converters
标题:PowerGenie:分析引导的卓越可重新配置电源转换器的进化发现
链接:https://arxiv.org/abs/2601.21984
作者:Jian Gao,Yiwei Zou,Abhishek Pradhan,Wenhao Huang,Yumin Su,Kaiyuan Yang,Xuan Zhang
【12】LoRIF: Low-Rank Influence Functions for Scalable Training Data Attribution
标题:LoRIF:可扩展训练数据归因的低等级影响函数
链接:https://arxiv.org/abs/2601.21929
作者:Shuangqi Li,Hieu Le,Jingyi Xu,Mathieu Salzmann
【13】Optimistic Transfer under Task Shift via Bellman Alignment
标题:通过贝尔曼调整实现任务轮班下的乐观转移
链接:https://arxiv.org/abs/2601.21924
作者:Jinhang Chai,Enpei Zhang,Elynn Chen,Yujun Yan
【14】Hardware-Triggered Backdoors
标题:硬件触发后门
链接:https://arxiv.org/abs/2601.21902
作者:Jonas Möller,Erik Imgrund,Thorsten Eisenhofer,Konrad Rieck
【15】LEMUR: Learned Multi-Vector Retrieval
标题:LEMUR:习得多载体检索
链接:https://arxiv.org/abs/2601.21853
作者:Elias Jääsaari,Ville Hyvönen,Teemu Roos
备注:17 pages
【16】Test-Time Compute Games
标题:测试时间计算游戏
链接:https://arxiv.org/abs/2601.21839
作者:Ander Artola Velasco,Dimitrios Rontogiannis,Stratis Tsirtsis,Manuel Gomez-Rodriguez
【17】Scalable Linearized Laplace Approximation via Surrogate Neural Kernel
标题:通过代理神经核的可扩展线性化拉普拉斯逼近
链接:https://arxiv.org/abs/2601.21835
作者:Luis A. Ortega,Simón Rodríguez-Santana,Daniel Hernández-Lobato
备注:6 pages, 1 table. Accepted at European Symposium on Artificial Neural Networks (ESANN 2026) as oral presentation
【18】Error Amplification Limits ANN-to-SNN Conversion in Continuous Control
标题:连续控制中误差放大限制ANN-to-SNN转换
链接:https://arxiv.org/abs/2601.21778
作者:Zijie Xu,Zihan Huang,Yiting Dong,Kang Chen,Wenxuan Liu,Zhaofei Yu
【19】Differentiable Knapsack and Top-k Operators via Dynamic Programming
标题:动态规划的可微背包和Top-k运算符
链接:https://arxiv.org/abs/2601.21775
作者:Germain Vivier-Ardisson,Michaël E. Sander,Axel Parmentier,Mathieu Blondel
【20】FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer
标题:FISMO:费舍尔结构动量优化器
链接:https://arxiv.org/abs/2601.21750
作者:Chenrui Xu,Wenjing Yan,Ying-Jun Angela Zhang
【21】Why Adam Works Better with $β_1 = β_2$: The Missing Gradient Scale Invariance Principle
标题:为什么Adam对$β_1 = β_2$效果更好:缺失的梯度尺度不变性原则
链接:https://arxiv.org/abs/2601.21739
作者:Alberto Fernández-Hernández,Cristian Pérez-Corral,Jose I. Mestre,Manuel F. Dolz,Enrique S. Quintana-Ortí
备注:23 pages, 8 figures. Preprint
【22】Mixed-Precision Training and Compilation for RRAM-based Computing-in-Memory Accelerators
标题:基于RAM的内存计算加速器的混合精度训练和编译
链接:https://arxiv.org/abs/2601.21737
作者:Rebecca Pelke,Joel Klein,Jose Cubero-Cascante,Nils Bosbach,Jan Moritz Joseph,Rainer Leupers
【23】LoRA and Privacy: When Random Projections Help (and When They Don't)
标题:LoRA和隐私:当随机投影有帮助时(当它们没有)
链接:https://arxiv.org/abs/2601.21719
作者:Yaxi Hu,Johanna Düngler,Bernhard Schölkopf,Amartya Sanyal
【24】Beyond Forgetting: Machine Unlearning Elicits Controllable Side Behaviors and Capabilities
标题:超越忘记:机器学习激发可控的侧行为和能力
链接:https://arxiv.org/abs/2601.21702
作者:Tien Dang,The-Hai Nguyen,Dinh Mai Phuong,Nguyen Minh Phuong,Hoang Thanh-Tung,Le-Minh Nguyen,Naoya Inoue
备注:21 pages, 11 tables, 12 figures
【25】Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling
标题:不要浪费您的卷展:回收搜索经验以实现有效的测试时间缩放
链接:https://arxiv.org/abs/2601.21684
作者:Xinglin Wang,Jiayi Shi,Shaoxiong Feng,Peiwen Yuan,Yiwei Li,Yueqi Zhang,Chuyi Tan,Ji Zhang,Boyuan Pan,Yao Hu,Kan Li
备注:preprint
【26】SENDAI: A Hierarchical Sparse-measurement, EfficieNt Data AssImilation Framework
标题:SENDAI:分层稀疏测量、高效数据评估框架
链接:https://arxiv.org/abs/2601.21664
作者:Xingyue Zhang,Yuxuan Bao,Mars Liyao Gao,J. Nathan Kutz
【27】Generative Design of Ship Propellers using Conditional Flow Matching
标题:基于条件流匹配的船舶螺旋桨生成式设计
链接:https://arxiv.org/abs/2601.21637
作者:Patrick Kruger,Rafael Diaz,Simon Hauschulz,Stefan Harries,Hanno Gottschalk
备注:19 pages, 13 figures, 3 tables
【28】Sampling-Free Privacy Accounting for Matrix Mechanisms under Random Allocation
标题:随机分配下矩阵机制的免采样隐私会计
链接:https://arxiv.org/abs/2601.21636
作者:Jan Schuchardt,Nikita Kalinin
【29】HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning
标题:HeRo-Q:通过Hessian条件处理稳定低位量化的通用框架
链接:https://arxiv.org/abs/2601.21626
作者:Jinhao Zhang Yunquan Zhang,Zicheng yan,Boyang Zhang,Jun Sun,Daning Cheng
【30】Breaking the Overscaling Curse: Thinking Parallelism Before Parallel Thinking
标题:打破过度诅咒:平行思维之前思考并行主义
链接:https://arxiv.org/abs/2601.21619
作者:Yiming Wang,Zhuosheng Zhang,Rui Wang
【31】Age Matters: Analyzing Age-Related Discussions in App Reviews
标题:年龄很重要:分析应用程序评论中与性别相关的讨论
链接:https://arxiv.org/abs/2601.21605
作者:Shashiwadana Nirmania,Garima Sharma,Hourieh Khalajzadeh,Mojtaba Shahin
【32】Dynamics Reveals Structure: Challenging the Linear Propagation Assumption
标题:动力学揭示结构:挑战线性传播假设
链接:https://arxiv.org/abs/2601.21601
作者:Hoyeon Chang,Bálint Mucsányi,Seong Joon Oh
【33】KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices
标题:KromHC:具有Kronecker积剩余矩阵的流形约束超联络
链接:https://arxiv.org/abs/2601.21579
作者:Wuyang Zhou,Yuxuan Gu,Giorgos Iacovides,Danilo Mandic
【34】Shaping capabilities with token-level data filtering
标题:具有代币级数据过滤的塑造能力
链接:https://arxiv.org/abs/2601.21571
作者:Neil Rathi,Alec Radford
备注:33 pages, 28 figures
【35】Bridging Functional and Representational Similarity via Usable Information
标题:通过可用信息弥合功能相似性和表示相似性
链接:https://arxiv.org/abs/2601.21568
作者:Antonio Almudévar,Alfonso Ortega
【36】FlexCausal: Flexible Causal Disentanglement via Structural Flow Priors and Manifold-Aware Interventions
标题:FlexCausal:通过结构流先验和管汇感知干预实现灵活的因果理清
链接:https://arxiv.org/abs/2601.21567
作者:Yutao Jin,Yuang Tao,Junyong Zhai
【37】ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
标题:ETS:用于免训练RL对齐的能量引导测试时间缩放
链接:https://arxiv.org/abs/2601.21484
作者:Xiuyu Li,Jinkai Zhang,Mingyang Yi,Yu Li,Longqiang Wang,Yue Wang,Ju Fan
【38】L$^3$: Large Lookup Layers
标题:L $' 3 $:大毛刺层
链接:https://arxiv.org/abs/2601.21461
作者:Albert Tseng,Christopher De Sa
备注:Preprint
【39】Perceptrons and localization of attention's mean-field landscape
标题:感知器和注意力平均场景观的局部化
链接:https://arxiv.org/abs/2601.21366
作者:Antonio Álvarez-López,Borjan Geshkovski,Domènec Ruiz-Balet
【40】The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation
标题:合规悖论:自动学术代码评估中的语义与指令脱钩
链接:https://arxiv.org/abs/2601.21360
作者:Devanshu Sahoo,Manish Prasad,Vasudev Majhi,Arjun Neekhra,Yash Sinha,Murari Mandal,Vinay Chamola,Dhruv Kumar
【41】Expected Improvement via Gradient Norms
标题:通过梯度规范的预期改进
链接:https://arxiv.org/abs/2601.21357
作者:Joshua Hang Sai Ip,Georgios Makrygiorgos,Ali Mesbah
【42】L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts
标题:L2R:混合专家的低秩Lipschitz控制路由
链接:https://arxiv.org/abs/2601.21349
作者:Minghao Yang,Ren Togo,Guang Li,Takahiro Ogawa,Miki Haseyama
【43】Position: Certifiable State Integrity in Cyber-Physical Systems -- Why Modular Sovereignty Solves the Plasticity-Stability Paradox
标题:职位:网络物理系统中可认证的国家完整性--为什么模块化主权解决了可塑性-稳定性悖论
链接:https://arxiv.org/abs/2601.21249
作者:Enzo Nicolás Spotorno,Antônio Augusto Medeiros Fröhlich
备注:14 pages, (8 main text, 6 references and appendices), 2 figures
【44】Temporal Context and Architecture: A Benchmark for Naturalistic EEG Decoding
标题:时间背景和架构:自然主义脑电解码的基准
链接:https://arxiv.org/abs/2601.21215
【45】ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling
标题:ZipMoE:通过无损压缩和缓存亲和力调度提供高效的设备上MoE服务
链接:https://arxiv.org/abs/2601.21198
作者:Yuchen Yang,Yaru Zhao,Pu Yang,Shaowei Wang,Zhi-Hua Zhou
【46】Rethinking Refinement: Correcting Generative Bias without Noise Injection
标题:重新思考细化:在不注入噪音的情况下纠正生成偏差
链接:https://arxiv.org/abs/2601.21182
【47】Efficient Simple Regret Algorithms for Stochastic Contextual Bandits
标题:随机背景盗贼的高效简单遗憾算法
链接:https://arxiv.org/abs/2601.21167
作者:Shuai Liu,Alireza Bakhtiari,Alex Ayoub,Botao Hao,Csaba Szepesvári
【48】FrontierScience: Evaluating AI's Ability to Perform Expert-Level Scientific Tasks
标题:FrontierScience:评估人工智能执行专家级科学任务的能力
链接:https://arxiv.org/abs/2601.21165
作者:Miles Wang,Robi Lin,Kat Hu,Joy Jiao,Neil Chowdhury,Ethan Chang,Tejal Patwardhan
【49】Parametric Hyperbolic Conservation Laws: A Unified Framework for Conservation, Entropy Stability, and Hyperbolicity
标题:参数双曲保护定律:保护、熵稳定性和双曲性的统一框架
链接:https://arxiv.org/abs/2601.21080
作者:Lizuo Liu,Lu Zhang,Anne Gelb
备注:arXiv admin note: text overlap with arXiv:2507.01795
【50】Textual Equilibrium Propagation for Deep Compound AI Systems
标题:深度复合人工智能系统的文本平衡传播
链接:https://arxiv.org/abs/2601.21064
作者:Minghui Chen,Wenlong Deng,James Zou,Han Yu,Xiaoxiao Li
备注:Accepted to ICLR 2026
【51】Solver-in-the-Loop: MDP-Based Benchmarks for Self-Correction and Behavioral Rationality in Operations Research
标题:循环中的求解者:基于MPP的运筹学自我纠正和行为合理性基准
链接:https://arxiv.org/abs/2601.21008
作者:Ruicheng Ao,David Simchi-Levi,Xinshang Wang
备注:55 pages, 5 figures
【52】MADE: Benchmark Environments for Closed-Loop Materials Discovery
标题:MADE:闭环材料发现的基准环境
链接:https://arxiv.org/abs/2601.20996
作者:Shreshth A Malik,Tiarnan Doherty,Panagiotis Tigas,Muhammed Razzak,Stephen J. Roberts,Aron Walsh,Yarin Gal
【53】Monotone Optimisation with Learned Projections
标题:通过学习预测进行单调优化
链接:https://arxiv.org/abs/2601.20983
作者:Ahmed Rashwan,Keith Briggs,Chris Budd,Lisa Kreusser
【54】Leveraging Generative AI for Enhancing Domain-Driven Software Design
标题:利用生成性人工智能增强领域驱动软件设计
链接:https://arxiv.org/abs/2601.20909
作者:Götz-Henrik Wiegand,Filip Stepniak,Patrick Baier
备注:Part of the Proceedings of the Upper-Rhine Artificial Intelligence Symposium 2024
【55】Finetune-Informed Pretraining Boosts Downstream Performance
标题:Finetune知情的预训练提高下游表现
链接:https://arxiv.org/abs/2601.20884
作者:Atik Faysal,Mohammad Rostami,Reihaneh Gh. Roshan,Nikhil Muralidhar,Huaxia Wang
【56】MEIDNet: Multimodal generative AI framework for inverse materials design
标题:MEIDNet:用于逆材料设计的多模式生成式人工智能框架
链接:https://arxiv.org/abs/2601.22009
作者:Anand Babu,Rogério Almeida Gouvêa,Pierre Vandergheynst,Gian-Marco Rignanese
【57】Efficient Stochastic Optimisation via Sequential Monte Carlo
标题:通过顺序蒙特卡罗进行高效随机优化
链接:https://arxiv.org/abs/2601.22003
作者:James Cuin,Davide Carbone,Yanbo Tang,O. Deniz Akyildiz
【58】Batched First-Order Methods for Parallel LP Solving in MIP
标题:MPP中并行LP求解的批量一阶方法
链接:https://arxiv.org/abs/2601.21990
作者:Nicolas Blin,Stefano Gualandi,Christopher Maes,Andrea Lodi,Bartolomeo Stellato
备注:15 pages, 4 figures, 4 tables
【59】Diffusion Path Samplers via Sequential Monte Carlo
标题:通过顺序蒙特卡罗的扩散路径采样器
链接:https://arxiv.org/abs/2601.21951
作者:James Matthew Young,Paula Cordero-Encinar,Sebastian Reich,Andrew Duncan,O. Deniz Akyildiz
【60】On Approximate Computation of Critical Points
标题:关于临界点的近似计算
链接:https://arxiv.org/abs/2601.21917
作者:Amir Ali Ahmadi,Georgina Hall
【61】Manifold constrained steepest descent
标题:管汇限制最陡下降
链接:https://arxiv.org/abs/2601.21487
作者:Kaiwei Yang,Lexiao Lai
备注:23 pages, 7 figures, and 5 tables
【62】Bulk-Calibrated Credal Ambiguity Sets: Fast, Tractable Decision Making under Out-of-Sample Contamination
标题:批量校准的Credal模糊度集:在样本外污染下快速、可适应的决策
链接:https://arxiv.org/abs/2601.21324
作者:Mengqi Chen,Thomas B. Berrett,Theodoros Damoulas,Michele Caprio
【63】Solving the Offline and Online Min-Max Problem of Non-smooth Submodular-Concave Functions: A Zeroth-Order Approach
标题:解决非光滑次模凹凸函数的离线和在线Min-Max问题:零阶方法
链接:https://arxiv.org/abs/2601.21243
作者:Amir Ali Farzin,Yuen-Man Pun,Philipp Braun,Tyler Summers,Iman Shames
【64】Multilevel and Sequential Monte Carlo for Training-Free Diffusion Guidance
标题:多层序贯蒙特卡罗方法在无训练扩散制导中的应用
链接:https://arxiv.org/abs/2601.21104
作者:Aidan Gleich,Scott C. Schmidler
【65】Better without U: Impact of Selective Hubbard U Correction on Foundational MLIPs
标题:没有U更好:选择性哈伯德U修正对基础MLIP的影响
链接:https://arxiv.org/abs/2601.21056
作者:Thomas Warford,Fabian L. Thiemann,Gábor Csányi
【66】Diffusion-based Annealed Boltzmann Generators : benefits, pitfalls and hopes
标题:基于扩散的Annealed Boltzmann发电机:好处、陷阱和希望
链接:https://arxiv.org/abs/2601.21026
作者:Louis Grenioux,Maxence Noble
【67】The augmented NLP bound for maximum-entropy remote sampling
标题:最大熵远程采样的增广NLP界
链接:https://arxiv.org/abs/2601.20970
作者:Gabriel Ponte,Marcia Fampa,Jon Lee
【68】Parametric Quantum State Tomography with HyperRBMs
标题:使用HyperRBM的参数量子状态断层扫描
链接:https://arxiv.org/abs/2601.20950
作者:Simon Tonner,Viet T. Tran,Richard Kueng
【69】Spatial Heterogeneity in Climate Risk and Human Flourishing: An Exploration with Generative AI
标题:气候风险和人类繁荣中的空间异异性:生成人工智能的探索
链接:https://arxiv.org/abs/2601.20880
作者:Stefano Maria Iacus,Haodong Qi,Devika Jain
机器翻译由腾讯交互翻译提供,仅供参考
点击“阅读原文”获取带摘要的学术速递