社区所有版块导航
Python
python开源   Django   Python   DjangoApp   pycharm  
DATA
docker   Elasticsearch  
aigc
aigc   chatgpt  
WEB开发
linux   MongoDB   Redis   DATABASE   NGINX   其他Web框架   web工具   zookeeper   tornado   NoSql   Bootstrap   js   peewee   Git   bottle   IE   MQ   Jquery  
机器学习
机器学习算法  
Python88.com
反馈   公告   社区推广  
产品
短视频  
印度
印度  
Py学习  »  机器学习算法

机器学习学术速递[1.9]

arXiv每日学术速递 • 2 月前 • 258 次点击  

点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!


cs.LG 方向,今日共计151篇


大模型相关(24篇)

【1】Cutting AI Research Costs: How Task-Aware Compression Makes Large Language Model Agents Affordable
标题:削减人工智能研究成本:任务感知压缩如何使大型语言模型代理变得负担得起
链接:https://arxiv.org/abs/2601.05191

作者:Zuhair Ahmed Khan Taha,Mohammed Mudassir Uddin,Shahnawaz Alam
摘要:当研究人员部署大型语言模型用于自主任务时,如回顾文献或生成假设,计算费用会迅速增加。使用700亿个参数模型的单个研究会话可能需要花费约127美元的云费用,这使得许多学术实验室无法使用这些工具。我们开发了AgentCompress来解决这个问题。核心思想来自我们自己工作中的一个简单观察:编写一个新的假设显然比重新格式化参考书目需要更多的模型。为什么两个任务都要以全精度运行?我们的系统使用一个小型神经网络来衡量每个传入任务的难度,仅基于其开头的单词,然后将其路由到一个适当压缩的模型变体。决定发生在一毫秒之内。通过对四个科学领域的500个研究工作流程进行测试,我们将计算成本降低了68.3%,同时保持了96.2%的原始成功率。对于关注预算的实验室来说,这可能意味着进行实验和袖手旁观的区别
摘要:When researchers deploy large language models for autonomous tasks like reviewing literature or generating hypotheses, the computational bills add up quickly. A single research session using a 70-billion parameter model can cost around $127 in cloud fees, putting these tools out of reach for many academic labs. We developed AgentCompress to tackle this problem head-on. The core idea came from a simple observation during our own work: writing a novel hypothesis clearly demands more from the model than reformatting a bibliography. Why should both tasks run at full precision? Our system uses a small neural network to gauge how hard each incoming task will be, based only on its opening words, then routes it to a suitably compressed model variant. The decision happens in under a millisecond. Testing across 500 research workflows in four scientific fields, we cut compute costs by 68.3% while keeping 96.2% of the original success rate. For labs watching their budgets, this could mean the difference between running experiments and sitting on the sidelines


【2】Token-Level LLM Collaboration via FusionRoute
标题:通过FusionRoute实现令牌级LLM协作
链接:https://arxiv.org/abs/2601.05106

作者:Nuoya Xiong,Yuhang Zhou,Hanqing Zeng,Zhaorun Chen,Furong Huang,Shuchao Bi,Lizhu Zhang,Zhuokai Zhao
备注:25 pages
摘要:大型语言模型(LLM)在不同的领域表现出优势。然而,要在这些域中使用单个通用模型实现强大的性能,通常需要扩展到训练和部署成本极其昂贵的规模。另一方面,虽然较小的领域专用模型效率更高,但它们很难推广到训练分布之外。为了解决这一困境,我们提出了FusionRoute,一个强大而有效的令牌级多LLM协作框架,其中轻量级路由器同时(i)在每个解码步骤中选择最合适的专家,以及(ii)通过logit添加来改进或纠正所选专家的下一个令牌分布的互补logit。与现有的令牌级协作方法,仅依赖于固定的专家输出,我们提供了一个理论分析表明,纯专家路由从根本上是有限的:除非强大的全球覆盖假设举行,它不能在一般情况下实现最佳的解码策略。通过使用可训练的互补生成器增强专家选择,FusionRoute扩展了有效的策略类,并在温和的条件下恢复最优值函数。从经验上看,在Llama-3和Gemma-2系列以及跨越数学推理、代码生成和指令跟踪的各种基准测试中,FusionRoute的性能优于序列级和令牌级协作、模型合并和直接微调,同时在各自的任务上与领域专家保持竞争力。
摘要:Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.


【3】Compositional Steering of Large Language Models with Steering Tokens
标题:使用引导令牌对大型语言模型进行组合引导
链接:https://arxiv.org/abs/2601.05062

作者:Gorjan Radevski,Kiril Gashteovski,Giwon Hong,Carolin Lawrence,Goran Glavaš
摘要 :在现实世界的应用程序中部署LLM需要同时满足多个需求的可控输出。虽然现有的工作广泛地解决了单一行为的LLM转向,\texit {组合转向} -即,同时将LLM转向多种行为-仍然是一个未充分探索的问题。在这项工作中,我们提出了\n {组合转向令牌}的多行为转向。我们首先通过自升华将表达为自然语言指令的个体行为嵌入到专用令牌中。与大多数在激活空间中操作的先前工作相反,我们的行为引导生活在输入令牌的空间中,从而实现更有效的zero-shot组合。然后,我们训练一个专用的\textit{composition token}对行为,并表明它成功地捕捉到了组成的概念:它很好地推广到\textit{unseen}组成,包括那些看不见的行为,以及那些看不见的\textit{number}的行为。我们在不同LLM架构上的实验表明,与竞争方法(指令,激活转向和LoRA合并)相比,转向令牌可以实现更好的多行为控制。此外,我们表明,转向令牌补充自然语言指令,它们的组合导致进一步的收益。
摘要:Deploying LLMs in real-world applications requires controllable output that satisfies multiple desiderata at the same time. While existing work extensively addresses LLM steering for a single behavior, \textit{compositional steering} -- i.e., steering LLMs simultaneously towards multiple behaviors -- remains an underexplored problem. In this work, we propose \emph{compositional steering tokens} for multi-behavior steering. We first embed individual behaviors, expressed as natural language instructions, into dedicated tokens via self-distillation. Contrary to most prior work, which operates in the activation space, our behavior steers live in the space of input tokens, enabling more effective zero-shot composition. We then train a dedicated \textit{composition token} on pairs of behaviors and show that it successfully captures the notion of composition: it generalizes well to \textit{unseen} compositions, including those with unseen behaviors as well as those with an unseen \textit{number} of behaviors. Our experiments across different LLM architectures show that steering tokens lead to superior multi-behavior control compared to competing approaches (instructions, activation steering, and LoRA merging). Moreover, we show that steering tokens complement natural language instructions, with their combination resulting in further gains.


【4】From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs)
标题:从理解到参与:通过视觉语言模型(VLM)实现个性化药房视频剪辑
链接:https://arxiv.org/abs/2601.05059

作者:Suyash Mishra,Qiang Li,Srikanth Patil,Anubhav Girdhar
备注:Contributed original research to top tier conference in VLM; currently undergoing peer review
摘要:视觉语言模型(VLM)通过实现智能、可扩展和自动化的多模态内容处理,有望彻底改变制药行业的数字化转型。传统的异构数据模态(文本、图像、视频、音频和Web链接)的手动注释容易出现不一致、质量下降和内容利用效率低下的问题。大量的长视频和音频数据进一步加剧了这些挑战(例如,长时间的临床试验访谈和教育研讨会)。   在这里,我们介绍了一个域适应的视频到视频剪辑生成框架,它集成了音频语言模型(ALM)和视觉语言模型(VLM),以产生突出的剪辑。我们的贡献有三个方面:(i)具有淡入/淡出和时间戳标准化的可再现的剪切和合并算法,确保平滑过渡和音频/视频对齐;(ii)基于角色定义和针对定制输出(营销、培训、监管)的即时注入的个性化机制;(iii)平衡ALM/VLM增强处理的成本有效的e2 e管道策略。对视频MME基准(900)和我们在14个疾病领域的16,159个药房视频的专有数据集进行的评估显示,速度提高了3到4倍,成本降低了4倍,并且具有竞争力的剪辑质量。除了效率的提高,我们还报告了我们的方法在最先进的VLM基线上提高了剪辑一致性得分(0.348)和信息性得分(0.721)(例如,Gemini 2.5 Pro),突出了透明、自定义提取和合规支持生命科学视频摘要的潜力。
摘要:Vision Language Models (VLMs) are poised to revolutionize the digital transformation of pharmacyceutical industry by enabling intelligent, scalable, and automated multi-modality content processing. Traditional manual annotation of heterogeneous data modalities (text, images, video, audio, and web links), is prone to inconsistencies, quality degradation, and inefficiencies in content utilization. The sheer volume of long video and audio data further exacerbates these challenges, (e.g. long clinical trial interviews and educational seminars).   Here, we introduce a domain adapted Video to Video Clip Generation framework that integrates Audio Language Models (ALMs) and Vision Language Models (VLMs) to produce highlight clips. Our contributions are threefold: (i) a reproducible Cut & Merge algorithm with fade in/out and timestamp normalization, ensuring smooth transitions and audio/visual alignment; (ii) a personalization mechanism based on role definition and prompt injection for tailored outputs (marketing, training, regulatory); (iii) a cost efficient e2e pipeline strategy balancing ALM/VLM enhanced processing. Evaluations on Video MME benchmark (900) and our proprietary dataset of 16,159 pharmacy videos across 14 disease areas demonstrate 3 to 4 times speedup, 4 times cost reduction, and competitive clip quality. Beyond efficiency gains, we also report our methods improved clip coherence scores (0.348) and informativeness scores (0.721) over state of the art VLM baselines (e.g., Gemini 2.5 Pro), highlighting the potential of transparent, custom extractive, and compliance supporting video summarization for life sciences.


【5】Challenges and Research Directions for Large Language Model Inference Hardware
标题:大型语言模型推理硬件面临的挑战和研究方向
链接:https://arxiv.org/abs/2601.05047

作者:Xiaoyu Ma,David Patterson
备注:Accepted for publication by IEEE Computer, 2026
摘要:大型语言模型(LLM)推理是困难的。底层Transformer模型的自回归解码阶段使LLM推理与训练有着根本的不同。随着最近的人工智能趋势的加剧,主要的挑战是内存和互连,而不是计算。为了应对这些挑战,我们强调了四个架构研究机会:具有HBM带宽的10倍内存容量的高带宽闪存;用于高内存带宽的处理近内存和3D内存逻辑堆栈;以及用于加速通信的低延迟互连。虽然我们的重点是数据中心人工智能,但我们也审查了它们对移动设备的适用性。
摘要:Large Language Model (LLM) inference is hard. The autoregressive Decode phase of the underlying Transformer model makes LLM inference fundamentally different from training. Exacerbated by recent AI trends, the primary challenges are memory and interconnect rather than compute. To address these challenges, we highlight four architecture research opportunities: High Bandwidth Flash for 10X memory capacity with HBM-like bandwidth; Processing-Near-Memory and 3D memory-logic stacking for high memory bandwidth; and low-latency interconnect to speedup communication. While our focus is datacenter AI, we also review their applicability for mobile devices.


【6】Scaling Vision Language Models for Pharmaceutical Long Form Video Reasoning on Industrial GenAI Platform
标题:在工业GenAI平台上扩展用于制药长篇视频推理的视觉语言模型
链接:https://arxiv.org/abs/2601.04891

作者:Suyash Mishra,Qiang Li,Srikanth Patil,Satyanarayan Pati,Baddu Narendra
备注:Submitted to the Industry Track of Top Tier Conference; currently under peer review
摘要:视觉语言模型(VLM)在多模态推理任务上表现出了强大的性能,但大多数评估都集中在短视频上,并假设计算资源不受约束。在制药内容理解等工业环境中,从业人员必须在严格的GPU、延迟和成本限制下处理长格式视频,而许多现有方法无法扩展。在这项工作中,我们提出了一个工业GenAI框架,它可以处理超过200,000个PDF,25,326个视频,包括八种格式(例如,MP4、M4 V等),以及888个20多种语言的多语种音频文件。我们的研究做出了三个贡献:(i)制药领域多模态推理的工业大规模架构;(ii)基于两个领先基准的40多个VLM的实证分析(Video-MME和MMBench)和14个疾病领域的25,326个视频的专有数据集;以及(iii)与长格式视频推理相关的四项发现:多模态的作用,注意力机制的权衡,时间推理的限制,以及GPU约束下的视频分割的挑战。结果显示,SDPA在商用GPU上的效率提高了3-8倍,多模态改进了多达8/12的任务域(特别是长度依赖的任务),并且在开源和闭源VLM之间的时间对齐和关键帧检测方面存在明显的瓶颈。而不是提出一个新的“A+B”模型,本文的特点是实际的限制,权衡,和当前的VLM的故障模式下现实的部署限制,并提供可操作的指导研究人员和从业者设计可扩展的多模式系统,在工业领域的长格式视频理解。
摘要:Vision Language Models (VLMs) have shown strong performance on multimodal reasoning tasks, yet most evaluations focus on short videos and assume unconstrained computational resources. In industrial settings such as pharmaceutical content understanding, practitioners must process long-form videos under strict GPU, latency, and cost constraints, where many existing approaches fail to scale. In this work, we present an industrial GenAI framework that processes over 200,000 PDFs, 25,326 videos across eight formats (e.g., MP4, M4V, etc.), and 888 multilingual audio files in more than 20 languages. Our study makes three contributions: (i) an industrial large-scale architecture for multimodal reasoning in pharmaceutical domains; (ii) empirical analysis of over 40 VLMs on two leading benchmarks (Video-MME and MMBench) and proprietary dataset of 25,326 videos across 14 disease areas; and (iii) four findings relevant to long-form video reasoning: the role of multimodality, attention mechanism trade-offs, temporal reasoning limits, and challenges of video splitting under GPU constraints. Results show 3-8 times efficiency gains with SDPA attention on commodity GPUs, multimodality improving up to 8/12 task domains (especially length-dependent tasks), and clear bottlenecks in temporal alignment and keyframe detection across open- and closed-source VLMs. Rather than proposing a new "A+B" model, this paper characterizes practical limits, trade-offs, and failure patterns of current VLMs under realistic deployment constraints, and provide actionable guidance for both researchers and practitioners designing scalable multimodal systems for long-form video understanding in industrial domains.


【7】Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers
标题:可学习的乘数:释放语言模型矩阵层的尺度
链接:https://arxiv.org/abs/2601.04890

作者:Maksim Velikanov,Ilyas Chahed,Jingwei Zuo,Dhia Eddine Rhaiem,Younes Belkada,Hakim Hacid
摘要:将权重衰减(WD)应用于矩阵层是大型语言模型预训练中的标准做法。先前的工作表明,随机梯度噪声诱导权重矩阵W的类似布朗展开,其增长被WD抵消,导致具有一定权重范数的WD-噪声均衡||W||.在这项工作中,我们将均衡范数视为训练过程中的有害工件,并通过引入可学习的乘数来学习最佳尺度来解决它。首先,我们将一个可学习的标量乘子附加到W上,并确认WD-噪声均衡范数是次优的:学习的尺度适应数据并提高性能。然后,我们认为,个别行和列规范同样受到约束,并通过引入可学习的每行和每列乘数释放其规模。我们的方法可以被看作是一个学习,更有表现力的推广muP乘数。它优于一个良好的调整muP基线,降低了乘法器调整的计算开销,并表面的实际问题,如前向传递对称性和学习乘法器的宽度缩放。最后,我们验证了可学习的乘数与亚当和μ子优化器,在下游评估匹配的改进从亚当到μ子的切换。
摘要:Applying weight decay (WD) to matrix layers is standard practice in large-language-model pretraining. Prior work suggests that stochastic gradient noise induces a Brownian-like expansion of the weight matrices W, whose growth is counteracted by WD, leading to a WD-noise equilibrium with a certain weight norm ||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducing learnable multipliers to learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that the WD-noise equilibrium norm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization of muP multipliers. It outperforms a well-tuned muP baseline, reduces the computational overhead of multiplier tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers. Finally, we validate learnable multipliers with both Adam and Muon optimizers, where it shows improvement in downstream evaluations matching the improvement of the switching from Adam to Muon.


【8】CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters
标题:CuMA:通过具有人口意识的适应者混合体将LLM与稀疏文化价值观保持一致
链接:https://arxiv.org/abs/2601.04885

作者:Ao Sun,Xiaoyu Wang,Zhe Tan,Yu Li,Jiachen Zhu,Shu Su,Yuheng Jia
摘要:As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value distributions, suffer from \textbf{Mean Collapse}, converging to a generic average that fails to represent diverse groups. We attribute this to \textbf{Cultural Sparsity}, where gradient interference prevents dense parameters from spanning distinct cultural modes. To resolve this, we propose \textbf{\textsc{CuMA}} (\textbf{Cu}ltural \textbf{M}ixture of \textbf{A}dapters), a framework that frames alignment as a \textbf{conditional capacity separation} problem. By incorporating demographic-aware routing, \textsc{CuMA} internalizes a \textit{Latent Cultural Topology} to explicitly disentangle conflicting gradients into specialized expert subspaces. Extensive evaluations on WorldValuesBench, Community Alignment, and PRISM demonstrate that \textsc{CuMA} achieves state-of-the-art performance, significantly outperforming both dense baselines and semantic-only MoEs. Crucially, our analysis confirms that \textsc{CuMA} effectively mitigates mean collapse, preserving cultural diversity. Our code is available at https://github.com/Throll/CuMA.
摘要 :As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value distributions, suffer from \textbf{Mean Collapse}, converging to a generic average that fails to represent diverse groups. We attribute this to \textbf{Cultural Sparsity}, where gradient interference prevents dense parameters from spanning distinct cultural modes. To resolve this, we propose \textbf{\textsc{CuMA}} (\textbf{Cu}ltural \textbf{M}ixture of \textbf{A}dapters), a framework that frames alignment as a \textbf{conditional capacity separation} problem. By incorporating demographic-aware routing, \textsc{CuMA} internalizes a \textit{Latent Cultural Topology} to explicitly disentangle conflicting gradients into specialized expert subspaces. Extensive evaluations on WorldValuesBench, Community Alignment, and PRISM demonstrate that \textsc{CuMA} achieves state-of-the-art performance, significantly outperforming both dense baselines and semantic-only MoEs. Crucially, our analysis confirms that \textsc{CuMA} effectively mitigates mean collapse, preserving cultural diversity. Our code is available at https://github.com/Throll/CuMA.


【9】MPM-LLM4DSE: Reaching the Pareto Frontier in HLS with Multimodal Learning and LLM-Driven Exploration
标题:MPM-LLM 4DSE:通过多模式学习和LLM驱动的探索达到HLS的帕累托前沿
链接:https://arxiv.org/abs/2601.04801

作者:Lei Xu,Shanshan Wang,Chenglong Xiao
摘要:高级综合(HLS)设计空间探索(DSE)在广阔的pragma配置空间中寻求帕累托最优设计。为了加速HLS DSE,图神经网络(GNN)通常被用作HLS工具的代理,以预测结果质量(QoR)指标,而多目标优化算法加快了探索。然而,基于GNN的预测方法可能无法完全捕获行为描述中固有的丰富语义特征,并且传统的多目标优化算法通常不会明确考虑有关pragma指令如何影响QoR的特定领域知识。为了解决这些局限性,本文提出了MPM-LLM 4DSE框架,该框架采用了多模态预测模型(MPM),同时融合了行为描述和控制及数据流图的特征。此外,该框架采用大型语言模型(LLM)作为优化器,并辅以量身定制的即时工程方法。该方法结合了对QoR的pragma影响分析,以指导LLM生成高质量配置(LLM 4DSE)。实验结果表明,我们的多模态预测模型显着优于国家的最先进的工作ProgSG高达10.25 $\times $。此外,在DSE任务中,拟议的LLM 4DSE比之前的方法实现了39.90%的平均性能增益,验证了我们的提示方法的有效性。代码和模型可在https://github.com/wslcccc/MPM-LLM4DSE上获得。
摘要:High-Level Synthesis (HLS) design space exploration (DSE) seeks Pareto-optimal designs within expansive pragma configuration spaces. To accelerate HLS DSE, graph neural networks (GNNs) are commonly employed as surrogates for HLS tools to predict quality of results (QoR) metrics, while multi-objective optimization algorithms expedite the exploration. However, GNN-based prediction methods may not fully capture the rich semantic features inherent in behavioral descriptions, and conventional multi-objective optimization algorithms often do not explicitly account for the domain-specific knowledge regarding how pragma directives influence QoR. To address these limitations, this paper proposes the MPM-LLM4DSE framework, which incorporates a multimodal prediction model (MPM) that simultaneously fuses features from behavioral descriptions and control and data flow graphs. Furthermore, the framework employs a large language model (LLM) as an optimizer, accompanied by a tailored prompt engineering methodology. This methodology incorporates pragma impact analysis on QoR to guide the LLM in generating high-quality configurations (LLM4DSE). Experimental results demonstrate that our multimodal predictive model significantly outperforms state-of-the-art work ProgSG by up to 10.25$\times$. Furthermore, in DSE tasks, the proposed LLM4DSE achieves an average performance gain of 39.90\% over prior methods, validating the effectiveness of our prompting methodology. Code and models are available at https://github.com/wslcccc/MPM-LLM4DSE.


【10】Differential syntactic and semantic encoding in LLMs
标题:LLM中的差异语法和语义编码
链接:https://arxiv.org/abs/2601.04765

作者:Santiago Acevedo,Alessandro Laio,Marco Baroni
摘要:我们研究了语法和语义信息是如何在大型语言模型(LLM)的内层表示中编码的,重点是非常大的DeepSeek-V3。我们发现,通过平均隐藏表示向量的句子共享句法结构或意义,我们得到的向量,捕获显着比例的句法和语义信息中所包含的表示。特别是,从句子向量中减去这些句法和语义的“质心”,会强烈影响它们与句法和语义匹配的句子的相似性,这表明句法和语义至少部分地是线性编码的。我们还发现,语法和语义的跨层编码配置文件是不同的,并且这两个信号可以在一定程度上解耦,这表明这两种类型的语言信息在LLM表示的差分编码。
摘要:We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the very large DeepSeek-V3. We find that, by averaging hidden-representation vectors of sentences sharing syntactic structure or meaning, we obtain vectors that capture a significant proportion of the syntactic and semantic information contained in the representations. In particular, subtracting these syntactic and semantic ``centroids'' from sentence vectors strongly affects their similarity with syntactically and semantically matched sentences, respectively, suggesting that syntax and semantics are, at least partially, linearly encoded. We also find that the cross-layer encoding profiles of syntax and semantics are different, and that the two signals can to some extent be decoupled, suggesting differential encoding of these two types of linguistic information in LLM representations.


【11】GPU-Accelerated INT8 Quantization for KV Cache Compression in Large Language Models
标题:用于大型语言模型中KV缓存压缩的GOP加速INT 8量化
链接:https://arxiv.org/abs/2601.04719

作者:Maanas Taneja,Purab Shingvi
摘要:大型语言模型中的键值(KV)缓存在推理过程中会出现显著的内存瓶颈,随着序列长度线性增长,并且经常超过模型权重本身的内存占用。我们实现和评估GPU加速的INT 8量化KV缓存压缩,实现4$\times$内存减少,精度降低最小。我们开发了四种CUDA内核变体-- naive、tiled、coarsened和vectorized --并在高达10亿个元素的实际工作负载大小上对它们进行了基准测试。我们的矢量化内核在CPU基线上实现了高达1,694$\times $的加速比,同时保持重建误差低于0.004,注意力得分误差低于0.1,即使是8 K维的头部。这些结果表明,INT 8量化提供了一种实用的方法,用于减少LLM推断中的存储器压力,其计算开销可以忽略不计(6- 58 ms),并且对下游模型行为的影响最小
摘要 :The key-value (KV) cache in large language models presents a significant memory bottleneck during inference, growing linearly with sequence length and often exceeding the memory footprint of model weights themselves. We implement and evaluate GPU-accelerated INT8 quantization for KV cache compression, achieving 4$\times$ memory reduction with minimal accuracy degradation. We develop four CUDA kernel variants -- naive, tiled, coarsened, and vectorized -- and benchmark them across realistic workload sizes up to 1 billion elements. Our vectorized kernel achieves up to 1,694$\times$ speedup over CPU baselines while maintaining reconstruction error below 0.004 and attention score error below 0.1 even for 8K-dimensional heads. These results demonstrate that INT8 quantization provides a practical approach for reducing memory pressure in LLM inference with negligible computational overhead (6--58ms) and minimal impact on downstream model behavior


【12】Prior-Informed Zeroth-Order Optimization with Adaptive Direction Alignment for Memory-Efficient LLM Fine-Tuning
标题:具有自适应方向对齐的先验信息零阶优化,用于内存高效的LLM微调
链接:https://arxiv.org/abs/2601.04710

作者:Feihu Jin,Shipeng Cen,Ying Tan
备注:12pages, 6figures
摘要:微调大型语言模型(LLM)在各种NLP任务中取得了显着的成功,但反向传播过程中的大量内存开销仍然是一个关键瓶颈,特别是随着模型规模的增长。零阶(ZO)优化通过前向传递和高斯采样估计梯度来解决这个问题,避免了反向传播的需要。然而,传统的ZO方法由于依赖于随机扰动而遭受梯度估计的高方差,导致收敛缓慢和次优性能。我们提出了一个简单的即插即用的方法,采用事先知情的扰动,以改善梯度估计。我们的方法从高斯样本中动态地计算一个引导向量,该向量将扰动引导到信息量更大的方向,与标准ZO方法相比,显着加速收敛。我们进一步研究了贪婪扰动策略,以探索先验知识对梯度估计的影响。理论上,我们证明了我们的梯度估计器实现了更强的对齐与真实的梯度方向,提高优化效率。在不同规模和架构的LLM上进行的大量实验表明,我们提出的方法可以无缝集成到现有的优化方法中,从而提供更快的收敛速度和更优异的性能。值得注意的是,在OPT-13 B模型上,我们的方法在所有11个基准任务中的性能都优于传统的ZO优化,并且在11个任务中的9个任务中超过了基于梯度的基线,从而在效率和准确性之间建立了稳健的平衡。
摘要:Fine-tuning large language models (LLMs) has achieved remarkable success across various NLP tasks, but the substantial memory overhead during backpropagation remains a critical bottleneck, especially as model scales grow. Zeroth-order (ZO) optimization alleviates this issue by estimating gradients through forward passes and Gaussian sampling, avoiding the need for backpropagation. However, conventional ZO methods suffer from high variance in gradient estimation due to their reliance on random perturbations, leading to slow convergence and suboptimal performance. We propose a simple plug-and-play method that incorporates prior-informed perturbations to refine gradient estimation. Our method dynamically computes a guiding vector from Gaussian samples, which directs perturbations toward more informative directions, significantly accelerating convergence compared to standard ZO approaches. We further investigate a greedy perturbation strategy to explore the impact of prior knowledge on gradient estimation. Theoretically, we prove that our gradient estimator achieves stronger alignment with the true gradient direction, enhancing optimization efficiency. Extensive experiments across LLMs of varying scales and architectures demonstrate that our proposed method could seamlessly integrate into existing optimization methods, delivering faster convergence and superior performance. Notably, on the OPT-13B model, our method outperforms traditional ZO optimization across all 11 benchmark tasks and surpasses gradient-based baselines on 9 out of 11 tasks, establishing a robust balance between efficiency and accuracy.


【13】Do LLMs Benefit from User and Item Embeddings in Recommendation Tasks?
标题:LLM是否受益于推荐任务中的用户和项目嵌入?
链接:https://arxiv.org/abs/2601.04690

作者:Mir Rayat Imtiaz Hossain,Leo Feng,Leonid Sigal,Mohamed Osama Ahmed
备注:Presented in Multimodal Algorithmic Reasoning Workshop at NeurIPS 2025
摘要:大型语言模型(LLM)已经成为有前途的推荐系统,通过生成方法提供了新的方法来建模用户偏好。然而,许多现有的方法通常仅依赖于文本语义或以有限的方式结合协作信号,通常仅使用用户或项目嵌入。这些方法很难处理表示用户历史的多个项目嵌入,恢复到文本语义,忽略了更丰富的协作信息。在这项工作中,我们提出了一个简单而有效的解决方案,项目的用户和项目嵌入,从协同过滤,到LLM令牌空间通过单独的轻量级投影模块。然后,一个微调的LLM根据这些投影的嵌入和文本标记来生成推荐。初步结果表明,这种设计有效地利用了结构化的用户-项目交互数据,提高了纯文本LLM基线的推荐性能,并为传统推荐系统与现代LLM提供了一条实用的途径。
摘要:Large Language Models (LLMs) have emerged as promising recommendation systems, offering novel ways to model user preferences through generative approaches. However, many existing methods often rely solely on text semantics or incorporate collaborative signals in a limited manner, typically using only user or item embeddings. These methods struggle to handle multiple item embeddings representing user history, reverting to textual semantics and neglecting richer collaborative information. In this work, we propose a simple yet effective solution that projects user and item embeddings, learned from collaborative filtering, into the LLM token space via separate lightweight projector modules. A finetuned LLM then conditions on these projected embeddings alongside textual tokens to generate recommendations. Preliminary results show that this design effectively leverages structured user-item interaction data, improves recommendation performance over text-only LLM baselines, and offers a practical path for bridging traditional recommendation systems with modern LLMs.


【14】Learning Dynamics in RL Post-Training for Language Models
标题:语言模型RL后训练中的学习动态
链接:https://arxiv.org/abs/2601.04670

作者:Akiyoshi Tomihari
摘要 :强化学习(RL)后训练是现代语言模型开发的关键阶段,在提高对齐和推理能力方面发挥着关键作用。然而,对若干现象,包括产出多样性的减少,仍然知之甚少。为了更广泛地理解RL后训练,我们从监督学习中已经研究但在RL中仍然未充分探索的角度分析了RL后训练的学习动态。我们采用了经验神经正切核(NTK)框架,并将NTK分解为两个分量,以表征RL更新如何在训练样本中传播。我们的分析表明,特征表示的有限可变性可以导致RL更新系统地增加模型置信度,为RL后训练后通常观察到的输出多样性减少提供了解释。此外,我们表明,在这种制度下,有效的学习取决于快速塑造分类器,这直接影响NTK的梯度分量。受这些见解的启发,我们提出了分类器优先强化学习(CF-RL),这是一种简单的两阶段训练策略,在标准RL优化之前优先考虑分类器更新。实验结果验证了我们的理论分析,证明了增加模型的信心和加速优化下CF-RL。进一步的分析表明,CF-RL的机制不同于监督学习中的线性探测然后微调。总的来说,我们的研究正式的RL后培训的学习动力,并激励进一步的分析和改进。
摘要:Reinforcement learning (RL) post-training is a critical stage in modern language model development, playing a key role in improving alignment and reasoning ability. However, several phenomena remain poorly understood, including the reduction in output diversity. To gain a broader understanding of RL post-training, we analyze the learning dynamics of RL post-training from a perspective that has been studied in supervised learning but remains underexplored in RL. We adopt an empirical neural tangent kernel (NTK) framework and decompose the NTK into two components to characterize how RL updates propagate across training samples. Our analysis reveals that limited variability in feature representations can cause RL updates to systematically increase model confidence, providing an explanation for the commonly observed reduction in output diversity after RL post-training. Furthermore, we show that effective learning in this regime depends on rapidly shaping the classifier, which directly affects the gradient component of the NTK. Motivated by these insights, we propose classifier-first reinforcement learning (CF-RL), a simple two-stage training strategy that prioritizes classifier updates before standard RL optimization. Experimental results validate our theoretical analysis by demonstrating increased model confidence and accelerated optimization under CF-RL. Additional analysis shows that the mechanism underlying CF-RL differs from that of linear-probing-then-fine-tuning in supervised learning. Overall, our study formalizes the learning dynamics of RL post-training and motivates further analysis and improvement.


【15】Not All Steps are Informative: On the Linearity of LLMs' RLVR Training
标题:并非所有步骤都是信息丰富的:关于LLC RL VR训练的线性性
链接:https://arxiv.org/abs/2601.04537

作者:Tianle Wang,Zhongyuan Wu,Shenghao Jin,Hao Xu,Wei Chen,Ning Miao
备注:pre-print
摘要:具有可验证奖励的强化学习(RLVR)已成为大型语言模型(LLM)后训练的核心组成部分。与监督微调(SFT)不同,RLVR让LLM生成多个候选解决方案,并加强那些导致可验证正确的最终答案的解决方案。然而,在实践中,RLVR通常需要数千个训练步骤才能达到强大的性能,这在很大程度上是由于长时间的探索而导致的大量计算。在这项工作中,我们做出了一个令人惊讶的观察:在RLVR期间,LLM以强烈的线性方式演化。具体而言,模型权重和模型输出对数概率都与RL训练步骤表现出强线性相关性。这表明RLVR主要放大训练早期出现的趋势,而不是在整个优化轨迹中不断发现新的行为。受这种线性的启发,我们研究了未来的模型状态是否可以通过外推从中间检查点预测,避免继续昂贵的训练。我们表明,权重外推产生的模型与标准RL训练的性能相当,同时需要更少的计算。此外,Logits外推法在所有四个基准测试中的表现始终优于持续的RL训练,因为外推超出了RL训练保持稳定的步长范围。
摘要:Reinforcement learning with verifiable rewards (RLVR) has become a central component of large language model (LLM) post-training. Unlike supervised fine-tuning (SFT), RLVR lets an LLM generate multiple candidate solutions and reinforces those that lead to a verifiably correct final answer. However, in practice, RLVR often requires thousands of training steps to reach strong performance, incurring substantial computation largely attributed to prolonged exploration. In this work, we make a surprising observation: during RLVR, LLMs evolve in a strongly linear manner. Specifically, both model weights and model output log-probabilities exhibit strong linear correlations with RL training steps. This suggests that RLVR predominantly amplifies trends that emerge early in training, rather than continuously discovering new behaviors throughout the entire optimization trajectory. Motivated by this linearity, we investigate whether future model states can be predicted from intermediate checkpoints via extrapolation, avoiding continued expensive training. We show that Weight Extrapolation produces models with performance comparable to standard RL training while requiring significantly less computation. Moreover, Logits Extrapolation consistently outperforms continued RL training on all four benchmarks by extrapolating beyond the step range where RL training remains stable.


【16】IGenBench: Benchmarking the Reliability of Text-to-Infographic Generation
标题:IGenBench:文本到信息图生成的可靠性基准测试
链接:https://arxiv.org/abs/2601.04498

作者:Yinghao Tang,Xueding Liu,Boyuan Zhang,Tingfeng Lan,Yupeng Xie,Jiale Lao,Yiyao Wang,Haoxuan Li,Tingting Gao,Bo Pan,Luoxuan Weng,Xiuqi Huang,Minfeng Zhu,Yingchaojie Feng,Yuyu Luo,Wei Chen
摘要:信息图是将数据可视化与文本和说明性元素相结合以传达信息的复合视觉工件。虽然最近的文本到图像(T2 I)模型可以生成美观的图像,但它们在生成信息图表方面的可靠性仍然不清楚。生成的信息图表可能乍一看是正确的,但包含容易被忽视的问题,例如扭曲的数据编码或不正确的文本内容。我们提出了IGENBENCH,第一个基准评估的可靠性,文本到信息图生成,包括600策划的测试案例,跨越30个信息图类型。我们设计了一个自动化的评估框架,分解成原子的是/否的问题的基础上的10个问题类型的分类的可靠性验证。我们采用多模态大型语言模型(MLLM)来验证每个问题,从而获得问题级准确度(Q-ACC)和信息图级准确度(I-ACC)。我们在IGENBENCH上全面评估了10个最先进的T2 I模型。我们的系统分析揭示了未来模型开发的关键见解:(i)三层性能层次结构,顶级模型的Q-ACC为0.90,但I-ACC仅为0.49;(ii)数据相关维度成为普遍瓶颈(例如,数据完整性:0.21);以及(iii)在所有模型中实现端到端正确性的挑战。我们在https://igen-bench.vercel.app/发布IGENBENCH。
摘要 :Infographics are composite visual artifacts that combine data visualizations with textual and illustrative elements to communicate information. While recent text-to-image (T2I) models can generate aesthetically appealing images, their reliability in generating infographics remains unclear. Generated infographics may appear correct at first glance but contain easily overlooked issues, such as distorted data encoding or incorrect textual content. We present IGENBENCH, the first benchmark for evaluating the reliability of text-to-infographic generation, comprising 600 curated test cases spanning 30 infographic types. We design an automated evaluation framework that decomposes reliability verification into atomic yes/no questions based on a taxonomy of 10 question types. We employ multimodal large language models (MLLMs) to verify each question, yielding question-level accuracy (Q-ACC) and infographic-level accuracy (I-ACC). We comprehensively evaluate 10 state-of-the-art T2I models on IGENBENCH. Our systematic analysis reveals key insights for future model development: (i) a three-tier performance hierarchy with the top model achieving Q-ACC of 0.90 but I-ACC of only 0.49; (ii) data-related dimensions emerging as universal bottlenecks (e.g., Data Completeness: 0.21); and (iii) the challenge of achieving end-to-end correctness across all models. We release IGENBENCH at https://igen-bench.vercel.app/.


【17】Using Large Language Models to Detect Socially Shared Regulation of Collaborative Learning
标题:使用大型语言模型检测协作学习的社会共享调节
链接:https://arxiv.org/abs/2601.04458

作者:Jiayi Zhang,Conrad Borchers,Clayton Cohn,Namrata Srivastava,Caitlin Snyder,Siyuan Guo,Ashwin T S,Naveeduddin Mohammed,Haley Noh,Gautam Biswas
备注:Short research paper accepted at Learning Analytics and Knowledge (LAK '26)
摘要:学习分析领域在自动检测多模态数据中的复杂学习过程方面取得了显著进展。然而,大多数进步都集中在个性化的问题解决上,而不是合作的、开放式的问题解决,这可能为行为预测提供了启示(更丰富的数据)和挑战(低凝聚力)。在这里,我们扩展了预测模型,使用基于嵌入的方法在协作计算建模环境中自动检测社会共享的学习调节(SSRL)行为。我们利用大型语言模型(LLM)作为摘要工具来生成与系统日志一致的学生对话的任务感知表示。这些摘要与纯文本嵌入、上下文丰富嵌入和日志派生特征相结合,用于训练预测模型。结果表明,纯文本嵌入在检测与制定或群体动态相关的SSRL行为(例如,任务外行为或请求帮助)。相比之下,上下文和多模态特征为规划和反思等结构提供了互补的好处。总的来说,我们的研究结果突出了基于嵌入的模型的承诺,通过实现SSRL行为的可扩展检测来扩展学习分析,最终支持教师重视的协作学习环境中的实时反馈和自适应脚手架。
摘要:The field of learning analytics has made notable strides in automating the detection of complex learning processes in multimodal data. However, most advancements have focused on individualized problem-solving instead of collaborative, open-ended problem-solving, which may offer both affordances (richer data) and challenges (low cohesion) to behavioral prediction. Here, we extend predictive models to automatically detect socially shared regulation of learning (SSRL) behaviors in collaborative computational modeling environments using embedding-based approaches. We leverage large language models (LLMs) as summarization tools to generate task-aware representations of student dialogue aligned with system logs. These summaries, combined with text-only embeddings, context-enriched embeddings, and log-derived features, were used to train predictive models. Results show that text-only embeddings often achieve stronger performance in detecting SSRL behaviors related to enactment or group dynamics (e.g., off-task behavior or requesting assistance). In contrast, contextual and multimodal features provide complementary benefits for constructs such as planning and reflection. Overall, our findings highlight the promise of embedding-based models for extending learning analytics by enabling scalable detection of SSRL behaviors, ultimately supporting real-time feedback and adaptive scaffolding in collaborative learning environments that teachers value.


【18】Large Language Models for Detecting Cyberattacks on Smart Grid Protective Relays
标题:用于检测智能电网保护装置网络攻击的大型语言模型
链接:https://arxiv.org/abs/2601.04443

作者:Ahmad Mohammad Saber,Saeed Jafari,Zhengmao Ouyang,Paul Budnarain,Amr Youssef,Deepa Kundur
摘要:本文提出了一种基于大语言模型(LLM)的框架,用于检测对Transformer电流差动继电器(TCDR)的网络攻击,如果未检测到,可能会触发关键Transformers的错误跳闸。所提出的方法调整和微调紧凑型LLM,如DistilBERT,以区分网络攻击与实际故障使用文本化的多维TCDR电流测量记录之前和之后跳闸。我们的研究结果表明,DistilBERT检测到97.6%的网络攻击,而不影响TCDR的可靠性,并在商业工作站上实现低于6毫秒的推理延迟。额外的评估确认了该框架在时间同步和假数据注入攻击下的鲁棒性,对测量噪声的弹性,以及在提示配方变体中的稳定性。此外,GPT-2和DistilBERT+LoRA实现了相当的性能,突出了LLM增强智能电网网络安全的潜力。我们提供了本研究中使用的完整数据集,以确保重现性。
摘要:This paper presents a large language model (LLM)-based framework for detecting cyberattacks on transformer current differential relays (TCDRs), which, if undetected, may trigger false tripping of critical transformers. The proposed approach adapts and fine-tunes compact LLMs such as DistilBERT to distinguish cyberattacks from actual faults using textualized multidimensional TCDR current measurements recorded before and after tripping. Our results demonstrate that DistilBERT detects 97.6% of cyberattacks without compromising TCDR dependability and achieves inference latency below 6 ms on a commercial workstation. Additional evaluations confirm the framework's robustness under combined time-synchronization and false-data-injection attacks, resilience to measurement noise, and stability across prompt formulation variants. Furthermore, GPT-2 and DistilBERT+LoRA achieve comparable performance, highlighting the potential of LLMs for enhancing smart grid cybersecurity. We provide the full dataset used in this study for reproducibility.


【19】From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning
标题:从领域到收件箱:LLM取消学习的双粒度数据合成
链接:https://arxiv.org/abs/2601.04278

作者:Xiaoyu Xu,Minxin Du,Zitong Li,Zi Liang,Zhibiao Guo,Shiyu Zhang,Peizhao Hu,Qingqing Ye,Haibo Hu
备注:16 pages
摘要:尽管机器非学习对于从LLM中删除私有的、有害的或受版权保护的内容至关重要,但当前的基准测试通常无法忠实地表示模型学习到的真正的“遗忘范围”。我们正式两个不同的遗忘粒度,域级和实例级,并提出BiForget,一个自动化的框架,合成高品质的遗忘集。与之前依赖外部生成器的工作不同,BiForget利用目标模型本身,通过种子引导和对抗性提示来获取与其内部知识分布相匹配的数据。我们在不同基准上的实验表明,它实现了相关性,多样性和效率的卓越平衡。在哈利波特领域,它将相关性提高了${\sim}20$,多样性提高了${\sim} 0.05 $,同时与SOTA相比,总数据大小减少了一半。最终,它促进了更强大的遗忘和更好的效用保存,为评估LLM unlearning提供了更严格的基础。
摘要 :Although machine unlearning is essential for removing private, harmful, or copyrighted content from LLMs, current benchmarks often fail to faithfully represent the true "forgetting scope" learned by the model. We formalize two distinct unlearning granularities, domain-level and instance-level, and propose BiForget, an automated framework for synthesizing high-quality forget sets. Unlike prior work relying on external generators, BiForget exploits the target model per se to elicit data that matches its internal knowledge distribution through seed-guided and adversarial prompting. Our experiments across diverse benchmarks show that it achieves a superior balance of relevance, diversity, and efficiency. Quantitatively, in the Harry Potter domain, it improves relevance by ${\sim}20$ and diversity by ${\sim}$0.05 while halving the total data size compared to SOTAs. Ultimately, it facilitates more robust forgetting and better utility preservation, providing a more rigorous foundation for evaluating LLM unlearning.


【20】Unlocking the Pre-Trained Model as a Dual-Alignment Calibrator for Post-Trained LLMs
标题:解锁预训练模型作为后训练LLM的双重校准
链接:https://arxiv.org/abs/2601.04277

作者:Beier Luo,Cheng Wang,Hongxin Wei,Sharon Li,Xuefeng Du
摘要:后训练改进了大型语言模型(LLM),但通常会破坏置信度校准,导致系统性过度自信。最近的无监督事后训练LM(PoLM)方法通过将PoLM置信度与经过良好校准的预训练对应物的置信度对齐来减轻这一点。然而,作为静态输出分布匹配的帧校准忽略了后训练引入的推理时间动态。特别是,我们表明,校准误差产生于两个制度:(i)置信漂移,最终的信心膨胀,尽管在很大程度上一致的中间决策过程,和(ii)过程漂移,中间推理途径分歧。在此诊断的指导下,我们提出了Dual-Align,一个无监督的事后框架,用于置信度校准中的双对齐。Dual-Align通过最终分布匹配执行置信度对齐以纠正置信度漂移,并通过定位轨迹发散的层并重新调整后续推理的稳定性来引入过程对齐以解决过程漂移。这种双重策略学习单个温度参数,可以在不牺牲训练后性能增益的情况下纠正两种漂移类型。实验表明,在基线上的一致改进,减少了校准误差,并接近监督的预言。
摘要:Post-training improves large language models (LLMs) but often worsens confidence calibration, leading to systematic overconfidence. Recent unsupervised post-hoc methods for post-trained LMs (PoLMs) mitigate this by aligning PoLM confidence to that of well-calibrated pre-trained counterparts. However, framing calibration as static output-distribution matching overlooks the inference-time dynamics introduced by post-training. In particular, we show that calibration errors arise from two regimes: (i) confidence drift, where final confidence inflates despite largely consistent intermediate decision processes, and (ii) process drift, where intermediate inference pathways diverge. Guided by this diagnosis, we propose Dual-Align, an unsupervised post-hoc framework for dual alignment in confidence calibration. Dual-Align performs confidence alignment to correct confidence drift via final-distribution matching, and introduces process alignment to address process drift by locating the layer where trajectories diverge and realigning the stability of subsequent inference. This dual strategy learns a single temperature parameter that corrects both drift types without sacrificing post-training performance gains. Experiments show consistent improvements over baselines, reducing calibration errors and approaching a supervised oracle.


【21】State Backdoor: Towards Stealthy Real-world Poisoning Attack on Vision-Language-Action Model in State Space
标题:状态后门:对状态空间中视觉-语言-动作模型的隐形现实中毒攻击
链接:https://arxiv.org/abs/2601.04266

作者:Ji Guo,Wenbo Jiang,Yansong Lin,Yijing Liu,Ruichen Zhang,Guomin Lu,Aiguo Chen,Xinshuo Han,Hongwei Li,Dusit Niyato
摘要:视觉-语言-动作(VLA)模型广泛部署在机器人等安全关键型嵌入式AI应用中。然而,它们复杂的多模式交互也暴露出新的安全漏洞。在本文中,我们研究了VLA模型中的后门威胁,其中恶意输入会导致有针对性的不当行为,同时保留干净数据的性能。现有的后门方法主要依赖于将可见的触发器插入到视觉模态中,由于环境的变化,其在现实世界的设置中具有较差的鲁棒性和低不敏感性。为了克服这些限制,我们引入了状态后门,一种新颖而实用的后门攻击,利用机器人手臂的初始状态作为触发器。为了优化触发器的不敏感性和有效性,我们设计了一个偏好引导的遗传算法(PGA),有效地搜索状态空间的最小但有效的触发器。在五个代表性的VLA模型和五个真实任务上进行的广泛实验表明,我们的方法在不影响良性任务性能的情况下实现了超过90%的攻击成功率,揭示了嵌入式AI系统中的一个未充分利用的漏洞。
摘要:Vision-Language-Action (VLA) models are widely deployed in safety-critical embodied AI applications such as robotics. However, their complex multimodal interactions also expose new security vulnerabilities. In this paper, we investigate a backdoor threat in VLA models, where malicious inputs cause targeted misbehavior while preserving performance on clean data. Existing backdoor methods predominantly rely on inserting visible triggers into visual modality, which suffer from poor robustness and low insusceptibility in real-world settings due to environmental variability. To overcome these limitations, we introduce the State Backdoor, a novel and practical backdoor attack that leverages the robot arm's initial state as the trigger. To optimize trigger for insusceptibility and effectiveness, we design a Preference-guided Genetic Algorithm (PGA) that efficiently searches the state space for minimal yet potent triggers. Extensive experiments on five representative VLA models and five real-world tasks show that our method achieves over 90% attack success rate without affecting benign task performance, revealing an underexplored vulnerability in embodied AI systems.


【22】Towards a Mechanistic Understanding of Propositional Logical Reasoning in Large Language Models
标题:走向对大型语言模型中命题逻辑推理的机械理解
链接:https://arxiv.org/abs/2601.04260

作者:Danchun Chen,Qiyao Yan,Liangming Pan
摘要 :理解大型语言模型(LLM)如何在内部执行逻辑推理仍然是一个根本性的挑战。虽然以前的机械研究集中在识别特定任务的电路,他们留下了开放的问题,什么计算策略LLM采用命题推理。我们通过对PropLogic-MI上的Qwen 3(8B和14 B)进行全面分析来解决这一差距,PropLogic-MI是一个跨越一跳和两跳推理的11个命题逻辑规则类别的受控数据集。我们不是问“哪些组件是必需的”,而是问“模型如何组织计算?”“我们的分析揭示了一个连贯的计算架构,包括四个联锁机制:阶段计算(分层处理阶段),信息传输(边界令牌处的信息流聚合),事实回顾(源事实的持久重新访问)和专业注意力头(功能不同的头部类型)。这些机制概括了模型规模、规则类型和推理深度,提供了LLM采用结构化计算策略进行逻辑推理的机械证据。
摘要:Understanding how Large Language Models (LLMs) perform logical reasoning internally remains a fundamental challenge. While prior mechanistic studies focus on identifying taskspecific circuits, they leave open the question of what computational strategies LLMs employ for propositional reasoning. We address this gap through comprehensive analysis of Qwen3 (8B and 14B) on PropLogic-MI, a controlled dataset spanning 11 propositional logic rule categories across one-hop and two-hop reasoning. Rather than asking ''which components are necessary,'' we ask ''how does the model organize computation?'' Our analysis reveals a coherent computational architecture comprising four interlocking mechanisms: Staged Computation (layer-wise processing phases), Information Transmission (information flow aggregation at boundary tokens), Fact Retrospection (persistent re-access of source facts), and Specialized Attention Heads (functionally distinct head types). These mechanisms generalize across model scales, rule types, and reasoning depths, providing mechanistic evidence that LLMs employ structured computational strategies for logical reasoning.


【23】Scaling Trends for Multi-Hop Contextual Reasoning in Mid-Scale Language Models
标题:中等规模语言模型中多跳上下文推理的扩展趋势
链接:https://arxiv.org/abs/2601.04254

作者:Brady Steele,Micah Katz
备注:18 pages, 6 figures, 8 tables
摘要:我们提出了一个多跳上下文推理在大型语言模型的控制研究,提供了一个干净的演示任务方法分离:基于规则的模式匹配实现100%的成功结构化信息检索,但只有6.7%的任务需要跨文档推理,而基于LLM的多智能体系统显示相反的模式,实现高达80%的推理任务,基于规则的方法失败。使用综合评估框架,在四个模型中进行120次试验(LLaMA-3 8B,LLaMA-2 13 B,Mixtral 8x 7 B,DeepSeek-V2 16 B),我们报告了三个关键发现:(1)多代理扩增取决于基础能力:只有具有足够推理能力的模型才能获得统计上显著的收益(LLaMA-3 8B的p < 0.001,Mixtral的p = 0.014),提高了46.7个百分点,而较弱的模型没有显示出任何好处,表明放大而不是补偿;(2)主动参数预测推理性能:Mixtral的性能与其~ 12 B活动参数而不是47 B总数一致,这与推理时间计算驱动MoE架构中的推理能力的假设一致;(3)架构质量很重要:LLaMA-3 8B优于LLaMA-2 13 B,尽管参数较少,与已知的训练改进一致。我们的研究结果提供了控制的定量证据的直觉多代理协调和MoE缩放,同时突出了多代理的利益对基础模型能力的依赖。我们发布了我们的评估框架,以支持中等规模模型中推理的可重复研究。
摘要:We present a controlled study of multi-hop contextual reasoning in large language models, providing a clean demonstration of the task-method dissociation: rule-based pattern matching achieves 100% success on structured information retrieval but only 6.7% on tasks requiring cross-document reasoning, while LLM-based multi-agent systems show the inverse pattern, achieving up to 80% on reasoning tasks where rule-based methods fail. Using a synthetic evaluation framework with 120 trials across four models (LLaMA-3 8B, LLaMA-2 13B, Mixtral 8x7B, DeepSeek-V2 16B), we report three key findings: (1) Multi-agent amplification depends on base capability: statistically significant gains occur only for models with sufficient reasoning ability (p < 0.001 for LLaMA-3 8B, p = 0.014 for Mixtral), with improvements of up to 46.7 percentage points, while weaker models show no benefit, suggesting amplification rather than compensation; (2) Active parameters predict reasoning performance: Mixtral's performance aligns with its ~12B active parameters rather than 47B total, consistent with the hypothesis that inference-time compute drives reasoning capability in MoE architectures; (3) Architecture quality matters: LLaMA-3 8B outperforms LLaMA-2 13B despite fewer parameters, consistent with known training improvements. Our results provide controlled quantitative evidence for intuitions about multi-agent coordination and MoE scaling, while highlighting the dependence of multi-agent benefits on base model capability. We release our evaluation framework to support reproducible research on reasoning in mid-scale models.


【24】TeleTables: A Benchmark for Large Language Models in Telecom Table Interpretation
标题:TeleTables:电信表解释中大型语言模型的基准
链接:https://arxiv.org/abs/2601.04202

作者:Anas Ezzakri,Nicola Piovesan,Mohamed Sana,Antonio De Domenico,Fadhel Ayed,Haozhe Zhang
摘要:电信行业越来越多地探索语言模型(LLM),以支持工程任务,加速故障排除,并协助解释复杂的技术文档。然而,最近的研究表明,LLM在电信标准,特别是3GPP规范上表现不佳。我们认为,一个关键的原因是,这些标准密集地包括表格,以提供必要的信息,但法学硕士的知识和解释能力,这些表仍然在很大程度上未经审查。为了解决这一差距,我们引入了TeleTables,这是一个基准测试,旨在评估LLM对技术规范中的表的隐式知识及其解释它们的显式能力。TeleTables是通过一种新颖的多阶段数据生成管道构建的,该管道从3GPP标准中提取表格,并使用多模态和面向推理的LLM来生成和验证问题。由此产生的数据集是公开的,包括500个人工验证的问答对,每个都与多种格式的相应表格相关联。我们的评估表明,较小的模型(在10B参数下)很难回忆起3GPP知识和解释表格,这表明他们在预训练中对电信标准的接触有限,并且导航复杂技术材料的归纳偏差不足。另一方面,更大的模型在表格解释上表现出更强的推理能力。总的来说,TeleTables强调了对领域专门化微调的需要,以可靠地解释和推理电信标准。
摘要:Language Models (LLMs) are increasingly explored in the telecom industry to support engineering tasks, accelerate troubleshooting, and assist in interpreting complex technical documents. However, recent studies show that LLMs perform poorly on telecom standards, particularly 3GPP specifications. We argue that a key reason is that these standards densely include tables to present essential information, yet the LLM knowledge and interpretation ability of such tables remains largely unexamined. To address this gap, we introduce TeleTables, a benchmark designed to evaluate both the implicit knowledge LLMs have about tables in technical specifications and their explicit ability to interpret them. TeleTables is built through a novel multi-stage data generation pipeline that extracts tables from 3GPP standards and uses multimodal and reasoning-oriented LLMs to generate and validate questions. The resulting dataset, which is publicly available, comprises 500 human-verified question-answer pairs, each associated with the corresponding table in multiple formats. Our evaluation shows that, smaller models (under 10B parameters) struggle both to recall 3GPP knowledge and to interpret tables, indicating the limited exposure to telecom standards in their pretraining and the insufficient inductive biases for navigating complex technical material. Larger models, on the other hand, show stronger reasoning on table interpretation. Overall, TeleTables highlights the need for domain-specialized fine-tuning to reliably interpret and reason over telecom standards.


Graph相关(图学习|图神经网络|图优化等)(4篇)

【1】FaST: Efficient and Effective Long-Horizon Forecasting for Large-Scale Spatial-Temporal Graphs via Mixture-of-Experts
标题:FaST:通过混合专家对大规模时空图进行高效有效的长期预测
链接:https://arxiv.org/abs/2601.05174

作者:Yiji Zhao,Zihao Zhong,Ao Wang,Haomin Wen,Ming Jin,Yuxuan Liang,Huaiyu Wan,Hao Wu
备注:Accepted to KDD 2026
摘要:大规模网络上的时空图(STG)预测已经引起了人们的极大关注。然而,现有的模型主要集中在短期预测和遭受臭名昭著的计算成本和内存消耗时,扩展到长期预测和大型图形。针对上述挑战,我们提出了FaST,这是一个有效和高效的框架,基于异质性感知的混合专家(MoE)进行长期和大规模的STG预测,它可以提前一周(672步,15分钟粒度)预测数千个节点。FastST由两项关键创新支撑。首先,提出了一种自适应图代理注意力机制,以减轻传统的图卷积和自注意力模块应用于大规模图时固有的计算负担。其次,我们提出了一个新的并行MoE模块,取代传统的前馈网络与门控线性单元(GLU),使一个有效的和可扩展的并行结构。在真实世界数据集上进行的大量实验表明,与最先进的基线相比,FaST不仅提供了卓越的长期预测准确性,而且还实现了显着的计算效率。我们的源代码可以在https://github.com/yijizhao/FaST上找到。
摘要:Spatial-Temporal Graph (STG) forecasting on large-scale networks has garnered significant attention. However, existing models predominantly focus on short-horizon predictions and suffer from notorious computational costs and memory consumption when scaling to long-horizon predictions and large graphs. Targeting the above challenges, we present FaST, an effective and efficient framework based on heterogeneity-aware Mixture-of-Experts (MoEs) for long-horizon and large-scale STG forecasting, which unlocks one-week-ahead (672 steps at a 15-minute granularity) prediction with thousands of nodes. FaST is underpinned by two key innovations. First, an adaptive graph agent attention mechanism is proposed to alleviate the computational burden inherent in conventional graph convolution and self-attention modules when applied to large-scale graphs. Second, we propose a new parallel MoE module that replaces traditional feed-forward networks with Gated Linear Units (GLUs), enabling an efficient and scalable parallel structure. Extensive experiments on real-world datasets demonstrate that FaST not only delivers superior long-horizon predictive accuracy but also achieves remarkable computational efficiency compared to state-of-the-art baselines. Our source code is available at: https://github.com/yijizhao/FaST.


【2】Parallelizing Node-Level Explainability in Graph Neural Networks
标题:图神经网络中的节点级解释性并行化
链接:https://arxiv.org/abs/2601.04807

作者:Oscar Llorente,Jaime Boal,Eugenio F. Sánchez-Úbeda,Antonio Diaz-Cano,Miguel Familiar
摘要:图神经网络(GNN)通过利用图结构数据中的结构信息,在节点分类、链接预测和图分类等广泛的任务中表现出了卓越的性能。然而,在节点分类中,随着图的大小增加,计算节点级可解释性变得非常耗时,而可解释性策略通常会降低解释质量。本文介绍了一种新的方法,通过图划分并行化GNNs中的节点级可解释性。通过将图分解为不相交的子图,我们可以并行计算节点邻居的可解释性,显著提高了可扩展性和效率,而不会影响结果的正确性,只要有足够的内存可用。对于内存有限的情况下,我们进一步提出了一个辍学为基础的重建机制,提供了一个可控的内存使用和解释保真度之间的权衡。在真实世界数据集上的实验结果显示了显著的加速,为大规模GNN模型提供了可扩展和透明的可解释性。
摘要:Graph Neural Networks (GNNs) have demonstrated remarkable performance in a wide range of tasks, such as node classification, link prediction, and graph classification, by exploiting the structural information in graph-structured data. However, in node classification, computing node-level explainability becomes extremely time-consuming as the size of the graph increases, while batching strategies often degrade explanation quality. This paper introduces a novel approach to parallelizing node-level explainability in GNNs through graph partitioning. By decomposing the graph into disjoint subgraphs, we enable parallel computation of explainability for node neighbors, significantly improving the scalability and efficiency without affecting the correctness of the results, provided sufficient memory is available. For scenarios where memory is limited, we further propose a dropout-based reconstruction mechanism that offers a controllable trade-off between memory usage and explanation fidelity. Experimental results on real-world datasets demonstrate substantial speedups, enabling scalable and transparent explainability for large-scale GNN models.


【3】A zone-based training approach for last-mile routing using Graph Neural Networks and Pointer Networks
标题:使用图神经网络和指针网络的最后一英里路由基于区域的训练方法
链接:https://arxiv.org/abs/2601.04705

作者:Àngel Ruiz-Fas,Carlos Granell,José Francisco Ramos,Joaquín Huerta,Sergio Trilles
备注:Accepted in SMF 2026. 8 pages, 3 figures
摘要:快速的电子商务增长已经将最后一英里的交付网络推向了极限,在这种情况下,小的路由增益可以转化为更低的成本,更快的服务和更少的排放。当旅行时间高度不对称时,经典的旅行学很难适应(例如,单行道、交通挤塞)。提出了一种基于深度学习的最后一英里路由问题的方法,以生成由停止序列组成的地理区域,从而最大限度地减少最后一英里的交付时间。   所提出的方法是一种编码器-解码器架构。每个路径被表示为一个完全有向图,其节点是停止,其边的权重是不对称的旅行时间。图神经网络编码器产生节点嵌入,捕获停靠点之间的空间关系。然后,指针网络解码器采用嵌入和路线的起始节点来顺序选择下一站,为每个未访问的节点分配概率作为下一个目的地。   离散全球网格系统的细胞,其中包含在训练数据中的路线停止被获得和聚类,以生成类似大小的地理区域,其中的训练和推理的过程被划分。随后,仅考虑包括在该区域中的训练路线的停止,每个区域训练模型的不同实例。   该方法使用2021年亚马逊最后一英里路由挑战赛的洛杉矶路线进行评估。从一般和基于区域的训练的结果进行了比较,显示减少的平均预测路线长度在基于区域的训练相比,一般的训练。随着每条路线停靠点数量的增加,基于区域的方法的性能改进变得更加明显。
摘要 :Rapid e-commerce growth has pushed last-mile delivery networks to their limits, where small routing gains translate into lower costs, faster service, and fewer emissions. Classical heuristics struggle to adapt when travel times are highly asymmetric (e.g., one-way streets, congestion). A deep learning-based approach to the last-mile routing problem is presented to generate geographical zones composed of stop sequences to minimize last-mile delivery times.   The presented approach is an encoder-decoder architecture. Each route is represented as a complete directed graph whose nodes are stops and whose edge weights are asymmetric travel times. A Graph Neural Network encoder produces node embeddings that captures the spatial relationships between stops. A Pointer Network decoder then takes the embeddings and the route's start node to sequentially select the next stops, assigning a probability to each unvisited node as the next destination.   Cells of a Discrete Global Grid System which contain route stops in the training data are obtained and clustered to generate geographical zones of similar size in which the process of training and inference are divided. Subsequently, a different instance of the model is trained per zone only considering the stops of the training routes which are included in that zone.   This approach is evaluated using the Los Angeles routes from the 2021 Amazon Last Mile Routing Challenge. Results from general and zone-based training are compared, showing a reduction in the average predicted route length in the zone-based training compared to the general training. The performance improvement of the zone-based approach becomes more pronounced as the number of stops per route increases.


【4】Phasor Agents: Oscillatory Graphs with Three-Factor Plasticity and Sleep-Staged Learning
标题:相量代理:具有三因素可塑性和睡眠阶段学习的振荡图
链接:https://arxiv.org/abs/2601.04362

作者:Rodja Trappe
备注:22 pages, 14 figures
摘要:相量代理是其内部状态是相量图的动力系统:耦合的斯图尔特-朗道振荡器的加权图。斯图尔特-朗道振荡器是一个最小的稳定的“节律发生器”(Hopf分叉附近的标准形式);每个振荡器被视为一个抽象的计算单元(受到生物振荡种群的启发,但不声称是模拟生物振荡种群)。在这种解释中,振荡器相位跟踪相对定时(相干性),而幅度跟踪局部增益或活动。相对相位结构作为一个代表性的介质,耦合权重是通过三个因素的本地可塑性学习-资格迹门控稀疏的全球调制器和振荡定时写窗口-没有反向传播。   振荡衬底的一个核心挑战是稳定性:在线权重更新可能会将网络驱动到不需要的状态(例如,全球同步),瓦解代表性的多样性。因此,我们将唤醒标记与离线整合分开,受到突触标记和捕获以及睡眠阶段动态的启发:类似深睡眠的门控捕获安全地提交标记的变化,而类似REM的重放重建并扰乱计划的经验。   一个阶段性的实验套件验证了每个机制与消融和证伪:资格痕迹保留信贷下延迟调制;压缩进度信号通过时间戳洗牌控制;相位相干检索达到4倍的扩散基线噪声下;清醒/睡眠分离扩大稳定学习67%下匹配的重量标准预算; REM重放提高迷宫成功率+45.5个百分点;一个托尔曼式的潜在学习特征--在没有回报的探索之后的即时能力和迂回优势,与内部模型一致--从重放中出现(托尔曼,1948)。   代码库和所有工件都是开源的。
摘要:Phasor Agents are dynamical systems whose internal state is a Phasor Graph: a weighted graph of coupled Stuart-Landau oscillators. A Stuart-Landau oscillator is a minimal stable "rhythm generator" (the normal form near a Hopf bifurcation); each oscillator is treated as an abstract computational unit (inspired by, but not claiming to model, biological oscillatory populations). In this interpretation, oscillator phase tracks relative timing (coherence), while amplitude tracks local gain or activity. Relative phase structure serves as a representational medium; coupling weights are learned via three-factor local plasticity - eligibility traces gated by sparse global modulators and oscillation-timed write windows - without backpropagation.   A central challenge in oscillatory substrates is stability: online weight updates can drive the network into unwanted regimes (e.g., global synchrony), collapsing representational diversity. We therefore separate wake tagging from offline consolidation, inspired by synaptic tagging-and-capture and sleep-stage dynamics: deep-sleep-like gated capture commits tagged changes safely, while REM-like replay reconstructs and perturbs experience for planning.   A staged experiment suite validates each mechanism with ablations and falsifiers: eligibility traces preserve credit under delayed modulation; compression-progress signals pass timestamp-shuffle controls; phase-coherent retrieval reaches 4x diffusive baselines under noise; wake/sleep separation expands stable learning by 67 percent under matched weight-norm budgets; REM replay improves maze success rate by +45.5 percentage points; and a Tolman-style latent-learning signature - immediate competence and detour advantage after unrewarded exploration, consistent with an internal model - emerges from replay (Tolman, 1948).   The codebase and all artifacts are open-source.


Transformer(2篇)

【1】Transformer-based Multi-agent Reinforcement Learning for Separation Assurance in Structured and Unstructured Airspaces
标题:基于转换器的多智能体强化学习用于结构化和非结构化空域中的隔离保证
链接:https://arxiv.org/abs/2601.04401

作者:Arsyi Aziz,Peng Wei
备注:9 pages, 4 figures, 4 tables. Presented at SESAR Innovation Days 2025
摘要:传统的基于优化的计量依赖于严格遵守预先计算的时间表,这限制了先进空中机动性(AAM)随机操作所需的灵活性。相比之下,多智能体强化学习(MARL)提供了一个分散的自适应框架,可以更好地处理安全飞机分离所需的不确定性。尽管有这个优势,目前的MARL方法往往过度适应特定的空域结构,限制了它们对新配置的适应性。为了提高泛化能力,我们在相对极状态空间中重铸了MARL问题,并在不同的交通模式和交叉角上训练了一个Transformer编码器模型。学习的模型提供了解决冲突的速度指令,同时保持飞机接近其期望的巡航速度。在我们的实验中,我们评估了结构化和非结构化空域中的1,2和3层编码器深度,并发现单个编码器配置优于更深的变体,产生接近零的近空中碰撞率和更短的分离损失侵犯比更深的配置。此外,我们还表明,相同的配置优于纯粹关注设计的基线模型。总之,我们的研究结果表明,新制定的状态表示,神经网络体系结构的新颖设计,以及提出的训练策略,为结构化和非结构化空域的飞机间隔保证提供了一个适应性强,可扩展的分散解决方案。
摘要 :Conventional optimization-based metering depends on strict adherence to precomputed schedules, which limits the flexibility required for the stochastic operations of Advanced Air Mobility (AAM). In contrast, multi-agent reinforcement learning (MARL) offers a decentralized, adaptive framework that can better handle uncertainty, required for safe aircraft separation assurance. Despite this advantage, current MARL approaches often overfit to specific airspace structures, limiting their adaptability to new configurations. To improve generalization, we recast the MARL problem in a relative polar state space and train a transformer encoder model across diverse traffic patterns and intersection angles. The learned model provides speed advisories to resolve conflicts while maintaining aircraft near their desired cruising speeds. In our experiments, we evaluated encoder depths of 1, 2, and 3 layers in both structured and unstructured airspaces, and found that a single encoder configuration outperformed deeper variants, yielding near-zero near mid-air collision rates and shorter loss-of-separation infringements than the deeper configurations. Additionally, we showed that the same configuration outperforms a baseline model designed purely with attention. Together, our results suggest that the newly formulated state representation, novel design of neural network architecture, and proposed training strategy provide an adaptable and scalable decentralized solution for aircraft separation assurance in both structured and unstructured airspaces.


【2】Transformer-Based Multi-Modal Temporal Embeddings for Explainable Metabolic Phenotyping in Type 1 Diabetes
标题:基于转换器的多模式时态嵌入用于1型糖尿病的可解释代谢表型
链接:https://arxiv.org/abs/2601.04299

作者:Pir Bakhsh Khokhar,Carmine Gravino,Fabio Palomba,Sule Yildrim Yayilgan,Sarang Shaikh
摘要:1型糖尿病(T1D)是一种高度代谢异质性疾病,无法通过常规生物标志物(如糖化血红蛋白(HbA1c))充分表征。这项研究提出了一个可解释的深度学习框架,该框架将连续葡萄糖监测(CGM)数据与实验室配置文件相结合,以学习个体代谢状态的多模态时间嵌入。使用Transformer编码器对模态之间的时间依赖性进行建模,而通过高斯混合建模来识别潜在的代谢表型。通过Transformer注意力可视化和基于SHAP的特征属性实现模型的可解释性。在577名T1D患者中确定了5种潜在的代谢表型,从代谢稳定性到心脏代谢风险升高。这些表型表现出不同的生化特征,包括血糖控制、脂质代谢、肾标志物和促甲状腺激素(TSH)水平的差异。注意力分析强调葡萄糖变异性是主要的时间因素,而SHAP分析将HbA1c、甘油三酯、胆固醇、肌酐和TSH确定为表型分化的关键因素。表型成员关系显示与高血压、心肌梗死和心力衰竭有统计学显著性相关,尽管程度不高。总的来说,这种可解释的多模态时间嵌入框架揭示了T1D中生理上连贯的代谢亚组,并支持单一生物标志物之外的风险分层。
摘要:Type 1 diabetes (T1D) is a highly metabolically heterogeneous disease that cannot be adequately characterized by conventional biomarkers such as glycated hemoglobin (HbA1c). This study proposes an explainable deep learning framework that integrates continuous glucose monitoring (CGM) data with laboratory profiles to learn multimodal temporal embeddings of individual metabolic status. Temporal dependencies across modalities are modeled using a transformer encoder, while latent metabolic phenotypes are identified via Gaussian mixture modeling. Model interpretability is achieved through transformer attention visualization and SHAP-based feature attribution. Five latent metabolic phenotypes, ranging from metabolic stability to elevated cardiometabolic risk, were identified among 577 individuals with T1D. These phenotypes exhibit distinct biochemical profiles, including differences in glycemic control, lipid metabolism, renal markers, and thyrotropin (TSH) levels. Attention analysis highlights glucose variability as a dominant temporal factor, while SHAP analysis identifies HbA1c, triglycerides, cholesterol, creatinine, and TSH as key contributors to phenotype differentiation. Phenotype membership shows statistically significant, albeit modest, associations with hypertension, myocardial infarction, and heart failure. Overall, this explainable multimodal temporal embedding framework reveals physiologically coherent metabolic subgroups in T1D and supports risk stratification beyond single biomarkers.


GAN|对抗|攻击|生成相关(10篇)

【1】DeepWeightFlow: Re-Basined Flow Matching for Generating Neural Network Weights
标题:DeepWeightFlow:用于生成神经网络权重的重新基础流匹配
链接:https://arxiv.org/abs/2601.05052

作者:Saumya Gupta,Scott Biggs,Moritz Laber,Zohair Shafi,Robin Walters,Ayan Paul
备注:25 pages, 20 tables, 2 figures
摘要:建立高效的神经网络权值生成模型是一个重要的研究热点,它面临着现代神经网络的高维权值空间及其对称性所带来的挑战。几个现有的生成模型仅限于生成部分神经网络权重,特别是对于较大的模型,如ResNet和ViT。那些生成完整权重的方法会在生成速度上遇到困难,或者需要对生成的模型进行微调。在这项工作中,我们提出了DeepWeightFlow,这是一种直接在权重空间中操作的流匹配模型,可以为各种架构,神经网络大小和数据模式生成多样化和高精度的神经网络权重。DeepWeightFlow生成的神经网络不需要微调就能表现良好,并且可以扩展到大型网络。我们在生成权重模型的背景下将Git Re-Basin和TransFusion应用于神经网络规范化,以考虑神经网络排列对称性的影响,并提高更大模型大小的生成效率。生成的网络擅长迁移学习,可以在几分钟内生成数百个神经网络的集合,远远超过基于扩散的方法的效率。DeepWeightFlow模型为更高效和可扩展地生成不同的神经网络集铺平了道路。
摘要:Building efficient and effective generative models for neural network weights has been a research focus of significant interest that faces challenges posed by the high-dimensional weight spaces of modern neural networks and their symmetries. Several prior generative models are limited to generating partial neural network weights, particularly for larger models, such as ResNet and ViT. Those that do generate complete weights struggle with generation speed or require finetuning of the generated models. In this work, we present DeepWeightFlow, a Flow Matching model that operates directly in weight space to generate diverse and high-accuracy neural network weights for a variety of architectures, neural network sizes, and data modalities. The neural networks generated by DeepWeightFlow do not require fine-tuning to perform well and can scale to large networks. We apply Git Re-Basin and TransFusion for neural network canonicalization in the context of generative weight models to account for the impact of neural network permutation symmetries and to improve generation efficiency for larger model sizes. The generated networks excel at transfer learning, and ensembles of hundreds of neural networks can be generated in minutes, far exceeding the efficiency of diffusion-based methods. DeepWeightFlow models pave the way for more efficient and scalable generation of diverse sets of neural networks.


【2】Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics
标题:代币成熟:通过连续代币动态学的自回归语言生成
链接:https://arxiv.org/abs/2601.04854

作者 :Oshri Naparstek
备注:In preperation to ICML 2026
摘要:自回归语言模型通常定义在离散的令牌序列上,在每个生成步骤中提交特定的令牌。这种早期的离散化迫使不确定性通过标记级采样来解决,这通常会导致不稳定性,重复性和对解码算法的敏感性。   在这项工作中,我们引入了一个连续的自回归公式的语言生成中,令牌表示为连续的向量,\n {成熟}在多个更新步骤之前被离散化。该模型不是对令牌进行采样,而是通过确定性的动态过程来演化连续的令牌表示,仅当表示充分收敛时才提交给离散令牌。离散文本通过硬解码恢复,而不确定性在连续空间中保持和解决。   我们表明,这个成熟过程本身就足以使用确定性解码(argmax)产生连贯和多样的文本,而不依赖于令牌级采样,扩散式去噪或辅助稳定机制。额外的扰动,如随机动力学或历史平滑,可以自然地合并,但不需要模型的功能。   据我们所知,这是第一个自回归语言模型,它通过在离散化之前将连续令牌表示进化到收敛来生成文本,从而在没有令牌级采样的情况下实现稳定的生成。
摘要:Autoregressive language models are conventionally defined over discrete token sequences, committing to a specific token at every generation step. This early discretization forces uncertainty to be resolved through token-level sampling, often leading to instability, repetition, and sensitivity to decoding heuristics.   In this work, we introduce a continuous autoregressive formulation of language generation in which tokens are represented as continuous vectors that \emph{mature} over multiple update steps before being discretized. Rather than sampling tokens, the model evolves continuous token representations through a deterministic dynamical process, committing to a discrete token only when the representation has sufficiently converged. Discrete text is recovered via hard decoding, while uncertainty is maintained and resolved in the continuous space.   We show that this maturation process alone is sufficient to produce coherent and diverse text using deterministic decoding (argmax), without reliance on token-level sampling, diffusion-style denoising, or auxiliary stabilization mechanisms. Additional perturbations, such as stochastic dynamics or history smoothing, can be incorporated naturally but are not required for the model to function.   To our knowledge, this is the first autoregressive language model that generates text by evolving continuous token representations to convergence prior to discretization, enabling stable generation without token-level sampling.


【3】Neurosymbolic Retrievers for Retrieval-augmented Generation
标题:用于检索增强生成的神经符号检索器
链接:https://arxiv.org/abs/2601.04568

作者:Yash Saxena,Manas Gaur
备注:8 pages, 2 Figures, To Appear in IEEE Intelligent Systems
摘要:检索增强生成(RAG)在克服大型语言模型的关键限制方面取得了重大进展,例如幻觉,缺乏上下文基础以及透明度问题。然而,传统的RAG系统由三个相互连接的神经组件组成-检索器,重新排序器和生成器-其内部推理过程仍然不透明。这种透明度的缺乏使可解释性变得复杂,阻碍了调试工作,并侵蚀了信任,特别是在高风险领域,明确的决策至关重要。为了解决这些挑战,我们引入了神经符号RAG的概念,它将符号推理与神经检索技术相结合,使用知识图。这一新的框架旨在回答两个主要问题:(a)检索者能否为文献选择提供一个清晰和可解释的基础?(b)符号知识能否增强检索过程的清晰度?我们提出了三种方法来改善这种整合。首先是MAR(Knowledge Modulation Aligned Retrieval,知识调制对齐检索),它使用调制网络来细化查询嵌入,使用可解释的符号特征,从而使文档匹配更加明确。其次,KG-Path RAG通过遍历知识图来增强查询,以提高整体检索质量和可解释性。最后,流程知识注入的RAG利用特定于域的工具,根据经过验证的工作流程重新排序检索到的内容。心理健康风险评估任务的初步结果表明,这种神经象征方法提高了透明度和整体表现
摘要:Retrieval Augmented Generation (RAG) has made significant strides in overcoming key limitations of large language models, such as hallucination, lack of contextual grounding, and issues with transparency. However, traditional RAG systems consist of three interconnected neural components - the retriever, re-ranker, and generator - whose internal reasoning processes remain opaque. This lack of transparency complicates interpretability, hinders debugging efforts, and erodes trust, especially in high-stakes domains where clear decision-making is essential. To address these challenges, we introduce the concept of Neurosymbolic RAG, which integrates symbolic reasoning using a knowledge graph with neural retrieval techniques. This new framework aims to answer two primary questions: (a) Can retrievers provide a clear and interpretable basis for document selection? (b) Can symbolic knowledge enhance the clarity of the retrieval process? We propose three methods to improve this integration. First is MAR (Knowledge Modulation Aligned Retrieval) that employs modulation networks to refine query embeddings using interpretable symbolic features, thereby making document matching more explicit. Second, KG-Path RAG enhances queries by traversing knowledge graphs to improve overall retrieval quality and interpretability. Lastly, Process Knowledge-infused RAG utilizes domain-specific tools to reorder retrieved content based on validated workflows. Preliminary results from mental health risk assessment tasks indicate that this neurosymbolic approach enhances both transparency and overall performance


【4】TSSR: Two-Stage Swap-Reward-Driven Reinforcement Learning for Character-Level SMILES Generation
标题:TSR:用于小型机级SMILES生成的两阶段交换奖励驱动强化学习
链接:https://arxiv.org/abs/2601.04521

作者:Jacob Ede Levine,Yun Lyan Luo,Sai Chandra Kosaraju
备注:Under Review
摘要 :设计可靠、有效和多样化的分子是现代药物发现的基础,因为改进的分子生成支持有效探索潜在候选药物的化学空间,并降低早期设计工作的成本。尽管有这些需求,但当前生成分子作为SMILES字符串的化学语言模型容易受到复合标记错误的影响:许多样本是不可解析的或化学上不可信的,并且旨在防止失败的硬约束可能会限制探索。为了解决这个问题,我们引入了TSSR,一个两阶段,交换奖励驱动的强化学习(RL)框架,用于字符级SMILES生成。第一阶段奖励修复语法的本地令牌交换,促进从无效字符串到可解析字符串的转换。第二阶段提供来自RDKit诊断的化学感知反馈,奖励化合价、芳香性和连接性问题的减少。奖励分解为可解释的术语(交换效率,错误减少,有效性距离),是模型不可知的,并且不需要特定于任务的标签或手工制作的语法。我们使用GRU策略在MOSES基准上评估了TSSR,该策略使用PPO在随机初始化的纯RL(P-RL)和从预训练的化学语言模型开始的微调RL(F-RL)中训练,每次运行评估10,000个生成的SMILES。在P-RL中,TSSR显著提高了句法有效性、化学有效性和新颖性。在F-RL中,TSSR保留了药物相似性和可合成性,同时增加了有效性和新颖性。令牌级分析表明,语法编辑和化学修复共同作用,以减少RDKit检测到的错误。TSSR将稀疏的终端目标转换为更密集和更可解释的奖励,在不减少多样性的情况下提高语法和化学质量。TSSR是不可知的,可以适应各种强化学习方法。
摘要:The design of reliable, valid, and diverse molecules is fundamental to modern drug discovery, as improved molecular generation supports efficient exploration of the chemical space for potential drug candidates and reduces the cost of early design efforts. Despite these needs, current chemical language models that generate molecules as SMILES strings are vulnerable to compounding token errors: many samples are unparseable or chemically implausible, and hard constraints meant to prevent failure can restrict exploration. To address this gap, we introduce TSSR, a Two-Stage, Swap-Reward-driven reinforcement learning (RL) framework for character-level SMILES generation. Stage one rewards local token swaps that repair syntax, promoting transitions from invalid to parseable strings. Stage two provides chemistry-aware feedback from RDKit diagnostics, rewarding reductions in valence, aromaticity, and connectivity issues. The reward decomposes into interpretable terms (swap efficiency, error reduction, distance to validity), is model agnostic, and requires no task-specific labels or hand-crafted grammars. We evaluated TSSR on the MOSES benchmark using a GRU policy trained with PPO in both pure RL (P-RL) from random initialization and fine-tuning RL (F-RL) starting from a pretrained chemical language model, assessing 10,000 generated SMILES per run. In P-RL, TSSR significantly improves syntactic validity, chemical validity, and novelty. In F-RL, TSSR preserves drug-likeness and synthesizability while increasing validity and novelty. Token-level analysis shows that syntax edits and chemistry fixes act jointly to reduce RDKit detected errors. TSSR converts a sparse terminal objective into a denser and more interpretable reward, improving both syntactic and chemical quality without reducing diversity. TSSR is dataset-agnostic and can be adapted to various reinforcement learning approaches.


【5】Disco-RAG: Discourse-Aware Retrieval-Augmented Generation
标题:Disco-RAG:话语感知检索增强一代
链接:https://arxiv.org/abs/2601.04377

作者:Dongqi Liu,Hang Ding,Qiming Feng,Jian Li,Xurong Xie,Zhucun Xue,Chengjie Wang,Jiangning Zhang,Yabiao Wang
摘要:检索增强生成(RAG)已成为提高大型语言模型(LLM)在知识密集型任务中性能的重要手段。然而,大多数现有的RAG策略以扁平和非结构化的方式处理检索到的段落,这阻止了模型捕获结构线索,并限制了其从分散的证据中综合知识的能力。为了克服这些局限性,我们提出了迪斯科RAG,话语感知框架,明确注入话语信号的生成过程。我们的方法构建块内话语树捕捉本地层次结构和构建块间修辞图模型跨通道连贯性。这些结构被共同整合到一个规划蓝图中,为发电提供条件。问答和长文档摘要基准测试的实验结果表明了该方法的有效性。Disco-RAG在基准测试中实现了最先进的结果,无需微调。这些发现强调了语篇结构在推进RAG系统中的重要作用。
摘要:Retrieval-Augmented Generation (RAG) has emerged as an important means of enhancing the performance of large language models (LLMs) in knowledge-intensive tasks. However, most existing RAG strategies treat retrieved passages in a flat and unstructured way, which prevents the model from capturing structural cues and constrains its ability to synthesize knowledge from dispersed evidence across documents. To overcome these limitations, we propose Disco-RAG, a discourse-aware framework that explicitly injects discourse signals into the generation process. Our method constructs intra-chunk discourse trees to capture local hierarchies and builds inter-chunk rhetorical graphs to model cross-passage coherence. These structures are jointly integrated into a planning blueprint that conditions the generation. Experiments on question answering and long-document summarization benchmarks show the efficacy of our approach. Disco-RAG achieves state-of-the-art results on the benchmarks without fine-tuning. These findings underscore the important role of discourse structure in advancing RAG systems.


【6】Generation of synthetic delay time series for air transport applications
标题:航空运输应用的合成延迟时间序列的生成
链接:https://arxiv.org/abs/2601.04279

作者:Pau Esteve,Massimiliano Zanin
备注:18 pages, 13 figures
摘要:合成数据的生成越来越受到科学界的关注,这要归功于它能够解决数据稀缺和隐私等问题,并开始在航空运输中找到应用。我们在这里解决的问题,产生合成的,但现实的,在机场延误的时间序列,从欧洲和美国的操作的大集合开始。我们具体比较了三种模型,其中两种基于最先进的深度学习算法,一种是简化的遗传算法方法。我们展示了后者如何生成与真实序列几乎无法区分的时间序列,同时保持高变异性。我们进一步验证所产生的时间序列中的一个问题,检测机场之间的延迟传播。我们最终将合成数据提供给科学界。
摘要:The generation of synthetic data is receiving increasing attention from the scientific community, thanks to its ability to solve problems like data scarcity and privacy, and is starting to find applications in air transport. We here tackle the problem of generating synthetic, yet realistic, time series of delays at airports, starting from large collections of operations in Europe and the US. We specifically compare three models, two of them based on state of the art Deep Learning algorithms, and one simplified Genetic Algorithm approach. We show how the latter can generate time series that are almost indistinguishable from real ones, while maintaining a high variability. We further validate the resulting time series in a problem of detecting delay propagations between airports. We finally make the synthetic data available to the scientific community.


【7】FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback
标题:FrontTalk:将前端开发作为具有多模式反馈的对话代码生成进行基准测试
链接:https://arxiv.org/abs/2601.04203

作者:Xueqing Wu,Zihan Xue,Da Yin,Shuyan Zhou,Kai-Wei Chang,Nanyun Peng,Yeming Wen
摘要 :我们提出了FronTalk,前端代码生成的基准,开创了一个独特的互动动态的研究:会话代码生成与多模态反馈。在前端开发中,草图、模型和带注释的屏幕截图等视觉工件对于传达设计意图至关重要,但它们在多轮代码生成中的作用在很大程度上仍未得到探索。为了解决这一差距,我们专注于前端开发任务和策划FronTalk,这是一个来自新闻,金融和艺术等不同领域的真实网站的100个多回合对话的集合。每个回合都有一个文本指令和一个等效的视觉指令,每个指令都代表相同的用户意图。为了全面评估模型的性能,我们提出了一种新的基于代理的评估框架,利用Web代理来模拟用户和探索网站,从而衡量功能的正确性和用户体验。对20个模型的评估揭示了两个在文献中系统地探索不足的关键挑战:(1)一个重要的遗忘问题,即模型覆盖先前实现的功能,导致任务失败,以及(2)解释视觉反馈的持续挑战,特别是对于开源视觉语言模型(VLM)。我们提出了一个强有力的基线来解决遗忘问题与AceCoder,一种方法,批评的每一个过去的指令使用一个自主的Web代理的实现。这种方法显著地将遗忘减少到几乎为零,并将性能提高了9.3%(56.0%到65.3%)。总的来说,我们的目标是为未来的前端开发和多回合,多模态代码生成的一般交互动力学研究提供坚实的基础。代码和数据发布于https://github.com/shirley-wu/frontalk
摘要:We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at https://github.com/shirley-wu/frontalk


【8】Listen to Rhythm, Choose Movements: Autoregressive Multimodal Dance Generation via Diffusion and Mamba with Decoupled Dance Dataset
标题:聆听节奏,选择动作:通过扩散和曼巴与脱钩舞蹈数据集的自回归多模式舞蹈生成
链接:https://arxiv.org/abs/2601.03323

作者:Oran Duan,Yinghua Shen,Yingzhu Lv,Luyang Jie,Yaxin Liu,Qiong Wu
备注:12 pages, 13 figures
摘要:生成模型和序列学习的发展极大地促进了舞蹈动作生成的研究,但目前的方法仍然存在语义控制粗糙和长序列连贯性差的问题。在这项工作中,我们提出了听节奏,选择动作(LRCM),多模态引导的扩散框架,支持不同的输入方式和自回归舞蹈运动生成。我们探索了一种舞蹈数据集的特征解耦范例,并将其推广到Motorica Dance数据集,分离运动捕捉数据,音频节奏以及专业注释的全局和局部文本描述。我们的扩散架构集成了一个音频潜在的构象和文本潜在的交叉构象,并采用了运动时间曼巴模块(MTMM),使平滑,长时间的自回归合成。实验结果表明,LRCM提供了强大的性能,在功能能力和定量指标,表现出显着的潜力,在多模态输入场景和扩展序列生成。我们将在接受后公开发布完整的代码库,数据集和预训练模型。
摘要:Advances in generative models and sequence learning have greatly promoted research in dance motion generation, yet current methods still suffer from coarse semantic control and poor coherence in long sequences. In this work, we present Listen to Rhythm, Choose Movements (LRCM), a multimodal-guided diffusion framework supporting both diverse input modalities and autoregressive dance motion generation. We explore a feature decoupling paradigm for dance datasets and generalize it to the Motorica Dance dataset, separating motion capture data, audio rhythm, and professionally annotated global and local text descriptions. Our diffusion architecture integrates an audio-latent Conformer and a text-latent Cross-Conformer, and incorporates a Motion Temporal Mamba Module (MTMM) to enable smooth, long-duration autoregressive synthesis. Experimental results indicate that LRCM delivers strong performance in both functional capability and quantitative metrics, demonstrating notable potential in multimodal input scenarios and extended sequence generation. We will release the full codebase, dataset, and pretrained models publicly upon acceptance.


【9】Exponential capacity scaling of classical GANs compared to hybrid latent style-based quantum GANs
标题:与基于混合潜在风格的量子GAN相比,经典GAN的指数容量扩展
链接:https://arxiv.org/abs/2601.05036

作者:Milan Liepelt,Julien Baglio
备注:34 pages, 7 figures, 7 tables
摘要:量子生成建模是一个非常活跃的研究领域,在数据分析中寻找实际的优势。量子生成对抗网络(QGAN)是量子生成建模的主要候选者,并已应用于从高能物理到图像生成的各个领域。基于潜在风格的QGAN依赖于经典的变分自编码器将输入数据编码到潜在空间中,然后使用基于风格的QGAN进行数据生成,已被证明对于图像生成或药物设计是有效的,暗示使用比经典对应物少得多的可训练参数来实现可比较的性能,然而这种优势从未被系统地研究过。在这项工作中,我们提出了第一次全面的实验分析QGANS应用于SAT 4图像生成的这一优势,在混合潜在风格为基础的QGAN架构中的量子生成器的容量缩放中获得指数优势。仔细调整自动编码器对于获得稳定、可靠的结果至关重要。一旦执行了这种调整并将训练最优性定义为当训练稳定并且FID分数也低且稳定时,则经典递归的最优容量(或可训练参数的数量)相对于量子生成器的容量呈指数级缩放,并且对于经典生成器的容量也是如此。这暗示了量子生成建模的一种量子优势。
摘要 :Quantum generative modeling is a very active area of research in looking for practical advantage in data analysis. Quantum generative adversarial networks (QGANs) are leading candidates for quantum generative modeling and have been applied to diverse areas, from high-energy physics to image generation. The latent style-based QGAN, relying on a classical variational autoencoder to encode the input data into a latent space and then using a style-based QGAN for data generation has been proven to be efficient for image generation or drug design, hinting at the use of far less trainable parameters than their classical counterpart to achieve comparable performance, however this advantage has never been systematically studied. We present in this work the first comprehensive experimental analysis of this advantage of QGANS applied to SAT4 image generation, obtaining an exponential advantage in capacity scaling for a quantum generator in the hybrid latent style-based QGAN architecture. Careful tuning of the autoencoder is crucial to obtain stable, reliable results. Once this tuning is performed and defining training optimality as when the training is stable and the FID score is low and stable as well, the optimal capacity (or number of trainable parameters) of the classical discriminator scales exponentially with respect to the capacity of the quantum generator, and the same is true for the capacity of the classical generator. This hints toward a type of quantum advantage for quantum generative modeling.


【10】Crystal Generation using the Fully Differentiable Pipeline and Latent Space Optimization
标题:使用完全可区分管道和潜在空间优化的晶体生成
链接:https://arxiv.org/abs/2601.04606

作者:Osman Goni Ridwan,Gilles Frapper,Hongfei Xue,Qiang Zhu
摘要:我们提出了一个材料生成框架,该框架将可微条件变分自编码器(CVAE)与可微SO(3)功率谱目标耦合,以在晶体学约束下将候选物导向指定的局部环境。特别是,我们实现了一个完全可微的管道,执行批量优化的直接和潜在的结晶表示。使用GPU加速,与我们以前的CPU工作流程相比,实现速度约为五倍,同时产生相当的结果。此外,我们介绍的优化策略,交替进行优化的直接和潜在的晶体表示。这种双层松弛方法可以有效地克服由不同目标梯度定义的局部障碍,从而提高生成满足目标局部环境的复杂结构的成功率。该框架可以扩展到由多组件和多环境组成的系统,提供了一个可扩展的路线,以生成材料结构与目标的本地环境。
摘要:We present a materials generation framework that couples a symmetry-conditioned variational autoencoder (CVAE) with a differentiable SO(3) power spectrum objective to steer candidates toward a specified local environment under the crystallographic constraints. In particular, we implement a fully differentiable pipeline that performs batch-wise optimization on both direct and latent crystallographic representations. Using the GPU acceleration, the implementation achieves about fivefold speed compared to our previous CPU workflow, while yielding comparable outcomes. In addition, we introduce the optimization strategy that alternatively performs optimization on the direct and latent crystal representations. This dual-level relaxation approach can effectively overcome local barrier defined by different objective gradients, thus increasing the success rate of generating complex structures satisfying the targe local environments. This framework can be extended to systems consisting of multi-components and multi-environments, providing a scalable route to generate material structures with the target local environment.


半/弱/无/有监督|不确定性|主动学习(5篇)

【1】Improving Semi-Supervised Contrastive Learning via Entropy-Weighted Confidence Integration of Anchor-Positive Pairs
标题:通过锚定正对的熵加权置信度积分改进半监督对比学习
链接:https://arxiv.org/abs/2601.04555

作者:Shogo Nakayama,Masahiro Okuda
摘要:传统的半监督对比学习方法仅将伪标签分配给其最高预测类概率超过预定义阈值的样本,然后使用这些选择的样本执行监督对比学习。在这项研究中,我们提出了一种新的损失函数,估计每个样本的预测概率分布的熵的基础上的置信度,并应用基于置信度的自适应加权。这种方法甚至能够对先前从训练中排除的样本进行伪标签分配,并促进对比学习,该对比学习以更有原则的方式考虑锚样本和阳性样本的置信度。实验结果表明,该方法提高了分类精度,即使在低标签条件下也能获得更稳定的学习性能。
摘要:Conventional semi-supervised contrastive learning methods assign pseudo-labels only to samples whose highest predicted class probability exceeds a predefined threshold, and then perform supervised contrastive learning using those selected samples. In this study, we propose a novel loss function that estimates the confidence of each sample based on the entropy of its predicted probability distribution and applies confidence-based adaptive weighting. This approach enables pseudo-label assignment even to samples that were previously excluded from training and facilitates contrastive learning that accounts for the confidence of both anchor and positive samples in a more principled manner. Experimental results demonstrate that the proposed method improves classification accuracy and achieves more stable learning performance even under low-label conditions.


【2】Integrating Distribution Matching into Semi-Supervised Contrastive Learning for Labeled and Unlabeled Data
标题:将分布匹配集成到标签和未标签数据的半监督对比学习中
链接:https://arxiv.org/abs/2601.04518

作者:Shogo Nakayama,Masahiro Okuda
备注:ITC-CSCC accepted
摘要:深度学习的进步极大地改善了监督图像分类。然而,标记数据的成本很高,这促使人们研究无监督学习方法,如对比学习。在现实世界中,完全未标记的数据集是罕见的,使得半监督学习(SSL)在少量标记数据与大量未标记数据共存的情况下高度相关。一种众所周知的半监督对比学习方法涉及为未标记的数据分配伪标签。本研究旨在通过在标记和未标记特征嵌入之间进行分布匹配来增强基于伪标签的SSL,以提高多个数据集的图像分类精度。
摘要:The advancement of deep learning has greatly improved supervised image classification. However, labeling data is costly, prompting research into unsupervised learning methods such as contrastive learning. In real-world scenarios, fully unlabeled datasets are rare, making semi-supervised learning (SSL) highly relevant in scenarios where a small amount of labeled data coexists with a large volume of unlabeled data. A well-known semi-supervised contrastive learning approach involves assigning pseudo-labels to unlabeled data. This study aims to enhance pseudo-label-based SSL by incorporating distribution matching between labeled and unlabeled feature embeddings to improve image classification accuracy across multiple datasets.


【3】Stochastic Deep Learning: A Probabilistic Framework for Modeling Uncertainty in Structured Temporal Data
标题:随机深度学习:一种用于建模结构化时态数据中不确定性的概率框架
链接:https://arxiv.org/abs/2601.05227

作者:James Rice
备注:20 pages, 6330 words
摘要:我提出了一个新的框架,将随机微分方程(SDEs)与深度生成模型相结合,以改善涉及结构化和时态数据的机器学习应用中的不确定性量化。这种方法被称为随机潜在差分推理(SLDI),它将Itô模型嵌入到变分自动编码器的潜在空间中,允许对不确定性进行灵活的连续时间建模,同时保留原则性的数学基础。通过神经网络对时间序列的漂移和扩散项进行参数化,从而实现数据驱动的推理,并将经典的时间序列模型推广到处理不规则采样和复杂的动态结构。   一个中心的理论贡献是一个专用的神经网络的伴随状态的coparameterization,形成一个耦合的前向-后向系统,不仅捕捉潜在的进化,而且梯度动态。我引入了一个路径正则化的伴随损失,并通过随机微积分的镜头分析方差减少的梯度流,为提高深层潜在SDES的训练稳定性提供了新的工具。我的论文统一和扩展了变分推理,连续时间生成建模和控制理论优化,为随机概率机器学习的未来发展提供了严格的基础。
摘要:I propose a novel framework that integrates stochastic differential equations (SDEs) with deep generative models to improve uncertainty quantification in machine learning applications involving structured and temporal data. This approach, termed Stochastic Latent Differential Inference (SLDI), embeds an Itô SDE in the latent space of a variational autoencoder, allowing for flexible, continuous-time modeling of uncertainty while preserving a principled mathematical foundation. The drift and diffusion terms of the SDE are parameterized by neural networks, enabling data-driven inference and generalizing classical time series models to handle irregular sampling and complex dynamic structure.   A central theoretical contribution is the co-parameterization of the adjoint state with a dedicated neural network, forming a coupled forward-backward system that captures not only latent evolution but also gradient dynamics. I introduce a pathwise-regularized adjoint loss and analyze variance-reduced gradient flows through the lens of stochastic calculus, offering new tools for improving training stability in deep latent SDEs. My paper unifies and extends variational inference, continuous-time generative modeling, and control-theoretic optimization, providing a rigorous foundation for future developments in stochastic probabilistic machine learning.


【4】Quantitative mapping from conventional MRI using self-supervised physics-guided deep learning: applications to a large-scale, clinically heterogeneous dataset
标题:使用自监督物理学引导的深度学习从常规MRI进行定量映射:应用于大规模临床异质数据集
链接:https://arxiv.org/abs/2601.05063

作者:Jelmer van Lune,Stefano Mandija,Oscar van der Heide,Matteo Maspero,Martin B. Schilder,Jan Willem Dankbaar,Cornelis A. T. van den Berg,Alessandro Sbrizzi
备注:30 pages, 13 figures, full paper
摘要:磁共振成像(MRI)是临床神经成像的基石,但传统的MRI提供的定性信息严重依赖于扫描仪硬件和采集设置。虽然定量MRI(qMRI)提供了内在的组织参数,但对专门采集协议和重建算法的要求限制了其可用性,并阻碍了大规模生物标志物研究。这项研究提出了一个自我监督的物理指导的深度学习框架,可以直接从广泛使用的临床常规T1加权,T2加权和FLAIR MRI中推断定量T1,T2和质子密度(PD)图。该框架在一个大规模的临床异质数据集上进行了训练和评估,该数据集包括在我们机构在四种不同的3 T MRI扫描仪系统上采集的4,121个扫描会话,捕获了真实世界的临床变异性。该框架将基于Bloch的信号模型直接集成到训练目标中。在超过600个测试会话中,生成的地图显示白质和灰质值与文献范围一致。此外,生成的地图显示不变性扫描仪硬件和采集协议组,组间变异系数为1.1%。特定主题的分析表明,优秀的体素的再现性扫描仪系统和序列参数,与皮尔逊r$和一致性相关系数超过0.82的T1和T2。所有定量参数的平均相对体素差异均较低,尤其是T2(<6%)。这些结果表明,所提出的框架可以稳健地将不同的临床常规MRI数据转换为定量地图,为大规模定量生物标志物研究铺平了道路。
摘要:Magnetic resonance imaging (MRI) is a cornerstone of clinical neuroimaging, yet conventional MRIs provide qualitative information heavily dependent on scanner hardware and acquisition settings. While quantitative MRI (qMRI) offers intrinsic tissue parameters, the requirement for specialized acquisition protocols and reconstruction algorithms restricts its availability and impedes large-scale biomarker research. This study presents a self-supervised physics-guided deep learning framework to infer quantitative T1, T2, and proton-density (PD) maps directly from widely available clinical conventional T1-weighted, T2-weighted, and FLAIR MRIs. The framework was trained and evaluated on a large-scale, clinically heterogeneous dataset comprising 4,121 scan sessions acquired at our institution over six years on four different 3 T MRI scanner systems, capturing real-world clinical variability. The framework integrates Bloch-based signal models directly into the training objective. Across more than 600 test sessions, the generated maps exhibited white matter and gray matter values consistent with literature ranges. Additionally, the generated maps showed invariance to scanner hardware and acquisition protocol groups, with inter-group coefficients of variation $\leq$ 1.1%. Subject-specific analyses demonstrated excellent voxel-wise reproducibility across scanner systems and sequence parameters, with Pearson $r$ and concordance correlation coefficients exceeding 0.82 for T1 and T2. Mean relative voxel-wise differences were low across all quantitative parameters, especially for T2 ($


【5】Prediction of Cellular Malignancy Using Electrical Impedance Signatures and Supervised Machine Learning
标题:使用电阻抗签名和有监督的机器学习预测细胞分裂
链接:https://arxiv.org/abs/2601.04478

作者:Shadeeb Hossain
摘要:细胞的生物电特性,如相对介电常数、电导率和特征时间常数,在不同频率下在健康细胞和恶性细胞之间变化显著。这些区别为诊断和分类应用提供了有希望的基础。本研究系统地回顾了33篇学术文章,以汇编定量生物电参数的数据集,并评估其在预测建模中的实用性。使用关键超参数实现并调整了三种监督机器学习算法--随机森林(RF)、支持向量机(SVM)和K最近邻(KNN)来评估分类性能。使用准确性和F1评分作为性能指标来评估模型有效性。结果表明,当配置最大深度为4和100个估计量时,随机森林实现了约90%的最高预测准确度。这些发现突出了将生物电特性分析与机器学习相结合以改善诊断决策的潜力。类似地,对于KNN和SVM,F1得分分别达到约78%和76.5%的峰值。未来的工作将探索纳入额外的判别特征,利用刺激数据集,并通过高级搜索策略优化超参数。最终,具有嵌入式微电极和实时控制系统的硬件原型可以为能够进行原位细胞分类的实用诊断工具铺平道路。
摘要 :Bioelectrical properties of cells such as relative permittivity, conductivity, and characteristic time constants vary significantly between healthy and malignant cells across different frequencies. These distinctions provide a promising foundation for diagnostic and classification applications. This study systematically reviewed 33 scholarly articles to compile datasets of quantitative bioelectric parameters and evaluated their utility in predictive modeling. Three supervised machine learning algorithms- Random Forest (RF), Support Vector Machine (SVM), and K-Nearest Neighbor (KNN) were implemented and tuned using key hyperparameters to assess classification performance. Model effectiveness was evaluated using accuracy and F1 score as performance metrics. Results demonstrate that Random Forest achieved the highest predictive accuracy of ~ 90% when configured with a maximum depth of 4 and 100 estimators. These findings highlight the potential of integrating bioelectrical property analysis with machine learning for improved diagnostic decision-making. Similarly, for KNN and SVM, the F1 score peaked at approximately 78% and 76.5%, respectively. Future work will explore incorporating additional discriminative features, leveraging stimulated datasets, and optimizing hyperparameter through advanced search strategies. Ultimately, hardware prototype with embedded micro-electrodes and real-time control systems could pave the path for practical diagnostic tools capable of in-situ cell classification.


迁移|Zero/Few/One-Shot|自适应(7篇)

【1】Leveraging Prediction Entropy for Automatic Prompt Weighting in Zero-Shot Audio-Language Classification
标题:Zero-Shot音频语言分类中利用预测熵自动提示加权
链接:https://arxiv.org/abs/2601.05011

作者:Karim El Khoury,Maxime Zanella,Tiffanie Godelaine,Christophe De Vleeschouwer,Benoit Macq
摘要:最近,音频语言模型通过利用自然语言监督来对没有标记的训练数据的音频事件进行分类,展示了强大的zero-shot能力。然而,它们的表现对文本提示的措辞高度敏感,微小的变化会导致准确性的大幅波动。以前的工作已经通过快速学习或快速集成来缓解这个问题。然而,这些策略要么需要注释的数据,要么无法考虑某些提示可能会对性能产生负面影响的事实。在这项工作中,我们提出了一种熵引导的提示加权方法,旨在找到提示贡献的鲁棒组合,以最大化预测置信度。为此,我们制定了一个量身定制的目标函数,最大限度地减少预测熵,以产生新的提示权重,利用低熵作为高置信度的代理。我们的方法可以应用于单个样本或一批音频样本,不需要额外的标签,产生的计算开销可以忽略不计。五个音频分类数据集,涵盖环境,城市和人声的实验,证明了一致的收益相比,经典的提示集成方法在zero-shot设置,精度提高5倍以上,在整个基准。
摘要:Audio-language models have recently demonstrated strong zero-shot capabilities by leveraging natural-language supervision to classify audio events without labeled training data. Yet, their performance is highly sensitive to the wording of text prompts, with small variations leading to large fluctuations in accuracy. Prior work has mitigated this issue through prompt learning or prompt ensembling. However, these strategies either require annotated data or fail to account for the fact that some prompts may negatively impact performance. In this work, we present an entropy-guided prompt weighting approach that aims to find a robust combination of prompt contributions to maximize prediction confidence. To this end, we formulate a tailored objective function that minimizes prediction entropy to yield new prompt weights, utilizing low-entropy as a proxy for high confidence. Our approach can be applied to individual samples or a batch of audio samples, requiring no additional labels and incurring negligible computational overhead. Experiments on five audio classification datasets covering environmental, urban, and vocal sounds, demonstrate consistent gains compared to classical prompt ensembling methods in a zero-shot setting, with accuracy improvements 5-times larger across the whole benchmark.


【2】Succeeding at Scale: Automated Multi-Retriever Fusion and Query-Side Adaptation for Multi-Tenant Search
标题:大规模成功:多租户搜索的自动多检索器融合和查询端自适应
链接:https://arxiv.org/abs/2601.04646

作者:Prateek Jain,Shabari S Nair,Ritesh Goru,Prakhar Agarwal,Ajay Yadav,Yoga Sri Varshan Varadharajan,Constantine Caramanis
摘要:大规模的多租户检索系统积累了大量的用户查询日志,但严重缺乏有效的域适应所需的策划相关标签。模型更新的操作成本加剧了这种“暗数据”问题:联合微调查询和文档编码器需要重新索引整个语料库,这在具有数千个独立索引的多租户环境中是禁止的。为了解决这些双重挑战,我们引入了\textbf{DevRev Search},这是一个通过全自动管道构建的用于技术客户支持的通道检索基准。我们采用\textbf{fusion-based candidate generation}策略,汇集来自不同稀疏和密集检索器的结果,并利用LLM作为法官来执行严格的\textbf{consistency filtering}和相关性分配。我们进一步提出了一个实用的\textbf{索引保持自适应}策略:通过低秩自适应(LoRA)仅微调查询编码器,我们在保持文档索引冻结的同时实现了有竞争力的性能改进。我们在DevRev Search和SciFact上的实验表明,在查询编码器中定位特定的Transformer层可以产生最佳的质量效率权衡,为个性化企业搜索提供可扩展的路径。
摘要:Large-scale multi-tenant retrieval systems amass vast user query logs yet critically lack the curated relevance labels required for effective domain adaptation. This "dark data" problem is exacerbated by the operational cost of model updates: jointly fine-tuning query and document encoders requires re-indexing the entire corpus, which is prohibitive in multi-tenant environments with thousands of isolated indices. To address these dual challenges, we introduce \textbf{DevRev Search}, a passage retrieval benchmark for technical customer support constructed through a fully automatic pipeline. We employ a \textbf{fusion-based candidate generation} strategy, pooling results from diverse sparse and dense retrievers, and utilize an LLM-as-a-Judge to perform rigorous \textbf{consistency filtering} and relevance assignment. We further propose a practical \textbf{Index-Preserving Adaptation} strategy: by fine-tuning only the query encoder via Low-Rank Adaptation (LoRA), we achieve competitive performance improvements while keeping the document index frozen. Our experiments on DevRev Search and SciFact demonstrate that targeting specific transformer layers in the query encoder yields optimal quality-efficiency trade-offs, offering a scalable path for personalized enterprise search.


【3】DP-MGTD: Privacy-Preserving Machine-Generated Text Detection via Adaptive Differentially Private Entity Sanitization
标题:DP-MGTD:通过自适应差异私有实体清理进行隐私保护的机器生成文本检测
链接:https://arxiv.org/abs/2601.04641

作者 :Lionel Z. Wang,Yusheng Zhao,Jiabin Luo,Xinfeng Li,Lixu Wang,Yinan Peng,Haoyang Li,XiaoFeng Wang,Wei Dong
备注:12 pages, 1 figure, 1 tables
摘要:机器生成文本(MGT)检测系统的部署需要处理敏感的用户数据,这在作者身份验证和隐私保护之间产生了根本性的冲突。标准的匿名化技术通常会破坏语言流畅性,而严格的差分隐私(DP)机制通常会降低准确检测所需的统计信号。为了解决这一困境,我们提出了\textbf{DP-MGTD},一个框架,结合自适应差分私有实体消毒算法。我们的方法采用了两阶段的机制,执行噪声频率估计和动态校准隐私预算,分别应用拉普拉斯和指数机制的数值和文本实体。至关重要的是,我们确定了一个反直觉的现象,DP噪声的应用程序放大了人类和机器文本之间的可识别性,通过暴露不同的敏感性模式的扰动。在MGTBench-2.0数据集上进行的大量实验表明,我们的方法达到了近乎完美的检测精度,在满足严格的隐私保证的同时,显著优于非私有基线。
摘要:The deployment of Machine-Generated Text (MGT) detection systems necessitates processing sensitive user data, creating a fundamental conflict between authorship verification and privacy preservation. Standard anonymization techniques often disrupt linguistic fluency, while rigorous Differential Privacy (DP) mechanisms typically degrade the statistical signals required for accurate detection. To resolve this dilemma, we propose \textbf{DP-MGTD}, a framework incorporating an Adaptive Differentially Private Entity Sanitization algorithm. Our approach utilizes a two-stage mechanism that performs noisy frequency estimation and dynamically calibrates privacy budgets, applying Laplace and Exponential mechanisms to numerical and textual entities respectively. Crucially, we identify a counter-intuitive phenomenon where the application of DP noise amplifies the distinguishability between human and machine text by exposing distinct sensitivity patterns to perturbation. Extensive experiments on the MGTBench-2.0 dataset show that our method achieves near-perfect detection accuracy, significantly outperforming non-private baselines while satisfying strict privacy guarantees.


【4】Causally-Aware Information Bottleneck for Domain Adaptation
标题:域适应的因果信息瓶颈
链接:https://arxiv.org/abs/2601.04361

作者:Mohammad Ali Javidian
备注:An extended abstract version of this work was accepted for the Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
摘要:我们解决了因果系统中的一个共同的域适应设置。在此设置中,目标变量在源域中被观察到,但在目标域中完全缺失。我们的目标是从不同变化下的剩余观测变量中估算目标域中的目标变量。我们将其定义为学习一个紧凑的,机制稳定的表示。这种表示保留了与预测目标相关的信息,同时丢弃了虚假变化。对于线性高斯因果模型,我们得到了一个封闭形式的高斯信息瓶颈(GIB)的解决方案。此解决方案简化为典型相关分析(CCA)风格的投影,并在需要时提供有向无环图(DAG)感知选项。对于非线性或非高斯数据,我们引入了变分信息瓶颈(VIB)编码器预测器。这种方法可以扩展到高维,并且可以在源数据上进行训练,并将zero-shot部署到目标域。在合成数据集和真实数据集上,我们的方法始终能够获得准确的估算,支持高维因果模型的实际使用,并为因果域适应提供统一的轻量级工具包。
摘要:We tackle a common domain adaptation setting in causal systems. In this setting, the target variable is observed in the source domain but is entirely missing in the target domain. We aim to impute the target variable in the target domain from the remaining observed variables under various shifts. We frame this as learning a compact, mechanism-stable representation. This representation preserves information relevant for predicting the target while discarding spurious variation. For linear Gaussian causal models, we derive a closed-form Gaussian Information Bottleneck (GIB) solution. This solution reduces to a canonical correlation analysis (CCA)-style projection and offers Directed Acyclic Graph (DAG)-aware options when desired. For nonlinear or non-Gaussian data, we introduce a Variational Information Bottleneck (VIB) encoder-predictor. This approach scales to high dimensions and can be trained on source data and deployed zero-shot to the target domain. Across synthetic and real datasets, our approach consistently attains accurate imputations, supporting practical use in high-dimensional causal models and furnishing a unified, lightweight toolkit for causal domain adaptation.


【5】Comparative Analysis of Custom CNN Architectures versus Pre-trained Models and Transfer Learning: A Study on Five Bangladesh Datasets
标题:定制CNN架构与预训练模型和迁移学习的比较分析:对五个孟加拉国数据集的研究
链接:https://arxiv.org/abs/2601.04352

作者:Ibrahim Tanvir,Alif Ruslan,Sartaj Solaiman
摘要:本研究使用特征提取和迁移学习方法对定制卷积神经网络(CNN)与流行的预训练架构(ResNet-18和VGG-16)进行了全面的比较分析。我们在来自孟加拉国的五个不同的图像分类数据集上评估了这些模型:足迹视觉,自动人力车检测,芒果图像分类,水稻品种识别和道路损坏检测。我们的实验结果表明,带有微调的迁移学习始终优于从头开始构建的自定义CNN和特征提取方法,在不同的数据集上实现了3%到76%的准确性提高。值得注意的是,经过微调的ResNet-18在Road Damage BD数据集上实现了完美的100%准确率。虽然自定义CNN在模型大小(3.4M参数与预训练模型的11- 134 M参数)和简单任务的训练效率方面具有优势,但具有迁移学习的预训练模型提供了卓越的性能,特别是在训练数据有限的复杂分类任务中。这项研究为从业者提供了实用的见解,可以根据数据集特征,计算资源和性能要求选择适当的深度学习方法。
摘要:This study presents a comprehensive comparative analysis of custom-built Convolutional Neural Networks (CNNs) against popular pre-trained architectures (ResNet-18 and VGG-16) using both feature extraction and transfer learning approaches. We evaluated these models across five diverse image classification datasets from Bangladesh: Footpath Vision, Auto Rickshaw Detection, Mango Image Classification, Paddy Variety Recognition, and Road Damage Detection. Our experimental results demonstrate that transfer learning with fine-tuning consistently outperforms both custom CNNs built from scratch and feature extraction methods, achieving accuracy improvements ranging from 3% to 76% across different datasets. Notably, ResNet-18 with fine-tuning achieved perfect 100% accuracy on the Road Damage BD dataset. While custom CNNs offer advantages in model size (3.4M parameters vs. 11-134M for pre-trained models) and training efficiency on simpler tasks, pre-trained models with transfer learning provide superior performance, particularly on complex classification tasks with limited training data. This research provides practical insights for practitioners in selecting appropriate deep learning approaches based on dataset characteristics, computational resources, and performance requirements.


【6】Learning to Reason: Temporal Saliency Distillation for Interpretable Knowledge Transfer
标题:学习推理:可解释知识转移的时间显着提炼
链接:https://arxiv.org/abs/2601.04263

作者:Nilushika Udayangani Hewa Dehigahawattage,Kishor Nandakishor,Marimuthu Palaniswami
备注:In Proceedings of the 27th European Conference on Artificial Intelligence (ECAI 2025), IOS Press
摘要:通过将知识从称为教师的较大网络转移到称为学生的较小网络,知识蒸馏已被证明对模型压缩有效。当前时间序列中的知识蒸馏主要基于最初为计算机视觉任务开发的logit和特征对齐技术。这些方法没有明确说明时间数据,在两个关键方面存在不足。首先,由于逻辑和特征的不可解释性,转移的知识帮助学生模型学习过程的机制仍然不清楚。其次,这些方法只传递有限的知识,主要是复制教师预测的准确性。因此,学生模型通常会产生与教师模型显著不同的预测分布,这阻碍了他们对教师模型的安全替代。在这项工作中,我们建议通过扩展传统的logit转移来传递可解释的知识,不仅要传达正确的预测,还要传达教师的正确推理。具体来说,我们从教师logits中归纳出其他有用的知识,称为时间显着性,它捕获了每个输入时间步对教师预测的重要性。通过使用时间显著性蒸馏训练学生,我们鼓励它根据与教师相同的输入特征进行预测。时间显著性蒸馏不需要额外的参数或架构特定的假设。我们证明了时间显着性蒸馏有效地提高了基线方法的性能,同时还实现了超出预测精度的理想特性。我们希望我们的工作建立了一个新的范式,可解释的知识提取的时间序列分析。
摘要:Knowledge distillation has proven effective for model compression by transferring knowledge from a larger network called the teacher to a smaller network called the student. Current knowledge distillation in time series is predominantly based on logit and feature aligning techniques originally developed for computer vision tasks. These methods do not explicitly account for temporal data and fall short in two key aspects. First, the mechanisms by which the transferred knowledge helps the student model learning process remain unclear due to uninterpretability of logits and features. Second, these methods transfer only limited knowledge, primarily replicating the teacher predictive accuracy. As a result, student models often produce predictive distributions that differ significantly from those of their teachers, hindering their safe substitution for teacher models. In this work, we propose transferring interpretable knowledge by extending conventional logit transfer to convey not just the right prediction but also the right reasoning of the teacher. Specifically, we induce other useful knowledge from the teacher logits termed temporal saliency which captures the importance of each input timestep to the teacher prediction. By training the student with Temporal Saliency Distillation we encourage it to make predictions based on the same input features as the teacher. Temporal Saliency Distillation requires no additional parameters or architecture specific assumptions. We demonstrate that Temporal Saliency Distillation effectively improves the performance of baseline methods while also achieving desirable properties beyond predictive accuracy. We hope our work establishes a new paradigm for interpretable knowledge distillation in time series analysis.


【7】CAOS: Conformal Aggregation of One-Shot Predictors
标题:CAOS:一次性预测因子的保形聚集
链接:https://arxiv.org/abs/2601.05219

作者:Maja Waldron
摘要:单次预测使得预训练的基础模型能够仅使用一个标记的示例快速适应新任务,但缺乏原则性的不确定性量化。虽然保形预测提供有限样本覆盖保证,但由于数据分裂和依赖于单个预测因子,标准分裂保形方法在单次设置中效率低下。我们提出了一次预测的共形聚合(CAOS),一个共形框架,自适应地聚合多个一次预测,并使用留一校准方案,充分利用稀缺的标记数据。尽管违反了经典的交换假设,我们证明了CAOS实现有效的边际覆盖使用单调性为基础的参数。在一次性面部地标和RAFT文本分类任务上的实验表明,CAOS产生的预测集比分裂的共形基线小得多,同时保持可靠的覆盖率。
摘要:One-shot prediction enables rapid adaptation of pretrained foundation models to new tasks using only one labeled example, but lacks principled uncertainty quantification. While conformal prediction provides finite-sample coverage guarantees, standard split conformal methods are inefficient in the one-shot setting due to data splitting and reliance on a single predictor. We propose Conformal Aggregation of One-Shot Predictors (CAOS), a conformal framework that adaptively aggregates multiple one-shot predictors and uses a leave-one-out calibration scheme to fully exploit scarce labeled data. Despite violating classical exchangeability assumptions, we prove that CAOS achieves valid marginal coverage using a monotonicity-based argument. Experiments on one-shot facial landmarking and RAFT text classification tasks show that CAOS produces substantially smaller prediction sets than split conformal baselines while maintaining reliable coverage.


强化学习(9篇)

【1】Safe Continual Reinforcement Learning Methods for Nonstationary Environments. Towards a Survey of the State of the Art
标题:非静态环境的安全连续强化学习方法。对艺术现状的调查
链接:https://arxiv.org/abs/2601.05152

作者:Timofey Tomashevskiy
备注:20 pages, 4 figures
摘要:这项工作提供了一个国家的最先进的调查持续安全在线强化学习(COSRL)方法。我们讨论了构建持续在线安全强化学习算法的理论方面,挑战和开放性问题。我们提供的分类和持续在线安全强化学习方法的细节的基础上的类型的安全学习机制,考虑到适应非平稳性。我们对在线强化学习算法的安全约束制定进行了分类,最后,我们讨论了创建可靠,安全的在线学习算法的前景。   保留字:非平稳环境中的安全强化学习,非平稳下的安全持续强化学习,HM-MDP,NSMDP,POMDP,安全POMDP,持续学习的约束,安全持续强化学习回顾,安全持续强化学习调查,安全持续强化学习,分布转移下的安全在线学习,安全持续在线适应,安全强化学习,安全探索,安全适应,约束马尔可夫决策过程,安全强化学习,部分可观察马尔可夫决策过程,安全强化学习和隐马尔可夫决策过程,安全在线强化学习,安全元学习,安全元强化学习,安全基于上下文的强化学习,制定持续学习的安全限制
摘要 :This work provides a state-of-the-art survey of continual safe online reinforcement learning (COSRL) methods. We discuss theoretical aspects, challenges, and open questions in building continual online safe reinforcement learning algorithms. We provide the taxonomy and the details of continual online safe reinforcement learning methods based on the type of safe learning mechanism that takes adaptation to nonstationarity into account. We categorize safety constraints formulation for online reinforcement learning algorithms, and finally, we discuss prospects for creating reliable, safe online learning algorithms.   Keywords: safe RL in nonstationary environments, safe continual reinforcement learning under nonstationarity, HM-MDP, NSMDP, POMDP, safe POMDP, constraints for continual learning, safe continual reinforcement learning review, safe continual reinforcement learning survey, safe continual reinforcement learning, safe online learning under distribution shift, safe continual online adaptation, safe reinforcement learning, safe exploration, safe adaptation, constrained Markov decision processes, safe reinforcement learning, partially observable Markov decision process, safe reinforcement learning and hidden Markov decision processes, Safe Online Reinforcement Learning, safe online reinforcement learning, safe online reinforcement learning, safe meta-learning, safe meta-reinforcement learning, safe context-based reinforcement learning, formulating safety constraints for continual learning


【2】On the Hidden Objective Biases of Group-based Reinforcement Learning
标题:基于群体的强化学习的隐藏目标偏差
链接:https://arxiv.org/abs/2601.05002

作者:Aleksandar Fontana,Marco Simoni,Giulio Rossolini,Andrea Saracino,Paolo Mori
摘要:基于组的强化学习方法,如组相对策略优化(GRPO),现在被广泛用于后训练大型语言模型。尽管他们的经验成功,他们表现出奖励优化和潜在的培训目标之间的结构性不匹配。在本文中,我们提出了一个理论分析GRPO风格的方法,通过研究他们在一个统一的代理制定。这种观点揭示了影响所有分析方法的重复属性:(i)不均匀的组加权会导致共享前缀令牌上的系统梯度偏差;(ii)与AdamW优化器的交互使训练动态对奖励缩放很不敏感;以及(iii)优化器动量可以在重复优化步骤下将策略更新推到预期的裁剪区域之外。我们认为,这些研究结果突出了当前方法的根本局限性,并为未来配方的设计提供了原则性指导。
摘要:Group-based reinforcement learning methods, like Group Relative Policy Optimization (GRPO), are widely used nowadays to post-train large language models. Despite their empirical success, they exhibit structural mismatches between reward optimization and the underlying training objective. In this paper, we present a theoretical analysis of GRPO style methods by studying them within a unified surrogate formulation. This perspective reveals recurring properties that affect all the methods under analysis: (i) non-uniform group weighting induces systematic gradient biases on shared prefix tokens; (ii) interactions with the AdamW optimizer make training dynamics largely insensitive to reward scaling; and (iii) optimizer momentum can push policy updates beyond the intended clipping region under repeated optimization steps. We believe that these findings highlight fundamental limitations of current approaches and provide principled guidance for the design of future formulations.


【3】TourPlanner: A Competitive Consensus Framework with Constraint-Gated Reinforcement Learning for Travel Planning
标题:TourPlanner:一个用于旅行规划的约束门控强化学习竞争共识框架
链接:https://arxiv.org/abs/2601.04698

作者:Yinuo Wang,Mining Tan,Wenxiang Jiao,Xiaoxi Li,Hao Wang,Xuanyu Zhang,Yuan Lu,Weiming Dong
摘要:旅行计划是一个复杂的决策过程,需要综合多方面的信息来构建行程。然而,现有的旅行规划方法面临着几个挑战:(1)修剪候选兴趣点(POI),同时保持高召回率;(2)单一的推理路径限制了旅行规划的可行解空间内的探索能力;(3)同时优化硬约束和软约束仍然是一个显着的困难。为了应对这些挑战,我们提出了TourPlanner,这是一个具有多路径推理和约束门控强化学习的综合框架。具体来说,我们首先介绍了一个个性化的召回和空间优化(PReSO)的工作流程来构建空间感知的候选POI的集合。随后,我们提出了竞争共识思想链(CCoT),多路径推理范式,提高了探索可行解空间的能力。为了进一步完善该计划,我们将基于S形的门控机制集成到强化学习阶段,只有在满足硬约束后,才动态优先考虑软约束的满足。在旅行规划基准上的实验结果表明,TourPlanner实现了最先进的性能,在可行性和用户偏好对齐方面都大大超过了现有的方法。
摘要:Travel planning is a sophisticated decision-making process that requires synthesizing multifaceted information to construct itineraries. However, existing travel planning approaches face several challenges: (1) Pruning candidate points of interest (POIs) while maintaining a high recall rate; (2) A single reasoning path restricts the exploration capability within the feasible solution space for travel planning; (3) Simultaneously optimizing hard constraints and soft constraints remains a significant difficulty. To address these challenges, we propose TourPlanner, a comprehensive framework featuring multi-path reasoning and constraint-gated reinforcement learning. Specifically, we first introduce a Personalized Recall and Spatial Optimization (PReSO) workflow to construct spatially-aware candidate POIs' set. Subsequently, we propose Competitive consensus Chain-of-Thought (CCoT), a multi-path reasoning paradigm that improves the ability of exploring the feasible solution space. To further refine the plan, we integrate a sigmoid-based gating mechanism into the reinforcement learning stage, which dynamically prioritizes soft-constraint satisfaction only after hard constraints are met. Experimental results on travel planning benchmarks demonstrate that TourPlanner achieves state-of-the-art performance, significantly surpassing existing methods in both feasibility and user-preference alignment.


【4】Tape: A Cellular Automata Benchmark for Evaluating Rule-Shift Generalization in Reinforcement Learning
标题:磁带:用于评估强化学习中规则转变概括的元胞自动机基准
链接:https://arxiv.org/abs/2601.04695

作者:Enze Pan
备注:4 tables
摘要 :我们提出了Tape,一个受控的并行学习基准,旨在隔离分布外(OOD)故障下的潜在rule shifts.Tape来自一维元胞自动机,使精确的训练/测试分裂的观察和动作空间保持固定,而过渡规则的变化。使用可重复的评估管道,我们比较了无模型基线,基于模型的规划与学习世界模型和任务推理(元RL)方法。一个一致的模式出现了:强分布(ID)的方法可能会在坚持规则OOD下崩溃,并且高方差OOD评估可能会使排名不稳定,除非实验被充分复制。我们提供(i)标准化OOD协议,(ii)统计报告要求(种子、置信区间和假设检验),以及(iii)将熵减少与条件互信息和期望的后验KL发散联系起来的信息论恒等式,澄清在规则转变下“减少不确定性”的目标能够和不能保证什么。
摘要:We present Tape, a controlled reinforcement-learning benchmark designed to isolate out-of-distribution (OOD) failure under latent rule shifts.Tape is derived from one-dimensional cellular automata, enabling precise train/test splits where observation and action spaces are held fixed while transition rules change. Using a reproducible evaluation pipeline, we compare model-free baselines, model-based planning with learned world models, and task-inference (meta-RL) methods. A consistent pattern emerges: methods that are strong in-distribution (ID) can collapse under heldout-rule OOD, and high-variance OOD evaluation can make rankings unstable unless experiments are sufficiently replicated.We provide (i) standardized OOD protocols, (ii) statistical reporting requirements (seeds, confidence intervals, and hypothesis tests), and (iii) information-theoretic identities connecting entropy reduction to conditional mutual information and expected posterior KL divergence, clarifying what "uncertainty reduction" objectives can and cannot guarantee under rule shifts.


【5】Multiagent Reinforcement Learning with Neighbor Action Estimation
标题:具有邻居动作估计的多智能体强化学习
链接:https://arxiv.org/abs/2601.04511

作者:Zhenglong Luo,Zhiyong Chen,Aoxiang Liu
摘要:多智能体强化学习作为一种重要的智能范式,能够在复杂系统中实现协同决策。然而,现有的方法通常依赖于代理之间的显式动作交换来评估动作值函数,由于通信约束、延迟、能量消耗和可靠性要求,这在现实世界的工程环境中通常是不切实际的。从人工智能的角度来看,本文提出了一个增强的多智能体强化学习框架,采用动作估计神经网络来推断代理的行为。通过集成一个轻量级的动作估计模块,每个代理只使用本地可观察的信息来推断相邻代理的行为,从而实现协作策略学习,而无需显式的动作共享。这种方法与标准TD3算法完全兼容,并可扩展到更大的多智能体系统。在工程应用层面上,该框架已在双臂机器人操作任务中得到实现和验证:两个机器人手臂协同举起物体。实验结果表明,这种方法显着提高了鲁棒性和部署的可行性,同时减少对信息基础设施的依赖。总的来说,这项研究推动了分散式多智能体人工智能系统的发展,同时使人工智能能够在动态的、信息受限的现实世界环境中有效地运行。
摘要:Multiagent reinforcement learning, as a prominent intelligent paradigm, enables collaborative decision-making within complex systems. However, existing approaches often rely on explicit action exchange between agents to evaluate action value functions, which is frequently impractical in real-world engineering environments due to communication constraints, latency, energy consumption, and reliability requirements. From an artificial intelligence perspective, this paper proposes an enhanced multiagent reinforcement learning framework that employs action estimation neural networks to infer agent behaviors. By integrating a lightweight action estimation module, each agent infers neighboring agents' behaviors using only locally observable information, enabling collaborative policy learning without explicit action sharing. This approach is fully compatible with standard TD3 algorithms and scalable to larger multiagent systems. At the engineering application level, this framework has been implemented and validated in dual-arm robotic manipulation tasks: two robotic arms collaboratively lift objects. Experimental results demonstrate that this approach significantly enhances the robustness and deployment feasibility of real-world robotic systems while reducing dependence on information infrastructure. Overall, this research advances the development of decentralized multiagent artificial intelligence systems while enabling AI to operate effectively in dynamic, information-constrained real-world environments.


【6】Rate or Fate? RLV$^\varepsilon$R: Reinforcement Learning with Verifiable Noisy Rewards
链接:https://arxiv.org/abs/2601.04411

作者:Ali Rad,Khashayar Filom,Darioush Keivan,Peyman Mohajerin Esfahani,Ehsan Kamalinejad
摘要:带有可验证奖励的强化学习(RLVR)是一种简单但功能强大的LLM训练范例:采样完成,验证并更新。然而,在实践中,验证者几乎从来都不是干净的--单元测试只探测有限的角落情况;人类和合成标签是不完美的; LLM判断(例如,RLAIF)是嘈杂的,可以利用-这个问题困扰着更难的领域(特别是编码),其中测试是稀疏的,越来越多的模型生成。我们提出了一个务实的问题:验证噪音仅仅是减缓了学习(速度),还是可以翻转结果(命运)?   为了解决这个问题,我们开发了一个易于分析的多臂土匪RLVR动态的观点,实例化GRPO和控制实验中验证。对假阳性和假阴性进行建模,并将完成分组到循环推理模式中,在概率单纯形上产生了复制器风格(自然选择)的流程。动力学演化为正确模内的竞争和不正确模上质量的一维演化,其漂移仅由Youden指数J=TPR-FPR决定。这产生了一个急剧的相变:当J>0时,不正确的质量被驱动到灭绝(学习);当J=0时,过程是中性的;当J<0时,不正确的模式放大,直到它们占主导地位(反学习和崩溃)。在学习机制J>0中,噪声主要重新调整收敛时间(“速率,而不是命运”)。在合成噪声下的可验证编程任务的实验重现了预测的J=0边界。除了噪声,该框架提供了一个通用的镜头分析RLVR的稳定性,收敛性和算法干预。
摘要:Reinforcement learning with verifiable rewards (RLVR) is a simple but powerful paradigm for training LLMs: sample a completion, verify it, and update. In practice, however, the verifier is almost never clean--unit tests probe only limited corner cases; human and synthetic labels are imperfect; and LLM judges (e.g., RLAIF) are noisy and can be exploited--and this problem worsens on harder domains (especially coding) where tests are sparse and increasingly model-generated. We ask a pragmatic question: does the verification noise merely slow down the learning (rate), or can it flip the outcome (fate)?   To address this, we develop an analytically tractable multi-armed bandit view of RLVR dynamics, instantiated with GRPO and validated in controlled experiments. Modeling false positives and false negatives and grouping completions into recurring reasoning modes yields a replicator-style (natural-selection) flow on the probability simplex. The dynamics decouples into within-correct-mode competition and a one-dimensional evolution for the mass on incorrect modes, whose drift is determined solely by Youden's index J=TPR-FPR. This yields a sharp phase transition: when J>0, the incorrect mass is driven toward extinction (learning); when J=0, the process is neutral; and when J<0, incorrect modes amplify until they dominate (anti-learning and collapse). In the learning regime J>0, noise primarily rescales convergence time ("rate, not fate"). Experiments on verifiable programming tasks under synthetic noise reproduce the predicted J=0 boundary. Beyond noise, the framework offers a general lens for analyzing RLVR stability, convergence, and algorithmic interventions.


【7】Survival Dynamics of Neural and Programmatic Policies in Evolutionary Reinforcement Learning
标题 :进化强化学习中神经和程序策略的生存动力学
链接:https://arxiv.org/abs/2601.04365

作者:Anton Roupassov-Ruiz,Yiyang Zuo
摘要:在进化强化学习任务(ERL)中,代理策略通常被编码为小型人工神经网络(NERL)。这种表征缺乏明确的模块结构,限制了行为解释。我们调查是否程序化的政策(PERL),实现软,可区分的决策列表(SDDL),可以匹配的NERL的性能。为了支持可重复的评估,我们提供了经典的1992年人工生命(ALife)ERL测试平台的第一个完全指定和开源的重新实现。我们对4000项独立试验进行了严格的生存分析,使用Kaplan-Meier曲线和原始研究中没有的限制平均生存时间(RMST)指标。我们发现PERL和NERL之间的生存概率存在统计学显著差异。PERL代理的平均生存时间比NERL代理长201.69步。此外,仅使用学习(无进化)的SDDL代理比使用学习和评估的神经代理平均多存活73.67步。这些结果表明,程序化策略可以超过ALIFE中神经策略的生存性能。
摘要:In evolutionary reinforcement learning tasks (ERL), agent policies are often encoded as small artificial neural networks (NERL). Such representations lack explicit modular structure, limiting behavioral interpretation. We investigate whether programmatic policies (PERL), implemented as soft, differentiable decision lists (SDDL), can match the performance of NERL. To support reproducible evaluation, we provide the first fully specified and open-source reimplementation of the classic 1992 Artificial Life (ALife) ERL testbed. We conduct a rigorous survival analysis across 4000 independent trials utilizing Kaplan-Meier curves and Restricted Mean Survival Time (RMST) metrics absent in the original study. We find a statistically significant difference in survival probability between PERL and NERL. PERL agents survive on average 201.69 steps longer than NERL agents. Moreover, SDDL agents using learning alone (no evolution) survive on average 73.67 steps longer than neural agents using both learning and evaluation. These results demonstrate that programmatic policies can exceed the survival performance of neural policies in ALife.


【8】Online Action-Stacking Improves Reinforcement Learning Performance for Air Traffic Control
标题:在线队列堆叠提高空中交通管制的强化学习性能
链接:https://arxiv.org/abs/2601.04287

作者:Ben Carvell,George De Ath,Eseoghene Benjamin,Richard Everson
摘要:我们引入了在线动作堆叠,这是一种用于强化学习策略的推理时间包装器,可以生成逼真的空中交通管制命令,同时允许在更小的离散动作空间上进行训练。策略通过简单的增量航向或水平调整进行训练,同时使用行动阻尼惩罚,以降低指令频率并导致代理在短时间内发出命令。在推理时,在线动作堆叠将这些原始动作的突发编译成适合域的复合间隙。使用邻近策略优化和BluebirdDT数字孪生平台,我们训练代理沿着横向航线导航飞机,管理爬升和下降到目标飞行高度层,并在最小间隔约束下执行两架飞机碰撞避免。在我们的横向导航实验中,动作堆叠大大减少了相对于阻尼基线发出的指令数量,并实现了与37维动作空间训练的策略相当的性能,尽管只使用五个动作。这些结果表明,在线动作叠加有助于弥合标准强化学习公式和操作ATC要求之间的关键差距,并提供了一种简单的机制,用于扩展到更复杂的控制场景。
摘要:We introduce online action-stacking, an inference-time wrapper for reinforcement learning policies that produces realistic air traffic control commands while allowing training on a much smaller discrete action space. Policies are trained with simple incremental heading or level adjustments, together with an action-damping penalty that reduces instruction frequency and leads agents to issue commands in short bursts. At inference, online action-stacking compiles these bursts of primitive actions into domain-appropriate compound clearances. Using Proximal Policy Optimisation and the BluebirdDT digital twin platform, we train agents to navigate aircraft along lateral routes, manage climb and descent to target flight levels, and perform two-aircraft collision avoidance under a minimum separation constraint. In our lateral navigation experiments, action stacking greatly reduces the number of issued instructions relative to a damped baseline and achieves comparable performance to a policy trained with a 37-dimensional action space, despite operating with only five actions. These results indicate that online action-stacking helps bridge a key gap between standard reinforcement learning formulations and operational ATC requirements, and provides a simple mechanism for scaling to more complex control scenarios.


【9】Making Tunable Parameters State-Dependent in Weather and Climate Models with Reinforcement Learning
标题:利用强化学习使天气和气候模型中的可调参数取决于状态
链接:https://arxiv.org/abs/2601.04268

作者:Pritthijit Nath,Sebastian Schemm,Henry Moss,Peter Haynes,Emily Shuckburgh,Mark J. Webb
备注:66 pages, 22 figures
摘要:天气和气候模式依赖于参数化来表示未解决的子网格过程。传统的方案依赖于弱约束和离线调整的固定系数,导致持续的偏差,限制了它们适应底层物理的能力。这项研究提出了一个框架,使用强化学习(RL)在线学习参数化方案的组成部分作为不断发展的模式状态的函数,并在跨越简单气候偏差校正(SCBC),辐射对流平衡(RCE),以及具有单智能体和联合多智能体设置的区域平均能量平衡模型(EBM)。在九种RL算法中,截断分位数批评(TQC),深度确定性策略梯度(DDPG)和双延迟DDPG(TD3)实现了最高的技能和跨配置的最稳定收敛,并使用面积加权RMSE,温度曲线和压力水平诊断对静态基线进行了性能评估。对于EBM,单代理RL优于静态参数调整,在热带和中纬度波段获得最大收益,而多代理设置上的联合RL实现了地理上的专业化控制和更快的收敛,使用频繁聚合的六代理DDPG配置在热带和中纬度地区产生最低的区域加权RMSE。所学习的校正也是有物理意义的,因为代理调制EBM辐射参数,以减少纬向偏差,调整RCE递减率,以匹配垂直温度误差,并稳定SCBC加热增量,以限制漂移。总体而言,结果突出了RL提供熟练的状态依赖和状态感知参数化,为数字模型中的在线学习提供了一条可扩展的途径。
摘要 :Weather and climate models rely on parametrisations to represent unresolved sub-grid processes. Traditional schemes rely on fixed coefficients that are weakly constrained and tuned offline, contributing to persistent biases that limit their ability to adapt to the underlying physics. This study presents a framework that learns components of parametrisation schemes online as a function of the evolving model state using reinforcement learning (RL) and evaluates the resulting RL-driven parameter updates across a hierarchy of idealised testbeds spanning a simple climate bias correction (SCBC), a radiative-convective equilibrium (RCE), and a zonal mean energy balance model (EBM) with both single-agent and federated multi-agent settings. Across nine RL algorithms, Truncated Quantile Critics (TQC), Deep Deterministic Policy Gradient (DDPG), and Twin Delayed DDPG (TD3) achieved the highest skill and the most stable convergence across configurations, with performance assessed against a static baseline using area-weighted RMSE, temperature profile and pressure-level diagnostics. For the EBM, single-agent RL outperformed static parameter tuning with the strongest gains in tropical and mid-latitude bands, while federated RL on multi-agent setups enabled geographically specialised control and faster convergence, with a six-agent DDPG configuration using frequent aggregation yielding the lowest area-weighted RMSE across the tropics and mid-latitudes. The learnt corrections were also physically meaningful as agents modulated EBM radiative parameters to reduce meridional biases, adjusted RCE lapse rates to match vertical temperature errors, and stabilised SCBC heating increments to limit drift. Overall, results highlight RL to deliver skilful state-dependent, and regime-aware parametrisations, offering a scalable pathway for online learning within numerical models.


元学习(1篇)

【1】Meta-probabilistic Modeling
标题:元概率建模
链接:https://arxiv.org/abs/2601.04462

作者:Kevin Zhang,Yixin Wang
摘要:虽然概率图模型可以发现数据中的潜在结构,但其有效性取决于选择良好的指定模型。在实践中,识别这些模型具有挑战性,通常需要通过反复试验进行反复检查和修订。为此,我们提出了元概率建模(MPM),这是一种元学习算法,可以直接从多个相关数据集学习生成模型结构。MPM使用分层架构,其中全局模型规范在数据集之间共享,而局部参数保持特定于数据集。对于学习和推理,我们提出了一个易于处理的VAE启发的代理目标,并通过双层优化进行优化:局部变量通过坐标上升进行分析更新,而全局参数则使用基于梯度的方法进行训练。我们在以对象为中心的图像建模和序列文本建模方面评估了MPM,证明它使生成模型适应数据,同时恢复有意义的潜在表示。
摘要:While probabilistic graphical models can discover latent structure in data, their effectiveness hinges on choosing well-specified models. Identifying such models is challenging in practice, often requiring iterative checking and revision through trial and error. To this end, we propose meta-probabilistic modeling (MPM), a meta-learning algorithm that learns generative model structure directly from multiple related datasets. MPM uses a hierarchical architecture where global model specifications are shared across datasets while local parameters remain dataset-specific. For learning and inference, we propose a tractable VAE-inspired surrogate objective, and optimize it through bi-level optimization: local variables are updated analytically via coordinate ascent, while global parameters are trained with gradient-based methods. We evaluate MPM on object-centric image modeling and sequential text modeling, demonstrating that it adapts generative models to data while recovering meaningful latent representations.


符号|符号学习(1篇)

【1】Neural-Symbolic Integration with Evolvable Policies
标题:神经符号与可进化政策的整合
链接:https://arxiv.org/abs/2601.04799

作者:Marios Thoma,Vassilis Vassiliades,Loizos Michael
备注:18 pages, 12 figures, related code available at https://github.com/CYENS/evolvable-nesy
摘要:神经符号人工智能(NeSy)已经成为一种很有前途的方法,将神经网络的学习能力与符号系统的可解释推理相结合。然而,现有的NeSy框架通常需要预定义的符号策略或可区分的策略,当领域专业知识不可用时或当策略固有地不可区分时,限制了它们的适用性。我们提出了一个框架,通过使不可微的符号策略和神经网络权重的并行学习,通过进化过程来解决这个限制。我们的方法将NeSy系统视为通过突变(符号规则添加和神经权重变化)进化的种群中的生物体,基于适应度的选择引导收敛到隐藏的目标策略。该框架扩展了神经元架构,使符号策略可训练,适应Valiant的进化框架的NeSy上下文,并采用机器教练语义可变的符号表示。神经网络是通过从符号部分进行溯因推理来训练的,消除了可微性要求。通过大量的实验,我们证明了从空策略和随机神经权重开始的NeSy系统可以成功地近似隐藏的不可微目标策略,实现接近100%的中值正确性能。这项工作代表了一个步骤,使NeSy研究领域的符号知识从专家的收购是具有挑战性的或不可行的。
摘要:Neural-Symbolic (NeSy) Artificial Intelligence has emerged as a promising approach for combining the learning capabilities of neural networks with the interpretable reasoning of symbolic systems. However, existing NeSy frameworks typically require either predefined symbolic policies or policies that are differentiable, limiting their applicability when domain expertise is unavailable or when policies are inherently non-differentiable. We propose a framework that addresses this limitation by enabling the concurrent learning of both non-differentiable symbolic policies and neural network weights through an evolutionary process. Our approach casts NeSy systems as organisms in a population that evolve through mutations (both symbolic rule additions and neural weight changes), with fitness-based selection guiding convergence toward hidden target policies. The framework extends the NEUROLOG architecture to make symbolic policies trainable, adapts Valiant's Evolvability framework to the NeSy context, and employs Machine Coaching semantics for mutable symbolic representations. Neural networks are trained through abductive reasoning from the symbolic component, eliminating differentiability requirements. Through extensive experimentation, we demonstrate that NeSy systems starting with empty policies and random neural weights can successfully approximate hidden non-differentiable target policies, achieving median correct performance approaching 100%. This work represents a step toward enabling NeSy research in domains where the acquisition of symbolic knowledge from experts is challenging or infeasible.


医学相关(3篇)

【1】An interpretable data-driven approach to optimizing clinical fall risk assessment
标题:优化临床跌倒风险评估的可解释数据驱动方法
链接:https://arxiv.org/abs/2601.05194

作者:Fardin Ganjkhanloo,Emmett Springer,Erik H. Hoyer,Daniel L. Young,Holley Farley,Kimia Ghobadi
备注:arXiv admin note: substantial text overlap with arXiv:2510.20714
摘要:在这项研究中,我们的目标是通过数据驱动的建模方法,更好地将约翰霍普金斯大学跌倒风险评估工具(JHFRAT)的跌倒风险预测与其他具有临床意义的措施相结合。我们对2022年3月至2023年10月期间来自约翰霍普金斯卫生系统医院的54,209名住院患者进行了回顾性队列分析。共有20,208人被列为高跌倒风险,13,941人被列为低跌倒风险。为了整合临床知识并保持可解释性,我们采用了约束评分优化(CSO)模型来重新加权JHFRAT评分权重,同时保留其附加结构和临床阈值。重新校准是指调整项目权重,以便所得分数可以根据研究的风险标签更一致地对遭遇进行排序,而无需更改工具的外形或部署工作流程。与当前JHFRAT相比,该模型显示出预测性能的显著改善(CSO AUC-ROC=0.91,JHFRAT AUC-ROC=0.86)。这一性能改进意味着约翰霍普金斯卫生系统每周可额外保护35名高风险患者。约束分数优化模型在有和没有EHR变量的情况下表现相似。尽管基准黑盒模型(XGBoost)改进了基于知识的约束逻辑回归的性能指标(AUC-ROC=0.94),但CSO对风险标签的变化表现出更强的鲁棒性。这种基于证据的方法为卫生系统提供了坚实的基础,可以使用数据驱动的优化技术系统地加强住院患者跌倒预防协议和患者安全,有助于改善医疗环境中的风险评估和资源分配。
摘要:In this study, we aim to better align fall risk prediction from the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) with additional clinically meaningful measures via a data-driven modelling approach. We conducted a retrospective cohort analysis of 54,209 inpatient admissions from three Johns Hopkins Health System hospitals between March 2022 and October 2023. A total of 20,208 admissions were included as high fall risk encounters, and 13,941 were included as low fall risk encounters. To incorporate clinical knowledge and maintain interpretability, we employed constrained score optimization (CSO) models to reweight the JHFRAT scoring weights, while preserving its additive structure and clinical thresholds. Recalibration refers to adjusting item weights so that the resulting score can order encounters more consistently by the study's risk labels, and without changing the tool's form factor or deployment workflow. The model demonstrated significant improvements in predictive performance over the current JHFRAT (CSO AUC-ROC=0.91, JHFRAT AUC-ROC=0.86). This performance improvement translates to protecting an additional 35 high-risk patients per week across the Johns Hopkins Health System. The constrained score optimization models performed similarly with and without the EHR variables. Although the benchmark black-box model (XGBoost), improves upon the performance metrics of the knowledge-based constrained logistic regression (AUC-ROC=0.94), the CSO demonstrates more robustness to variations in risk labeling. This evidence-based approach provides a robust foundation for health systems to systematically enhance inpatient fall prevention protocols and patient safety using data-driven optimization techniques, contributing to improved risk assessment and resource allocation in healthcare settings.


【2】Atlas 2 - Foundation models for clinical deployment
标题:Atlas 2 -临床部署的基础模型
链接:https://arxiv.org/abs/2601.05148

作者:Maximilian Alber,Timo Milbich,Alexandra Carpen-Amarie,Stephan Tietz,Jonas Dippel,Lukas Muttenthaler,Beatriz Perez Cancer,Alessandro Benetti,Panos Korfiatis,Elias Eulig,Jérôme Lüscher,Jiasen Wu,Sayed Abid Hashimi,Gabriel Dernbach,Simon Schallenberg,Neelay Shah,Moritz Krügener,Aniruddh Jammoria,Jake Matras,Patrick Duffy,Matt Redlon,Philipp Jurmeister,David Horst,Lukas Ruff,Klaus-Robert Müller,Frederick Klauschen,Andrew Norgan
摘要:病理学基础模型大大提高了计算病理学的可能性-但在性能,鲁棒性和计算要求方面的权衡仍然存在,这限制了它们的临床部署。在本报告中,我们介绍了Atlas 2,Atlas 2-B和Atlas 2-S,这三个病理学视觉基础模型通过在80个公共基准的综合评估中显示预测性能,鲁棒性和资源效率方面的最先进性能来弥补这些缺点。我们的模型是在迄今为止最大的病理学基础模型数据集上训练的,该数据集包括从三家医疗机构Charité -Universstätsmedizin Berlin,LMU Munich和Mayo Clinic收集的550万张组织病理学全切片图像。
摘要:Pathology foundation models substantially advanced the possibilities in computational pathology -- yet tradeoffs in terms of performance, robustness, and computational requirements remained, which limited their clinical deployment. In this report, we present Atlas 2, Atlas 2-B, and Atlas 2-S, three pathology vision foundation models which bridge these shortcomings by showing state-of-the-art performance in prediction performance, robustness, and resource efficiency in a comprehensive evaluation across eighty public benchmarks. Our models were trained on the largest pathology foundation model dataset to date comprising 5.5 million histopathology whole slide images, collected from three medical institutions Charité - Universtätsmedizin Berlin, LMU Munich, and Mayo Clinic.


【3】The Forgotten Shield: Safety Grafting in Parameter-Space for Medical MLLMs
标题:被遗忘的盾牌:医疗MLLM参数空间中的安全移植
链接:https://arxiv.org/abs/2601.04199

作者:Jiale Zhao,Xing Mou,Jinlin Wu,Hongyuan Yu,Mingrui Sun,Yang Shi,Xuanwu Yin,Zhen Chen,Zhen Lei,Yaohua Wang
摘要:医学多模态大型语言模型(Medical MLLM)在专业医疗任务中取得了显着进展;然而,对其安全性的研究却滞后,为现实世界的部署带来了潜在风险。在本文中,我们首先建立了一个多维度的评估框架,系统地基准目前SOTA医疗MLLM的安全性。我们的实证分析揭示了现有模型在一般和医疗特定安全方面的普遍漏洞,特别是突出了它们对跨模式越狱攻击的脆弱性。此外,我们发现,医疗微调过程中经常会导致灾难性的忘记模型的原始安全对齐。为了应对这一挑战,我们提出了一种新的“参数空间干预”的方法,有效的安全重新调整。该方法从原始基础模型中提取本质安全知识表示,并在医疗能力构建过程中将其注入目标模型。此外,我们设计了一个细粒度的参数搜索算法,以实现安全性和医疗性能之间的最佳权衡。实验结果表明,我们的方法显着支持医疗MLLM的安全护栏,而不依赖于额外的特定领域的安全数据,同时最大限度地降低核心医疗性能。
摘要 :Medical Multimodal Large Language Models (Medical MLLMs) have achieved remarkable progress in specialized medical tasks; however, research into their safety has lagged, posing potential risks for real-world deployment. In this paper, we first establish a multidimensional evaluation framework to systematically benchmark the safety of current SOTA Medical MLLMs. Our empirical analysis reveals pervasive vulnerabilities across both general and medical-specific safety dimensions in existing models, particularly highlighting their fragility against cross-modality jailbreak attacks. Furthermore, we find that the medical fine-tuning process frequently induces catastrophic forgetting of the model's original safety alignment. To address this challenge, we propose a novel "Parameter-Space Intervention" approach for efficient safety re-alignment. This method extracts intrinsic safety knowledge representations from original base models and concurrently injects them into the target model during the construction of medical capabilities. Additionally, we design a fine-grained parameter search algorithm to achieve an optimal trade-off between safety and medical performance. Experimental results demonstrate that our approach significantly bolsters the safety guardrails of Medical MLLMs without relying on additional domain-specific safety data, while minimizing degradation to core medical performance.


蒸馏|知识提取(3篇)

【1】FedKDX: Federated Learning with Negative Knowledge Distillation for Enhanced Healthcare AI Systems
标题:FedKDX:用于增强医疗保健人工智能系统的负知识提炼联合学习
链接:https://arxiv.org/abs/2601.04587

作者:Quang-Tu Pham,Hoang-Dieu Vu,Dinh-Dat Pham,Hieu H. Pham
摘要:本文介绍了FedKDX,这是一个联邦学习框架,通过负面知识蒸馏(NKD)解决了医疗保健AI的局限性。与仅关注积极知识转移的现有方法不同,FedKDX捕获目标和非目标信息,以提高医疗保健应用中的模型泛化能力。该框架将多种知识转移技术(包括传统知识蒸馏、对比学习和NKD)集成在一个统一的架构中,在降低通信成本的同时保持隐私。通过对医疗数据集(SLEEP、UCI-HAR和PAMAP 2)的实验,FedKDX在非IID数据分布上表现出更高的准确性(比最先进的方法高出2.53%)、更快的收敛速度和更好的性能。理论分析支持NKD的贡献,以解决分布式医疗数据的统计异质性。该方法显示了在HIPAA和GDPR等监管框架下对隐私敏感的医疗应用的承诺,在分散的医疗保健环境中提供了性能和实际实施要求之间的平衡解决方案。代码和模型可在https://github.com/phamdinhdat-ai/Fed_2024上获得。
摘要:This paper introduces FedKDX, a federated learning framework that addresses limitations in healthcare AI through Negative Knowledge Distillation (NKD). Unlike existing approaches that focus solely on positive knowledge transfer, FedKDX captures both target and non-target information to improve model generalization in healthcare applications. The framework integrates multiple knowledge transfer techniques--including traditional knowledge distillation, contrastive learning, and NKD--within a unified architecture that maintains privacy while reducing communication costs. Through experiments on healthcare datasets (SLEEP, UCI-HAR, and PAMAP2), FedKDX demonstrates improved accuracy (up to 2.53% over state-of-the-art methods), faster convergence, and better performance on non-IID data distributions. Theoretical analysis supports NKD's contribution to addressing statistical heterogeneity in distributed healthcare data. The approach shows promise for privacy-sensitive medical applications under regulatory frameworks like HIPAA and GDPR, offering a balanced solution between performance and practical implementation requirements in decentralized healthcare settings. The code and model are available at https://github.com/phamdinhdat-ai/Fed_2024.


【2】MemKD: Memory-Discrepancy Knowledge Distillation for Efficient Time Series Classification
标题:MemKD:用于高效时间序列分类的记忆差异知识提炼
链接:https://arxiv.org/abs/2601.04264

作者:Nilushika Udayangani,Kishor Nandakishor,Marimuthu Palaniswami
备注:In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2025), Hyderabad, India
摘要:深度学习模型,特别是循环神经网络及其变体,如长短期记忆,大大提高了时间序列数据分析的水平。这些模型捕捉时间序列中复杂的连续模式,从而能够进行实时评估。然而,它们的高计算复杂性和大模型尺寸对在资源受限环境中的部署提出了挑战,例如可穿戴设备和边缘计算平台。知识蒸馏(KD)通过将知识从大型复杂模型(教师)转移到更小,更有效的模型(学生)来提供解决方案,从而在保持高性能的同时降低计算需求。目前的KD方法,最初是为计算机视觉任务设计的,忽略了时间序列模型独特的时间依赖性和记忆保持特性。为此,我们提出了一种新的KD框架称为记忆离散知识蒸馏(MemKD)。MemKD利用专门的损失函数来捕获教师和学生模型之间的记忆保留差异,确保学生模型有效地模仿教师模型的行为。这种方法有助于开发适用于实时时间序列分析任务的紧凑,高性能的递归神经网络。我们广泛的实验表明,MemKD显着优于最先进的KD方法。它将参数大小和内存使用量减少了大约500倍,同时保持与教师模型相当的性能。
摘要:Deep learning models, particularly recurrent neural networks and their variants, such as long short-term memory, have significantly advanced time series data analysis. These models capture complex, sequential patterns in time series, enabling real-time assessments. However, their high computational complexity and large model sizes pose challenges for deployment in resource-constrained environments, such as wearable devices and edge computing platforms. Knowledge Distillation (KD) offers a solution by transferring knowledge from a large, complex model (teacher) to a smaller, more efficient model (student), thereby retaining high performance while reducing computational demands. Current KD methods, originally designed for computer vision tasks, neglect the unique temporal dependencies and memory retention characteristics of time series models. To this end, we propose a novel KD framework termed Memory-Discrepancy Knowledge Distillation (MemKD). MemKD leverages a specialized loss function to capture memory retention discrepancies between the teacher and student models across subsequences within time series data, ensuring that the student model effectively mimics the teacher model's behaviour. This approach facilitates the development of compact, high-performing recurrent neural networks suitable for real-time, time series analysis tasks. Our extensive experiments demonstrate that MemKD significantly outperforms state-of-the-art KD methods. It reduces parameter size and memory usage by approximately 500 times while maintaining comparable performance to the teacher model.


【3】SAGE-32B: Agentic Reasoning via Iterative Distillation
标题:SAGE-32 B:通过迭代蒸馏进行抽象推理
链接:https://arxiv.org/abs/2601.04237

作者:Basab Jha,Firoj Paudel,Ujjwal Puri,Ethan Henkel,Zhang Yuting,Mateusz Kowalczyk,Mei Huang,Choi Donghyuk,Wang Junhao
备注:23 Pages, 3 figures, 4 tables
摘要 :我们展示了SAGE-32 B,一个320亿参数的语言模型,专注于代理推理和长期规划任务。与旨在实现一般会话流畅性的聊天模型不同,SAGE-32 B旨在以代理循环的方式运行,强调任务分解,工具使用和错误恢复。该模型从Qwen2.5- 32 B预训练模型初始化,并使用迭代蒸馏进行微调,这是一个两阶段的训练过程,通过严格测试的反馈回路提高推理性能。SAGE-32 B还引入了一种逆向推理方法,该方法使用Meta认知头在执行之前预测规划过程中的潜在故障。在包括MMLU-Pro,AgentBench和MATH-500在内的代理推理基准测试中,与类似大小的基线模型相比,SAGE-32 B在多工具使用场景中实现了更高的成功率,同时在标准推理评估中保持竞争力。模型权重在https://huggingface.co/sagea-ai/sage-reasoning-32b上公开发布
摘要:We demonstrate SAGE-32B, a 32 billion parameter language model that focuses on agentic reasoning and long range planning tasks. Unlike chat models that aim for general conversation fluency, SAGE-32B is designed to operate in an agentic loop, emphasizing task decomposition, tool usage, and error recovery. The model is initialized from the Qwen2.5-32B pretrained model and fine tuned using Iterative Distillation, a two stage training process that improves reasoning performance through rigorously tested feedback loops. SAGE-32B also introduces an inverse reasoning approach, which uses a meta cognition head to forecast potential failures in the planning process before execution. On agentic reasoning benchmarks including MMLU-Pro, AgentBench, and MATH-500, SAGE-32B achieves higher success rates in multi tool usage scenarios compared to similarly sized baseline models, while remaining competitive on standard reasoning evaluations. Model weights are publicly released at https://huggingface.co/sagea-ai/sage-reasoning-32b


推荐(1篇)

【1】Correct and Weight: A Simple Yet Effective Loss for Implicit Feedback Recommendation
标题:正确性和权重:隐性反馈推荐的简单而有效的损失
链接:https://arxiv.org/abs/2601.04291

作者:Minglei Yin,Chuanbo Hu,Bin Liu,Neil Zhenqiang Gong,Yanfang,Ye,Xin Li
备注:arXiv admin note: text overlap with arXiv:2508.05673 by other authors
摘要:从隐式反馈中学习已经成为现代推荐系统的标准范例。然而,这种设置充满了假阴性的持续挑战,其中未观察到的用户-项目交互不一定指示负面偏好。为了解决这个问题,本文引入了一种新的原则性损失函数,称为校正和加权(CW)损失,它系统地校正了训练目标中假阴性的影响。我们的方法集成了两个关键技术。首先,受Positive-Unlabeled学习的启发,我们通过重新校准假设的负分布来消除负采样过程的偏差。通过使用可观察的一般数据分布(p)和正交互作用分布(p^+)在理论上近似真负分布(p-),我们的方法提供了对采样的未标记项是真负的可能性的更准确的估计。其次,我们引入了一个动态的重新加权机制,根据模型的当前预测来调整每个负面实例的重要性。该方案鼓励模型在肯定项和置信预测项之间强制执行更大的排名界限(即,容易的)否定项,同时降低对具有较高假否定概率的不确定否定项的惩罚。我们的方法的一个关键优势是它的优雅和效率;它不需要对数据采样过程进行复杂的修改,也不需要大量的计算开销,这使得它很容易适用于各种现有的推荐模型。在四个大规模的稀疏基准数据集上进行的大量实验证明了我们提出的损失的优越性。结果表明,我们的方法在多个面向排名的指标上一致且显着优于一套最先进的损失函数。
摘要:Learning from implicit feedback has become the standard paradigm for modern recommender systems. However, this setting is fraught with the persistent challenge of false negatives, where unobserved user-item interactions are not necessarily indicative of negative preference. To address this issue, this paper introduces a novel and principled loss function, named Corrected and Weighted (CW) loss, that systematically corrects for the impact of false negatives within the training objective. Our approach integrates two key techniques. First, inspired by Positive-Unlabeled learning, we debias the negative sampling process by re-calibrating the assumed negative distribution. By theoretically approximating the true negative distribution (p-) using the observable general data distribution (p) and the positive interaction distribution (p^+), our method provides a more accurate estimate of the likelihood that a sampled unlabeled item is truly negative. Second, we introduce a dynamic re-weighting mechanism that modulates the importance of each negative instance based on the model's current prediction. This scheme encourages the model to enforce a larger ranking margin between positive items and confidently predicted (i.e., easy) negative items, while simultaneously down-weighting the penalty on uncertain negatives that have a higher probability of being false negatives. A key advantage of our approach is its elegance and efficiency; it requires no complex modifications to the data sampling process or significant computational overhead, making it readily applicable to a wide array of existing recommendation models. Extensive experiments conducted on four large-scale, sparse benchmark datasets demonstrate the superiority of our proposed loss. The results show that our method consistently and significantly outperforms a suite of state-of-the-art loss functions across multiple ranking-oriented metrics.


自动驾驶|车辆|车道检测等(4篇)

【1】Spatial-Temporal Feedback Diffusion Guidance for Controlled Traffic Imputation
标题:控制交通估算的时空反馈扩散引导
链接:https://arxiv.org/abs/2601.04572

作者:Xiaowei Mao,Huihu Ding,Yan Lin,Tingrui Wu,Shengnan Guo,Dazhuo Qiu,Feiling Fang,Jilin Hu,Huaiyu Wan
摘要:在时空交通数据中填补缺失值是智能交通系统的关键。在先进的插补方法中,基于分数的扩散模型表现出了具有竞争力的性能。这些模型通过反转噪声过程来生成数据,使用观测值作为条件指导。然而,现有的扩散模型通常在空间和时间维度上应用统一的指导尺度,这对于具有高缺失数据率的节点是不够的。稀疏观测提供的条件指导不足,导致生成过程向学习的先验分布漂移,而不是紧密跟随条件观测,从而导致次优插补性能。   为了解决这个问题,我们提出了FENCE,时空反馈扩散指导方法,旨在自适应控制指导尺度在插补。首先,FENCE引入了一种动态反馈机制,该机制基于后验似然近似来调整引导尺度。当生成的值偏离观测值时,引导尺度增加,当对准改善时,引导尺度减小,以防止过度校正。其次,由于与观测值的对齐在节点和去噪步骤之间变化,因此所有节点的全局指导尺度是次优的。FENCE通过基于节点的注意力分数对节点进行分组来计算集群级别的指导尺度,利用时空相关性来提供更准确的指导。在真实交通数据集上的实验结果表明,FENCE算法显著提高了插补精度。
摘要 :Imputing missing values in spatial-temporal traffic data is essential for intelligent transportation systems. Among advanced imputation methods, score-based diffusion models have demonstrated competitive performance. These models generate data by reversing a noising process, using observed values as conditional guidance. However, existing diffusion models typically apply a uniform guidance scale across both spatial and temporal dimensions, which is inadequate for nodes with high missing data rates. Sparse observations provide insufficient conditional guidance, causing the generative process to drift toward the learned prior distribution rather than closely following the conditional observations, resulting in suboptimal imputation performance.   To address this, we propose FENCE, a spatial-temporal feedback diffusion guidance method designed to adaptively control guidance scales during imputation. First, FENCE introduces a dynamic feedback mechanism that adjusts the guidance scale based on the posterior likelihood approximations. The guidance scale is increased when generated values diverge from observations and reduced when alignment improves, preventing overcorrection. Second, because alignment to observations varies across nodes and denoising steps, a global guidance scale for all nodes is suboptimal. FENCE computes guidance scales at the cluster level by grouping nodes based on their attention scores, leveraging spatial-temporal correlations to provide more accurate guidance. Experimental results on real-world traffic datasets show that FENCE significantly enhances imputation accuracy.


【2】GEnSHIN: Graphical Enhanced Spatio-temporal Hierarchical Inference Network for Traffic Flow Prediction
标题:GEnSHIN:用于交通流预测的图形增强时空层次推理网络
链接:https://arxiv.org/abs/2601.04550

作者:Zhiyan Zhou,Junjie Liao,Manho Zhang,Yingyi Liao,Ziai Wang
摘要:随着城市化进程的加快,智能交通系统对交通流准确预测的需求越来越大。提出了一种新的图增强时空层次推理网络(GEnSHIN)来处理交通流预测中复杂的时空相关性。该模型集成了三个创新设计:1)注意力增强的图卷积递归单元(GCRU),通过引入Transformer模块,增强了对长期时间依赖的建模能力; 2)非对称双嵌入图生成机制,利用真实路网和数据驱动的潜在非对称拓扑,生成更符合实际交通流特征的图结构; 3)动态存储库模块,其利用可学习的交通模式原型为每个传感器节点提供个性化的交通模式表示,并在解码阶段引入轻量级图更新器以适应道路网络状态的动态变化。在公共数据集METR-LA上进行的大量实验表明,GEnSHIN在多个指标上达到或超过了比较模型的性能,如平均绝对误差(MAE),均方根误差(RMSE)和平均绝对百分比误差(MAPE)。值得注意的是,该模型在早晚交通高峰时段表现出良好的预测稳定性。烧蚀实验进一步验证了各个核心模块的有效性及其对最终性能的贡献。
摘要:With the acceleration of urbanization, intelligent transportation systems have an increasing demand for accurate traffic flow prediction. This paper proposes a novel Graph Enhanced Spatio-temporal Hierarchical Inference Network (GEnSHIN) to handle the complex spatio-temporal dependencies in traffic flow prediction. The model integrates three innovative designs: 1) An attention-enhanced Graph Convolutional Recurrent Unit (GCRU), which strengthens the modeling capability for long-term temporal dependencies by introducing Transformer modules; 2) An asymmetric dual-embedding graph generation mechanism, which leverages the real road network and data-driven latent asymmetric topology to generate graph structures that better fit the characteristics of actual traffic flow; 3) A dynamic memory bank module, which utilizes learnable traffic pattern prototypes to provide personalized traffic pattern representations for each sensor node, and introduces a lightweight graph updater during the decoding phase to adapt to dynamic changes in road network states. Extensive experiments on the public dataset METR-LA show that GEnSHIN achieves or surpasses the performance of comparative models across multiple metrics such as Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE). Notably, the model demonstrates excellent prediction stability during peak morning and evening traffic hours. Ablation experiments further validate the effectiveness of each core module and its contribution to the final performance.


【3】Human-in-the-Loop Testing of AI Agents for Air Traffic Control with a Regulated Assessment Framework
标题:采用监管评估框架对空中交通管制人工智能代理进行人在环测试
链接:https://arxiv.org/abs/2601.04288

作者:Ben Carvell,Marc Thomas,Andrew Pace,Christopher Dorney,George De Ath,Richard Everson,Nick Pepper,Adam Keane,Samuel Tomlinson,Richard Cannon
摘要:我们提出了一个严格的人在回路评估框架,用于评估人工智能代理在空中交通管制任务中的表现,该框架基于监管机构认证的基于模拟器的课程,用于培训和测试真实世界的实习管制员。通过利用法律监管的评估,并在评估过程中引入专家级的人类讲师,我们的框架能够对人工智能的性能进行更真实、更准确的测量。这项工作解决了现有文献中的一个关键差距:空中交通管制的学术代表和实际操作环境的复杂性之间的频繁不一致。它还通过将机器性能与人类评估目标相结合,为未来有效的人机合作模式奠定了基础。
摘要:We present a rigorous, human-in-the-loop evaluation framework for assessing the performance of AI agents on the task of Air Traffic Control, grounded in a regulator-certified simulator-based curriculum used for training and testing real-world trainee controllers. By leveraging legally regulated assessments and involving expert human instructors in the evaluation process, our framework enables a more authentic and domain-accurate measurement of AI performance. This work addresses a critical gap in the existing literature: the frequent misalignment between academic representations of Air Traffic Control and the complexities of the actual operational environment. It also lays the foundations for effective future human-machine teaming paradigms by aligning machine performance with human assessment targets.


【4】A Future Capabilities Agent for Tactical Air Traffic Control
标题:战术空中交通管制的未来能力代理
链接:https://arxiv.org/abs/2601.04285

作者:Paul Kent,George De Ath,Martin Layton,Allen Hart,Richard Everson,Ben Carvell
摘要:不断升级的空中交通需求正在推动采用自动化来支持空中交通管制员,但现有方法面临安全保证和可解释性之间的权衡。强化学习等基于优化的方法提供了强大的性能,但难以验证和解释,而基于规则的系统是透明的,但很少在不确定性下检查安全性。本文概述了代理Mallard,一个前瞻性的规划,基于规则的代理系统化的空域,直接嵌入一个随机的数字孪生到其冲突解决循环的战术控制。Mallard在预定义的GPS引导路线上运行,将连续的4D矢量化减少到对车道和水平的离散选择,并从专家知情的消除冲突策略库中构建分层计划。深度受限回溯搜索使用因果归因、拓扑计划拼接和单调轴约束来为所有飞机寻求完整的安全计划,针对不确定的执行场景(例如,风变化、飞行员响应、通信损失)。   与英国管制员的初步演练和BluebirdDT空域数字孪生的初步测试表明,Mallard的行为与专家推理一致,并在简化的场景中解决了冲突。该架构旨在结合基于模型的安全评估,可解释的决策逻辑,和易于处理的计算性能,在未来的结构化航路环境。
摘要 :Escalating air traffic demand is driving the adoption of automation to support air traffic controllers, but existing approaches face a trade-off between safety assurance and interpretability. Optimisation-based methods such as reinforcement learning offer strong performance but are difficult to verify and explain, while rules-based systems are transparent yet rarely check safety under uncertainty. This paper outlines Agent Mallard, a forward-planning, rules-based agent for tactical control in systemised airspace that embeds a stochastic digital twin directly into its conflict-resolution loop. Mallard operates on predefined GPS-guided routes, reducing continuous 4D vectoring to discrete choices over lanes and levels, and constructs hierarchical plans from an expert-informed library of deconfliction strategies. A depth-limited backtracking search uses causal attribution, topological plan splicing, and monotonic axis constraints to seek a complete safe plan for all aircraft, validating each candidate manoeuvre against uncertain execution scenarios (e.g., wind variation, pilot response, communication loss) before commitment.   Preliminary walkthroughs with UK controllers and initial tests in the BluebirdDT airspace digital twin indicate that Mallard's behaviour aligns with expert reasoning and resolves conflicts in simplified scenarios. The architecture is intended to combine model-based safety assessment, interpretable decision logic, and tractable computational performance in future structured en-route environments.


联邦学习|隐私保护|加密(2篇)

【1】Mechanism Design for Federated Learning with Non-Monotonic Network Effects
标题:具有非单调网络效应的联邦学习机制设计
链接:https://arxiv.org/abs/2601.04648

作者:Xiang Li,Bing Luo,Jianwei Huang,Yuan Luo
备注:Journal extension of Mobihoc conference version, under review of IEEE TMC
摘要:机制设计是联邦学习(FL)的关键,通过协调自利的客户来最大化社会福利。然而,现有的机制往往忽视了客户参与的网络效应和不同的模型性能要求(即,泛化错误),导致次优激励和社会福利,甚至在实际部署中不适用。为了解决这一差距,我们探索激励机制设计FL与网络效应和应用特定的模型性能的要求。我们开发了一个理论模型来量化网络效应对异质客户参与的影响,揭示了这种效应的非单调性。基于这些见解,我们提出了一个模型交易和共享(MoTS)框架,使客户能够通过参与或购买获得FL模型。为了进一步解决客户的战略行为,我们设计了一个社会福利最大化与应用感知和网络效应(SWAN)机制,利用模型客户支付激励。在硬件原型上的实验结果表明,SWAN机制的性能优于现有的FL机制,最多可提高社会福利352.42美元,减少额外激励成本93.07美元.
摘要:Mechanism design is pivotal to federated learning (FL) for maximizing social welfare by coordinating self-interested clients. Existing mechanisms, however, often overlook the network effects of client participation and the diverse model performance requirements (i.e., generalization error) across applications, leading to suboptimal incentives and social welfare, or even inapplicability in real deployments. To address this gap, we explore incentive mechanism design for FL with network effects and application-specific requirements of model performance. We develop a theoretical model to quantify the impact of network effects on heterogeneous client participation, revealing the non-monotonic nature of such effects. Based on these insights, we propose a Model Trading and Sharing (MoTS) framework, which enables clients to obtain FL models through either participation or purchase. To further address clients' strategic behaviors, we design a Social Welfare maximization with Application-aware and Network effects (SWAN) mechanism, exploiting model customer payments for incentivization. Experimental results on a hardware prototype demonstrate that our SWAN mechanism outperforms existing FL mechanisms, improving social welfare by up to $352.42\%$ and reducing extra incentive costs by $93.07\%$.


【2】Hybrid Federated Learning for Noise-Robust Training
标题:用于噪音稳健训练的混合联邦学习
链接:https://arxiv.org/abs/2601.04483

作者:Yongjun Kim,Hyeongjun Park,Hwanjin Kim,Junil Choi
摘要:联邦学习(FL)和联邦蒸馏(FD)是分布式学习范例,其训练具有增强隐私的UE模型,每种都提供噪声鲁棒性和学习速度之间的不同权衡。为了减轻它们各自的弱点,我们提出了一个混合联邦学习(HFL)框架,其中每个用户设备(UE)发送梯度或对数,基站(BS)选择FL和FD更新的每轮权重。我们推导了HFL框架的收敛性,并介绍了两种方法来利用HFL中的自由度(DoF),它们是(i)通过Jenks优化的自适应UE聚类和(ii)通过阻尼牛顿方法的自适应权重选择。数值结果表明,当两个自由度都被利用时,HFL在低信噪比下实现了优异的测试精度。
摘要:Federated learning (FL) and federated distillation (FD) are distributed learning paradigms that train UE models with enhanced privacy, each offering different trade-offs between noise robustness and learning speed. To mitigate their respective weaknesses, we propose a hybrid federated learning (HFL) framework in which each user equipment (UE) transmits either gradients or logits, and the base station (BS) selects the per-round weights of FL and FD updates. We derive convergence of HFL framework and introduce two methods to exploit degrees of freedom (DoF) in HFL, which are (i) adaptive UE clustering via Jenks optimization and (ii) adaptive weight selection via a damped Newton method. Numerical results show that HFL achieves superior test accuracy at low SNR when both DoF are exploited.


推理|分析|理解|解释(13篇)

【1】Robust Reasoning as a Symmetry-Protected Topological Phase
标题:鲁棒推理作为对称性保护的Topology阶段
链接:https://arxiv.org/abs/2601.05240

作者:Ilmo Sung
摘要 :大型语言模型遭受“幻觉”-语义噪声引起的逻辑不一致。我们建议,目前的架构运行在一个“度量阶段”,因果顺序是脆弱的自发对称性破缺。在这里,我们确定强大的推理作为一个有效的对称保护的拓扑阶段,逻辑运算形式上同构于非阿贝尔任意子编织,取代脆弱的几何插值与强大的拓扑不变量。从经验上讲,我们证明了一个尖锐的拓扑相变:虽然Transformers和RNN表现出无间隙衰减,但我们的完整网络揭示了一个宏观的“质量间隙”,在临界噪声阈值以下保持不变的保真度。此外,在一个变量绑定的任务上$S_{10}$($3.6 \times 10^6$状态)表示符号操作,我们证明了完整的推广:拓扑模型保持完美的保真度外推100\times $超出训练($L=50 \to 5000$),符合理论上不确定的因果视界,而Transformers失去逻辑连贯性。烧蚀研究表明,这种保护出现严格的非阿贝尔规范对称性。这为逻辑推理提供了一个新的普适类,将因果稳定性与语义流形的拓扑联系起来。
摘要:Large language models suffer from "hallucinations"-logical inconsistencies induced by semantic noise. We propose that current architectures operate in a "Metric Phase," where causal order is vulnerable to spontaneous symmetry breaking. Here, we identify robust inference as an effective Symmetry-Protected Topological phase, where logical operations are formally isomorphic to non-Abelian anyon braiding, replacing fragile geometric interpolation with robust topological invariants. Empirically, we demonstrate a sharp topological phase transition: while Transformers and RNNs exhibit gapless decay, our Holonomic Network reveals a macroscopic "mass gap," maintaining invariant fidelity below a critical noise threshold. Furthermore, in a variable-binding task on $S_{10}$ ($3.6 \times 10^6$ states) representing symbolic manipulation, we demonstrate holonomic generalization: the topological model maintains perfect fidelity extrapolating $100\times$ beyond training ($L=50 \to 5000$), consistent with a theoretically indefinite causal horizon, whereas Transformers lose logical coherence. Ablation studies indicate this protection emerges strictly from non-Abelian gauge symmetry. This provides strong evidence for a new universality class for logical reasoning, linking causal stability to the topology of the semantic manifold.


【2】RelayLLM: Efficient Reasoning via Collaborative Decoding
标题:RelayLLM:通过协作解码实现高效推理
链接:https://arxiv.org/abs/2601.05167

作者:Chengsong Huang,Tong Zheng,Langlin Huang,Jinyuan Li,Haolin Liu,Jiaxin Huang
摘要:用于复杂推理的大型语言模型(LLM)通常受到高计算成本和延迟的阻碍,而资源高效的小型语言模型(SLM)通常缺乏必要的推理能力。现有的协作方法,例如级联或路由,通过将整个查询卸载到LLM来以粗粒度操作,当SLM能够处理大多数推理步骤时,导致显著的计算浪费。为了解决这个问题,我们提出了RelayLLM,一个新的框架,通过令牌级协作解码的高效推理。与路由器不同,RelayLLM授权SLM充当主动控制器,通过特殊命令动态调用LLM仅用于关键令牌,有效地“中继”生成过程。我们引入了一个两阶段的训练框架,包括热身和组相对策略优化(GRPO)教模型平衡独立性与战略寻求帮助。六个基准测试的实证结果表明,RelayLLM的平均准确度为49.52%,有效地弥补了两个模型之间的性能差距。值得注意的是,这是通过仅为生成的令牌总数的1.07%调用LLM来实现的,与性能匹配的随机路由器相比,成本降低了98.2%。
摘要:Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.


【3】Code-Mix Sentiment Analysis on Hinglish Tweets
标题:印度式英语推文的代码组合情绪分析
链接:https://arxiv.org/abs/2601.05091

作者:Aashi Garg,Aneshya Das,Arshi Arya,Anushka Goyal,Aditi
备注:Accepted at the 9th International Conference on Natural Language Processing and Information Retrieval (NLPIR 2025), Fukuoka, Japan
摘要:印度品牌监测的有效性越来越受到印度式英语(印地语和英语的混合体)兴起的挑战,这种英语广泛用于Twitter等平台上的用户生成内容。为单语数据构建的传统自然语言处理(NLP)模型通常无法解释这种代码混合语言的语法和语义复杂性,导致不准确的情绪分析和误导性的市场见解。为了解决这一差距,我们提出了一个高性能的情感分类框架,专门为印度式英语的推文。我们的方法微调mBERT(多语言BERT),利用其多语言功能更好地了解印度社交媒体的语言多样性。我们的方法的一个关键组成部分是使用子词标记化,这使得该模型能够有效地管理拼写变化,俚语,和在罗马化的印度式英语中常见的词汇表外的术语。这项研究为品牌情感跟踪提供了一个生产就绪的人工智能解决方案,并为低资源,代码混合环境中的多语言NLP建立了一个强大的基准。
摘要:The effectiveness of brand monitoring in India is increasingly challenged by the rise of Hinglish--a hybrid of Hindi and English--used widely in user-generated content on platforms like Twitter. Traditional Natural Language Processing (NLP) models, built for monolingual data, often fail to interpret the syntactic and semantic complexity of this code-mixed language, resulting in inaccurate sentiment analysis and misleading market insights. To address this gap, we propose a high-performance sentiment classification framework specifically designed for Hinglish tweets. Our approach fine-tunes mBERT (Multilingual BERT), leveraging its multilingual capabilities to better understand the linguistic diversity of Indian social media. A key component of our methodology is the use of subword tokenization, which enables the model to effectively manage spelling variations, slang, and out-of-vocabulary terms common in Romanized Hinglish. This research delivers a production-ready AI solution for brand sentiment tracking and establishes a strong benchmark for multilingual NLP in low-resource, code-mixed environments.


【4】Milestones over Outcome: Unlocking Geometric Reasoning with Sub-Goal Verifiable Reward
标题:成果的里程碑:通过子目标可验证奖励解锁几何推理
链接:https://arxiv.org/abs/2601.05073

作者:Jianlong Chen,Daocheng Fu,Shengze Xu,Jiawei Chen,Yuan Feng,Yue Yang,Junchi Yan,Hongyuan Zha,Renqiu Xia
摘要 :多模态大型语言模型(MLLM)在复杂的几何推理中苦苦挣扎,这主要是因为基于“黑箱”结果的监督无法区分幸运猜测和严格演绎。为了解决这一问题,我们引入了一个范式转变为子目标级的评估和学习。我们首先构建GeoGoal,通过严格的形式化验证数据引擎,将抽象证明转换为可验证的数字子目标的基准综合。这种结构揭示了推理质量和结果准确性之间的关键分歧。利用这一点,我们提出了子目标可验证奖励(SGVR)框架,该框架基于骨架率用密集奖励取代稀疏信号。实验表明,SGVR不仅提高了几何性能(+9.7%),而且还表现出很强的泛化能力,将收益转移到一般数学(+8.0%)和其他一般推理任务(+2.8%),证明了在不同领域的广泛适用性。
摘要:Multimodal Large Language Models (MLLMs) struggle with complex geometric reasoning, largely because "black box" outcome-based supervision fails to distinguish between lucky guesses and rigorous deduction. To address this, we introduce a paradigm shift towards subgoal-level evaluation and learning. We first construct GeoGoal, a benchmark synthesized via a rigorous formal verification data engine, which converts abstract proofs into verifiable numeric subgoals. This structure reveals a critical divergence between reasoning quality and outcome accuracy. Leveraging this, we propose the Sub-Goal Verifiable Reward (SGVR) framework, which replaces sparse signals with dense rewards based on the Skeleton Rate. Experiments demonstrate that SGVR not only enhances geometric performance (+9.7%) but also exhibits strong generalization, transferring gains to general math (+8.0%) and other general reasoning tasks (+2.8%), demonstrating broad applicability across diverse domains.


【5】HMVI: Unifying Heterogeneous Attributes with Natural Neighbors for Missing Value Inference
标题:HMVI:将异类属性与自然邻居统一起来以实现缺失值推理
链接:https://arxiv.org/abs/2601.05017

作者:Xiaopeng Luo,Zexi Tan,Zhuowei Wang
备注:Submitted to ICASSP 2026
摘要:缺失值插补是机器智能中的一个基本挑战,严重依赖于数据的完整性。目前的插补方法往往处理数值和分类属性独立,忽略了异质功能之间的关键相互依赖性。为了解决这些局限性,我们提出了一种新的插补方法,明确模型跨类型的功能依赖在一个统一的框架。我们的方法利用完整和不完整的实例,以确保准确和一致的插补表格数据。大量的实验结果表明,所提出的方法实现了优于现有技术的性能,并显着增强了下游机器学习任务,为现实世界的系统与丢失的数据提供了一个强大的解决方案。
摘要:Missing value imputation is a fundamental challenge in machine intelligence, heavily dependent on data completeness. Current imputation methods often handle numerical and categorical attributes independently, overlooking critical interdependencies among heterogeneous features. To address these limitations, we propose a novel imputation approach that explicitly models cross-type feature dependencies within a unified framework. Our method leverages both complete and incomplete instances to ensure accurate and consistent imputation in tabular data. Extensive experimental results demonstrate that the proposed approach achieves superior performance over existing techniques and significantly enhances downstream machine learning tasks, providing a robust solution for real-world systems with missing data.


【6】On the Definition and Detection of Cherry-Picking in Counterfactual Explanations
标题:论反事实解释中摘樱桃的定义与检测
链接:https://arxiv.org/abs/2601.04977

作者:James Hinns,Sofie Goethals,Stephan Van der Veeken,Theodoros Evgeniou,David Martens
摘要:反事实解释被广泛用于沟通输入必须如何改变模型才能改变其预测。在一个单一的例子中,可以存在许多有效的反事实,这使得解释提供者有可能挑选更适合他们选择的叙述的解释,突出有利的行为,并保留揭示问题行为的例子。我们正式定义樱桃采摘反事实的解释,在一个可接受的解释空间,指定的生成程序,和一个效用函数。然后,我们研究在何种程度上外部审计师可以发现这种操纵。考虑到三个层次的访问的解释过程:完整的程序访问,部分程序访问,和解释只访问,我们发现,检测是非常有限的,在实践中。即使有完全的程序访问权限,精心挑选的解释仍然很难与非精心挑选的解释区分开来,因为解释规范中有效的反事实的多样性和灵活性提供了足够的自由度来掩盖故意的选择。从经验上讲,我们证明,这种变异性往往超过了樱桃采摘的影响,对标准的反事实质量指标,如接近,可扩展性,稀疏性,使樱桃采摘的解释从基线解释统计上无法区分。因此,我们认为,保障措施应优先考虑再现性,标准化和事后检测的程序约束,我们为算法开发人员,解释提供者和审计员提供建议。
摘要:Counterfactual explanations are widely used to communicate how inputs must change for a model to alter its prediction. For a single instance, many valid counterfactuals can exist, which leaves open the possibility for an explanation provider to cherry-pick explanations that better suit a narrative of their choice, highlighting favourable behaviour and withholding examples that reveal problematic behaviour. We formally define cherry-picking for counterfactual explanations in terms of an admissible explanation space, specified by the generation procedure, and a utility function. We then study to what extent an external auditor can detect such manipulation. Considering three levels of access to the explanation process: full procedural access, partial procedural access, and explanation-only access, we show that detection is extremely limited in practice. Even with full procedural access, cherry-picked explanations can remain difficult to distinguish from non cherry-picked explanations, because the multiplicity of valid counterfactuals and flexibility in the explanation specification provide sufficient degrees of freedom to mask deliberate selection. Empirically, we demonstrate that this variability often exceeds the effect of cherry-picking on standard counterfactual quality metrics such as proximity, plausibility, and sparsity, making cherry-picked explanations statistically indistinguishable from baseline explanations. We argue that safeguards should therefore prioritise reproducibility, standardisation, and procedural constraints over post-hoc detection, and we provide recommendations for algorithm developers, explanation providers, and auditors.


【7】Higher-Order Knowledge Representations for Agentic Scientific Reasoning
标题:用于统计科学推理的更高级知识表示
链接:https://arxiv.org/abs/2601.04878

作者:Isabella A. Stewart,Markus J. Buehler
摘要 :科学探究需要系统级的推理,将异质的实验数据、跨领域知识和机械证据整合到连贯的解释中。虽然大型语言模型(LLM)提供推理能力,但它们通常依赖于缺乏结构深度的检索增强上下文。传统知识图(KG)试图弥合这一差距,但它们的成对约束未能捕捉到支配涌现物理行为的不可约高阶相互作用。为了解决这个问题,我们引入了一种方法来构建基于超图的知识表示,忠实地编码多实体关系。应用于约1,100篇关于生物复合材料支架的手稿的语料库,我们的框架构建了一个161,172个节点和320,201个超边的全局超图,揭示了围绕高度连接的概念中心组织的无标度拓扑(幂律指数约1.23)。这种表示法防止了成对展开式的组合爆炸,并明确地保留了科学公式的同现上下文。我们进一步证明,配备代理系统与超图遍历工具,特别是使用节点交叉约束,使他们能够桥接语义上遥远的概念。通过利用这些高阶途径,该系统成功地为新型复合材料产生了基础的机理假设,例如通过壳聚糖中间体将氧化铈连接到PCL支架上。这项工作建立了一个“无教师”的代理推理系统,超图拓扑结构作为一个可验证的护栏,通过揭示传统的图形方法掩盖的关系,加速科学发现。
摘要:Scientific inquiry requires systems-level reasoning that integrates heterogeneous experimental data, cross-domain knowledge, and mechanistic evidence into coherent explanations. While Large Language Models (LLMs) offer inferential capabilities, they often depend on retrieval-augmented contexts that lack structural depth. Traditional Knowledge Graphs (KGs) attempt to bridge this gap, yet their pairwise constraints fail to capture the irreducible higher-order interactions that govern emergent physical behavior. To address this, we introduce a methodology for constructing hypergraph-based knowledge representations that faithfully encode multi-entity relationships. Applied to a corpus of ~1,100 manuscripts on biocomposite scaffolds, our framework constructs a global hypergraph of 161,172 nodes and 320,201 hyperedges, revealing a scale-free topology (power law exponent ~1.23) organized around highly connected conceptual hubs. This representation prevents the combinatorial explosion typical of pairwise expansions and explicitly preserves the co-occurrence context of scientific formulations. We further demonstrate that equipping agentic systems with hypergraph traversal tools, specifically using node-intersection constraints, enables them to bridge semantically distant concepts. By exploiting these higher-order pathways, the system successfully generates grounded mechanistic hypotheses for novel composite materials, such as linking cerium oxide to PCL scaffolds via chitosan intermediates. This work establishes a "teacherless" agentic reasoning system where hypergraph topology acts as a verifiable guardrail, accelerating scientific discovery by uncovering relationships obscured by traditional graph methods.


【8】Sci-Reasoning: A Dataset Decoding AI Innovation Patterns
标题:科学推理:解码人工智能创新模式的数据集
链接:https://arxiv.org/abs/2601.04577

作者:Jiachen Liu,Maestro Harmon,Zechen Zhang
备注:22 pages, 9 figures
摘要:虽然人工智能创新迅速加速,但突破背后的智力过程-研究人员如何识别差距,综合先前的工作并产生见解-仍然知之甚少。缺乏科学推理的结构化数据阻碍了人工智能研究代理的系统分析和开发。我们介绍了Sci-Reasoning,这是第一个捕捉高质量AI研究背后的智能综合的数据集。使用社区验证的质量信号和LLM-accelerated,人工验证的管道,我们将NeurIPS,ICML和ICLR(2023-2025)的口头和聚光灯论文追溯到其主要前辈,以结构化格式阐明特定的推理链接。我们的分析确定了15种不同的思维模式,其中三种占主导地位的策略占52.7%:差距驱动的重新框架(24.2%),跨领域综合(18.0%)和表征转移(10.5%)。最强大的创新配方结合了多种模式:间隙驱动的重构+表示转换,跨域合成+表示转换,以及间隙驱动的重构+跨域合成。该数据集可以对科学进步进行定量研究,并为训练下一代人工智能研究代理提供结构化的推理轨迹。
摘要:While AI innovation accelerates rapidly, the intellectual process behind breakthroughs -- how researchers identify gaps, synthesize prior work, and generate insights -- remains poorly understood. The lack of structured data on scientific reasoning hinders systematic analysis and development of AI research agents. We introduce Sci-Reasoning, the first dataset capturing the intellectual synthesis behind high-quality AI research. Using community-validated quality signals and an LLM-accelerated, human-verified pipeline, we trace Oral and Spotlight papers across NeurIPS, ICML, and ICLR (2023-2025) to its key predecessors, articulating specific reasoning links in a structured format. Our analysis identifies 15 distinct thinking patterns, with three dominant strategies accounting for 52.7%: Gap-Driven Reframing (24.2%), Cross-Domain Synthesis (18.0%), and Representation Shift (10.5%). The most powerful innovation recipes combine multiple patterns: Gap-Driven Reframing + Representation Shift, Cross-Domain Synthesis + Representation Shift, and Gap-Driven Reframing + Cross-Domain Synthesis. This dataset enables quantitative studies of scientific progress and provides structured reasoning trajectories for training the next generation AI research agents.


【9】SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers
标题:SampoNLP:一个用于子词令牌器形态学分析的自我参考工具包
链接:https://arxiv.org/abs/2601.04469

作者:Iaroslav Chelombitko,Ekaterina Chelombitko,Aleksey Komissarov
备注:Accepted to the 10th International Workshop on Computational Linguistics for Uralic Languages (IWCLUL 2025), pp. 57-67
摘要:子词标记化的质量对于大型语言模型至关重要,然而,由于缺乏干净的语素词典,对形态丰富的乌拉尔语言的标记器进行评估受到阻碍。   我们介绍SampoNLP,一个无语料库的工具包,用于使用MDL启发的自指原子性评分创建形态词典,通过内部结构线索过滤复合形式-适合低资源设置。   使用SampoNLP为芬兰语、匈牙利语和爱沙尼亚语生成的高纯度词典,我们对一系列词汇量(8 k-256 k)的BPE分词器进行了系统评估。我们提出了一个统一的指标,综合性能得分(IPS),导航语素覆盖和过度分裂之间的权衡。通过分析IPS曲线,我们确定了收益递减的“拐点”,并提供了第一个经验为基础的建议,这些语言的最佳词汇量(k)。我们的研究不仅提供了实践指导,而且定量地证明了标准BPE对高粘着语言的局限性。SampoNLP库和所有生成的资源都是公开的:https://github.com/AragonerUA/SampoNLP
摘要 :The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons.   We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings.   Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k-256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the "elbow points" of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: https://github.com/AragonerUA/SampoNLP


【10】Explainable Admission-Level Predictive Modeling for Prolonged Hospital Stay in Elderly Populations: Challenges in Low- and Middle-Income Countries
标题:老年人住院时间延长的可解释入院水平预测模型:低收入和中等收入国家的挑战
链接:https://arxiv.org/abs/2601.04449

作者:Daniel Sierra-Botero,Ana Molina-Taborda,Leonardo Espinosa-Leal,Alexander Karpenko,Alejandro Hernandez,Olga Lopez-Acevedo
备注:23 pages, 6 figures
摘要:住院时间延长(pLoS)是与院内不良事件风险相关的重要因素。我们开发并解释了一个预测模型的pLos使用入院水平的患者和医院管理数据。该方法包括通过选择具有最高信息值的不相关特征的特征选择方法。该方法使用特征的证据权重,从图论中选择一个集团内的代表。这项预后研究分析了2017年1月至2022年3月期间在安蒂奥基亚母校医院的120,354例住院记录。在清理过程之后,数据集被分为训练(67%),测试(22%)和验证(11%)队列。训练逻辑回归模型以预测两个类别的pLoS:小于或大于7天。使用准确度、精密度、灵敏度、特异性和AUC-ROC指标评价模型的性能。特征选择方法返回九个可解释的变量,增强了模型的透明度。在验证队列中,pLoS模型的特异性达到0.83(95% CI,0.82-0.84),灵敏度为0.64(95% CI,0.62-0.65),准确度为0.76(95% CI,0.76-0.77),精密度为0.67(95% CI,0.66-0.69),AUC-ROC为0.82(95% CI,0.81-0.83)。该模型具有很强的预测性能,并提供了对影响长期住院的因素的见解。这使其成为医院管理和开发旨在减少pLoS的未来干预研究的宝贵工具。
摘要:Prolonged length of stay (pLoS) is a significant factor associated with the risk of adverse in-hospital events. We develop and explain a predictive model for pLos using admission-level patient and hospital administrative data. The approach includes a feature selection method by selecting non-correlated features with the highest information value. The method uses features weights of evidence to select a representative within cliques from graph theory. The prognosis study analyzed the records from 120,354 hospital admissions at the Hospital Alma Mater de Antioquia between January 2017 and March 2022. After a cleaning process the dataset was split into training (67%), test (22%), and validation (11%) cohorts. A logistic regression model was trained to predict the pLoS in two classes: less than or greater than 7 days. The performance of the model was evaluated using accuracy, precision, sensitivity, specificity, and AUC-ROC metrics. The feature selection method returns nine interpretable variables, enhancing the models' transparency. In the validation cohort, the pLoS model achieved a specificity of 0.83 (95% CI, 0.82-0.84), sensitivity of 0.64 (95% CI, 0.62-0.65), accuracy of 0.76 (95% CI, 0.76-0.77), precision of 0.67 (95% CI, 0.66-0.69), and AUC-ROC of 0.82 (95% CI, 0.81-0.83). The model exhibits strong predictive performance and offers insights into the factors that influence prolonged hospital stays. This makes it a valuable tool for hospital management and for developing future intervention studies aimed at reducing pLoS.


【11】Aligned explanations in neural networks
标题:神经网络中的一致解释
链接:https://arxiv.org/abs/2601.04378

作者:Corentin Lobet,Francesca Chiaromonte
摘要:特征归因是解释深度神经网络的主要范式。然而,大多数现有的方法只是松散地反映了模型的预测过程,从而仅仅是白画黑盒子。我们认为,解释对齐是预测任务中可信度的一个关键方面:解释必须直接与预测相关联,而不是作为事后合理化。我们将模型可读性作为实现对齐的设计原则,并将PiNets作为在深度学习环境中追求它的建模框架。PiNet是伪线性网络,在任意特征空间中生成实例线性预测,使其线性可读。我们展示了它们在图像分类和分割任务中的应用,展示了PiNets如何产生除了对齐之外的多个标准的忠实解释。
摘要:Feature attribution is the dominant paradigm for explaining deep neural networks. However, most existing methods only loosely reflect the model's prediction-making process, thereby merely white-painting the black box. We argue that explanatory alignment is a key aspect of trustworthiness in prediction tasks: explanations must be directly linked to predictions, rather than serving as post-hoc rationalizations. We present model readability as a design principle enabling alignment, and PiNets as a modeling framework to pursue it in a deep learning context. PiNets are pseudo-linear networks that produce instance-wise linear predictions in an arbitrary feature space, making them linearly readable. We illustrate their use on image classification and segmentation tasks, demonstrating how PiNets produce explanations that are faithful across multiple criteria in addition to alignment.


【12】Green MLOps: Closed-Loop, Energy-Aware Inference with NVIDIA Triton, FastAPI, and Bio-Inspired Thresholding
标题:绿色MLOps:采用NVIDIA Triton、FastAPI和Bio-Inspired Inbox持有的闭环、能源感知推理
链接:https://arxiv.org/abs/2601.04250

作者:Mustapha Hamdi,Mourad Jabou
备注:6 pages, 4 figures. Code available at:https://github.com/InnoDeep-repos/Green_MLOps
摘要:能源效率是人工智能部署中的首要问题,因为长期运行的推理可能会超过累积碳影响的训练。我们提出了一个受生物启发的框架,将蛋白质折叠能量盆地映射到推理成本景观,并通过衰减的闭环阈值控制执行。只有当预期的效用与能量的权衡是有利的(在低边际能量和拥塞下的高置信度/效用)时,才允许请求,使操作偏向于第一个可接受的局部流域,而不是追求昂贵的全局最小值。我们在RTX 4000 Ada GPU上评估了DistilBERT和ResNet-18,它们通过FastAPI与ONNX和NVIDIA Triton一起提供。我们的消融研究表明,与标准开环执行相比,生物控制器将处理时间减少了42%(A100测试集上为0.50 s vs 0.29 s),精度下降最小(<0.5%)。此外,我们建立了轻量级本地服务(ORT)和托管服务(Triton)之间的效率边界。这些结果将生物物理能量模型与绿色MLOps相连接,并为生产中的闭环能量感知推理提供了实用的、可审计的基础。
摘要 :Energy efficiency is a first-order concern in AI deployment, as long-running inference can exceed training in cumulative carbon impact. We propose a bio-inspired framework that maps protein-folding energy basins to inference cost landscapes and controls execution via a decaying, closed-loop threshold. A request is admitted only when the expected utility-to-energy trade-off is favorable (high confidence/utility at low marginal energy and congestion), biasing operation toward the first acceptable local basin rather than pursuing costly global minima. We evaluate DistilBERT and ResNet-18 served through FastAPI with ONNX Runtime and NVIDIA Triton on an RTX 4000 Ada GPU. Our ablation study reveals that the bio-controller reduces processing time by 42% compared to standard open-loop execution (0.50s vs 0.29s on A100 test set), with a minimal accuracy degradation (<0.5%). Furthermore, we establish the efficiency boundaries between lightweight local serving (ORT) and managed batching (Triton). The results connect biophysical energy models to Green MLOps and offer a practical, auditable basis for closed-loop energy-aware inference in production.


【13】Neural Algorithmic Reasoning for Approximate $k$-Coloring with Recursive Warm Starts
标题:具有渐进热启动的近似$k$-着色的神经数学推理
链接:https://arxiv.org/abs/2601.05137

作者:Knut Vanderbush,Melanie Weber
备注:33 pages, 10 figures
摘要:节点着色是指为图的节点分配颜色,使得没有两个相邻节点具有相同的颜色,同时使用尽可能少的颜色。它是最广泛研究的例子图着色和核心重要性的图论;主要成果包括四色定理和工作的哈德维格-纳尔逊问题。作为调度、资源分配等经典组合优化任务的抽象,它也有着丰富的实际应用。在这里,我们专注于一个宽松的版本,近似$k$-着色,这是分配最多$k$颜色的图的节点,使顶点具有相同颜色的边的数量近似最小化的任务。虽然经典方法利用数学编程或SAT求解器,但最近的研究已经探索了机器学习的使用。我们遵循这条路线,探索使用图神经网络(GNNs)进行节点着色。我们首先提出了一个优化的微分算法,改进了Schuetz等人的方法。正交节点特征初始化和损失函数,当它们的端点具有更高的程度时,更严重地惩罚冲突的边缘;后者受到经典结果的启发,即一个图是$k$-可着色的当且仅当它的$k$-核心是$k$-可着色的。接下来,我们介绍了一个轻量级的贪婪局部搜索算法,并表明它可以通过递归计算$(k-1)$着色作为一个温暖的开始。然后,我们表明,应用这种递归的热启动GNN的方法会导致进一步的改进。在一系列不同图结构上的数值实验表明,虽然局部搜索算法在小输入上表现最好,但GNN在大规模上表现出优越的性能。递归热启动可能是独立的兴趣超越图着色的局部搜索方法的组合优化。
摘要:Node coloring is the task of assigning colors to the nodes of a graph such that no two adjacent nodes have the same color, while using as few colors as possible. It is the most widely studied instance of graph coloring and of central importance in graph theory; major results include the Four Color Theorem and work on the Hadwiger-Nelson Problem. As an abstraction of classical combinatorial optimization tasks, such as scheduling and resource allocation, it is also rich in practical applications. Here, we focus on a relaxed version, approximate $k$-coloring, which is the task of assigning at most $k$ colors to the nodes of a graph such that the number of edges whose vertices have the same color is approximately minimized. While classical approaches leverage mathematical programming or SAT solvers, recent studies have explored the use of machine learning. We follow this route and explore the use of graph neural networks (GNNs) for node coloring. We first present an optimized differentiable algorithm that improves a prior approach by Schuetz et al. with orthogonal node feature initialization and a loss function that penalizes conflicting edges more heavily when their endpoints have higher degree; the latter inspired by the classical result that a graph is $k$-colorable if and only if its $k$-core is $k$-colorable. Next, we introduce a lightweight greedy local search algorithm and show that it may be improved by recursively computing a $(k-1)$-coloring to use as a warm start. We then show that applying such recursive warm starts to the GNN approach leads to further improvements. Numerical experiments on a range of different graph structures show that while the local search algorithms perform best on small inputs, the GNN exhibits superior performance at scale. The recursive warm start may be of independent interest beyond graph coloring for local search methods for combinatorial optimization.


检测相关(1篇)

【1】Smart IoT-Based Wearable Device for Detection and Monitoring of Common Cow Diseases Using a Novel Machine Learning Technique
标题:基于物联网的智能可穿戴设备使用新型机器学习技术检测和监测常见奶牛疾病
链接:https://arxiv.org/abs/2601.04761

作者:Rupsa Rani Mishra,D. Chandrasekhar Rao,Ajaya Kumar Tripathy
摘要:人工观察和监测个体奶牛以进行疾病检测在大规模养殖操作中提出了重大挑战,因为该过程是劳动密集型的,耗时的,并且容易降低准确性。对人类观察的依赖往往导致识别症状的延迟,因为动物的数量会阻碍对每头奶牛的及时关注。因此,疾病检测的准确性和精确度受到严重影响,可能影响动物健康和整体农场生产力。此外,组织和管理人工观察和监测奶牛健康的人力资源是一项复杂且经济上要求高的任务。它需要熟练人员的参与,从而导致农场维护成本增加和运营效率低下。因此,开发自动化、低成本和可靠的智能系统对于有效应对这些挑战至关重要。虽然在这一领域已经进行了几项研究,但很少有同时考虑检测多种常见疾病的高预测精度。然而,物联网(IoT)、机器学习(ML)和网络物理系统的进步使奶牛健康监测自动化成为可能,提高了准确性并降低了运营成本。该研究提出了一个物联网网络物理系统框架,旨在监测奶牛的日常活动和健康状况。提出了一种新的机器学习算法,利用收集的生理和行为数据诊断常见的奶牛疾病。该算法旨在通过分析一组全面的生理和行为特征来预测多种疾病,从而实现准确有效的健康评估。
摘要 :Manual observation and monitoring of individual cows for disease detection present significant challenges in large-scale farming operations, as the process is labor-intensive, time-consuming, and prone to reduced accuracy. The reliance on human observation often leads to delays in identifying symptoms, as the sheer number of animals can hinder timely attention to each cow. Consequently, the accuracy and precision of disease detection are significantly compromised, potentially affecting animal health and overall farm productivity. Furthermore, organizing and managing human resources for the manual observation and monitoring of cow health is a complex and economically demanding task. It necessitates the involvement of skilled personnel, thereby contributing to elevated farm maintenance costs and operational inefficiencies. Therefore, the development of an automated, low-cost, and reliable smart system is essential to address these challenges effectively. Although several studies have been conducted in this domain, very few have simultaneously considered the detection of multiple common diseases with high prediction accuracy. However, advancements in Internet of Things (IoT), Machine Learning (ML), and Cyber-Physical Systems have enabled the automation of cow health monitoring with enhanced accuracy and reduced operational costs. This study proposes an IoT-enabled Cyber-Physical System framework designed to monitor the daily activities and health status of cow. A novel ML algorithm is proposed for the diagnosis of common cow diseases using collected physiological and behavioral data. The algorithm is designed to predict multiple diseases by analyzing a comprehensive set of recorded physiological and behavioral features, enabling accurate and efficient health assessment.


分类|识别(2篇)

【1】Enhancing Robustness of Asynchronous EEG-Based Movement Prediction using Classifier Ensembles
标题:使用分类器集成增强基于同步脑电的运动预测的鲁棒性
链接:https://arxiv.org/abs/2601.04286

作者:Niklas Kueper,Kartik Chari,Elsa Andrea Kirchner
摘要:目的:脑卒中是导致残疾的主要原因之一。一个有前途的方法是延长康复与自我启动机器人辅助运动疗法。为了实现这一点,需要检测患者移动的意图以触发机器人设备的辅助。这种移动的意图可以从人类表面脑电图(EEG)信号中检测到;然而,当在线和异步地执行分类时,解码特别具有挑战性。在这项工作中,分类器集成和滑动窗口后处理技术的有效性进行了研究,以提高这种异步分类的鲁棒性。方法:为了研究分类器集成和滑动窗口后处理的有效性,分析了14名健康受试者进行自主手臂运动的两个EEG数据集。离线和伪在线评估进行比较集成组合的支持向量机(SVM),多层感知器(MLP),和EEGNet分类模型。结果如下:伪在线评估的结果表明,这两个模型集成显着优于最佳的单一模型的最佳数量的后处理窗口。特别是,对于单个模型,后处理窗口数量的增加显着提高了分类性能。有趣的是,我们发现在离线评估中,最好的单个模型和分类器集成的性能之间没有显着的改善。重要性:我们证明了分类器集成和适当的后处理方法,有效地提高了运动意图从EEG信号的异步检测。特别地,分类器集成方法在在线分类中比在离线分类中产生更大的改进,并且减少错误检测,即,早期的假阳性
摘要:Objective: Stroke is one of the leading causes of disabilities. One promising approach is to extend the rehabilitation with self-initiated robot-assisted movement therapy. To enable this, it is required to detect the patient's intention to move to trigger the assistance of a robotic device. This intention to move can be detected from human surface electroencephalography (EEG) signals; however, it is particularly challenging to decode when classifications are performed online and asynchronously. In this work, the effectiveness of classifier ensembles and a sliding-window postprocessing technique was investigated to enhance the robustness of such asynchronous classification. Approach: To investigate the effectiveness of classifier ensembles and a sliding-window postprocessing, two EEG datasets with 14 healthy subjects who performed self-initiated arm movements were analyzed. Offline and pseudo-online evaluations were conducted to compare ensemble combinations of the support vector machine (SVM), multilayer perceptron (MLP), and EEGNet classification models. Results: The results of the pseudo-online evaluation show that the two model ensembles significantly outperformed the best single model for the optimal number of postprocessing windows. In particular, for single models, an increased number of postprocessing windows significantly improved classification performances. Interestingly, we found no significant improvements between performances of the best single model and classifier ensembles in the offline evaluation. Significance: We demonstrated that classifier ensembles and appropriate postprocessing methods effectively enhance the asynchronous detection of movement intentions from EEG signals. In particular, the classifier ensemble approach yields greater improvements in online classification than in offline classification, and reduces false detections, i.e., early false positives.


【2】Comparison of Maximum Likelihood Classification Before and After Applying Weierstrass Transform
标题:应用维尔斯特拉斯变换前后最大似然分类的比较
链接:https://arxiv.org/abs/2601.04808

作者:Muhammad Shoaib,Zaka Ur Rehman,Muhammad Qasim
摘要:本文的目的是使用最大似然(ML)分类的多光谱数据的定性和定量的方法。最大似然法是一种基于经典贝叶斯定理的监督分类算法。它利用判别函数将像素分配给具有最高可能性的类。类均值向量和协方差矩阵是函数的关键输入,可以从特定类的训练像素中估计。由于最大似然法在应用于数据之前需要一些假设。在本文中,我们将比较应用Weierstrass变换之前的最大似然分类(ML)和应用Weierstrass变换的结果,并将看到高分辨率Quickbird卫星图像的训练像素的精度之间的差异。主成分分析(PCA)也用于降维,也用于检查波段的变化。结果表明,决策空间中类间均值的分离是导致最大似然(ML)算法在使用Weierstrass变换后比未使用时具有更高分类精度的主要因素。
摘要:The aim of this paper is to use Maximum Likelihood (ML) Classification on multispectral data by means of qualitative and quantitative approaches. Maximum Likelihood is a supervised classification algorithm which is based on the Classical Bayes theorem. It makes use of a discriminant function to assign pixel to the class with the highest likelihood. Class means vector and covariance matrix are the key inputs to the function and can be estimated from training pixels of a particular class. As Maximum Likelihood need some assumptions before it has to be applied on the data. In this paper we will compare the results of Maximum Likelihood Classification (ML) before apply the Weierstrass Transform and apply Weierstrass Transform and will see the difference between the accuracy on training pixels of high resolution Quickbird satellite image. Principle Component analysis (PCA) is also used for dimension reduction and also used to check the variation in bands. The results shows that the separation between mean of the classes in the decision space is to be the main factor that leads to the high classification accuracy of Maximum Likelihood (ML) after using Weierstrass Transform than without using it.


编码器(1篇)

【1】Illumination Angular Spectrum Encoding for Controlling the Functionality of Diffractive Networks
标题:控制折射网络功能的照明角谱编码
链接:https://arxiv.org/abs/2601.04825

作者:Matan Kleiner,Lior Michaeli,Tomer Michaeli
备注:Project's code https://github.com/matankleiner/Angular-Spectrum-Encoding
摘要 :衍射神经网络最近成为一个有前途的全光计算框架。然而,这些网络通常针对单个任务进行训练,限制了它们在需要多个功能的系统中的潜在采用。现有的实现多任务功能的方法要么修改每个任务的网络的机械配置,要么为每个任务使用不同的照明波长或偏振状态。在这项工作中,我们提出了一种新的控制机制,这是基于照明的角谱。具体来说,我们使用选择性地控制其角谱的振幅掩模来塑造照明。我们采用不同的照明掩模实现不同的网络功能,使掩模作为一个独特的任务编码器。有趣的是,我们表明,有效的控制可以在一个非常窄的角度范围内,近轴制度。我们通过训练单个衍射网络来执行多个图像到图像的翻译任务,以数值方式说明了所提出的方法。特别是,我们演示了将手写数字翻译成不同值的typical数字,并将手写英文字母翻译成typical数字和typical希腊字母,其中输出的类型由照明的角度分量确定。正如我们所展示的,所提出的框架可以在不同的相干条件下工作,并且可以与现有的控制策略(例如不同的波长)相结合。我们的研究结果建立了照明角谱作为一个强大的自由度控制衍射网络,使多任务全光学计算的可扩展和通用的框架。
摘要:Diffractive neural networks have recently emerged as a promising framework for all-optical computing. However, these networks are typically trained for a single task, limiting their potential adoption in systems requiring multiple functionalities. Existing approaches to achieving multi-task functionality either modify the mechanical configuration of the network per task or use a different illumination wavelength or polarization state for each task. In this work, we propose a new control mechanism, which is based on the illumination's angular spectrum. Specifically, we shape the illumination using an amplitude mask that selectively controls its angular spectrum. We employ different illumination masks for achieving different network functionalities, so that the mask serves as a unique task encoder. Interestingly, we show that effective control can be achieved over a very narrow angular range, within the paraxial regime. We numerically illustrate the proposed approach by training a single diffractive network to perform multiple image-to-image translation tasks. In particular, we demonstrate translating handwritten digits into typeset digits of different values, and translating handwritten English letters into typeset numbers and typeset Greek letters, where the type of the output is determined by the illumination's angular components. As we show, the proposed framework can work under different coherence conditions, and can be combined with existing control strategies, such as different wavelengths. Our results establish the illumination angular spectrum as a powerful degree of freedom for controlling diffractive networks, enabling a scalable and versatile framework for multi-task all-optical computing.


优化|敛散性(6篇)

【1】Optimal Lower Bounds for Online Multicalibration
标题:在线多校准的最佳下限
链接:https://arxiv.org/abs/2601.05245

作者:Natalie Collina,Jiuyao Lu,Georgy Noarov,Aaron Roth
摘要:我们证明了紧的下限在线多校准,建立一个信息理论分离边缘校准。   在一般的设置中,组函数可以依赖于上下文和学习者的预测,我们证明了一个$Ω(T^{2/3})$下界预期的多校准误差只使用三个不相交的二进制组。这与Noarov等人的上限相匹配。(2025)达到对数因子,并超过边际校准的$O(T^{2/3-\varepad})$上限(Dagan等人,#20250;,从而解决了两个问题。   然后,我们转向更困难的情况下,组函数,可能取决于上下文,但不学习者的预测下限。在这种情况下,我们通过使用正交函数系统构建的$Θ(T)$大小的群族建立了在线多校准的$\widetildeΩ(T^{2/3})$下界,再次匹配上界到对数因子。
摘要:We prove tight lower bounds for online multicalibration, establishing an information-theoretic separation from marginal calibration.   In the general setting where group functions can depend on both context and the learner's predictions, we prove an $Ω(T^{2/3})$ lower bound on expected multicalibration error using just three disjoint binary groups. This matches the upper bounds of Noarov et al. (2025) up to logarithmic factors and exceeds the $O(T^{2/3-\varepsilon})$ upper bound for marginal calibration (Dagan et al., 2025), thereby separating the two problems.   We then turn to lower bounds for the more difficult case of group functions that may depend on context but not on the learner's predictions. In this case, we establish an $\widetildeΩ(T^{2/3})$ lower bound for online multicalibration via a $Θ(T)$-sized group family constructed using orthogonal function systems, again matching upper bounds up to logarithmic factors.


【2】GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
标题:GDPO:多回报RL优化的团体回报脱钩规范化政策优化
链接:https://arxiv.org/abs/2601.05242

作者:Shih-Yang Liu,Xin Dong,Ximing Lu,Shizhe Diao,Peter Belcak,Mingjie Liu,Min-Hung Chen,Hongxu Yin,Yu-Chiang Frank Wang,Kwang-Ting Cheng,Yejin Choi,Jan Kautz,Pavlo Molchanov
备注:NVIDIA-Tech Report
摘要:随着语言模型变得越来越强大,用户希望它们不仅能提供准确的响应,还能在各种场景中提供符合人类偏好的行为。为了实现这一目标,强化学习(RL)管道已经开始整合多个奖励,每个奖励都捕获一个不同的偏好,以引导模型实现这些期望的行为。然而,最近的工作已经默认应用组相对策略优化(GRPO)下的多奖励设置没有检查其适用性。在本文中,我们证明了直接应用GRPO来规范化不同的推出奖励组合会导致它们崩溃为相同的优势值,降低训练信号的分辨率,导致次优收敛,在某些情况下,早期训练失败。然后,我们介绍了组奖励解耦规范化策略优化(GDPO),这是一种新的策略优化方法,通过解耦个体奖励的规范化来解决这些问题,更忠实地保留它们的相对差异,实现更准确的多奖励优化,同时大幅提高训练稳定性。我们比较GDPO与GRPO在三个任务:工具调用,数学推理和编码推理,评估正确性指标(准确性,错误率)和约束遵守指标(格式,长度)。在所有设置中,GDPO始终优于GRPO,证明了其对多奖励强化学习优化的有效性和普遍性。
摘要 :As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.


【3】EARL: Energy-Aware Optimization of Liquid State Machines for Pervasive AI
标题:EARL:针对普适人工智能的液态机的能量感知优化
链接:https://arxiv.org/abs/2601.05205

作者:Zain Iqbal,Lorenzo Valerio
备注:6 pages, 9 figures, 2 Tables, conference [Submitted in PerConAI-2026]
摘要:普适人工智能越来越依赖于设备上的学习系统,这些系统在严格的资源限制下提供低延迟和高能效的计算。液体状态机(LSM)提供了一个有前途的方法,在普适和神经形态系统的低功耗时间处理,但他们的部署仍然具有挑战性,由于高超参数的敏感性和传统的优化方法,忽略能量约束的计算成本。EARL是一种能量感知强化学习框架,它将贝叶斯优化与基于自适应强化学习的选择策略相结合,以共同优化准确性和能耗。EARL采用代理建模进行全局探索,采用强化学习进行动态候选优先级排序,并采用提前终止机制消除冗余评估,从而大大减少计算开销。在三个基准数据集上的实验表明,与领先的超参数调优框架相比,EARL实现了6%到15%的准确性,60%到80%的能耗降低,优化时间减少了一个数量级。这些结果突出了能量感知自适应搜索在提高资源受限的设备上AI应用的LSM的效率和可扩展性方面的有效性。
摘要:Pervasive AI increasingly depends on on-device learning systems that deliver low-latency and energy-efficient computation under strict resource constraints. Liquid State Machines (LSMs) offer a promising approach for low-power temporal processing in pervasive and neuromorphic systems, but their deployment remains challenging due to high hyperparameter sensitivity and the computational cost of traditional optimization methods that ignore energy constraints. This work presents EARL, an energy-aware reinforcement learning framework that integrates Bayesian optimization with an adaptive reinforcement learning based selection policy to jointly optimize accuracy and energy consumption. EARL employs surrogate modeling for global exploration, reinforcement learning for dynamic candidate prioritization, and an early termination mechanism to eliminate redundant evaluations, substantially reducing computational overhead. Experiments on three benchmark datasets demonstrate that EARL achieves 6 to 15 percent higher accuracy, 60 to 80 percent lower energy consumption, and up to an order of magnitude reduction in optimization time compared to leading hyperparameter tuning frameworks. These results highlight the effectiveness of energy-aware adaptive search in improving the efficiency and scalability of LSMs for resource-constrained on-device AI applications.


【4】A Data-Driven Predictive Framework for Inventory Optimization Using Context-Augmented Machine Learning Models
标题:使用上下文增强机器学习模型的库存优化数据驱动预测框架
链接:https://arxiv.org/abs/2601.05033

作者:Anees Fatima,Mohammad Abdus Salam
摘要:供应链管理中的需求预测对于优化库存、减少浪费和提高客户满意度至关重要。传统方法经常忽视天气、庆祝活动和设备故障等外部影响,导致效率低下。本研究调查了使用机器学习(ML)算法来改善零售和自动售货机行业的需求预测。四种机器学习算法。极端梯度提升(XGBoost),自回归综合移动平均(ARIMA),Facebook先知(Fb先知)和支持向量回归(SVR)被用来预测库存需求。外部因素,如工作日,节假日和销售偏差指标,有条不紊地纳入,以提高精度。XGBoost超越了其他模型,在包含外部变量的情况下达到了最低的平均绝对误差(MAE)22.7。ARIMAX和Fb Prophet表现出了值得注意的增强,而SVR在性能上有所欠缺。考虑外部因素后,需求预测模型的精度得到很大提高,XGBoost算法被认为是最有效的算法。这项研究提供了一个强有力的框架,以加强零售和自动售货机系统的库存管理。
摘要:Demand forecasting in supply chain management (SCM) is critical for optimizing inventory, reducing waste, and improving customer satisfaction. Conventional approaches frequently neglect external influences like weather, festivities, and equipment breakdowns, resulting in inefficiencies. This research investigates the use of machine learning (ML) algorithms to improve demand prediction in retail and vending machine sectors. Four machine learning algorithms. Extreme Gradient Boosting (XGBoost), Autoregressive Integrated Moving Average (ARIMA), Facebook Prophet (Fb Prophet), and Support Vector Regression (SVR) were used to forecast inventory requirements. Ex-ternal factors like weekdays, holidays, and sales deviation indicators were methodically incorporated to enhance precision. XGBoost surpassed other models, reaching the lowest Mean Absolute Error (MAE) of 22.7 with the inclusion of external variables. ARIMAX and Fb Prophet demonstrated noteworthy enhancements, whereas SVR fell short in performance. Incorporating external factors greatly improves the precision of demand forecasting models, and XGBoost is identified as the most efficient algorithm. This study offers a strong framework for enhancing inventory management in retail and vending machine systems.


【5】Distributed Online Convex Optimization with Efficient Communication: Improved Algorithm and Lower bounds
标题:具有高效通信的分布式在线凸优化:改进的算法和下限
链接:https://arxiv.org/abs/2601.04907

作者:Sifan Yang,Wenhao Yang,Wei Jiang,Lijun Zhang
摘要 :We investigate distributed online convex optimization with compressed communication, where $n$ learners connected by a network collaboratively minimize a sequence of global loss functions using only local information and compressed data from neighbors. Prior work has established regret bounds of $O(\max\{ω^{-2}ρ^{-4}n^{1/2},ω^{-4}ρ^{-8}\}n\sqrt{T})$ and $O(\max\{ω^{-2}ρ^{-4}n^{1/2},ω^{-4}ρ^{-8}\}n\ln{T})$ for convex and strongly convex functions, respectively, where $ω\in(0,1]$ is the compression quality factor ($ω=1$ means no compression) and $ρ<1$ is the spectral gap of the communication matrix. However, these regret bounds suffer from a \emph{quadratic} or even \emph{quartic} dependence on $ω^{-1}$. Moreover, the \emph{super-linear} dependence on $n$ is also undesirable. To overcome these limitations, we propose a novel algorithm that achieves improved regret bounds of $\tilde{O}(ω^{-1/2}ρ^{-1}n\sqrt{T})$ and $\tilde{O}(ω^{-1}ρ^{-2}n\ln{T})$ for convex and strongly convex functions, respectively. The primary idea is to design a \emph{two-level blocking update framework} incorporating two novel ingredients: an online gossip strategy and an error compensation scheme, which collaborate to \emph{achieve a better consensus} among learners. Furthermore, we establish the first lower bounds for this problem, justifying the optimality of our results with respect to both $ω$ and $T$. Additionally, we consider the bandit feedback scenario, and extend our method with the classic gradient estimators to enhance existing regret bounds.
摘要:We investigate distributed online convex optimization with compressed communication, where $n$ learners connected by a network collaboratively minimize a sequence of global loss functions using only local information and compressed data from neighbors. Prior work has established regret bounds of $O(\max\{ω^{-2}ρ^{-4}n^{1/2},ω^{-4}ρ^{-8}\}n\sqrt{T})$ and $O(\max\{ω^{-2}ρ^{-4}n^{1/2},ω^{-4}ρ^{-8}\}n\ln{T})$ for convex and strongly convex functions, respectively, where $ω\in(0,1]$ is the compression quality factor ($ω=1$ means no compression) and $ρ<1$ is the spectral gap of the communication matrix. However, these regret bounds suffer from a \emph{quadratic} or even \emph{quartic} dependence on $ω^{-1}$. Moreover, the \emph{super-linear} dependence on $n$ is also undesirable. To overcome these limitations, we propose a novel algorithm that achieves improved regret bounds of $\tilde{O}(ω^{-1/2}ρ^{-1}n\sqrt{T})$ and $\tilde{O}(ω^{-1}ρ^{-2}n\ln{T})$ for convex and strongly convex functions, respectively. The primary idea is to design a \emph{two-level blocking update framework} incorporating two novel ingredients: an online gossip strategy and an error compensation scheme, which collaborate to \emph{achieve a better consensus} among learners. Furthermore, we establish the first lower bounds for this problem, justifying the optimality of our results with respect to both $ω$ and $T$. Additionally, we consider the bandit feedback scenario, and extend our method with the classic gradient estimators to enhance existing regret bounds.


【6】Convergence Rates for Learning Pseudo-Differential Operators
标题:学习伪微运算符的收敛速度
链接:https://arxiv.org/abs/2601.04473

作者:Jiaheng Chen,Daniel Sanz-Alonso
备注:72 pages, 1 figure
摘要:本文建立了学习椭圆型伪微分算子的收敛速度,椭圆型伪微分算子是偏微分方程和数学物理中的基本算子类。在小波Galerkin框架中,我们将此类学习制定为具有多尺度稀疏性的结构化无限维回归问题。在此结构的基础上,我们提出了一个稀疏的,数据和计算效率的估计,它利用了一种新的矩阵压缩方案,适合于学习任务和嵌套支持策略,以平衡近似和估计误差。除了获得估计的收敛速度,我们表明,学习运营商诱导一个有效的和稳定的Galerkin求解器,其数值误差匹配其统计精度。因此,我们的研究结果有助于将算子学习,数据驱动的求解器和小波方法结合在一起,用于科学计算。
摘要:This paper establishes convergence rates for learning elliptic pseudo-differential operators, a fundamental operator class in partial differential equations and mathematical physics. In a wavelet-Galerkin framework, we formulate learning over this class as a structured infinite-dimensional regression problem with multiscale sparsity. Building on this structure, we propose a sparse, data- and computation-efficient estimator, which leverages a novel matrix compression scheme tailored to the learning task and a nested-support strategy to balance approximation and estimation errors. In addition to obtaining convergence rates for the estimator, we show that the learned operator induces an efficient and stable Galerkin solver whose numerical error matches its statistical accuracy. Our results therefore contribute to bringing together operator learning, data-driven solvers, and wavelet methods in scientific computing.


预测|估计(4篇)

【1】Stock Market Price Prediction using Neural Prophet with Deep Neural Network
标题:使用深度神经网络的神经先知预测股市价格
链接:https://arxiv.org/abs/2601.05202

作者:Navin Chhibber,Suneel Khemka,Navneet Kumar Tyagi,Rohit Tewari,Bireswar Banerjee,Piyush Ranjan
摘要:Stock market price prediction is a significant interdisciplinary research domain that depends at the intersection of finance, statistics, and economics. Forecasting Accurately predicting stock prices has always been a focal point for various researchers. However, existing statistical approaches for time-series prediction often fail to effectively forecast the probability range of future stock prices. Hence, to solve this problem, the Neural Prophet with a Deep Neural Network (NP-DNN) is proposed to predict stock market prices. The preprocessing technique used in this research is Z-score normalization, which normalizes stock price data by removing scale differences, making patterns easier to detect. Missing value imputation fills gaps in historical data, enhancing the models use of complete information for more accurate predictions. The Multi-Layer Perceptron (MLP) learns complex nonlinear relationships among stock market prices and extracts hidden patterns from the input data, thereby creating meaningful feature representations for better prediction accuracy. The proposed NP-DNN model achieved an accuracy of 99.21% compared with other approaches using the Fused Large Language Model. Keywords: deep neural network, forecasting stock prices, multi-layer perceptron, neural prophet, stock market price prediction.


【2】FibreCastML: An Open Web Platform for Predicting Electrospun Nanofibre Diameter Distributions
标题:FibreCastML:预测电纺纳米纤维直径分布的开放网络平台
链接:https://arxiv.org/abs/2601.04873

作者:Elisa Roldan,Kirstie Andrews,Stephen M. Richardson,Reyhaneh Fatahian,Glen Cooper,Rasool Erfani,Tasneem Sabir,Neil D. Reeves
摘要 :Electrospinning is a scalable technique for producing fibrous scaffolds with tunable micro- and nanoscale architectures for applications in tissue engineering, drug delivery, and wound care. While machine learning (ML) has been used to support electrospinning process optimisation, most existing approaches predict only mean fibre diameters, neglecting the full diameter distribution that governs scaffold performance. This work presents FibreCastML, an open, distribution-aware ML framework that predicts complete fibre diameter spectra from routinely reported electrospinning parameters and provides interpretable insights into process structure relationships.   A meta-dataset comprising 68538 individual fibre diameter measurements extracted from 1778 studies across 16 biomedical polymers was curated. Six standard processing parameters, namely solution concentration, applied voltage, flow rate, tip to collector distance, needle diameter, and collector rotation speed, were used to train seven ML models using nested cross validation with leave one study out external folds. Model interpretability was achieved using variable importance analysis, SHapley Additive exPlanations, correlation matrices, and three dimensional parameter maps.   Non linear models consistently outperformed linear baselines, achieving coefficients of determination above 0.91 for several widely used polymers. Solution concentration emerged as the dominant global driver of fibre diameter distributions. Experimental validation across different electrospinning systems demonstrated close agreement between predicted and measured distributions. FibreCastML enables more reproducible and data driven optimisation of electrospun scaffold architectures.


【3】Intraday spatiotemporal PV power prediction at national scale using satellite-based solar forecast models
标题:使用基于卫星的太阳预报模型进行全国范围内的日内时空太阳能发电预测
链接:https://arxiv.org/abs/2601.04751

作者:Luca Lanzilao,Angela Meyer
摘要:We present a novel framework for spatiotemporal photovoltaic (PV) power forecasting and use it to evaluate the reliability, sharpness, and overall performance of seven intraday PV power nowcasting models. The model suite includes satellite-based deep learning and optical-flow approaches and physics-based numerical weather prediction models, covering both deterministic and probabilistic formulations. Forecasts are first validated against satellite-derived surface solar irradiance (SSI). Irradiance fields are then converted into PV power using station-specific machine learning models, enabling comparison with production data from 6434 PV stations across Switzerland. To our knowledge, this is the first study to investigate spatiotemporal PV forecasting at a national scale. We additionally provide the first visualizations of how mesoscale cloud systems shape national PV production on hourly and sub-hourly timescales. Our results show that satellite-based approaches outperform the Integrated Forecast System (IFS-ENS), particularly at short lead times. Among them, SolarSTEPS and SHADECast deliver the most accurate SSI and PV power predictions, with SHADECast providing the most reliable ensemble spread. The deterministic model IrradianceNet achieves the lowest root mean square error, while probabilistic forecasts of SolarSTEPS and SHADECast provide better-calibrated uncertainty. Forecast skill generally decreases with elevation. At a national scale, satellite-based models forecast the daily total PV generation with relative errors below 10% for 82% of the days in 2019-2020, demonstrating robustness and their potential for operational use.


【4】Fast Mining and Dynamic Time-to-Event Prediction over Multi-sensor Data Streams
标题:多传感器数据流上的快速挖掘和动态事件时间预测
链接:https://arxiv.org/abs/2601.04741

作者:Kota Nakamura,Koki Kawabata,Yasuko Matsubara,Yasushi Sakurai
备注:Accepted by KDD 2026
摘要:Given real-time sensor data streams obtained from machines, how can we continuously predict when a machine failure will occur? This work aims to continuously forecast the timing of future events by analyzing multi-sensor data streams. A key characteristic of real-world data streams is their dynamic nature, where the underlying patterns evolve over time. To address this, we present TimeCast, a dynamic prediction framework designed to adapt to these changes and provide accurate, real-time predictions of future event time. Our proposed method has the following properties: (a) Dynamic: it identifies the distinct time-evolving patterns (i.e., stages) and learns individual models for each, enabling us to make adaptive predictions based on pattern shifts. (b) Practical: it finds meaningful stages that capture time-varying interdependencies between multiple sensors and improve prediction performance; (c) Scalable: our algorithm scales linearly with the input size and enables online model updates on data streams. Extensive experiments on real datasets demonstrate that TimeCast provides higher prediction accuracy than state-of-the-art methods while finding dynamic changes in data streams with a great reduction in computational time.


其他神经网络|深度学习|模型|建模(16篇)

【1】Measuring and Fostering Peace through Machine Learning and Artificial Intelligence
标题:通过机器学习和人工智能衡量和促进和平
链接:https://arxiv.org/abs/2601.05232

作者:P. Gilda,P. Dungarwal,A. Thongkham,E. T. Ajayi,S. Choudhary,T. M. Terol,C. Lam,J. P. Araujo,M. McFadyen-Mungalln,L. S. Liebovitch,P. T. Coleman,H. West,K. Sieck,S. Carter
备注:6 pages, 4 figures
摘要:We used machine learning and artificial intelligence: 1) to measure levels of peace in countries from news and social media and 2) to develop on-line tools that promote peace by helping users better understand their own media diet. For news media, we used neural networks to measure levels of peace from text embeddings of on-line news sources. The model, trained on one news media dataset also showed high accuracy when used to analyze a different news dataset. For social media, such as YouTube, we developed other models to measure levels of social dimensions important in peace using word level (GoEmotions) and context level (Large Language Model) methods. To promote peace, we note that 71% of people 20-40 years old daily view most of their news through short videos on social media. Content creators of these videos are biased towards creating videos with emotional activation, making you angry to engage you, to increase clicks. We developed and tested a Chrome extension, MirrorMirror, which provides real-time feedback to YouTube viewers about the peacefulness of the media they are watching. Our long term goal is for MirrorMirror to evolve into an open-source tool for content creators, journalists, researchers, platforms, and individual users to better understand the tone of their media creation and consumption and its effects on viewers. Moving beyond simple engagement metrics, we hope to encourage more respectful, nuanced, and informative communication.


【2】Learning Mixture Models via Efficient High-dimensional Sparse Fourier Transforms
标题:通过高效的多维稀疏傅里叶变换学习混合模型
链接:https://arxiv.org/abs/2601.05157

作者:Alkis Kalavasis,Pravesh K. Kothari,Shuchen Li,Manolis Zampetakis
摘要 :In this work, we give a ${\rm poly}(d,k)$ time and sample algorithm for efficiently learning the parameters of a mixture of $k$ spherical distributions in $d$ dimensions. Unlike all previous methods, our techniques apply to heavy-tailed distributions and include examples that do not even have finite covariances. Our method succeeds whenever the cluster distributions have a characteristic function with sufficiently heavy tails. Such distributions include the Laplace distribution but crucially exclude Gaussians.   All previous methods for learning mixture models relied implicitly or explicitly on the low-degree moments. Even for the case of Laplace distributions, we prove that any such algorithm must use super-polynomially many samples. Our method thus adds to the short list of techniques that bypass the limitations of the method of moments.   Somewhat surprisingly, our algorithm does not require any minimum separation between the cluster means. This is in stark contrast to spherical Gaussian mixtures where a minimum $\ell_2$-separation is provably necessary even information-theoretically [Regev and Vijayaraghavan '17]. Our methods compose well with existing techniques and allow obtaining ''best of both worlds" guarantees for mixtures where every component either has a heavy-tailed characteristic function or has a sub-Gaussian tail with a light-tailed characteristic function.   Our algorithm is based on a new approach to learning mixture models via efficient high-dimensional sparse Fourier transforms. We believe that this method will find more applications to statistical estimation. As an example, we give an algorithm for consistent robust mean estimation against noise-oblivious adversaries, a model practically motivated by the literature on multiple hypothesis testing. It was formally proposed in a recent Master's thesis by one of the authors, and has already inspired follow-up works.


【3】Exploring Student Expectations and Confidence in Learning Analytics
标题:探索学生对学习分析的期望和信心
链接:https://arxiv.org/abs/2601.05082

作者:Hayk Asatryan,Basile Tousside,Janis Mohr,Malte Neugebauer,Hildo Bijl,Paul Spiegelberg,Claudia Frohn-Schauf,Jörg Frochte
备注:7 pages, Keywords: Learning Analytics, Survey, Data Protection, Clustering
摘要:Learning Analytics (LA) is nowadays ubiquitous in many educational systems, providing the ability to collect and analyze student data in order to understand and optimize learning and the environments in which it occurs. On the other hand, the collection of data requires to comply with the growing demand regarding privacy legislation. In this paper, we use the Student Expectation of Learning Analytics Questionnaire (SELAQ) to analyze the expectations and confidence of students from different faculties regarding the processing of their data for Learning Analytics purposes. This allows us to identify four clusters of students through clustering algorithms: Enthusiasts, Realists, Cautious and Indifferents. This structured analysis provides valuable insights into the acceptance and criticism of Learning Analytics among students.


【4】Rotation-Robust Regression with Convolutional Model Trees
标题:具有卷积模型树的旋转稳健回归
链接:https://arxiv.org/abs/2601.04899

作者:Hongyi Li,William Ward Armstrong,Jun Xu
摘要:We study rotation-robust learning for image inputs using Convolutional Model Trees (CMTs) [1], whose split and leaf coefficients can be structured on the image grid and transformed geometrically at deployment time. In a controlled MNIST setting with a rotation-invariant regression target, we introduce three geometry-aware inductive biases for split directions -- convolutional smoothing, a tilt dominance constraint, and importance-based pruning -- and quantify their impact on robustness under in-plane rotations. We further evaluate a deployment-time orientation search that selects a discrete rotation maximizing a forest-level confidence proxy without updating model parameters. Orientation search improves robustness under severe rotations but can be harmful near the canonical orientation when confidence is misaligned with correctness. Finally, we observe consistent trends on MNIST digit recognition implemented as one-vs-rest regression, highlighting both the promise and limitations of confidence-based orientation selection for model-tree ensembles.


【5】Excess Description Length of Learning Generalizable Predictors
标题:过度描述学习可推广预测器的长度
链接:https://arxiv.org/abs/2601.04728

作者:Elizabeth Donoway,Hailey Joren,Fabien Roger,Jan Leike
摘要:Understanding whether fine-tuning elicits latent capabilities or teaches new ones is a fundamental question for language model evaluation and safety. We develop a formal information-theoretic framework for quantifying how much predictive structure fine-tuning extracts from the train dataset and writes into a model's parameters. Our central quantity, Excess Description Length (EDL), is defined via prequential coding and measures the gap between the bits required to encode training labels sequentially using an evolving model (trained online) and the residual encoding cost under the final trained model. We establish that EDL is non-negative in expectation, converges to surplus description length in the infinite-data limit, and provides bounds on expected generalization gain. Through a series of toy models, we clarify common confusions about information in learning: why random labels yield EDL near zero, how a single example can eliminate many bits of uncertainty about the underlying rule(s) that describe the data distribution, why structure learned on rare inputs contributes proportionally little to expected generalization, and how format learning creates early transients distinct from capability acquisition. This framework provides rigorous foundations for the empirical observation that capability elicitation and teaching exhibit qualitatively distinct scaling signatures.


【6】DeepHalo: A Neural Choice Model with Controllable Context Effects
标题:DeepHalo:一个具有可控上下文效应的神经选择模型
链接:https://arxiv.org/abs/2601.04616

作者:Shuhan Zhang,Zhi Wang,Rui Gao,Shuang Li
摘要 :Modeling human decision-making is central to applications such as recommendation, preference learning, and human-AI alignment. While many classic models assume context-independent choice behavior, a large body of behavioral research shows that preferences are often influenced by the composition of the choice set itself -- a phenomenon known as the context effect or Halo effect. These effects can manifest as pairwise (first-order) or even higher-order interactions among the available alternatives. Recent models that attempt to capture such effects either focus on the featureless setting or, in the feature-based setting, rely on restrictive interaction structures or entangle interactions across all orders, which limits interpretability. In this work, we propose DeepHalo, a neural modeling framework that incorporates features while enabling explicit control over interaction order and principled interpretation of context effects. Our model enables systematic identification of interaction effects by order and serves as a universal approximator of context-dependent choice functions when specialized to a featureless setting. Experiments on synthetic and real-world datasets demonstrate strong predictive performance while providing greater transparency into the drivers of choice.


【7】On the Limitations of Rank-One Model Editing in Answering Multi-hop Questions
标题:论一级模型编辑在回答多跳问题时的局限性
链接:https://arxiv.org/abs/2601.04600

作者:Zhiyuan He,Binghan Chen,Tianxiang Xiong,Ziyang Sun,Mozhao Zhu,Xi Chen
摘要:Recent advances in Knowledge Editing (KE), particularly Rank-One Model Editing (ROME), show superior efficiency over fine-tuning and in-context learning for updating single-hop facts in transformers. However, these methods face significant challenges when applied to multi-hop reasoning tasks requiring knowledge chaining. In this work, we study the effect of editing knowledge with ROME on different layer depths and identify three key failure modes. First, the "hopping-too-late" problem occurs as later layers lack access to necessary intermediate representations. Second, generalization ability deteriorates sharply when editing later layers. Third, the model overfits to edited knowledge, incorrectly prioritizing edited-hop answers regardless of context. To mitigate the issues of "hopping-too-late" and generalisation decay, we propose Redundant Editing, a simple yet effective strategy that enhances multi-hop reasoning. Our experiments demonstrate that this approach can improve accuracy on 2-hop questions by at least 15.5 percentage points, representing a 96% increase over the previous single-edit strategy, while trading off some specificity and language naturalness.


【8】Density Matrix RNN (DM-RNN): A Quantum Information Theoretic Framework for Modeling Musical Context and Polyphony
标题:密度矩阵RNN(DM-RNN):音乐语境和复调建模的量子信息理论框架
链接:https://arxiv.org/abs/2601.04592

作者:Joonwon Seo,Mariana Montiel
备注:Submitted to the 10th International Conference on Mathematics and Computation in Music (MCM 2026)
摘要:Classical Recurrent Neural Networks (RNNs) summarize musical context into a deterministic hidden state vector, imposing an information bottleneck that fails to capture the inherent ambiguity in music. We propose the Density Matrix RNN (DM-RNN), a novel theoretical architecture utilizing the Density Matrix. This allows the model to maintain a statistical ensemble of musical interpretations (a mixed state), capturing both classical probabilities and quantum coherences. We rigorously define the temporal dynamics using Quantum Channels (CPTP maps). Crucially, we detail a parameterization strategy based on the Choi-Jamiolkowski isomorphism, ensuring the learned dynamics remain physically valid (CPTP) by construction. We introduce an analytical framework using Von Neumann Entropy to quantify musical uncertainty and Quantum Mutual Information (QMI) to measure entanglement between voices. The DM-RNN provides a mathematically rigorous framework for modeling complex, ambiguous musical structures.


【9】Towards Spatio-Temporal Extrapolation of Phase-Field Simulations with Convolution-Only Neural Networks
标题:纯卷积神经网络相场模拟的时空外推
链接:https://arxiv.org/abs/2601.04510

作者:Christophe Bonneville,Nathan Bieberdorf,Pieterjan Robbe,Mark Asta,Habib Najm,Laurent Capolungo,Cosmin Safta
摘要:Phase-field simulations of liquid metal dealloying (LMD) can capture complex microstructural evolutions but can be prohibitively expensive for large domains and long time horizons. In this paper, we introduce a fully convolutional, conditionally parameterized U-Net surrogate designed to extrapolate far beyond its training data in both space and time. The architecture integrates convolutional self-attention, physically informed padding, and a flood-fill corrector method to maintain accuracy under extreme extrapolation, while conditioning on simulation parameters allows for flexible time-step skipping and adaptation to varying alloy compositions. To remove the need for costly solver-based initialization, we couple the surrogate with a conditional diffusion model that generates synthetic, physically consistent initial conditions. We train our surrogate on simulations generated over small domain sizes and short time spans, but, by taking advantage of the convolutional nature of U-Nets, we are able to run and extrapolate surrogate simulations for longer time horizons than what would be achievable with classic numerical solvers. Across multiple alloy compositions, the framework is able to reproduce the LMD physics accurately. It predicts key quantities of interest and spatial statistics with relative errors typically below 5% in the training regime and under 15% during large-scale, long time-horizon extrapolations. Our framework can also deliver speed-ups of up to 36,000 times, bringing the time to run weeks-long simulations down to a few seconds. This work is a first stepping stone towards high-fidelity extrapolation in both space and time of phase-field simulation for LMD.


【10】When Models Manipulate Manifolds: The Geometry of a Counting Task
标题:当模型操纵Manifolds:计数任务的几何形状
链接:https://arxiv.org/abs/2601.04480

作者:Wes Gurnee,Emmanuel Ameisen,Isaac Kauvar,Julius Tarng,Adam Pearce,Chris Olah,Joshua Batson
摘要 :Language models can perceive visual properties of text despite receiving only sequences of tokens-we mechanistically investigate how Claude 3.5 Haiku accomplishes one such task: linebreaking in fixed-width text. We find that character counts are represented on low-dimensional curved manifolds discretized by sparse feature families, analogous to biological place cells. Accurate predictions emerge from a sequence of geometric transformations: token lengths are accumulated into character count manifolds, attention heads twist these manifolds to estimate distance to the line boundary, and the decision to break the line is enabled by arranging estimates orthogonally to create a linear decision boundary. We validate our findings through causal interventions and discover visual illusions--character sequences that hijack the counting mechanism. Our work demonstrates the rich sensory processing of early layers, the intricacy of attention algorithms, and the importance of combining feature-based and geometric views of interpretability.


【11】Concept Tokens: Learning Behavioral Embeddings Through Concept Definitions
标题:概念代币:通过概念定义学习行为嵌入
链接:https://arxiv.org/abs/2601.04465

作者:Ignacio Sastre,Aiala Rosá
摘要:We propose Concept Tokens, a lightweight method that adds a new special token to a pretrained LLM and learns only its embedding from multiple natural language definitions of a target concept, where occurrences of the concept are replaced by the new token. The LLM is kept frozen and the embedding is optimized with the standard language-modeling objective. We evaluate Concept Tokens in three settings. First, we study hallucinations in closed-book question answering on HotpotQA and find a directional effect: negating the hallucination token reduces hallucinated answers mainly by increasing abstentions, whereas asserting it increases hallucinations and lowers precision. Second, we induce recasting, a pedagogical feedback strategy for second language teaching, and observe the same directional effect. Moreover, compared to providing the full definitional corpus in-context, concept tokens better preserve compliance with other instructions (e.g., asking follow-up questions). Finally, we include a qualitative study with the Eiffel Tower and a fictional "Austral Tower" to illustrate what information the learned embeddings capture and where their limitations emerge. Overall, Concept Tokens provide a compact control signal learned from definitions that can steer behavior in frozen LLMs.


【12】When Predictions Shape Reality: A Socio-Technical Synthesis of Performative Predictions in Machine Learning
标题:当预测塑造现实:机器学习中表演性预测的社会技术综合
链接:https://arxiv.org/abs/2601.04447

作者:Gal Fybish,Teo Susnjak
摘要:Machine learning models are increasingly used in high-stakes domains where their predictions can actively shape the environments in which they operate, a phenomenon known as performative prediction. This dynamic, in which the deployment of the model influences the very outcome it seeks to predict, can lead to unintended consequences, including feedback loops, performance issues, and significant societal risks. While the literature in the field has grown rapidly in recent years, a socio-technical synthesis that systemises the phenomenon concepts and provides practical guidance has been lacking. This Systematisation of Knowledge (SoK) addresses this gap by providing a comprehensive review of the literature on performative predictions. We provide an overview of the primary mechanisms through which performativity manifests, present a typology of associated risks, and survey the proposed solutions offered in the literature. Our primary contribution is the ``Performative Strength vs. Impact Matrix" assessment framework. This practical tool is designed to help practitioners assess the potential influence and severity of performativity on their deployed predictive models and select the appropriate level of algorithmic or human intervention.


【13】Learning Multinomial Logits in $O(n \log n)$ time
链接:https://arxiv.org/abs/2601.04423

作者:Flavio Chierichetti,Mirko Giacchini,Ravi Kumar,Silvio Lattanzi,Alessandro Panconesi,Erasmo Tani,Andrew Tomkins
摘要:A Multinomial Logit (MNL) model is composed of a finite universe of items $[n]=\{1,..., n\}$, each assigned a positive weight. A query specifies an admissible subset -- called a slate -- and the model chooses one item from that slate with probability proportional to its weight. This query model is also known as the Plackett-Luce model or conditional sampling oracle in the literature. Although MNLs have been studied extensively, a basic computational question remains open: given query access to slates, how efficiently can we learn weights so that, for every slate, the induced choice distribution is within total variation distance $\varepsilon$ of the ground truth? This question is central to MNL learning and has direct implications for modern recommender system interfaces.   We provide two algorithms for this task, one with adaptive queries and one with non-adaptive queries. Each algorithm outputs an MNL $M'$ that induces, for each slate $S$, a distribution $M'_S$ on $S$ that is within $\varepsilon$ total variation distance of the true distribution. Our adaptive algorithm makes $O\left(\frac{n}{\varepsilon^{3}}\log n\right)$ queries, while our non-adaptive algorithm makes $O\left(\frac{n^{2}}{\varepsilon^{3}}\log n \log\frac{n}{\varepsilon}\right)$ queries. Both algorithms query only slates of size two and run in time proportional to their query complexity.   We complement these upper bounds with lower bounds of $Ω\left(\frac{n}{\varepsilon^{2}}\log n\right)$ for adaptive queries and $Ω\left(\frac{n^{2}}{\varepsilon^{2}}\log n\right)$ for non-adaptive queries, thus proving that our adaptive algorithm is optimal in its dependence on the support size $n$, while the non-adaptive one is tight within a $\log n$ factor.


【14】Machine Learning Model for Sparse PCM Completion
标题:稀疏PCM完成的机器学习模型
链接:https://arxiv.org/abs/2601.04366

作者:Selcuk Koyuncu,Ronak Nouri,Stephen Providence
摘要:In this paper, we propose a machine learning model for sparse pairwise comparison matrices (PCMs), combining classical PCM approaches with graph-based learning techniques. Numerical results are provided to demonstrate the effectiveness and scalability of the proposed method.


【15】Predictable Gradient Manifolds in Deep Learning: Temporal Path-Length and Intrinsic Rank as a Complexity Regime
标题:深度学习中的可预测梯度管汇:作为复杂性机制的时间路径长度和内在排名
链接:https://arxiv.org/abs/2601.04270

作者:Anherutowa Calvo
备注:12 Pages. Preprint
摘要:Deep learning optimization exhibits structure that is not captured by worst-case gradient bounds. Empirically, gradients along training trajectories are often temporally predictable and evolve within a low-dimensional subspace. In this work we formalize this observation through a measurable framework for predictable gradient manifolds.   We introduce two computable quantities: a prediction-based path length that measures how well gradients can be forecast from past information, and a predictable rank that quantifies the intrinsic temporal dimension of gradient increments. We show how classical online and nonconvex optimization guarantees can be restated so that convergence and regret depend explicitly on these quantities, rather than on worst-case variation.   Across convolutional networks, vision transformers, language models, and synthetic control tasks, we find that gradient trajectories are locally predictable and exhibit strong low-rank structure over time. These properties are stable across architectures and optimizers, and can be diagnosed directly from logged gradients using lightweight random projections.   Our results provide a unifying lens for understanding optimization dynamics in modern deep learning, reframing standard training as operating in a low-complexity temporal regime. This perspective suggests new directions for adaptive optimizers, rank-aware tracking, and prediction-based algorithm design grounded in measurable properties of real training runs.


【16】The Role of Quantum in Hybrid Quantum-Classical Neural Networks: A Realistic Assessment
标题:量子在混合量子经典神经网络中的作用:现实评估
链接:https://arxiv.org/abs/2601.04732

作者:Dominik Freinberger,Philipp Moser
备注:16 pages, 6 figures
摘要:Quantum machine learning has emerged as a promising application domain for near-term quantum hardware, particularly through hybrid quantum-classical models that leverage both classical and quantum processing. Although numerous hybrid architectures have been proposed and demonstrated successfully on benchmark tasks, a significant open question remains regarding the specific contribution of quantum components to the overall performance of these models. In this work, we aim to shed light on the impact of quantum processing within hybrid quantum-classical neural network architectures through a rigorous statistical study. We systematically assess common hybrid models on medical signal data as well as planar and volumetric images, examining the influence attributable to classical and quantum aspects such as encoding schemes, entanglement, and circuit size. We find that in best-case scenarios, hybrid models show performance comparable to their classical counterparts, however, in most cases, performance metrics deteriorate under the influence of quantum components. Our multi-modal analysis provides realistic insights into the contributions of quantum components and advocates for cautious claims and design choices for hybrid models in near-term applications.


其他(32篇)

【1】Sequential Subspace Noise Injection Prevents Accuracy Collapse in Certified Unlearning
标题:顺序子空间噪音注入防止认证取消学习中的准确性崩溃
链接:https://arxiv.org/abs/2601.05134

作者:Polina Dolgova,Sebastian U. Stich
摘要:Certified unlearning based on differential privacy offers strong guarantees but remains largely impractical: the noisy fine-tuning approaches proposed so far achieve these guarantees but severely reduce model accuracy. We propose sequential noise scheduling, which distributes the noise budget across orthogonal subspaces of the parameter space, rather than injecting it all at once. This simple modification mitigates the destructive effect of noise while preserving the original certification guarantees. We extend the analysis of noisy fine-tuning to the subspace setting, proving that the same $(\varepsilon,δ)$ privacy budget is retained. Empirical results on image classification benchmarks show that our approach substantially improves accuracy after unlearning while remaining robust to membership inference attacks. These results show that certified unlearning can achieve both rigorous guarantees and practical utility.


【2】Approximate equivariance via projection-based regularisation
标题:通过基于投影的正规化实现大致等方差
链接:https://arxiv.org/abs/2601.05028

作者:Torben Berndt,Jan Stühmer
摘要:Equivariance is a powerful inductive bias in neural networks, improving generalisation and physical consistency. Recently, however, non-equivariant models have regained attention, due to their better runtime performance and imperfect symmetries that might arise in real-world applications. This has motivated the development of approximately equivariant models that strike a middle ground between respecting symmetries and fitting the data distribution. Existing approaches in this field usually apply sample-based regularisers which depend on data augmentation at training time, incurring a high sample complexity, in particular for continuous groups such as $SO(3)$. This work instead approaches approximate equivariance via a projection-based regulariser which leverages the orthogonal decomposition of linear layers into equivariant and non-equivariant components. In contrast to existing methods, this penalises non-equivariance at an operator level across the full group orbit, rather than point-wise. We present a mathematical framework for computing the non-equivariance penalty exactly and efficiently in both the spatial and spectral domain. In our experiments, our method consistently outperforms prior approximate equivariance approaches in both model performance and efficiency, achieving substantial runtime gains over sample-based regularisers.


【3】Precision over Diversity: High-Precision Reward Generalizes to Robust Instruction Following
标题:精确度高于多样性:高精确度奖励推广到稳健的指令遵循
链接:https://arxiv.org/abs/2601.04954

作者 :Yirong Zeng,Yufei Liu,Xiao Ding,Yutai Hou,Yuxian Wang,Haonan Song,Wu Ning,Dandan Tu,Qixun Zhang,Bibo Cai,Yuxiang He,Ting Liu
备注:ACL under review 13 pages, 8 figures
摘要:A central belief in scaling reinforcement learning with verifiable rewards for instruction following (IF) tasks is that, a diverse mixture of verifiable hard and unverifiable soft constraints is essential for generalizing to unseen instructions. In this work, we challenge this prevailing consensus through a systematic empirical investigation. Counter-intuitively, we find that models trained on hard-only constraints consistently outperform those trained on mixed datasets. Extensive experiments reveal that reward precision, rather than constraint diversity, is the primary driver of effective alignment. The LLM judge suffers from a low recall rate in detecting false response, which leads to severe reward hacking, thereby undermining the benefits of diversity. Furthermore, analysis of the attention mechanism reveals that high-precision rewards develop a transferable meta-skill for IF. Motivated by these insights, we propose a simple yet effective data-centric refinement strategy that prioritizes reward precision. Evaluated on five benchmarks, our approach outperforms competitive baselines by 13.4\% in performance while achieving a 58\% reduction in training time, maintaining strong generalization beyond instruction following. Our findings advocate for a paradigm shift: moving away from the indiscriminate pursuit of data diversity toward high-precision rewards.


【4】Cardinality augmented loss functions
标题:基数扩大损失函数
链接:https://arxiv.org/abs/2601.04941

作者:Miguel O'Malley
备注:12 pages, 3 figures
摘要:Class imbalance is a common and pernicious issue for the training of neural networks. Often, an imbalanced majority class can dominate training to skew classifier performance towards the majority outcome. To address this problem we introduce cardinality augmented loss functions, derived from cardinality-like invariants in modern mathematics literature such as magnitude and the spread. These invariants enrich the concept of cardinality by evaluating the `effective diversity' of a metric space, and as such represent a natural solution to overly homogeneous training data. In this work, we establish a methodology for applying cardinality augmented loss functions in the training of neural networks and report results on both artificially imbalanced datasets as well as a real-world imbalanced material science dataset. We observe significant performance improvement among minority classes, as well as improvement in overall performance metrics.


【5】V-FAT: Benchmarking Visual Fidelity Against Text-bias
标题:V-脂肪:视觉逼真度与文本偏见进行基准测试
链接:https://arxiv.org/abs/2601.04897

作者:Ziteng Wang,Yujie He,Guanliang Li,Siqi Yang,Jiaqi Xiong,Songxiang Liu
备注:12 pages, 6 figures
摘要:Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize "lucky" linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.


【6】Rethinking GNNs and Missing Features: Challenges, Evaluation and a Robust Solution
标题:重新思考GNN和缺失的功能:挑战、评估和稳健的解决方案
链接:https://arxiv.org/abs/2601.04855

作者:Francesco Ferrini,Veronica Lachi,Antonio Longa,Bruno Lepri,Matono Akiyoshi,Andrea Passerini,Xin Liu,Manfred Jaeger
摘要:Handling missing node features is a key challenge for deploying Graph Neural Networks (GNNs) in real-world domains such as healthcare and sensor networks. Existing studies mostly address relatively benign scenarios, namely benchmark datasets with (a) high-dimensional but sparse node features and (b) incomplete data generated under Missing Completely At Random (MCAR) mechanisms. For (a), we theoretically prove that high sparsity substantially limits the information loss caused by missingness, making all models appear robust and preventing a meaningful comparison of their performance. To overcome this limitation, we introduce one synthetic and three real-world datasets with dense, semantically meaningful features. For (b), we move beyond MCAR and design evaluation protocols with more realistic missingness mechanisms. Moreover, we provide a theoretical background to state explicit assumptions on the missingness process and analyze their implications for different methods. Building on this analysis, we propose GNNmim, a simple yet effective baseline for node classification with incomplete feature data. Experiments show that GNNmim is competitive with respect to specialized architectures across diverse datasets and missingness regimes.


【7】Measurement-Consistent Langevin Corrector: A Remedy for Latent Diffusion Inverse Solvers
标题:测量一致的朗之万修正器:潜在扩散逆解器的补救措施
链接:https://arxiv.org/abs/2601.04791

作者:Lee Hyoseok,Sohwi Lim,Eunju Cha,Tae-Hyun Oh
备注:Under Review
摘要 :With recent advances in generative models, diffusion models have emerged as powerful priors for solving inverse problems in each domain. Since Latent Diffusion Models (LDMs) provide generic priors, several studies have explored their potential as domain-agnostic zero-shot inverse solvers. Despite these efforts, existing latent diffusion inverse solvers suffer from their instability, exhibiting undesirable artifacts and degraded quality. In this work, we first identify the instability as a discrepancy between the solver's and true reverse diffusion dynamics, and show that reducing this gap stabilizes the solver. Building on this, we introduce Measurement-Consistent Langevin Corrector (MCLC), a theoretically grounded plug-and-play correction module that remedies the LDM-based inverse solvers through measurement-consistent Langevin updates. Compared to prior approaches that rely on linear manifold assumptions, which often do not hold in latent space, MCLC operates without this assumption, leading to more stable and reliable behavior. We experimentally demonstrate the effectiveness of MCLC and its compatibility with existing solvers across diverse image restoration tasks. Additionally, we analyze blob artifacts and offer insights into their underlying causes. We highlight that MCLC is a key step toward more robust zero-shot inverse problem solvers.


【8】AgentOCR: Reimagining Agent History via Optical Self-Compression
标题:AgentOCR:通过光学自压缩重新构想代理历史
链接:https://arxiv.org/abs/2601.04786

作者:Lang Feng,Fuchao Yang,Feng Chen,Xin Cheng,Haiyang Xu,Zhenglin Wan,Ming Yan,Bo An
备注:Work in progress
摘要:Recent advances in large language models (LLMs) enable agentic systems trained with reinforcement learning (RL) over multi-turn interaction trajectories, but practical deployment is bottlenecked by rapidly growing textual histories that inflate token budgets and memory usage. We introduce AgentOCR, a framework that exploits the superior information density of visual tokens by representing the accumulated observation-action history as a compact rendered image. To make multi-turn rollouts scalable, AgentOCR proposes segment optical caching. By decomposing history into hashable segments and maintaining a visual cache, this mechanism eliminates redundant re-rendering. Beyond fixed rendering, AgentOCR introduces agentic self-compression, where the agent actively emits a compression rate and is trained with compression-aware reward to adaptively balance task success and token efficiency. We conduct extensive experiments on challenging agentic benchmarks, ALFWorld and search-based QA. Remarkably, results demonstrate that AgentOCR preserves over 95\% of text-based agent performance while substantially reducing token consumption (>50\%), yielding consistent token and memory efficiency. Our further analysis validates a 20x rendering speedup from segment optical caching and the effective strategic balancing of self-compression.


【9】MQ-GNN: A Multi-Queue Pipelined Architecture for Scalable and Efficient GNN Training
标题:MQ-GNN:用于可扩展和高效GNN训练的多队列流水线架构
链接:https://arxiv.org/abs/2601.04707

作者:Irfan Ullah,Young-Koo Lee
摘要:Graph Neural Networks (GNNs) are powerful tools for learning graph-structured data, but their scalability is hindered by inefficient mini-batch generation, data transfer bottlenecks, and costly inter-GPU synchronization. Existing training frameworks fail to overlap these stages, leading to suboptimal resource utilization. This paper proposes MQ-GNN, a multi-queue pipelined framework that maximizes training efficiency by interleaving GNN training stages and optimizing resource utilization. MQ-GNN introduces Ready-to-Update Asynchronous Consistent Model (RaCoM), which enables asynchronous gradient sharing and model updates while ensuring global consistency through adaptive periodic synchronization. Additionally, it employs global neighbor sampling with caching to reduce data transfer overhead and an adaptive queue-sizing strategy to balance computation and memory efficiency. Experiments on four large-scale datasets and ten baseline models demonstrate that MQ-GNN achieves up to \boldmath $\bm{4.6\,\times}$ faster training time and 30% improved GPU utilization while maintaining competitive accuracy. These results establish MQ-GNN as a scalable and efficient solution for multi-GPU GNN training.


【10】Nightmare Dreamer: Dreaming About Unsafe States And Planning Ahead
标题:噩梦梦想家:梦想不安全的国家并提前计划
链接:https://arxiv.org/abs/2601.04686

作者:Oluwatosin Oseni,Shengjie Wang,Jun Zhu,Micah Corah
备注:RSS'25: Multi-Objective Optimization and Planning in Robotics Workshop: 5 pages, 8 figures
摘要:Reinforcement Learning (RL) has shown remarkable success in real-world applications, particularly in robotics control. However, RL adoption remains limited due to insufficient safety guarantees. We introduce Nightmare Dreamer, a model-based Safe RL algorithm that addresses safety concerns by leveraging a learned world model to predict potential safety violations and plan actions accordingly. Nightmare Dreamer achieves nearly zero safety violations while maximizing rewards. Nightmare Dreamer outperforms model-free baselines on Safety Gymnasium tasks using only image observations, achieving nearly a 20x improvement in efficiency.


【11】Estimating Causal Effects in Gaussian Linear SCMs with Finite Data
标题:有限数据下高斯线性SCM中的因果效应估计
链接:https://arxiv.org/abs/2601.04673

作者:Aurghya Maiti,Prateek Jain
备注:Accepted at the Workshop on Scaling Up Intervention Models at the 42nd International Conference on Machine Learning (ICML 2025)
摘要 :Estimating causal effects from observational data remains a fundamental challenge in causal inference, especially in the presence of latent confounders. This paper focuses on estimating causal effects in Gaussian Linear Structural Causal Models (GL-SCMs), which are widely used due to their analytical tractability. However, parameter estimation in GL-SCMs is often infeasible with finite data, primarily due to overparameterization. To address this, we introduce the class of Centralized Gaussian Linear SCMs (CGL-SCMs), a simplified yet expressive subclass where exogenous variables follow standardized distributions. We show that CGL-SCMs are equally expressive in terms of causal effect identifiability from observational distributions and present a novel EM-based estimation algorithm that can learn CGL-SCM parameters and estimate identifiable causal effects from finite observational samples. Our theoretical analysis is validated through experiments on synthetic data and benchmark causal graphs, demonstrating that the learned models accurately recover causal distributions.


【12】A Vision for Multisensory Intelligence: Sensing, Synergy, and Science
标题:多感官智能愿景:感知、协同和科学
链接:https://arxiv.org/abs/2601.04563

作者:Paul Pu Liang
摘要:Our experience of the world is multisensory, spanning a synthesis of language, sight, sound, touch, taste, and smell. Yet, artificial intelligence has primarily advanced in digital modalities like text, vision, and audio. This paper outlines a research vision for multisensory artificial intelligence over the next decade. This new set of technologies can change how humans and AI experience and interact with one another, by connecting AI to the human senses and a rich spectrum of signals from physiological and tactile cues on the body, to physical and social signals in homes, cities, and the environment. We outline how this field must advance through three interrelated themes of sensing, science, and synergy. Firstly, research in sensing should extend how AI captures the world in richer ways beyond the digital medium. Secondly, developing a principled science for quantifying multimodal heterogeneity and interactions, developing unified modeling architectures and representations, and understanding cross-modal transfer. Finally, we present new technical challenges to learn synergy between modalities and between humans and AI, covering multisensory integration, alignment, reasoning, generation, generalization, and experience. Accompanying this vision paper are a series of projects, resources, and demos of latest advances from the Multisensory Intelligence group at the MIT Media Lab, see https://mit-mi.github.io/.


【13】Timeliness-Oriented Scheduling and Resource Allocation in Multi-Region Collaborative Perception
标题:多区域协同感知中面向时性的调度和资源分配
链接:https://arxiv.org/abs/2601.04542

作者:Mengmeng Zhu,Yuxuan Sun,Yukuan Jia,Wei Chen,Bo Ai,Sheng Zhou
备注:This work has been submitted to the IEEE for possible publication
摘要:Collaborative perception (CP) is a critical technology in applications like autonomous driving and smart cities. It involves the sharing and fusion of information among sensors to overcome the limitations of individual perception, such as blind spots and range limitations. However, CP faces two primary challenges. First, due to the dynamic nature of the environment, the timeliness of the transmitted information is critical to perception performance. Second, with limited computational power at the sensors and constrained wireless bandwidth, the communication volume must be carefully designed to ensure feature representations are both effective and sufficient. This work studies the dynamic scheduling problem in a multi-region CP scenario, and presents a Timeliness-Aware Multi-region Prioritized (TAMP) scheduling algorithm to trade-off perception accuracy and communication resource usage. Timeliness reflects the utility of information that decays as time elapses, which is manifested by the perception performance in CP tasks. We propose an empirical penalty function that maps the joint impact of Age of Information (AoI) and communication volume to perception performance. Aiming to minimize this timeliness-oriented penalty in the long-term, and recognizing that scheduling decisions have a cumulative effect on subsequent system states, we propose the TAMP scheduling algorithm. TAMP is a Lyapunov-based optimization policy that decomposes the long-term average objective into a per-slot prioritization problem, balancing the scheduling worth against resource cost. We validate our algorithm in both intersection and corridor scenarios with the real-world Roadside Cooperative perception (RCooper) dataset. Extensive simulations demonstrate that TAMP outperforms the best-performing baseline, achieving an Average Precision (AP) improvement of up to 27% across various configurations.


【14】Paradoxical noise preference in RNNs
标题:RNN中的反常噪音偏好
链接:https://arxiv.org/abs/2601.04539

作者:Noah Eckstein,Manoj Srinivasan
备注:15 pages, 6 figures
摘要:In recurrent neural networks (RNNs) used to model biological neural networks, noise is typically introduced during training to emulate biological variability and regularize learning. The expectation is that removing the noise at test time should preserve or improve performance. Contrary to this intuition, we find that continuous-time recurrent neural networks (CTRNNs) often perform best at a nonzero noise level, specifically, the same level used during training. This noise preference typically arises when noise is injected inside the neural activation function; networks trained with noise injected outside the activation function perform best with zero noise. Through analyses of simple function approximation, maze navigation, and single neuron regulator tasks, we show that the phenomenon stems from noise-induced shifts of fixed points (stationary distributions) in the underlying stochastic dynamics of the RNNs. These fixed point shifts are noise-level dependent and bias the network outputs when the noise is removed, degrading performance. Analytical and numerical results show that the bias arises when neural states operate near activation function nonlinearities, where noise is asymmetrically attenuated, and that performance optimization incentivizes operation near these nonlinearities. Thus, networks can overfit to the stochastic training environment itself rather than just to the input-output data. The phenomenon is distinct from stochastic resonance, wherein nonzero noise enhances signal processing. Our findings reveal that training noise can become an integral part of the computation learned by recurrent networks, with implications for understanding neural population dynamics and for the design of robust artificial RNNs.


【15】Bridging Distance and Spectral Positional Encodings via Anchor-Based Diffusion Geometry Approximation
标题:基于锚点的扩散几何逼近的桥距离和光谱位置编码
链接:https://arxiv.org/abs/2601.04517

作者:Zimo Yan,Zheng Xie,Runfan Duan,Chang Liu,Wumei Du
摘要:Molecular graph learning benefits from positional signals that capture both local neighborhoods and global topology. Two widely used families are spectral encodings derived from Laplacian or diffusion operators and anchor-based distance encodings built from shortest-path information, yet their precise relationship is poorly understood. We interpret distance encodings as a low-rank surrogate of diffusion geometry and derive an explicit trilateration map that reconstructs truncated diffusion coordinates from transformed anchor distances and anchor spectral positions, with pointwise and Frobenius-gap guarantees on random regular graphs. On DrugBank molecular graphs using a shared GNP-based DDI prediction backbone, a distance-driven Nyström scheme closely recovers diffusion geometry, and both Laplacian and distance encodings substantially outperform a no-encoding baseline.


【16】Surface-based Molecular Design with Multi-modal Flow Matching
标题:具有多峰流匹配的基于表面的分子设计
链接:https://arxiv.org/abs/2601.04506

作者:Fang Wu,Zhengyuan Zhou,Shuting Jin,Xiangxiang Zeng,Jure Leskovec,Jinbo Xu
摘要:Therapeutic peptides show promise in targeting previously undruggable binding sites, with recent advancements in deep generative models enabling full-atom peptide co-design for specific protein receptors. However, the critical role of molecular surfaces in protein-protein interactions (PPIs) has been underexplored. To bridge this gap, we propose an omni-design peptides generation paradigm, called SurfFlow, a novel surface-based generative algorithm that enables comprehensive co-design of sequence, structure, and surface for peptides. SurfFlow employs a multi-modality conditional flow matching (CFM) architecture to learn distributions of surface geometries and biochemical properties, enhancing peptide binding accuracy. Evaluated on the comprehensive PepMerge benchmark, SurfFlow consistently outperforms full-atom baselines across all metrics. These results highlight the advantages of considering molecular surfaces in de novo peptide discovery and demonstrate the potential of integrating multiple protein modalities for more effective therapeutic peptide discovery.


【17】Re-Rankers as Relevance Judges
标题:重新排名作为相关性评委
链接:https://arxiv.org/abs/2601.04455

作者:Chuan Meng,Jiqun Liu,Mohammad Aliannejadi,Fengran Mo,Jeff Dalton,Maarten de Rijke
摘要:Using large language models (LLMs) to predict relevance judgments has shown promising results. Most studies treat this task as a distinct research line, e.g., focusing on prompt design for predicting relevance labels given a query and passage. However, predicting relevance judgments is essentially a form of relevance prediction, a problem extensively studied in tasks such as re-ranking. Despite this potential overlap, little research has explored reusing or adapting established re-ranking methods to predict relevance judgments, leading to potential resource waste and redundant development. To bridge this gap, we reproduce re-rankers in a re-ranker-as-relevance-judge setup. We design two adaptation strategies: (i) using binary tokens (e.g., "true" and "false") generated by a re-ranker as direct judgments, and (ii) converting continuous re-ranking scores into binary labels via thresholding. We perform extensive experiments on TREC-DL 2019 to 2023 with 8 re-rankers from 3 families, ranging from 220M to 32B, and analyse the evaluation bias exhibited by re-ranker-based judges. Results show that re-ranker-based relevance judges, under both strategies, can outperform UMBRELA, a state-of-the-art LLM-based relevance judge, in around 40% to 50% of the cases; they also exhibit strong self-preference towards their own and same-family re-rankers, as well as cross-family bias.


【18】Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization
标题:通过结构化政策指标改善和加速大型离散行动空间中的离线RL
链接:https://arxiv.org/abs/2601.04441

作者:Matthew Landers,Taylor W. Killian,Thomas Hartvigsen,Afsaneh Doryab
摘要:Reinforcement learning in discrete combinatorial action spaces requires searching over exponentially many joint actions to simultaneously select multiple sub-actions that form coherent combinations. Existing approaches either simplify policy learning by assuming independence across sub-actions, which often yields incoherent or invalid actions, or attempt to learn action structure and control jointly, which is slow and unstable. We introduce Structured Policy Initialization (SPIN), a two-stage framework that first pre-trains an Action Structure Model (ASM) to capture the manifold of valid actions, then freezes this representation and trains lightweight policy heads for control. On challenging discrete DM Control benchmarks, SPIN improves average return by up to 39% over the state of the art while reducing time to convergence by up to 12.8$\times$.


【19】Distribution-Guided and Constrained Quantum Machine Unlearning
标题:分布引导和约束的量子机去学习
链接:https://arxiv.org/abs/2601.04413

作者:Nausherwan Malik,Zubair Khalid,Muhammad Faryad
备注:8 pages
摘要 :Machine unlearning aims to remove the influence of specific training data from a learned model without full retraining. While recent work has begun to explore unlearning in quantum machine learning, existing approaches largely rely on fixed, uniform target distributions and do not explicitly control the trade-off between forgetting and retained model behaviour. In this work, we propose a distribution-guided framework for class-level quantum machine unlearning that treats unlearning as a constrained optimization problem. Our method introduces a tunable target distribution derived from model similarity statistics, decoupling the suppression of forgotten-class confidence from assumptions about redistribution among retained classes. We further incorporate an anchor-based preservation constraint that explicitly maintains predictive behaviour on selected retained data, yielding a controlled optimization trajectory that limits deviation from the original model. We evaluate the approach on variational quantum classifiers trained on the Iris and Covertype datasets. Results demonstrate sharp suppression of forgotten-class confidence, minimal degradation of retained-class performance, and closer alignment with the gold retrained model baselines compared to uniform-target unlearning. These findings highlight the importance of target design and constraint-based formulations for reliable and interpretable quantum machine unlearning.


【20】Enhanced-FQL($λ$), an Efficient and Interpretable RL with novel Fuzzy Eligibility Traces and Segmented Experience Replay
标题:增强型FQL($X $),一种高效且可解释的RL,具有新颖的模糊资格追踪和分段体验重播
链接:https://arxiv.org/abs/2601.04392

作者:Mohsen Jalaeian-Farimani
备注:Submitted to ECC26 conference
摘要:This paper introduces a fuzzy reinforcement learning framework, Enhanced-FQL($λ$), that integrates novel Fuzzified Eligibility Traces (FET) and Segmented Experience Replay (SER) into fuzzy Q-learning with Fuzzified Bellman Equation (FBE) for continuous control tasks. The proposed approach employs an interpretable fuzzy rule base instead of complex neural architectures, while maintaining competitive performance through two key innovations: a fuzzified Bellman equation with eligibility traces for stable multi-step credit assignment, and a memory-efficient segment-based experience replay mechanism for enhanced sample efficiency. Theoretical analysis proves the proposed method convergence under standard assumptions. Extensive evaluations in continuous control domains demonstrate that Enhanced-FQL($λ$) achieves superior sample efficiency and reduced variance compared to n-step fuzzy TD and fuzzy SARSA($λ$) baselines, while maintaining substantially lower computational complexity than deep RL alternatives such as DDPG. The framework's inherent interpretability, combined with its computational efficiency and theoretical convergence guarantees, makes it particularly suitable for safety-critical applications where transparency and resource constraints are essential.


【21】Quantifying the Effect of Test Set Contamination on Generative Evaluations
标题:量化测试集污染对生成性评估的影响
链接:https://arxiv.org/abs/2601.04301

作者:Rylan Schaeffer,Joshua Kazdan,Baber Abbasi,Ken Ziyu Liu,Brando Miranda,Ahmed Ahmed,Abhay Puri,Niloofar Mireshghallah,Sanmi Koyejo
摘要:As frontier AI systems are pretrained on web-scale data, test set contamination has become a critical concern for accurately assessing their capabilities. While research has thoroughly investigated the impact of test set contamination on discriminative evaluations like multiple-choice question-answering, comparatively little research has studied the impact of test set contamination on generative evaluations. In this work, we quantitatively assess the effect of test set contamination on generative evaluations through the language model lifecycle. We pretrain language models on mixtures of web data and the MATH benchmark, sweeping model sizes and number of test set replicas contaminating the pretraining corpus; performance improves with contamination and model size. Using scaling laws, we make a surprising discovery: including even a single test set replica enables models to achieve lower loss than the irreducible error of training on the uncontaminated corpus. We then study further training: overtraining with fresh data reduces the effects of contamination, whereas supervised finetuning on the training set can either increase or decrease performance on test data, depending on the amount of pretraining contamination. Finally, at inference, we identify factors that modulate memorization: high sampling temperatures mitigate contamination effects, and longer solutions are exponentially more difficult to memorize than shorter ones, presenting a contrast with discriminative evaluations, where solutions are only a few tokens in length. By characterizing how generation and memorization interact, we highlight a new layer of complexity for trustworthy evaluation of AI systems.


【22】ArtCognition: A Multimodal AI Framework for Affective State Sensing from Visual and Kinematic Drawing Cues
标题:艺术认知:一个多模式人工智能框架,用于从视觉和运动学绘画线索感知情感状态
链接:https://arxiv.org/abs/2601.04297

作者:Behrad Binaei-Haghighi,Nafiseh Sadat Sajadi,Mehrad Liviyan,Reyhane Akhavan Kharazi,Fatemeh Amirkhani,Behnam Bahrak
备注:12 pages, 7 figures
摘要:The objective assessment of human affective and psychological states presents a significant challenge, particularly through non-verbal channels. This paper introduces digital drawing as a rich and underexplored modality for affective sensing. We present a novel multimodal framework, named ArtCognition, for the automated analysis of the House-Tree-Person (HTP) test, a widely used psychological instrument. ArtCognition uniquely fuses two distinct data streams: static visual features from the final artwork, captured by computer vision models, and dynamic behavioral kinematic cues derived from the drawing process itself, such as stroke speed, pauses, and smoothness. To bridge the gap between low-level features and high-level psychological interpretation, we employ a Retrieval-Augmented Generation (RAG) architecture. This grounds the analysis in established psychological knowledge, enhancing explainability and reducing the potential for model hallucination. Our results demonstrate that the fusion of visual and behavioral kinematic cues provides a more nuanced assessment than either modality alone. We show significant correlations between the extracted multimodal features and standardized psychological metrics, validating the framework's potential as a scalable tool to support clinicians. This work contributes a new methodology for non-intrusive affective state assessment and opens new avenues for technology-assisted mental healthcare.


【23】Mitigating Position-Shift Failures in Text-Based Modular Arithmetic via Position Curriculum and Template Diversity
标题:通过位置课程和模板多样性缓解基于文本的模块化算法中的位置转移失败
链接:https://arxiv.org/abs/2601.04283

作者:Nikolay Yudin
摘要 :Building on insights from the grokking literature, we study character-level Transformers trained to compute modular addition from text, and focus on robustness under input-format variation rather than only in-distribution accuracy. We identify a previously under-emphasized failure mode: models that achieve high in-distribution accuracy can fail catastrophically when the same expression is shifted to different absolute character positions ("position shift") or presented under out-of-distribution natural-language templates. Using a disjoint-pair split over all ordered pairs for p=97, we show that a baseline model reaches strong in-distribution performance yet collapses under position shift and template OOD. We then introduce a simple training recipe that combines (i) explicit expression boundary markers, (ii) position curriculum that broadens the range of absolute positions seen during training, (iii) diverse template mixtures, and (iv) consistency training across multiple variants per example. Across three seeds, this intervention substantially improves robustness to position shift and template OOD while maintaining high in-distribution accuracy, whereas an ALiBi-style ablation fails to learn the task under our setup. Our results suggest that steering procedural generalization under noisy supervision benefits from explicitly training invariances that are otherwise absent from the data distribution, and we provide a reproducible evaluation protocol and artifacts.


【24】LEGATO: Good Identity Unlearning Is Continuous
标题:乐高:良好的身份遗忘是持续的
链接:https://arxiv.org/abs/2601.04282

作者:Qiang Chen,Chun-Wun Cheng,Xiu Su,Hongyan Xu,Xi Lin,Shan You,Angelica I. Aviles-Rivero,Yi Chen
摘要:Machine unlearning has become a crucial role in enabling generative models trained on large datasets to remove sensitive, private, or copyright-protected data. However, existing machine unlearning methods face three challenges in learning to forget identity of generative models: 1) inefficient, where identity erasure requires fine-tuning all the model's parameters; 2) limited controllability, where forgetting intensity cannot be controlled and explainability is lacking; 3) catastrophic collapse, where the model's retention capability undergoes drastic degradation as forgetting progresses. Forgetting has typically been handled through discrete and unstable updates, often requiring full-model fine-tuning and leading to catastrophic collapse. In this work, we argue that identity forgetting should be modeled as a continuous trajectory, and introduce LEGATO - Learn to ForgEt Identity in GenerAtive Models via Trajectory-consistent Neural Ordinary Differential Equations. LEGATO augments pre-trained generators with fine-tunable lightweight Neural ODE adapters, enabling smooth, controllable forgetting while keeping the original model weights frozen. This formulation allows forgetting intensity to be precisely modulated via ODE step size, offering interpretability and robustness. To further ensure stability, we introduce trajectory consistency constraints that explicitly prevent catastrophic collapse during unlearning. Extensive experiments across in-domain and out-of-domain identity unlearning benchmarks show that LEGATO achieves state-of-the-art forgetting performance, avoids catastrophic collapse and reduces fine-tuned parameters.


【25】Systems Explaining Systems: A Framework for Intelligence and Consciousness
标题:系统解释系统:智力和意识的框架
链接:https://arxiv.org/abs/2601.04269

作者:Sean Niklas Semmler
备注:This work is presented as a preprint, and the author welcomes constructive feedback and discussion
摘要:This paper proposes a conceptual framework in which intelligence and consciousness emerge from relational structure rather than from prediction or domain-specific mechanisms. Intelligence is defined as the capacity to form and integrate causal connections between signals, actions, and internal states. Through context enrichment, systems interpret incoming information using learned relational structure that provides essential context in an efficient representation that the raw input itself does not contain, enabling efficient processing under metabolic constraints.   Building on this foundation, we introduce the systems-explaining-systems principle, where consciousness emerges when recursive architectures allow higher-order systems to learn and interpret the relational patterns of lower-order systems across time. These interpretations are integrated into a dynamically stabilized meta-state and fed back through context enrichment, transforming internal models from representations of the external world into models of the system's own cognitive processes.   The framework reframes predictive processing as an emergent consequence of contextual interpretation rather than explicit forecasting and suggests that recursive multi-system architectures may be necessary for more human-like artificial intelligence.


【26】Safety-Utility Conflicts Are Not Global: Surgical Alignment via Head-Level Diagnosis
标题:安全与实用冲突并非全球性:通过头部诊断进行手术调整
链接:https://arxiv.org/abs/2601.04262

作者:Wang Cai,Yilin Wen,Jinchang Hou,Du Su,Guoqiu Wang,Zhonghou Lv,Chenfu Bao,Yunfang Wu
摘要:Safety alignment in Large Language Models (LLMs) inherently presents a multi-objective optimization conflict, often accompanied by an unintended degradation of general capabilities. Existing mitigation strategies typically rely on global gradient geometry to resolve these conflicts, yet they overlook Modular Heterogeneity within Transformers, specifically that the functional sensitivity and degree of conflict vary substantially across different attention heads. Such global approaches impose uniform update rules across all parameters, often resulting in suboptimal trade-offs by indiscriminately updating utility sensitive heads that exhibit intense gradient conflicts. To address this limitation, we propose Conflict-Aware Sparse Tuning (CAST), a framework that integrates head-level diagnosis with sparse fine-tuning. CAST first constructs a pre-alignment conflict map by synthesizing Optimization Conflict and Functional Sensitivity, which then guides the selective update of parameters. Experiments reveal that alignment conflicts in LLMs are not uniformly distributed. We find that the drop in general capabilities mainly comes from updating a small group of ``high-conflict'' heads. By simply skipping these heads during training, we significantly reduce this loss without compromising safety, offering an interpretable and parameter-efficient approach to improving the safety-utility trade-off.


【27】Automated Reproducibility Has a Problem Statement Problem
标题:自动复制存在问题陈述问题
链接:https://arxiv.org/abs/2601.04226

作者:Thijs Snelleman,Peter Lundestad Lawrence,Holger H. Hoos,Odd Erik Gundersen
备注:Accepted at RAI Workshop @ AAAI 2026
摘要:Background. Reproducibility is essential to the scientific method, but reproduction is often a laborious task. Recent works have attempted to automate this process and relieve researchers of this workload. However, due to varying definitions of reproducibility, a clear problem statement is missing. Objectives. Create a generalisable problem statement, applicable to any empirical study. We hypothesise that we can represent any empirical study using a structure based on the scientific method and that this representation can be automatically extracted from any publication, and captures the essence of the study. Methods. We apply our definition of reproducibility as a problem statement for the automatisation of reproducibility by automatically extracting the hypotheses, experiments and interpretations of 20 studies and assess the quality based on assessments by the original authors of each study. Results. We create a dataset representing the reproducibility problem, consisting of the representation of 20 studies. The majority of author feedback is positive, for all parts of the representation. In a few cases, our method failed to capture all elements of the study. We also find room for improvement at capturing specific details, such as results of experiments. Conclusions. We conclude that our formulation of the problem is able to capture the concept of reproducibility in empirical AI studies across a wide range of subfields. Authors of original publications generally agree that the produced structure is representative of their work; we believe improvements can be achieved by applying our findings to create a more structured and fine-grained output in future work.


【28】Beyond Interaction Effects: Two Logics for Studying Population Inequalities
标题:超越相互作用效应:研究人口不平等的两种逻辑
链接:https://arxiv.org/abs/2601.04223

作者:Adel Daoud
摘要:When sociologists and other social scientist ask whether the return to college differs by race and gender, they face a choice between two fundamentally different modes of inquiry. Traditional interaction models follow deductive logic: the researcher specifies which variables moderate effects and tests these hypotheses. Machine learning methods follow inductive logic: algorithms search across vast combinatorial spaces to discover patterns of heterogeneity. This article develops a framework for navigating between these approaches. We show that the choice between deduction and induction reflects a tradeoff between interpretability and flexibility, and we demonstrate through simulation when each approach excels. Our framework is particularly relevant for inequality research, where understanding how treatment effects vary across intersecting social subpopulation is substantively central.


【29】ROOFS: RObust biOmarker Feature Selection
标题:ROOFS:RObust biOmarker功能选择
链接:https://arxiv.org/abs/2601.05151

作者:Anastasiia Bakhmach,Paul Dufossé,Andrea Vaglio,Florence Monville,Laurent Greillier,Fabrice Barlési,Sébastien Benzekry
摘要:Feature selection (FS) is essential for biomarker discovery and in the analysis of biomedical datasets. However, challenges such as high-dimensional feature space, low sample size, multicollinearity, and missing values make FS non-trivial. Moreover, FS performances vary across datasets and predictive tasks. We propose roofs, a Python package available at https://gitlab.inria.fr/compo/roofs, designed to help researchers in the choice of FS method adapted to their problem. Roofs benchmarks multiple FS methods on the user's data and generates reports that summarize a comprehensive set of evaluation metrics, including downstream predictive performance estimated using optimism correction, stability, reliability of individual features, and true positive and false positive rates assessed on semi-synthetic data with a simulated outcome. We demonstrate the utility of roofs on data from the PIONeeR clinical trial, aimed at identifying predictors of resistance to anti-PD-(L)1 immunotherapy in lung cancer. The PIONeeR dataset contained 374 multi-source blood and tumor biomarkers from 435 patients. A reduced subset of 214 features was obtained through iterative variance inflation factor pre-filtering. Of the 34 FS methods gathered in roofs, we evaluated 23 in combination with 11 classifiers (253 models in total) and identified a filter based on the union of Benjamini-Hochberg false discovery rate-adjusted p-values from t-test and logistic regression as the optimal approach, outperforming other methods including the widely used LASSO. We conclude that comprehensive benchmarking with roofs has the potential to improve the robustness and reproducibility of FS discoveries and increase the translational value of clinical models.


【30】Gradient-based Optimisation of Modulation Effects
标题:基于对象的调制效果优化
链接:https://arxiv.org/abs/2601.04867

作者:Alistair Carson,Alec Wright,Stefan Bilbao
备注:Submitted to J. Audio Eng. Soc. Dec. 2025
摘要:Modulation effects such as phasers, flangers and chorus effects are heavily used in conjunction with the electric guitar. Machine learning based emulation of analog modulation units has been investigated in recent years, but most methods have either been limited to one class of effect or suffer from a high computational cost or latency compared to canonical digital implementations. Here, we build on previous work and present a framework for modelling flanger, chorus and phaser effects based on differentiable digital signal processing. The model is trained in the time-frequency domain, but at inference operates in the time-domain, requiring zero latency. We investigate the challenges associated with gradient-based optimisation of such effects, and show that low-frequency weighting of loss functions avoids convergence to local minima when learning delay times. We show that when trained against analog effects units, sound output from the model is in some cases perceptually indistinguishable from the reference, but challenges still remain for effects with long delay times and feedback.


【31】The Minary Primitive of Computational Autopoiesis
标题:计算自生成的微小原始性
链接:https://arxiv.org/abs/2601.04501

作者:Daniel Connor,Colin Defant
备注 :21 pages, 2 figures
摘要:We introduce Minary, a computational framework designed as a candidate for the first formally provable autopoietic primitive. Minary represents interacting probabilistic events as multi-dimensional vectors and combines them via linear superposition rather than multiplicative scalar operations, thereby preserving uncertainty and enabling constructive and destructive interference in the range $[-1,1]$. A fixed set of ``perspectives'' evaluates ``semantic dimensions'' according to hidden competencies, and their interactions drive two discrete-time stochastic processes. We model this system as an iterated random affine map and use the theory of iterated random functions to prove that it converges in distribution to a unique stationary law; we moreover obtain an explicit closed form for the limiting expectation in terms of row, column, and global averages of the competency matrix. We then derive exact formulas for the mean and variance of the normalized consensus conditioned on the activation of a given semantic dimension, revealing how consensus depends on competency structure rather than raw input signals. Finally, we argue that Minary is organizationally closed yet operationally open in the sense of Maturana and Varela, and we discuss implications for building self-maintaining, distributed, and parallelizable computational systems that house a uniquely subjective notion of identity.


【32】SpectraFormer: an Attention-Based Raman Unmixing Tool for Accessing the Graphene Buffer-Layer Signature on SiC
标题:SpectraFormer:一种基于注意力的拉曼解混工具,用于在SiC上量化石墨烯缓冲层特征
链接:https://arxiv.org/abs/2601.04445

作者:Dmitriy Poteryayev,Pietro Novelli,Annalisa Coriolano,Riccardo Dettori,Valentina Tozzini,Fabio Beltram,Massimiliano Pontil,Antonio Rossi,Stiven Forti,Camilla Coletti
备注:14 pages, 4 figures, 1 table
摘要:Raman spectroscopy is a key tool for graphene characterization, yet its application to graphene grown on silicon carbide (SiC) is strongly limited by the intense and variable second-order Raman response of the substrate. This limitation is critical for buffer layer graphene, a semiconducting interfacial phase, whose vibrational signatures are overlapped with the SiC background and challenging to be reliably accessed using conventional reference-based subtraction, due to strong spatial and experimental variability of the substrate signal. Here we present SpectraFormer, a transformer-based deep learning model that reconstructs the SiC Raman substrate contribution directly from post-growth partially masked spectroscopic data without relying on explicit reference measurements. By learning global correlations across the entire Raman shift range, the model captures the statistical structure of the SiC background and enables accurate reconstruction of its contribution in mixed spectra. Subtraction of the reconstructed substrate signal reveals weak vibrational features associated with ZLG that are inaccessible through conventional analysis methods. The extracted spectra are validated by ab initio vibrational calculations, allowing assignment of the resolved features to specific modes and confirming their physical consistency. By leveraging a state-of-the-art attention-based deep learning architecture, this approach establishes a robust, reference-free framework for Raman analysis of graphene on SiC and provides a foundation, compatible with real-time data acquisition, to its integration into automated, closed-loop AI-assisted growth optimization.


机器翻译由腾讯交互翻译提供,仅供参考

点击“阅读原文”获取带摘要的学术速递

Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/191493