机器学习学术速递[10.13]

点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.LG 方向，今日共计212篇

大模型相关(39篇)

【1】Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation
标题：预算测试时间缩放是一种强大的LLM推理数据增强
链接：https://arxiv.org/abs/2510.09599

作者：Sondos Mahmoud Bsharat, Zhiqiang Shen
备注：Our code and data are available at this https URL
摘要：大型语言模型（LLM）在提供思维链范例时已经展示了令人印象深刻的推理能力，但管理大型推理数据集仍然是费力和资源密集型的。在这项工作中，我们介绍了一种简单而有效的推理时间数据增强策略，用于通过微调来增强LLM推理。P-TTS不是收集数千甚至数百万个示例，而是利用仅90个手动选择的推理实例的小池，并通过测试时的原则性指令提示强度来系统地改变范例增强，以合成不同的推理轨迹上下文。然后在P-TTS数据上对Qwen-2.5模型的各种尺寸进行了微调。在一套数学推理AIME 2024和25，MATH 500和GPQA-Diamond中，我们的P-TTS-7 B和32 B模型优于之前的竞争基准，如S1和S1.1（1 K-shot），在AIME'24（7 B）上实现了+26.66%和+30.00%的绝对精度增益，在AIME'25（7 B）上实现了+13.34%和+6.67%; P-TTS-32 B在AIME'24上的收益为+23.33%和+16.63%，在AIME'25上的收益为+26.63%和+3.33%（分别与S1和S1.1相比），在MATH 500和GPQA-Diamond上的性能相当或更好。我们进一步表明，P-TTS提高了zero-shot泛化精度的域外推理基准的高考，Kaoyan，OlympiadBench，AMC 23，GradeSchoolMath，和Minerva。我们的分析表明，测试时间缩放有效地探索了推理模式的潜在空间，以最小的注释开销放大了LLM解决问题的能力，并进一步释放了LLM的推理潜力和能力。在资源受限或快速发展的领域中，简化测试时间缩放提供了一种实用的，低成本的方法来引出LLM推理。
摘要：Large language models (LLMs) have demonstrated impressive reasoning capabilities when provided with chain-of-thought exemplars, but curating large reasoning datasets remains laborious and resource-intensive. In this work, we introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective inference-time data augmentation strategy for enhancing LLM reasoning through finetuning. Rather than collecting thousands or even millions of examples, P-TTS leverages a small pool of only 90 manually selected reasoning instances and systematically varies exemplar augmentation through principled instruction prompting intensities at test time to synthesize diverse reasoning trajectory contexts. Then we finetune the various sizes of Qwen-2.5 models on P-TTS data. Across a suite of mathematical reasoning AIME2024 & 25, MATH500, and GPQA-Diamond, our P-TTS-7B and 32B models outperform the prior competitive baselines like S1 and S1.1 (1K-shot), achieving absolute accuracy gains of +26.66% and +30.00% on AIME'24 (7B), and +13.34% and +6.67% on AIME'25 (7B); P-TTS-32B yields gains of +23.33% and +16.63% on AIME'24, and +26.63% and +3.33% on AIME'25 (vs. S1 and S1.1, respectively), with comparable or better performance on MATH500 and GPQA-Diamond. We further show that P-TTS enhances zero-shot generalization accuracy on out-of-domain reasoning benchmarks of Gaokao, Kaoyan, OlympiadBench, AMC23, GradeSchoolMath, and Minerva. Our analysis suggests that test-time scaling effectively explores the latent space of reasoning patterns, amplifying LLM problem-solving with minimal annotation overhead, and further unlocking the reasoning potential and capabilities of LLMs. Prompting Test-Time Scaling offers a practical, low-cost way to elicit LLM reasoning in resource-constrained or rapidly evolving domains.

【2】LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?
标题：LiveOIBench：大型语言模型能否在信息学奥林匹克竞赛中胜过人类参赛者？
链接：https://arxiv.org/abs/2510.09595

作者：Kaijian Zou, Aaron Xiong, Yunxiang Zhang, Frederick Zhang, Yueqi Ren, Jirong Yang, Ayoung Lee, Shitanshu Bhushan, Lu Wang
摘要：由于大型语言模型（LLM）的复杂性和易于验证性，竞争性编程问题越来越多地成为评估其编码能力的有价值的基准。然而，目前的编码基准面临着一些限制，例如缺乏特别具有挑战性的问题，测试用例覆盖不足，依赖于限制可访问性的在线平台API。为了解决这些问题，我们引入了LiveOIBench，这是一个全面的基准测试，包含403个专家策划的奥林匹克级竞争性编程问题，每个问题平均有60个专家设计的测试用例。这些问题直接来自2023年至2025年期间在不同地区举办的72场官方信息学奥林匹克竞赛。LiveOIBench通过四个关键功能脱颖而出：（1）精心策划的高质量任务，包括详细的子任务规则和广泛的私人测试用例;（2）直接集成精英选手的表现数据，以便与表现最好的人类进行信息比较;（3）有计划的连续，无污染的更新新发布的奥林匹克问题;以及（4）一个独立的评估系统，便于离线和易于复制的评估。对32个流行的通用和推理LLM进行基准测试，我们发现GPT-5达到了显着的第81.76百分位，这是一个很好的结果，但仍低于通常排名在第90位以上的顶级人类选手。相比之下，在开放权重推理模型中，GPT-OSS-120 B仅达到第60个百分位数，强调了与前沿封闭模型的显著能力差异。详细的分析表明，健壮的推理模型优先考虑精确的问题分析过度的探索，建议未来的模型应该强调结构化分析，并尽量减少不必要的探索。所有数据、代码和排行榜结果都将在我们的网站上公开。
摘要：Competitive programming problems increasingly serve as valuable benchmarks to evaluate the coding capabilities of large language models (LLMs) due to their complexity and ease of verification. Yet, current coding benchmarks face limitations such as lack of exceptionally challenging problems, insufficient test case coverage, reliance on online platform APIs that limit accessibility. To address these issues, we introduce LiveOIBench, a comprehensive benchmark featuring 403 expert-curated Olympiad-level competitive programming problems, each with an average of 60 expert-designed test cases. The problems are sourced directly from 72 official Informatics Olympiads in different regions conducted between 2023 and 2025. LiveOIBench distinguishes itself through four key features: (1) meticulously curated high-quality tasks with detailed subtask rubrics and extensive private test cases; (2) direct integration of elite contestant performance data to enable informative comparison against top-performing humans; (3) planned continuous, contamination-free updates from newly released Olympiad problems; and (4) a self-contained evaluation system facilitating offline and easy-to-reproduce assessments. Benchmarking 32 popular general-purpose and reasoning LLMs, we find that GPT-5 achieves a notable 81.76th percentile, a strong result that nonetheless falls short of top human contestant performance, who usually place above 90th. In contrast, among open-weight reasoning models, GPT-OSS-120B achieves only a 60th percentile, underscoring significant capability disparities from frontier closed models. Detailed analyses indicate that robust reasoning models prioritize precise problem analysis over excessive exploration, suggesting future models should emphasize structured analysis and minimize unnecessary exploration. All data, code, and leaderboard results will be made publicly available on our website.

【3】D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models
标题：D-TPT：用于校准视觉语言模型中的测试时间即时调优的维度熵最大化
链接：https://arxiv.org/abs/2510.09473

作者：Jisu Han, Wonjun Hwang
摘要：测试时自适应范式通过对来自源模型的未标记目标数据执行即时自适应，提供了域转换的灵活性。视觉语言模型（VLM）利用其泛化能力进行各种下游任务，测试时提示调优已成为适应VLM的突出解决方案。在这项工作中，我们探索对比性VLM并识别由跨模态的单个主导特征维度引起的模态差距。我们观察到，在文本和图像模态中的主导尺寸表现出高的预测灵敏度，并且约束它们的影响可以改善校准误差。基于这一认识，我们提出了维度熵最大化，使文本特征的分布趋于均匀，以减轻主导维度的依赖性。我们的方法在测试时及时调整校准性能的退化，提供了一个简单而有效的解决方案，以提高在现实世界的部署场景中的VLM的可靠性。
摘要：Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the dominant dimensions in both text and image modalities exhibit high predictive sensitivity, and that constraining their influence can improve calibration error. Building on this insight, we propose dimensional entropy maximization that regularizes the distribution of textual features toward uniformity to mitigate the dependency of dominant dimensions. Our method alleviates the degradation of calibration performance in test-time prompt tuning, offering a simple yet effective solution to enhance the reliability of VLMs in real-world deployment scenarios.

【4】Hybrid Models for Natural Language Reasoning: The Case of Syllogistic Logic
标题：自然语言推理的混合模型：逻辑逻辑的案例
链接：https://arxiv.org/abs/2510.09472

作者：Manuel Vargas Guzmán, Jakub Szymanik, Maciej Malicki
摘要：尽管神经模型取得了显着的进步，但它们的泛化能力（逻辑推理等应用的基石）仍然是一个关键挑战。我们描述了这种能力的两个基本方面：组合性，即抽象复杂推理背后的原子逻辑规则的能力;递归性，即通过推理规则的迭代应用构建复杂表征的能力。在文献中，这两个方面经常被混淆在一起的总括术语的泛化。为了突出这一区别，我们使用三段论片段作为自然语言推理的基准，研究了预训练的大型语言模型（LLM）的逻辑泛化能力。虽然简单，这个片段提供了一个基本的，但表达的形式逻辑，支持基本推理能力的控制评估的子集。我们的研究结果揭示了一个显着的差距：虽然LLM表现出合理的递归能力，他们的斗争与组合。为了克服这些局限性并建立一个可靠的逻辑证明器，我们提出了一种将符号推理和神经计算相结合的混合体系结构。这种协同作用使强大而有效的推理，神经组件加速处理，而符号推理确保完整性。我们的实验表明，即使神经组件相对较小，也能保持高效率。作为我们提出的方法的一部分，这种分析给出了一个基本原理，并强调了混合模型有效解决神经推理系统中关键泛化障碍的潜力。
摘要：Despite the remarkable progress in neural models, their ability to generalize, a cornerstone for applications like logical reasoning, remains a critical challenge. We delineate two fundamental aspects of this ability: compositionality, the capacity to abstract atomic logical rules underlying complex inferences, and recursiveness, the aptitude to build intricate representations through iterative application of inference rules. In the literature, these two aspects are often confounded together under the umbrella term of generalization. To sharpen this distinction, we investigated the logical generalization capabilities of pre-trained large language models (LLMs) using the syllogistic fragment as a benchmark for natural language reasoning. Though simple, this fragment provides a foundational yet expressive subset of formal logic that supports controlled evaluation of essential reasoning abilities. Our findings reveal a significant disparity: while LLMs demonstrate reasonable proficiency in recursiveness, they struggle with compositionality. To overcome these limitations and establish a reliable logical prover, we propose a hybrid architecture integrating symbolic reasoning with neural computation. This synergistic interaction enables robust and efficient inference, neural components accelerate processing, while symbolic reasoning ensures completeness. Our experiments show that high efficiency is preserved even with relatively small neural components. As part of our proposed methodology, this analysis gives a rationale and highlights the potential of hybrid models to effectively address key generalization barriers in neural reasoning systems.

【5】The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
标题：演讲LLM囊括一切：真正完全端到端的口语对话状态跟踪方法
链接：https://arxiv.org/abs/2510.09424

作者：Nizar El Ghazal, Antoine Caubrière, Valentin Vielzeuf
摘要：本文提出了一种比较研究的上下文管理策略，端到端的口语对话状态跟踪使用语音LLM。我们系统地评估了传统的多模态上下文（结合文本历史和口语当前的转折），完整的口语历史，压缩口语历史的方法。我们在SpokenWOZ语料库上的实验表明，提供完整的口语对话作为输入，在类似大小的模型中产生最高的性能，显着超过先前的方法。此外，我们表明，注意力池为基础的压缩的口语历史提供了一个很强的权衡，保持竞争力的准确性，减少上下文大小。详细的分析证实，改进源于更有效的上下文利用。
摘要：This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing prior methods. Furthermore, we show that attention-pooling-based compression of the spoken history offers a strong trade-off, maintaining competitive accuracy with reduced context size. Detailed analysis confirms that improvements stem from more effective context utilization.

【6】Weight Initialization and Variance Dynamics in Deep Neural Networks and Large Language Models
标题：深度神经网络和大型语言模型中的权重分配和方差动态
链接：https://arxiv.org/abs/2510.09423

作者：Yankun Han
备注：8 pages, 6 figures
摘要：权重初始化控制训练开始时的信号传播和梯度流。本文提供了两种机制的理论基础和经验验证研究：紧凑型ReLU多层感知器和GPT-2风格的Transformers。首先，对数扫描的初始标准偏差地图消失和爆炸制度，并确定了一个广泛的稳定带与1 e-2和1 e-1之间的标准偏差。其次，一个受控的比较表明，Kaiming（扇入）初始化在ReLU下比Xavier收敛得更快，更稳定，与方差保持理论一致。第三，在一个从头开始的12层GPT-2风格模型中，本文通过预训练跟踪逐层Q/K/V权重方差，并观察到深度依赖的平衡进入窄带：浅层迅速扩张，而深层变化更缓慢。总之，这些结果将经典的初始化原则与现代Transformer行为联系起来，并为强大的训练提供简单实用的方法。
摘要：Weight initialization governs signal propagation and gradient flow at the start of training. This paper offers a theory-grounded and empirically validated study across two regimes: compact ReLU multilayer perceptrons and GPT-2-style transformers. First, a logarithmic sweep of the initial standard deviation maps vanishing and exploding regimes and identifies a broad stability band with standard deviations between 1e-2 and 1e-1. Second, a controlled comparison shows that Kaiming (fan-in) initialization converges faster and more stably than Xavier under ReLU, consistent with variance-preserving theory. Third, in a from-scratch 12-layer GPT-2-style model, this paper tracks layerwise Q/K/V weight variance through pretraining and observe depth-dependent equilibration into narrow bands: shallow layers expand rapidly while deeper layers change more gradually. Together, these results connect classic initialization principles with modern transformer behavior and yield simple, practical recipes for robust training.

【7】Active Model Selection for Large Language Models
标题：大型语言模型的主动模型选择
链接：https://arxiv.org/abs/2510.09418

作者：Yavuz Durmazkeser, Patrik Okanovic, Andreas Kirsch, Torsten Hoefler, Nezihe Merve Gürel
摘要：我们介绍LLM选择器，第一个框架的主动模型选择的大型语言模型（LLM）。与依赖于完全注释数据集的先前评估和基准测试方法不同，LLM SELECTOR有效地识别了具有有限注释的最佳LLM。特别是，对于任何给定的任务，LLM选择器自适应地选择一小部分查询来注释，这些查询提供了关于任务的最佳模型的最多信息。为了进一步降低标注成本，我们利用了基于判断的oracle标注模型。通过对6个基准测试的151个LLM进行广泛的实验，我们发现LLM选择器在为任务选择最佳和接近最佳的LLM时，可将注释成本降低高达59.62%。
摘要：We introduce LLM SELECTOR, the first framework for active model selection of Large Language Models (LLMs). Unlike prior evaluation and benchmarking approaches that rely on fully annotated datasets, LLM SELECTOR efficiently identifies the best LLM with limited annotations. In particular, for any given task, LLM SELECTOR adaptively selects a small set of queries to annotate that are most informative about the best model for the task. To further reduce annotation cost, we leverage a judge-based oracle annotation model. Through extensive experiments on 6 benchmarks with 151 LLMs, we show that LLM SELECTOR reduces annotation costs by up to 59.62% when selecting the best and near-best LLM for the task.

【8】The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton
标题：LLM二阶优化的潜力：全高斯-牛顿研究
链接：https://arxiv.org/abs/2510.09378

作者：Natalie Abreu, Nikhil Vyas, Sham Kakade, Depen Morwani
摘要：最近加速LLM预训练的努力集中在利用二阶结构的计算效率近似上。这给大规模训练提出了一个关键问题：这些近似值会损失多少性能？为了探讨这个问题，我们建立了一个实用的上限迭代复杂性的应用全高斯-牛顿（GN）预处理Transformer模型高达150 M的参数。我们的实验表明，与现有的优化器相比，完整的GN更新产生了实质性的收益，与SOAP和Muon等强基线相比，训练迭代减少了5.4倍。此外，我们发现，一个精确的逐层GN预条件，忽略跨层信息，几乎匹配的性能的完整GN方法。总的来说，我们的研究结果表明：（1）GN近似是非常有效的预处理，这意味着高阶损失项可能不是收敛速度的关键;（2）分层Hessian结构包含足够的信息，以实现大多数这些潜在的收益;和（3）一个显着的性能差距存在于当前的近似方法和理想化的分层预言。
摘要：Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.

【9】Localist LLMs -- A Mathematical Framework for Dynamic Locality Control
标题：本地法学硕士--动态局部控制的数学框架
链接：https://arxiv.org/abs/2510.09338

作者：Joachim Diederich
摘要：我们提出了一个新的框架，用于训练大型语言模型，具有可持续调整的内部表示，这些表示跨越了从局部（可解释的，基于规则的）到分布式（可推广的，高效的）编码的全方位。关键的创新是局部拨号，这是一个可调参数，可以在训练和推理过程中动态控制局部化程度，而不需要模型重新训练。这是通过对注意力机制的组稀疏惩罚，信息论锚设计和动态规则注入来实现的。我们提供了严格的数学证明，建立明确的阈值条件下，注意力可证明集中在语义相关的块，注意熵和指针保真度的指数界。具体来说，我们证明了当组稀疏性惩罚超过一定的阈值时，模型的注意力机制集中在语义相关的块上，实现了低熵和高保真度，误差可以忽略不计。该框架使从业人员能够在可解释和高性能模式之间不断进行插值，支持需要透明度和能力的受监管领域的应用程序。
摘要：We present a novel framework for training large language models with continuously adjustable internal representations that span the full spectrum from localist (interpretable, rule-based) to distributed (generalizable, efficient) encodings. The key innovation is a locality dial, a tunable parameter that dynamically controls the degree of localization during both training and inference without requiring model retraining. This is achieved through group sparsity penalties on attention mechanisms, information-theoretic anchor design, and dynamic rule injection. We provide rigorous mathematical proofs establishing explicit threshold conditions under which attention provably concentrates on semantically relevant blocks, with exponential bounds on attention entropy and pointer fidelity. Specifically, we prove that when group sparsity penalties exceed certain threshold values, the model's attention mechanisms concentrate on semantically relevant blocks, achieving low entropy and high fidelity with negligible error. This framework enables practitioners to continuously interpolate between interpretable and high-performance modes, supporting applications in regulated domains requiring both transparency and capability.

【10】Large Language Model Prompt Datasets: An In-depth Analysis and Insights
标题：大型语言模型提示数据集：深入分析和见解
链接：https://arxiv.org/abs/2510.09316

作者：Yuanming Zhang, Yan Lin, Arijit Khan, Huaiyu Wan
摘要：提示符是一种自然语言指令，它为大型语言模型（LLM）定义了特定的任务，并作为人类与LLM交互的主要界面。随着LLM的部署越来越多，各种提示数据集正在从GitHub和社交媒体等平台中涌现。这些数据集涵盖了广泛的应用程序和内容类型，促进了更广泛的LLM利用率和改进的即时工程。在这项工作中，我们第一次编制了一个广泛的来自各种渠道的提示数据集列表，代表了下游任务，语言，工程技术，属性和模式的范围。我们选择了具有代表性的关键数据集进行系统分析，揭示了不同类别的提示构建的共性和差异，将它们与其他文本语料库（如文学和网络）区分开来。我们进一步提出了一个及时的优化方法，利用语法嵌入的词性和依存结构。通过识别提示的质心表示并引导LLM向该质心重写提示，我们的方法提高了模型输出的意义。我们已经提供了我们的数据集和代码。
摘要：A prompt is a natural language instruction that defines a specific task for a large language model (LLM) and serves as the primary interface for human-LLM interaction. With the growing deployment of LLMs, diverse prompt datasets are emerging from platforms such as GitHub and social media. These datasets span a wide array of applications and content types, facilitating both broader LLM utilization and improved prompt engineering. In this work, we--for the first time--have compiled an extensive list of prompt datasets sourced from various channels, representing a spectrum of downstream tasks, languages, engineering techniques, attributes, and modalities. We select key representative datasets for systematic analysis, revealing commonalities and differences in prompt construction across categories, distinguishing them from other text corpora like literature and web. We further propose a prompt optimization approach that leverages syntactic embeddings of part-of-speech and dependency structures. By identifying a centroid representation of prompts and guiding LLMs to rewrite prompts toward this centroid, our method improves the meaningfulness of model outputs. We have made our datasets and code available.

【11】Goal-oriented Backdoor Attack against Vision-Language-Action Models via Physical Objects
标题：通过物理对象对视觉-语言-动作模型的面向目标的后门攻击
链接：https://arxiv.org/abs/2510.09269

作者：Zirun Zhou, Zhengyang Xiao, Haochuan Xu, Jing Sun, Di Wang, Jingfeng Zhang
摘要：视觉-语言-动作（VLA）模型的最新进展极大地改善了嵌入式AI，使机器人能够遵循自然语言指令并执行各种任务。然而，它们对未经策划的训练数据集的依赖引起了严重的安全问题。现有的对VLA的后门攻击大多假设白盒访问并导致任务失败，而不是强制执行特定操作。在这项工作中，我们揭示了一个更实际的威胁：攻击者可以通过简单地将物理对象作为触发器注入训练数据集中来操纵VLA。我们提出了目标导向的后门攻击（GoBA），其中VLA在没有物理触发器的情况下行为正常，但在存在物理触发器的情况下执行预定义和目标导向的操作。具体来说，基于流行的VLA基准LIBERO，我们引入BadLIBERO，它包含不同的物理触发器和面向目标的后门操作。此外，我们提出了一个三级评估，将受害者在GoBA下的VLA行为分为三种状态：什么都不做，尝试做，成功做。实验表明，当存在物理触发时，GoBA使受害者VLA能够在97%的输入中成功实现后门目标，同时在干净输入上造成零性能下降。最后，通过对GoBA相关因素的研究，我们发现动作轨迹和触发颜色对攻击性能有显著影响，而触发大小的影响却令人惊讶地小。代码和BadLIBERO数据集可通过项目页面https://goba-attack.github.io/访问。
摘要：Recent advances in vision-language-action (VLA) models have greatly improved embodied AI, enabling robots to follow natural language instructions and perform diverse tasks. However, their reliance on uncurated training datasets raises serious security concerns. Existing backdoor attacks on VLAs mostly assume white-box access and result in task failures instead of enforcing specific actions. In this work, we reveal a more practical threat: attackers can manipulate VLAs by simply injecting physical objects as triggers into the training dataset. We propose goal-oriented backdoor attacks (GoBA), where the VLA behaves normally in the absence of physical triggers but executes predefined and goal-oriented actions in the presence of physical triggers. Specifically, based on a popular VLA benchmark LIBERO, we introduce BadLIBERO that incorporates diverse physical triggers and goal-oriented backdoor actions. In addition, we propose a three-level evaluation that categorizes the victim VLA's actions under GoBA into three states: nothing to do, try to do, and success to do. Experiments show that GoBA enables the victim VLA to successfully achieve the backdoor goal in 97 percentage of inputs when the physical trigger is present, while causing zero performance degradation on clean inputs. Finally, by investigating factors related to GoBA, we find that the action trajectory and trigger color significantly influence attack performance, while trigger size has surprisingly little effect. The code and BadLIBERO dataset are accessible via the project page at https://goba-attack.github.io/.

【12】Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models
标题：从大型语言模型的强化学习后训练中检测数据污染
链接：https://arxiv.org/abs/2510.09259

作者：Yongding Tao, Tian Wang, Yihong Dong, Huanyu Liu, Kechi Zhang, Xiaolong Hu, Ge Li
摘要：数据污染对大型语言模型（LLM）的可靠评估构成了重大威胁。当基准样本可能无意中出现在训练集中时，就会出现这个问题，从而影响报告性能的有效性。虽然已经为预训练和监督微调阶段开发了检测方法，但对于日益重要的强化学习（RL）后训练阶段，存在着关键的研究差距。随着RL后训练成为推进LLM推理的关键，在这种范式中缺乏专门的污染检测方法，这是一个关键的漏洞。为了解决这个问题，我们在RL后训练场景中进行了第一次系统的数据检测研究，并提出了自我批评。我们的方法的动机是一个关键的观察：RL阶段后，LLM的输出熵分布往往会崩溃成高度特定和稀疏的模式。自我批评探讨了潜在的政策崩溃，即，模型收敛到一个狭窄的推理路径，这导致了熵的减少。为了促进这项研究，我们还介绍了RL-MIA，一个基准构建来模拟这种特定的污染情况。大量的实验表明，Self-Critique在多个模型和污染任务中的表现明显优于基线方法，AUC提高高达30%。而现有的方法是接近一个随机猜测的RL相污染，我们的方法使检测成为可能。
摘要：Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training. As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability. To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model's convergence to a narrow reasoning path, which causes this entropy reduction. To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario. Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.

【13】Zero-shot image privacy classification with Vision-Language Models
标题：使用视觉语言模型的Zero-Shot图像隐私分类
链接：https://arxiv.org/abs/2510.09253

作者：Alina Elena Baia, Alessio Xompero, Andrea Cavallaro
备注：5 pages, 3 figures, 3 tables. This work has been submitted to the ICASSP 2026
摘要：虽然专门的基于学习的模型在历史上主导了图像隐私预测，但目前的文献越来越倾向于采用为通用任务设计的大型视觉语言模型（VLM）。由于缺乏系统的评估，这一趋势有可能忽视专用模型设定的性能上限。为了解决这个问题，我们建立了一个zero-shot基准图像隐私分类，使公平的比较。我们根据隐私基准评估了前3名开源VLM，使用任务对齐提示，并将其性能，效率和鲁棒性与已建立的仅视觉和多模式方法进行了对比。与直觉相反，我们的研究结果表明，尽管VLM在高参数计数和较慢的推理方面具有资源密集型的性质，但目前在隐私预测准确性方面落后于专门的较小模型。我们还发现，VLM表现出更高的鲁棒性图像扰动。
摘要：While specialized learning-based models have historically dominated image privacy prediction, the current literature increasingly favours adopting large Vision-Language Models (VLMs) designed for generic tasks. This trend risks overlooking the performance ceiling set by purpose-built models due to a lack of systematic evaluation. To address this problem, we establish a zero-shot benchmark for image privacy classification, enabling a fair comparison. We evaluate the top-3 open-source VLMs, according to a privacy benchmark, using task-aligned prompts and we contrast their performance, efficiency, and robustness against established vision-only and multi-modal methods. Counter-intuitively, our results show that VLMs, despite their resource-intensive nature in terms of high parameter count and slower inference, currently lag behind specialized, smaller models in privacy prediction accuracy. We also find that VLMs exhibit higher robustness to image perturbations.

【14】Diagnosing Shoulder Disorders Using Multimodal Large Language Models and Consumer-Grade Cameras
标题：使用多模式大型语言模型和消费级相机诊断肩部疾病
链接：https://arxiv.org/abs/2510.09230

作者：Jindong Hong, Wencheng Zhang, Shiqin Qiao, Jianhai Chen, Jianing Qiu, Chuanyang Zheng, Qian Xu, Yun Ji, Qianyue Wen, Weiwei Sun, Hao Li, Huizhen Li, Huichao Wang, Kai Wu, Meng Li, Yijun He, Lingjie Luo, Jiankai Sun
摘要：肩部疾病，如冻结肩（又名，粘连性囊炎）是影响全世界人民健康的常见病症，并且在老年人和从事重复性肩部任务的工人中具有高发病率。在医疗资源稀缺的地区，实现早期和准确的诊断带来了重大挑战，迫切需要低成本和易于扩展的辅助诊断解决方案。这项研究引入了消费级设备捕获的视频作为诊断的基础，降低了用户的成本。我们专注于多模态大语言模型（MLLM）在肩关节疾病的初步诊断中的创新应用，并提出了一个混合运动视频诊断框架（HMVDx）。该框架将动作理解和疾病诊断两个任务分开，分别由两个MLLM完成。除了传统的评估指标，这项工作提出了一种新的度量称为可用性指数的医疗决策的逻辑过程（动作识别，运动诊断，最终诊断）。该指数从整个医学诊断路径的角度评估了MLLM在医学领域的有效性，揭示了低成本MLLM在医学应用中对医学从业者的潜在价值。在实验比较中，与直接视频诊断相比，HMVDx诊断肩关节损伤的准确性提高了79.6%，这对未来研究MLLM在医学领域的视频理解应用做出了重大技术贡献。
摘要：Shoulder disorders, such as frozen shoulder (a.k.a., adhesive capsulitis), are common conditions affecting the health of people worldwide, and have a high incidence rate among the elderly and workers engaged in repetitive shoulder tasks. In regions with scarce medical resources, achieving early and accurate diagnosis poses significant challenges, and there is an urgent need for low-cost and easily scalable auxiliary diagnostic solutions. This research introduces videos captured by consumer-grade devices as the basis for diagnosis, reducing the cost for users. We focus on the innovative application of Multimodal Large Language Models (MLLMs) in the preliminary diagnosis of shoulder disorders and propose a Hybrid Motion Video Diagnosis framework (HMVDx). This framework divides the two tasks of action understanding and disease diagnosis, which are respectively completed by two MLLMs. In addition to traditional evaluation indicators, this work proposes a novel metric called Usability Index by the logical process of medical decision-making (action recognition, movement diagnosis, and final diagnosis). This index evaluates the effectiveness of MLLMs in the medical field from the perspective of the entire medical diagnostic pathway, revealing the potential value of low-cost MLLMs in medical applications for medical practitioners. In experimental comparisons, the accuracy of HMVDx in diagnosing shoulder joint injuries has increased by 79.6\% compared with direct video diagnosis, a significant technical contribution to future research on the application of MLLMs for video understanding in the medical field.

【15】AdaPM: a Partial Momentum Algorithm for LLM Training
标题：AdaPM：LLM训练的部分动量算法
链接：https://arxiv.org/abs/2510.09103

作者：Yimu Zhang, Yuanshi Liu, Cong Fang
摘要：在大型语言模型的训练中，动量被广泛使用，并且经常被证明可以实现显着的加速。然而，存储动量通常会带来内存挑战。在本文中，我们提出了AdaPM，一种自适应训练策略，利用部分动量来实现内存效率优化器。为此，AdaPM采用了非均匀动量设计：对于大多数块，完全动量对于保持优化性能是不必要的。在AdaPM的动量设计中，为了减小部分动量带来的偏差和性能损失，我们采用了偏置修正技术来增强部分动量。从经验上讲，我们验证了我们的方法减少了超过90美元的内存，同时保持了从60 M到1.5B的各种语言模型的预训练效率和性能，以及监督微调和RLHF。AdaPM通过结合二阶统计量的内存高效技术，可以在优化器状态下进一步减少高达95\%$的内存，为预训练GPT-2 1.5 B节省超过30\%$ GPU小时。
摘要：In the training of large language models, momentum is widely used and often demonstrated to achieve significant acceleration. However, storing momentum typically presents memory challenges. In this paper, we propose AdaPM, an adaptive training strategy that leverages partial momentum to implement a memory-efficient optimizer. To this end, AdaPM utilizes a non-uniform momentum design: for most blocks, full momentum is not necessary to preserve the performance of the optimization. In the momentum design of AdaPM, to mitigate the bias and performance loss caused by partial momentum, we enhance the partial momentum by a bias correction technique. Empirically, we verify that our approach reduces memory by over $90\%$ in momentum while maintaining both efficiency and performance for pretraining various language models ranging from 60M to 1.5B, as well as for supervised fine-tuning and RLHF. AdaPM can further reduce memory by up to $95\%$ in optimizer states by combining the memory-efficient technique on the second-order statistic, saving over $30\%$ GPU hours for pretraining GPT-2 1.5B.

【16】Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation
标题：Alif：通过多语言合成数据蒸馏推进乌尔都语大型语言模型
链接：https://arxiv.org/abs/2510.09051

作者：Muhammad Ali Shafique, Kanwal Mehreen, Muhammad Arham, Maaz Amjad, Sabur Butt, Hamza Farooq
备注：Accepted to the EMNLP 2025 Workshop on Multilingual Representation Learning (MRL)
摘要：为乌尔都语等低资源语言开发高性能的大型语言模型（LLM）存在一些挑战。这些挑战包括缺乏高质量的数据集、多语言不一致和安全问题。现有的多语言LLM通常通过翻译大量可用数据来解决这些问题。然而，这种翻译往往缺乏质量和文化上的细微差别，同时也会产生大量的数据管理和培训成本。为了解决这些问题，我们提出了Alif-1.0-8B-Instruct，一个多语言乌尔都语-英语模型，以独特的方法应对这些挑战。我们训练模型的高品质，多语言的合成数据集（乌尔都语-指令），开发使用修改后的自我指导技术。通过为每个任务使用独特的提示和种子值以及全局任务池，该数据集结合了基于乌尔都语本地思维链的推理，双语翻译，文化相关性和道德安全对齐。这种技术显著增强了对Alif-1.0-8B-Instruct模型的理解，用于乌尔都语特定任务。因此，Alif-1.0-8B-Instruct建立在预先训练的Llama-3.1-8B基础上，与Llama-3.1-8B-Instruct相比，在乌尔都语特定任务中表现出更好的性能。它还优于领先的多语言LLM，包括Mistral-7 B-Instruct-v0.3，Qwen-2.5- 7 B-Instruct和Cohere-Aya-Expanse-8B，所有这些都在100美元以下的培训预算内。我们的研究结果表明，高性能和低资源的语言LLM可以有效地开发和文化对齐使用我们修改后的自我指导的方法。所有数据集、模型和代码都可以在https://github.com/traversaal-ai/alif-urdu-llm上公开获取。
摘要：Developing a high-performing large language models (LLMs) for low-resource languages such as Urdu, present several challenges. These challenges include the scarcity of high-quality datasets, multilingual inconsistencies, and safety concerns. Existing multilingual LLMs often address these issues by translating large volumes of available data. However, such translations often lack quality and cultural nuance while also incurring significant costs for data curation and training. To address these issues, we propose Alif-1.0-8B-Instruct, a multilingual Urdu-English model, that tackles these challenges with a unique approach. We train the model on a high-quality, multilingual synthetic dataset (Urdu-Instruct), developed using a modified self-instruct technique. By using unique prompts and seed values for each task along with a global task pool, this dataset incorporates Urdu-native chain-of-thought based reasoning, bilingual translation, cultural relevance, and ethical safety alignments. This technique significantly enhances the comprehension of Alif-1.0-8B-Instruct model for Urdu-specific tasks. As a result, Alif-1.0-8B-Instruct, built upon the pretrained Llama-3.1-8B, demonstrates superior performance compared to Llama-3.1-8B-Instruct for Urdu specific-tasks. It also outperformed leading multilingual LLMs, including Mistral-7B-Instruct-v0.3, Qwen-2.5-7B-Instruct, and Cohere-Aya-Expanse-8B, all within a training budget of under $100. Our results demonstrate that high-performance and low-resource language LLMs can be developed efficiently and culturally aligned using our modified self-instruct approach. All datasets, models, and code are publicly available at: https://github.com/traversaal-ai/alif-urdu-llm.

【17】Cost-Efficient Long Code Translation using LLMs while Leveraging Identifier Replacements
标题：使用LLM进行经济高效的长代码翻译，同时利用标识符替换
链接：https://arxiv.org/abs/2510.09045

作者：Manojit Chakraborty, Madhusudan Ghosh, Rishabh Gupta
摘要：在软件开发领域，LLM已被用于自动化诸如代码翻译之类的任务，其中来自一种编程语言的源代码被翻译成另一种编程语言，同时保留其功能。然而，LLM经常与不适合上下文窗口的长源代码作斗争，这会产生不准确的翻译。为了解决这个问题，我们提出了一种新的zero-shot代码翻译方法，包括标识符替换。通过在翻译过程中用广义占位符替换用户给定的长标识符，我们的方法允许LLM专注于代码的逻辑结构，通过减少令牌计数和内存使用，提高了长代码翻译的效率和成本效益。我们的实证结果表明，我们的方法保留了句法和层次信息，并产生减少令牌的翻译结果。
摘要：In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often struggle with long source codes that don't fit into the context window, which produces inaccurate translations. To address this, we propose a novel zero-shot code translation method that incorporates identifier replacement. By substituting user-given long identifiers with generalized placeholders during translation, our method allows the LLM to focus on the logical structure of the code, by reducing token count and memory usage, which improves the efficiency and cost-effectiveness of long code translation. Our empirical results demonstrate that our approach preserves syntactical and hierarchical information and produces translation results with reduced tokens.

【18】The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections
标题：攻击者第二步行动：更强的适应性攻击绕过对Llm越狱和即时注射的防御
链接：https://arxiv.org/abs/2510.09023

作者：Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, Florian Tramèr
摘要：我们应该如何评估语言模型防御的鲁棒性？当前针对越狱和即时注入的防御（旨在防止攻击者引出有害知识或远程触发恶意操作）通常针对一组静态的有害攻击字符串进行评估，或者针对计算能力较弱的优化方法进行评估，这些方法在设计时并未考虑到防御。我们认为这种评估过程是有缺陷的。相反，我们应该评估针对适应性攻击者的防御，这些攻击者明确修改其攻击策略以对抗防御的设计，同时花费大量资源来优化其目标。通过系统地调整和扩展一般优化技术-梯度下降，强化学习，随机搜索和人工引导的探索-我们绕过了12个最近的防御（基于一组不同的技术），其中大多数攻击成功率超过90%;重要的是，大多数防御最初报告的攻击成功率接近零。我们认为，未来的防御工作必须考虑更强的攻击，例如我们所描述的攻击，以便提出可靠和令人信服的鲁棒性声明。
摘要：How should we evaluate the robustness of language model defenses? Current defenses against jailbreaks and prompt injections (which aim to prevent an attacker from eliciting harmful knowledge or remotely triggering malicious actions, respectively) are typically evaluated either against a static set of harmful attack strings, or against computationally weak optimization methods that were not designed with the defense in mind. We argue that this evaluation process is flawed. Instead, we should evaluate defenses against adaptive attackers who explicitly modify their attack strategy to counter a defense's design while spending considerable resources to optimize their objective. By systematically tuning and scaling general optimization techniques-gradient descent, reinforcement learning, random search, and human-guided exploration-we bypass 12 recent defenses (based on a diverse set of techniques) with attack success rate above 90% for most; importantly, the majority of defenses originally reported near-zero attack success rates. We believe that future defense work must consider stronger attacks, such as the ones we describe, in order to make reliable and convincing claims of robustness.

【19】LLM Unlearning on Noisy Forget Sets: A Study of Incomplete, Rewritten, and Watermarked Data
标题：LLM关于噪音遗忘集的遗忘：不完整、重写和水印数据的研究
链接：https://arxiv.org/abs/2510.09007

作者：Changsheng Wang, Yihua Zhang, Dennis Wei, Jinghan Jia, Pin-Yu Chen, Sijia Liu
备注：Accepted by 18th ACM Workshop on Artificial Intelligence and Security (AISec'25)
摘要：大型语言模型（LLM）表现出显着的生成能力，但通过记忆敏感数据，强化偏见和产生有害内容，引发了道德和安全问题。这些风险激发了人们对LLM unlearning的兴趣，LLM unlearning的任务是从预先训练的模型中删除与不需要的数据相关的知识。然而，大多数现有的方法假设访问干净的，定义良好的遗忘数据样本，而现实世界的遗忘数据往往是低质量的，综合重写，或水印，对unlearning的可靠性产生怀疑。这项工作提出了第一个研究unlearning扰动或低保真度的遗忘数据，被称为嘈杂的遗忘集。通过系统地对最先进的LLM非学习方法RMU和NPO进行基准测试，我们发现，只要保留核心语义信号，非学习对扰动仍然具有惊人的鲁棒性。为了解释这种鲁棒性，我们提出了一个基于显着性的解释：尽管表面形式有很大的变化，但驱动遗忘的关键语义成分仍然具有持续的影响力。这表明，遗忘算法主要由深层语义线索而不是浅层词汇模式指导。
摘要：Large language models (LLMs) exhibit remarkable generative capabilities but raise ethical and security concerns by memorizing sensitive data, reinforcing biases, and producing harmful content. These risks have spurred interest in LLM unlearning, the task of removing knowledge associated with undesirable data from pre-trained models. However, most existing methods assume access to clean, well-defined forget data samples, whereas real-world forget data could often be low-quality, synthetically rewritten, or watermarked, casting doubt on the reliability of unlearning. This work presents the first study of unlearning under perturbed or low-fidelity forget data, referred to as noisy forget sets. By systematically benchmarking state-of-the-art LLM unlearning methods, RMU and NPO, on such noisy forget sets, we find that unlearning remains surprisingly robust to perturbations, provided that core semantic signals are preserved. To explain this robustness, we propose a saliency-based interpretation: key semantic components that drive forgetting remain consistently influential despite substantial variation in surface form. This suggests that unlearning algorithms are primarily guided by deep semantic cues rather than shallow lexical patterns.

【20】Constraints-of-Thought: A Framework for Constrained Reasoning in Language-Model-Guided Search
标题：思想约束：数据模型引导搜索中的约束推理框架
链接：https://arxiv.org/abs/2510.08992

作者：Kamel Alrashedy, Vriksha Srihari, Zulfiqar Zaidi, Ridam Srivastava, Pradyumna Tambwekar, Matthew Gombolay
摘要：虽然研究人员在使大型语言模型（LLM）能够执行多步规划方面取得了重大进展，但LLM很难确保这些计划与高级用户意图保持一致并满足符号约束，特别是在复杂的多步领域。现有的推理方法，如思想链（CoT），思想树（ToT）和验证器增强方法，扩展了搜索空间，但往往产生不可行的行动或幻觉的步骤。为了克服这些限制，我们提出了约束的思想（Const-o-T），一个框架，提供了一个结构化的前，使蒙特卡洛树搜索（MCTS）集中搜索语义有意义的路径。每个推理步骤被表示为一个（意图，约束）对，这既可以压缩搜索空间，也可以强制执行有效性。与先前的方法不同，这些方法只是生成推理轨迹或事后验证输出，Const-o-T使用（意图，约束）对来积极地将搜索集中在可行和有意义的计划上。我们将Const-o-T集成到MCTS中，使用意图约束对约束的结构化表示，修剪不可行的分支，并引导探索语义上有效的行动，提高规划效率和可验证的决策。我们在风险游戏、CAD代码生成和算术推理三个领域证明了我们的方法优于基线，产生更高的准确性和更强的结构对齐。我们的贡献是证明，常数T提供了一个可推广的基础约束引导推理，使更有效的，约束对齐，域适应规划与LLM。
摘要：While researchers have made significant progress in enabling large language models (LLMs) to perform multi-step planning, LLMs struggle to ensure that those plans align with high-level user intent and satisfy symbolic constraints, especially in complex, multi-step domains. Existing reasoning approaches such as Chain-of-Thought (CoT), Tree-of-Thought (ToT), and verifier-augmented methods, expand the search space but often yield infeasible actions or hallucinated steps. To overcome these limitations, we propose Constraints-of-Thought (Const-o-T), a framework that provides a structured prior that enables Monte Carlo Tree Search (MCTS) focus search on semantically meaningful paths. Each reasoning step is represented as an (intent, constraint) pair, which serves both to compress the search space and enforce validity. Unlike prior methods that merely generate reasoning traces or validate outputs post hoc, Const-o-T uses (intent, constraint)pairs to actively focus the search toward feasible and meaningful plans. We integrate Const-o-T into MCTS using a structured representation of intent-constraint pairs constraints prune infeasible branches and guide exploration toward semantically valid actions, improving planning efficiency and verifiable decision-making. We demonstrate across three domains Risk game, CAD code generation, and arithmetic reasoning that our approach outperforms baselines, yielding higher accuracy and stronger structural alignment. Our contribution is to demonstrate that Const-of-T offers a generalizable foundation for constraint-guided reasoning, enabling more efficient, constraint-aligned, and domain-adaptable planning with LLMs.

【21】When LLM Agents Meet Graph Optimization: An Automated Data Quality Improvement Approach
标题：当LLM代理遇到图形优化时：自动化数据质量改进方法
链接：https://arxiv.org/abs/2510.08952

作者：Zhihan Zhang, Xunkai Li, Yilong Zuo, Zhenjun Li, Bing Zhou, Rong-Hua Li, Guoren Wang
备注：12 pages, 7figures
摘要：文本属性图（TAG）已经成为一种强大的表示方法，它将结构连接与细粒度语义相结合，支持广泛的以数据为中心的应用程序。然而，图神经网络（GNN）在TAG上的性能对输入质量高度敏感。我们的实证研究表明，传统的GNN和LLM增强的GNN在稀疏性，噪声和不平衡的九个代表性场景中都遭受了显着的退化，突出了图形质量作为关键瓶颈。现有的方法主要集中在改进模型架构，而忽略了TAG数据本身的系统优化，导致在实践中的有效性有限。为了解决这一差距，我们提出了LAGA（大型语言和图形代理），一个统一的多代理框架，将图形质量控制作为一个一流的，以数据为中心的问题。LAGA将四个协作代理-检测，规划，行动和评估-集成到一个自动化的闭环中。在其核心，动作代理采用双编码器和三目标设计，以捕获跨模态的互补信息，并执行整体图形质量增强。九个场景的实验表明，LAGA提高了图形质量，并在各种任务和骨干中实现了最先进的性能，验证了以数据为中心的质量优化是可靠TAG和强大图形学习的关键。
摘要：Text-attributed graphs (TAGs) have emerged as a powerful representation that combines structural connections with fine-grained semantics, supporting a wide range of data-centric applications. However, the performance of graph neural networks (GNNs) on TAGs is highly sensitive to input quality. Our empirical study shows that both traditional GNNs and LLM-enhanced GNNs suffer significant degradation across nine representative scenarios of sparsity, noise, and imbalance, highlighting graph quality as a critical bottleneck. Existing approaches mainly focus on improving model architectures, while neglecting systematic optimization of TAG data itself, leading to limited effectiveness in practice. To address this gap, we propose LAGA (Large Language and Graph Agent), a unified multi-agent framework that treats graph quality control as a first-class, data-centric problem. LAGA integrates four collaborative agents-detection, planning, action, and evaluation-into an automated closed loop. At its core, the action agent employs a dual-encoder and tri-objective design to capture complementary information across modalities and perform holistic graph quality enhancement. Experiments across nine scenarios show that LAGA improves graph quality and achieves state-of-the-art performance across various tasks and backbones, validating data-centric quality optimization as key to reliable TAGs and robust graph learning.

【22】RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation
标题：雷达：LLM评估中检测数据污染的机制途径
链接：https://arxiv.org/abs/2510.08931

作者：Ashish Kattamuri, Harshwardhan Fartale, Arpita Vats, Rahul Raja, Ishita Prasad
备注：NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
摘要：数据污染对可靠的LLM评估提出了重大挑战，其中模型可能通过记忆训练数据而不是展示真正的推理能力来实现高性能。我们引入了RADAR（通过激活表示的回忆与推理检测），这是一种新的框架，它利用机械的可解释性来检测污染，通过区分基于回忆的模型响应和基于推理的模型响应。RADAR提取了37个特征，这些特征跨越了表面水平的置信度轨迹和深层的机械属性，包括注意力专业化、回路动力学和激活流模式。使用在这些特征上训练的分类器集合，RADAR在不同的评估集上达到93%的准确率，在明确的情况下具有完美的性能，在具有挑战性的模糊示例上具有76.7%的准确率。这项工作展示了机械可解释性的潜力，以推进LLM评估超越传统的表面水平的指标。
摘要：Data contamination poses a significant challenge to reliable LLM evaluation, where models may achieve high performance by memorizing training data rather than demonstrating genuine reasoning capabilities. We introduce RADAR (Recall vs. Reasoning Detection through Activation Representation), a novel framework that leverages mechanistic interpretability to detect contamination by distinguishing recall-based from reasoning-based model responses. RADAR extracts 37 features spanning surface-level confidence trajectories and deep mechanistic properties including attention specialization, circuit dynamics, and activation flow patterns. Using an ensemble of classifiers trained on these features, RADAR achieves 93\% accuracy on a diverse evaluation set, with perfect performance on clear cases and 76.7\% accuracy on challenging ambiguous examples. This work demonstrates the potential of mechanistic interpretability for advancing LLM evaluation beyond traditional surface-level metrics.

【23】GTAlign: Game-Theoretic Alignment of LLM Assistants for Mutual Welfare
标题：GTAlignn：LLM互惠福利助理的游戏理论一致
链接：https://arxiv.org/abs/2510.08872

作者：Siqi Zhu, David Zhang, Pedro Cisneros-Velarde, Jiaxuan You
备注：31 pages, 6 figures
摘要：大型语言模型（LLM）在推理方面取得了显着的进步，但有时在写作，信息搜索或提供实用指导等任务中产生的响应对于用户来说是次优的。传统的对齐实践通常假设最大化模型奖励也最大化用户福利，但这种假设在实践中经常失败：当用户喜欢简洁的答案时，模型可能会过度澄清或生成过于冗长的推理。这种行为类似于囚徒困境，即个人理性的选择导致社会次优的结果。根本的挑战是缺乏一个原则性的决策机制，使LLM和用户都受益。我们提出了博弈论对齐（GTAlign），一个对齐框架，将博弈论决策集成到推理和训练中。在推理过程中，该模型明确地将用户与LLM的交互视为一种战略游戏：它在推理链中构建收益矩阵，以估计自身和用户的福利，然后选择互利的行动。在训练过程中，我们引入了一种共同福利奖励，以加强合作反应，使模型行为与社会有效结果保持一致。此外，我们引入了一种推理技术，利用博弈论推理，动态地适应LLM的响应时，LLM服务的定价政策的变化。大量的实验表明，GTAlign大大提高了推理效率，答案质量和相互福利相比，基线在不同的任务。该代码可在https://github.com/ulab-uiuc/GTAlign上获得。
摘要：Large Language Models (LLMs) have achieved remarkable progress in reasoning, yet sometimes produce responses that are suboptimal for users in tasks such as writing, information seeking, or providing practical guidance. Conventional alignment practices typically assume that maximizing model reward also maximizes user welfare, but this assumption frequently fails in practice: models may over-clarify or generate overly verbose reasoning when users prefer concise answers. Such behaviors resemble the prisoner's dilemma, where individually rational choices lead to socially suboptimal outcomes. The fundamental challenge is the lack of a principled decision making mechanism that mutually benefits both the LLM and the user. We propose Game-Theoretic Alignment (GTAlign), an alignment framework that integrates game-theoretic decision making into both reasoning and training. During reasoning, the model explicitly treats user-LLM interaction as a strategic game: it constructs payoff matrices within its reasoning chain to estimate welfare for both itself and the user, and then selects actions that are mutually beneficial. During training, we introduce a mutual welfare reward that reinforces cooperative responses, aligning model behavior with socially efficient outcomes. In addition, we introduce an inference technique that leverages game-theoretic reasoning to dynamically adapt LLM's response when pricing policies of LLM service change. Extensive experiments demonstrate that GTAlign substantially improves reasoning efficiency, answer quality, and mutual welfare compared to baselines across diverse tasks. The code is available at https://github.com/ulab-uiuc/GTAlign .

【24】TinyGraphEstimator: Adapting Lightweight Language Models for Graph Structure Inference
标题：TinyGraphEstimate：适应轻量级语言模型的图结构推理
链接：https://arxiv.org/abs/2510.08808

作者：Michal Podstawski
摘要：图为表示复杂的关系系统提供了一个通用的框架，而推断其结构属性是图分析和推理的核心挑战。虽然大型语言模型最近表现出了执行符号和数字推理的新兴能力，但在这种情况下，较小的资源高效模型的潜力在很大程度上尚未开发。本文研究了基于紧凑变换的语言模型是否可以直接从文本图表示中推断图论参数。为了实现系统评估，我们引入了TinyGraphEstimate数据集-一个由多个随机图模型生成的连接图的平衡集合，并使用详细的结构元数据进行注释。我们评估了几个小的开放模型预测关键图形参数，如密度，聚类和色数的能力。此外，我们使用低秩自适应（LoRA）技术进行轻量级微调，在所有评估的指标中实现一致的改进。结果表明，小语言模型具有非平凡的推理能力，图结构的数据，并可以有效地适应结构推理任务，通过有效的参数调整。
摘要：Graphs provide a universal framework for representing complex relational systems, and inferring their structural properties is a core challenge in graph analysis and reasoning. While large language models have recently demonstrated emerging abilities to perform symbolic and numerical reasoning, the potential of smaller, resource-efficient models in this context remains largely unexplored. This paper investigates whether compact transformer-based language models can infer graph-theoretic parameters directly from textual graph representations. To enable systematic evaluation, we introduce the TinyGraphEstimator dataset - a balanced collection of connected graphs generated from multiple random graph models and annotated with detailed structural metadata. We evaluate several small open models on their ability to predict key graph parameters such as density, clustering, and chromatic number. Furthermore, we apply lightweight fine-tuning using the Low-Rank Adaptation (LoRA) technique, achieving consistent improvements across all evaluated metrics. The results demonstrate that small language models possess non-trivial reasoning capacity over graph-structured data and can be effectively adapted for structural inference tasks through efficient parameter tuning.

【25】Guiding Exploration in Reinforcement Learning Through LLM-Augmented Observations
标题：通过法学硕士增强观察指导强化学习探索
链接：https://arxiv.org/abs/2510.08779

作者：Vaibhav Jain, Gerrit Grossmann
备注：Accepted to LM4Plan Workshop @ ICAPS 2025 (withdrawn before presentation due to lack of travel funding)
摘要：强化学习（RL）代理通常在稀疏奖励环境中挣扎，传统的探索策略无法发现有效的动作序列。大型语言模型（LLM）拥有来自文本预训练的过程知识和推理能力，可以指导RL探索，但现有方法会产生严格的依赖关系，其中RL策略必须遵循LLM建议或将其直接纳入奖励函数。我们提出了一个框架，通过增强的观察空间提供LLM生成的行动建议，允许RL代理学习何时遵循或忽略此指导。我们的方法利用LLM的世界知识和推理能力，同时通过软约束保持灵活性。我们在三个日益复杂的BabyAI环境中评估了我们的方法，并表明LLM指导的好处随着任务难度的增加而增加。在最具挑战性的环境中，我们的最终成功率相对于基线提高了71%。该方法提供了大量的样本效率增益，代理达到性能阈值的速度快了9倍，并且不需要修改现有的RL算法。我们的研究结果证明了一种有效的方法，利用LLM规划能力，以加快RL培训在具有挑战性的环境。
摘要：Reinforcement Learning (RL) agents often struggle in sparse-reward environments where traditional exploration strategies fail to discover effective action sequences. Large Language Models (LLMs) possess procedural knowledge and reasoning capabilities from text pretraining that could guide RL exploration, but existing approaches create rigid dependencies where RL policies must follow LLM suggestions or incorporate them directly into reward functions. We propose a framework that provides LLM-generated action recommendations through augmented observation spaces, allowing RL agents to learn when to follow or ignore this guidance. Our method leverages LLMs' world knowledge and reasoning abilities while maintaining flexibility through soft constraints. We evaluate our approach on three BabyAI environments of increasing complexity and show that the benefits of LLM guidance scale with task difficulty. In the most challenging environment, we achieve 71% relative improvement in final success rates over baseline. The approach provides substantial sample efficiency gains, with agents reaching performance thresholds up to 9 times faster, and requires no modifications to existing RL algorithms. Our results demonstrate an effective method for leveraging LLM planning capabilities to accelerate RL training in challenging environments.

【26】Exploring Cross-Client Memorization of Training Data in Large Language Models for Federated Learning
标题：探索联邦学习大型语言模型中训练数据的跨客户机并行化
链接：https://arxiv.org/abs/2510.08750

作者：Tinnakit Udsa, Can Udomcharoenchaikit, Patomporn Payoungkhamdee, Sarana Nutanong, Norrathep Rattanavipanon
摘要：联邦学习（FL）允许在不共享原始数据的情况下进行协作训练，但仍然存在训练数据记忆的风险。现有的外语记忆检测技术每次只关注一个样本，低估了跨样本记忆的更微妙的风险。相比之下，最近的集中式学习（CL）的工作引入了细粒度的方法来评估记忆在训练数据中的所有样本，但这些假设集中访问数据，不能直接应用于FL。我们提出了一个框架，量化内部和客户间的记忆在FL使用细粒度的跨样本记忆测量所有客户端，从而弥合了这一差距。基于这个框架，我们进行了两项研究：（1）测量客户的微妙记忆和（2）检查影响记忆的关键因素，包括解码策略，前缀长度和FL算法。我们的研究结果表明，FL模型确实记住了客户数据，特别是客户内部数据，而不是客户之间的数据，记忆受到训练和推理因素的影响。
摘要：Federated learning (FL) enables collaborative training without raw data sharing, but still risks training data memorization. Existing FL memorization detection techniques focus on one sample at a time, underestimating more subtle risks of cross-sample memorization. In contrast, recent work on centralized learning (CL) has introduced fine-grained methods to assess memorization across all samples in training data, but these assume centralized access to data and cannot be applied directly to FL. We bridge this gap by proposing a framework that quantifies both intra- and inter-client memorization in FL using fine-grained cross-sample memorization measurement across all clients. Based on this framework, we conduct two studies: (1) measuring subtle memorization across clients and (2) examining key factors that influence memorization, including decoding strategies, prefix length, and FL algorithms. Our findings reveal that FL models do memorize client data, particularly intra-client data, more than inter-client data, with memorization influenced by training and inferencing factors.

【27】How Reliable is Language Model Micro-Benchmarking?
标题：语言模型微基准测试的可靠性有多高？
链接：https://arxiv.org/abs/2510.08730

作者：Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta
摘要：微基准测试为语言模型开发的时间和成本提供了一种解决方案：在现有基准测试的一个非常小的子集上进行评估。然而，这些微基准能否像它们所取代的完整基准一样一致地对模型进行排名？它们是否能比选择一个随机的数据点子集更一致地对模型进行排名？在许多情况下，我们发现答案是否定的。我们引入了一个元评估措施的微基准测试，调查如何以及一个微基准可以排名两个模型作为其性能差异的函数在完整的基准。这种方法可以确定哪些模型对可以通过微基准进行正确排名，从而允许对微基准大小和可靠性之间的权衡进行更细粒度的分析。先前的工作建议选择最少10个示例;我们发现没有微基准测试方法可以在MMLU-Pro上以3.5个点的准确度或在BIG-bench Hard上以4个点的准确度一致地对模型对进行排名。为了始终如一地排名模型对相对相似的性能，我们表明，往往多达250个例子必须选择，在这一点上随机抽样是有竞争力的与现有的微观基准测试方法。在MMLU-Pro微基准测试中，当仅比较8B预调模型与25个示例时，我们发现超过一半的成对比较不太可能被保留。我们的工作为微基准用户和开发人员在评估效率和可靠性之间的权衡中提供了可操作的指导。
摘要：Micro-benchmarking offers a solution to the often prohibitive time and cost of language model development: evaluate on a very small subset of existing benchmarks. Can these micro-benchmarks, however, rank models as consistently as the full benchmarks they replace? And can they rank models more consistently than selecting a random subset of data points? In many scenarios, we find that the answer is no. We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark. This approach can determine which model pairs can be ranked correctly by a micro-benchmark, allowing for a finer-grained analysis of the trade-off between micro-benchmark size and reliability. Prior work has suggested selecting as few as 10 examples; we find that no micro-benchmarking method can consistently rank model pairs 3.5 points of accuracy apart on MMLU-Pro or 4 points apart on BIG-bench Hard. In order to consistently rank model pairs with relatively similar performances, we show that often as many as 250 examples must be selected, at which point random sampling is competitive with existing micro-benchmarking methods. When comparing only 8B instruction-tuned models on MMLU-Pro micro-benchmarks with 25 examples, we find that more than half of pairwise comparisons are not likely to be preserved. Our work provides actionable guidance for both micro-benchmark users and developers in navigating the trade-off between evaluation efficiency and reliability.

【28】Provably Robust Adaptation for Language-Empowered Foundation Models
标题：对预算授权的基金会模型的可证明稳健的适应
链接：https://arxiv.org/abs/2510.08659

作者：Yuni Lai, Xiaoyu Xue, Linghui Shen, Yulun Wu, Gaolei Li, Song Guo, Kai Zhou, Bin Xiao
备注：19 pages
摘要：CLIP和GraphCLIP等授权基础模型（LeFM）通过将视觉（或图形）特征与文本表示对齐，改变了多模态学习，实现了强大的下游功能，如Few-Shot学习。然而，依赖于在开放环境中收集的小型特定于任务的支持数据集，使这些模型暴露于中毒攻击，其中对手操纵支持样本以降低性能。现有的防御依赖于经验策略，缺乏正式的保证，仍然容易受到看不见的和自适应的攻击。经过认证的鲁棒性提供了可证明的保证，但在很大程度上尚未探索基于LeFM的Few-Shot分类器。本研究旨在填补这些关键的差距，提出了第一个可证明的强大的Few-Shot分类器，是专为LeFM。我们将我们的型号称为“授权存储Few-Shot认证”（\textbf{LeFCert}）。它将文本和特征嵌入与自适应混合机制集成在一起。为了实现可证明的鲁棒性，我们提出了一个双重修剪平均原型，并推导出可证明的上限和下限的分类分数，使最坏情况下的中毒情况下的认证。为了进一步提高性能，我们通过考虑更现实和更严格的攻击预算来扩展LeFCert：LeFCert-L采用随机平滑来提供Lipschitz连续性并在双重预算约束下获得鲁棒性，LeFCert-C为攻击者在多个样本中分配共享中毒预算的场景提供集体认证。实验表明，LeFCert实现了最先进的性能，与现有基线相比，显著提高了清洁和认证的准确性。尽管LeFCert具有先进的鲁棒性机制，但它在计算上是高效的，这使得它在现实世界的应用中非常实用。
摘要：Language-empowered foundation models (LeFMs), such as CLIP and GraphCLIP, have transformed multimodal learning by aligning visual (or graph) features with textual representations, enabling powerful downstream capabilities like few-shot learning. However, the reliance on small, task-specific support datasets collected in open environments exposes these models to poisoning attacks, where adversaries manipulate the support samples to degrade performance. Existing defenses rely on empirical strategies, which lack formal guarantees and remain vulnerable to unseen and adaptive attacks. Certified robustness offers provable guarantees but has been largely unexplored for few-shot classifiers based on LeFMs. This study seeks to fill these critical gaps by proposing the first provably robust few-shot classifier that is tailored for LeFMs. We term our model Language-empowered Few-shot Certification (\textbf{LeFCert}). It integrates both textual and feature embeddings with an adaptive blending mechanism. To achieve provable robustness, we propose a twofold trimmed mean prototype and derive provable upper and lower bounds for classification scores, enabling certification under worst-case poisoning scenarios. To further enhance the performance, we extend LeFCert with two variants by considering a more realistic and tighter attack budget: LeFCert-L incorporates randomized smoothing to provide Lipschitz continuity and derive robustness under dual budget constraints, and LeFCert-C provides collective certification for scenarios where attackers distribute a shared poisoning budget across multiple samples. Experiments demonstrate that LeFCert achieves state-of-the-art performance, significantly improving both clean and certified accuracy compared to existing baselines. Despite its advanced robustness mechanisms, LeFCert is computationally efficient, making it practical for real-world applications.

【29】Energy-Driven Steering: Reducing False Refusals in Large Language Models
标题：能量驱动的转向：减少大型语言模型中的错误拒绝
链接：https://arxiv.org/abs/2510.08646

作者：Eric Hanchen Jiang, Weixuan Ou, Run Liu, Shengyuan Pang, Guancheng Wan, Ranjie Duan, Wei Dong, Kai-Wei Chang, XiaoFeng Wang, Ying Nian Wu, Xinfeng Li
摘要：大型语言模型（LLM）的安全对齐面临着一个关键挑战：当前的对齐技术往往只关注提高对有害提示的安全性，导致LLM变得过于谨慎，拒绝响应良性提示。因此，安全对齐的一个关键目标是提高安全性，同时减少错误拒绝。在本文中，我们介绍了能源驱动转向（EDS），一种新的，微调免费的框架，旨在通过动态的，推理时间干预来解决这一挑战。我们训练了一个轻量级的外部基于能量的模型（EBM），将高能量分配给不受欢迎的状态（错误拒绝或越狱），将低能量分配给理想的状态（有帮助的响应或安全拒绝）。在推理过程中，EBM将LLM的内部激活映射到“能量景观”。我们使用能量函数的梯度来动态地将LLM的隐藏状态引导到低能量区域，从而在不修改其权重的情况下实时校正模型以生成期望的响应。这种方法将行为控制从模型的核心知识中分离出来，提供了一种灵活的解决方案，具有最小的计算开销。在各种模型上的广泛实验表明，我们的方法成功地实现了这一目标：它大大降低了错误拒绝率。例如，将ORB-H基准的合规性从57.3%提高到82.6%，同时保持基线安全性能。我们的工作提出了一个有效的范例，建设LLM，实现低错误拒绝率和高安全性。
摘要：Safety alignment of large language models (LLMs) faces a key challenge: current alignment techniques often only focus on improving safety against harmful prompts, causing LLMs to become over-cautious and refuse to respond to benign prompts. Therefore, a key objective of safe alignment is to enhance safety while simultaneously reducing false refusals. In this paper, we introduce Energy-Driven Steering (EDS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We trained a lightweight, external Energy-Based Model (EBM) to assign high energy to undesirable (false refusal or jailbreak) states and low energy to desirable (helpful response or safe reject) ones. During inference, EBM maps the LLM's internal activations to an "energy landscape". We use the gradient of the energy function to dynamically steer the LLM's hidden states to low energy regions, correcting the model to generate a desirable response in real-time without modifying its weights. This method decouples behavioral control from the model's core knowledge, offering a flexible solution with minimal computational overhead. Extensive experiments across a wide range of models show our method successfully achieves this objective: it substantially lowers false refusal rates. For example, raising compliance on the ORB-H benchmark from 57.3% to 82.6% while maintaining the baseline safety performance. Our work presents an effective paradigm for building LLMs that achieve both low false refusal rates and high safety.

【30】Next Semantic Scale Prediction via Hierarchical Diffusion Language Models
标题：下一个通过分层扩散语言模型进行语义规模预测
链接：https://arxiv.org/abs/2510.08632

作者：Cai Zhou, Chenyu Wang, Dinghuai Zhang, Shangyuan Tong, Yifei Wang, Stephen Bates, Tommi Jaakkola
备注：Accepted to NeurIPS 2025
摘要：在本文中，我们介绍了层次扩散语言模型（HDLM）-一个新的家庭离散扩散模型的语言建模。HDLM建立在一个层次化的词汇表上，在这个词汇表中，具有详细语义的低级标记被满射地映射到具有粗粒度含义的高级标记。在前向过程中，每个令牌根据调度程序独立地扰动到其具有更抽象语义的更高级别的祖先，而在反向过程中，模型逐步预测下一个更详细的语义。总之，HDLM为语言建模提供了一个通用的时变下一个语义尺度预测过程。我们推导出封闭形式的表达式的扩散证据下限（ELBO），并表明，HDLM可以实现在一个灵活的方式，同时包括现有的MDLM作为一种特殊情况。我们还提出了实用的培训技术的基础上的见解。大量的文本生成实验验证了HDLM的有效性，它始终表现出较低的验证和生成困惑比基线。
摘要：In this paper we introduce Hierarchical Diffusion Language Models (HDLM) -- a novel family of discrete diffusion models for language modeling. HDLM builds on a hierarchical vocabulary where low-level tokens with detailed semantics are surjectively mapped to high-level tokens with coarse-grained meanings. In the forward process, each token is independently perturbed to its higher-level ancestor with more abstract semantics according to the scheduler, while in the reverse process the model progressively predicts the next, more detailed semantics. Taken together, HDLM provides a general time-varying next semantic scale prediction process for language modeling. We derive closed-form expressions for the diffusion Evidence Lower Bound (ELBO), and show that HDLM can be implemented in a flexible manner while including the existing MDLM as a special case. We also propose practical training techniques based on the insights. Extensive text generation experiments validate the effectiveness of HDLM, which demonstrates consistently lower validation and generative perplexity than baselines.

【31】PARSE: LLM Driven Schema Optimization for Reliable Entity Extraction
标题：PARSE：LLM驱动的模式优化，用于可靠的实体提取
链接：https://arxiv.org/abs/2510.08623

作者：Anubhav Shrimal, Aryan Jain, Soumyajit Chowdhury, Promod Yenigalla
备注：EMNLP 2025 Industry Track
摘要：从非结构化文本中提取结构化信息对于新兴的软件3.0系统至关重要，其中LLM代理自主地与API和工具进行交互。最近的方法使用现有的JSON模式直接将大型语言模型应用于提取任务，通常使用约束解码或强化学习方法来确保语法有效性，但将JSON模式视为为人类开发人员设计的静态合同，导致次优提取性能，频繁幻觉，以及当模式包含模糊或不完整的规范时不可靠的代理行为。我们认识到，JSON模式本身是一种自然语言理解契约的形式，它对LLM应该能够解释和系统改进的数据结构契约的规则，关系和期望进行编码。因此，我们开发了PARSE（参数自动细化和模式提取），一个新的系统，具有两个协同组件：ARCHITECT，它自主优化JSON模式的LLM消费，同时通过RELAY（集成代码生成系统）保持向后兼容性，和SCOPE，它实现了基于反射的提取结合静态和基于LLM的护栏。我们在三个数据集上对PARSE进行了定性和定量评估，包括模式引导对话（SGD），结构化Web数据提取（SWDE）和内部零售对话数据，发现它在SWDE上的提取准确率提高了64.7%，综合框架改进在各个模型中达到了10%，同时在第一次重试内将提取错误减少了92%，并保持了实际的延迟。
摘要：Structured information extraction from unstructured text is critical for emerging Software 3.0 systems where LLM agents autonomously interact with APIs and tools. Recent approaches apply large language models directly to extraction tasks using existing JSON schemas, often with constraint decoding or reinforcement learning approaches to ensure syntactic validity, but treat JSON schemas as static contracts designed for human developers, leading to suboptimal extraction performance, frequent hallucinations, and unreliable agent behavior when schemas contain ambiguous or incomplete specifications. We recognize that JSON schemas themselves are a form of natural language understanding contract that encodes rules, relationships, and expectations about data structure contracts that LLMs should be able to both interpret and systematically improve. Consequently, we develop PARSE (Parameter Automated Refinement and Schema Extraction), a novel system with two synergistic components: ARCHITECT, which autonomously optimizes JSON schemas for LLM consumption while maintaining backward compatibility through RELAY (an integrated code generation system), and SCOPE, which implements reflection-based extraction with combined static and LLM-based guardrails. We evaluate PARSE qualitatively and quantitatively on three datasets including Schema-Guided Dialogue (SGD), Structured Web Data Extraction (SWDE), and internal retail conversation data, and find that it achieves up to 64.7% improvement in extraction accuracy on SWDE with combined framework improvements reaching 10% across models, while reducing extraction errors by 92% within the first retry and and maintaining practical latency.

【32】Relative Positioning Based Code Chunking Method For Rich Context Retrieval In Repository Level Code Completion Task With Code Language Model
标题：基于相对定位的代码分块方法在具有代码语言模型的存储库级代码完成任务中用于丰富上下文检索
链接：https://arxiv.org/abs/2510.08610

作者：Imranur Rahman, Md Rayhanur Rahman
备注：Accepted to Context Collection Workshop co-located with ASE 2025
摘要：代码完成可以帮助开发人员提高效率并简化开发生命周期。虽然代码完成在现代集成开发环境（IDE）中可用，但研究缺乏确定什么是基于IDE可用的信息的代码完成的良好上下文，以便大型语言模型（LLM）更好地执行。在本文中，我们描述了一种有效的上下文收集策略，以帮助LLM更好地执行代码完成任务。我们的策略的关键思想是预处理存储库成较小的代码块，然后使用基于语法和语义相似性的相对定位的代码块检索。我们发现，代码分块和块在最终上下文中的相对定位提高了代码完成任务的性能。
摘要：Code completion can help developers improve efficiency and ease the development lifecycle. Although code completion is available in modern integrated development environments (IDEs), research lacks in determining what makes a good context for code completion based on the information available to the IDEs for the large language models (LLMs) to perform better. In this paper, we describe an effective context collection strategy to assist the LLMs in performing better at code completion tasks. The key idea of our strategy is to preprocess the repository into smaller code chunks and later use syntactic and semantic similarity-based code chunk retrieval with relative positioning. We found that code chunking and relative positioning of the chunks in the final context improve the performance of code completion tasks.

【33】Toward a Safer Web: Multilingual Multi-Agent LLMs for Mitigating Adversarial Misinformation Attacks
标题：迈向更安全的网络：用于缓解敌对错误信息攻击的多语言多代理LLM
链接：https://arxiv.org/abs/2510.08605

作者：Nouar Aldahoul, Yasir Zaki
摘要：数字平台上错误信息的快速传播威胁着公共话语、情绪稳定和决策。虽然先前的工作已经探索了错误信息检测中的各种对抗性攻击，但本文中研究的具体转换尚未进行系统研究。特别是，我们调查了英语，法语，西班牙语，阿拉伯语，印地语和中文之间的语言切换，然后进行翻译。我们还研究了查询长度膨胀之前的总结和结构重新格式化为多项选择题。在本文中，我们提出了一个多语言，多代理大型语言模型框架检索增强生成，可以部署为网络插件到在线平台。我们的工作强调了人工智能驱动的错误信息检测在保护在线事实完整性免受各种攻击方面的重要性，同时展示了基于插件的部署在现实世界的Web应用程序中的可行性。
摘要：The rapid spread of misinformation on digital platforms threatens public discourse, emotional stability, and decision-making. While prior work has explored various adversarial attacks in misinformation detection, the specific transformations examined in this paper have not been systematically studied. In particular, we investigate language-switching across English, French, Spanish, Arabic, Hindi, and Chinese, followed by translation. We also study query length inflation preceding summarization and structural reformatting into multiple-choice questions. In this paper, we present a multilingual, multi-agent large language model framework with retrieval-augmented generation that can be deployed as a web plugin into online platforms. Our work underscores the importance of AI-driven misinformation detection in safeguarding online factual integrity against diverse attacks, while showcasing the feasibility of plugin-based deployment for real-world web applications.

【34】LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback
标题：LatentBreak：通过潜在空间反馈越狱大型语言模型
链接：https://arxiv.org/abs/2510.08604

作者：Raffaele Mura, Giorgio Piras, Kamilė Lukošiūtė, Maura Pintor, Amin Karbasi, Battista Biggio
摘要：越狱是对抗性攻击，旨在绕过大型语言模型的内置安全机制。自动越狱通常通过强制模型生成受限或有害响应的初始部分来优化对抗性后缀或适应长提示模板。在这项工作中，我们表明，现有的越狱攻击，利用这种机制来解锁模型响应可以检测到一个简单的基于困惑的过滤输入提示。为了克服这个问题，我们提出了LatentBreak，这是一种白盒越狱攻击，可以生成具有低困惑度的自然对抗性提示，从而能够逃避这种防御。LatentBreak将输入提示中的单词替换为语义等效的单词，保留提示的初始意图，而不是添加高困惑度的对抗性后缀或长模板。这些词是通过最小化潜在空间中对抗性提示的表示与无害请求的表示之间的距离来选择的。我们广泛的评估表明，LatentBreak导致更短和低困惑的提示，从而优于竞争的越狱算法对基于困惑的过滤器在多个安全对齐的模型。
摘要：Jailbreaks are adversarial attacks designed to bypass the built-in safety mechanisms of large language models. Automated jailbreaks typically optimize an adversarial suffix or adapt long prompt templates by forcing the model to generate the initial part of a restricted or harmful response. In this work, we show that existing jailbreak attacks that leverage such mechanisms to unlock the model response can be detected by a straightforward perplexity-based filtering on the input prompt. To overcome this issue, we propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity capable of evading such defenses. LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt, instead of adding high-perplexity adversarial suffixes or long templates. These words are chosen by minimizing the distance in the latent space between the representation of the adversarial prompt and that of harmless requests. Our extensive evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models.

【35】Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection
标题：人类文本是离群值：通过分发外检测检测LLM生成的文本
链接：https://arxiv.org/abs/2510.08602

作者：Cong Zeng, Shengkun Tang, Yuanzhou Chen, Zhiqiang Shen, Wenchao Yu, Xujiang Zhao, Haifeng Chen, Wei Cheng, Zhiqiang Xu
备注：None
摘要：ChatGPT、DeepSeek和Claude等大型语言模型（LLM）的快速发展显著增加了数字通信中人工智能生成文本的存在。这种趋势增加了对可靠检测方法的需求，以区分人类创作和机器生成的内容。现有的方法zero-shot方法和监督分类器在很大程度上将此任务概念化为二进制分类问题，通常导致跨域和模型的泛化能力差。在本文中，我们认为，这样的二进制制定从根本上错误的检测任务，假设一个连贯的表示人类写的文本。在现实中，人类文本并不构成一个统一的分布，其多样性无法通过有限的抽样有效地捕捉。这导致以前的分类器记住观察到的OOD特征，而不是学习“非ID”行为的本质，限制了对看不见的人类创作输入的泛化。基于这一观察，我们提出将检测任务重新定义为分布外（OOD）检测问题，将人类书写的文本视为分布离群值，而机器生成的文本则是分布内（ID）样本。为此，我们开发了一个使用单类学习方法（包括DeepSVDD和HRN）和基于分数的学习技术（如基于能量的方法）的检测框架，从而实现了鲁棒性和可推广性。在多个数据集上的大量实验验证了我们基于OOD的方法的有效性。具体来说，基于OOD的方法在DeepFake数据集上实现了98.3%的AUROC和AUPR，而FPR 95仅为8.9%。此外，我们测试我们的检测框架多语言，攻击，和unseen-model和-domain文本设置，展示了我们的框架的鲁棒性和通用性。代码，预训练的权重和演示将被释放。
摘要：The rapid advancement of large language models (LLMs) such as ChatGPT, DeepSeek, and Claude has significantly increased the presence of AI-generated text in digital communication. This trend has heightened the need for reliable detection methods to distinguish between human-authored and machine-generated content. Existing approaches both zero-shot methods and supervised classifiers largely conceptualize this task as a binary classification problem, often leading to poor generalization across domains and models. In this paper, we argue that such a binary formulation fundamentally mischaracterizes the detection task by assuming a coherent representation of human-written texts. In reality, human texts do not constitute a unified distribution, and their diversity cannot be effectively captured through limited sampling. This causes previous classifiers to memorize observed OOD characteristics rather than learn the essence of `non-ID' behavior, limiting generalization to unseen human-authored inputs. Based on this observation, we propose reframing the detection task as an out-of-distribution (OOD) detection problem, treating human-written texts as distributional outliers while machine-generated texts are in-distribution (ID) samples. To this end, we develop a detection framework using one-class learning method including DeepSVDD and HRN, and score-based learning techniques such as energy-based method, enabling robust and generalizable performance. Extensive experiments across multiple datasets validate the effectiveness of our OOD-based approach. Specifically, the OOD-based method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset. Moreover, we test our detection framework on multilingual, attacked, and unseen-model and -domain text settings, demonstrating the robustness and generalizability of our framework. Code, pretrained weights, and demo will be released.

【36】Mnemosyne: An Unsupervised, Human-Inspired Long-Term Memory Architecture for Edge-Based LLMs
标题：Mnemosyne：一种用于基于边缘的LLM的无监督、受人为启发的长期记忆架构
链接：https://arxiv.org/abs/2510.08601

作者：Aneesh Jonelagadda, Christina Hahn, Haoze Zheng, Salvatore Penachio (Kaliber AI)
备注：12 pages, 4 figures
摘要：长期记忆对于自然、现实的对话至关重要。然而，目前的大型语言模型（LLM）内存系统依赖于暴力上下文扩展或静态检索管道，失败的边缘约束的设备。我们介绍Mnemosyne，一个无监督的，人类启发的长期记忆体系结构，专为基于边缘的LLM。我们的方法使用图结构存储，模块化的物质和冗余过滤器，内存提交和修剪机制，以及概率召回与时间衰减和刷新过程模仿人类记忆。Mnemosyne还引入了一个集中的“核心摘要”，它有效地从内存图的固定长度子集中导出，以捕获用户的个性和其他特定领域的长期细节，例如，以医疗保健应用程序为例，恢复后的野心和对护理的态度。与现有的检索增强方法不同，Mnemosyne被设计用于纵向医疗助理，其中重复和语义相似但时间上不同的对话受到朴素检索的限制。在纵向医疗保健对话的实验中，Mnemosyne在对现实主义和长期记忆能力的盲人评估中表现出65.8%的最高获胜率，而基线RAG获胜率为31.1%。Mnemosyne在时间推理和单跳检索方面也达到了目前最高的LoCoMo基准分数，与其他相同的主干技术相比。此外，54.6%的平均总分在所有方法中排名第二，击败了常用的Mem0和OpenAI基线。这表明，改进的事实回忆，增强的时间推理，更自然的用户面对的反应可以与边缘兼容，易于转移的无监督记忆体系结构是可行的。
摘要：Long-term memory is essential for natural, realistic dialogue. However, current large language model (LLM) memory systems rely on either brute-force context expansion or static retrieval pipelines that fail on edge-constrained devices. We introduce Mnemosyne, an unsupervised, human-inspired long-term memory architecture designed for edge-based LLMs. Our approach uses graph-structured storage, modular substance and redundancy filters, memory committing and pruning mechanisms, and probabilistic recall with temporal decay and refresh processes modeled after human memory. Mnemosyne also introduces a concentrated "core summary" efficiently derived from a fixed-length subset of the memory graph to capture the user's personality and other domain-specific long-term details such as, using healthcare application as an example, post-recovery ambitions and attitude towards care. Unlike existing retrieval-augmented methods, Mnemosyne is designed for use in longitudinal healthcare assistants, where repetitive and semantically similar but temporally distinct conversations are limited by naive retrieval. In experiments with longitudinal healthcare dialogues, Mnemosyne demonstrates the highest win rate of 65.8% in blind human evaluations of realism and long-term memory capability compared to a baseline RAG win rate of 31.1%. Mnemosyne also achieves current highest LoCoMo benchmark scores in temporal reasoning and single-hop retrieval compared to other same-backboned techniques. Further, the average overall score of 54.6% was second highest across all methods, beating commonly used Mem0 and OpenAI baselines among others. This demonstrates that improved factual recall, enhanced temporal reasoning, and much more natural user-facing responses can be feasible with an edge-compatible and easily transferable unsupervised memory architecture.

【37】Recover-LoRA: Data-Free Accuracy Recovery of Degraded Language Models via Low-Rank Adaptation
标题：RECE-LoRA：通过低等级自适应对降级语言模型进行无数据准确性恢复
链接：https://arxiv.org/abs/2510.08600

作者：Devleena Das, Rajeev Patwari, Ashish Sirasao
备注：Accepted to EMNLP 2025 Industry Track
摘要：诸如量化、修剪、格式和数据类型转换、模型导出和序列化之类的推理优化会导致语言模型任务性能的功能降级。虽然大多数部署的性能恢复工作都集中在强大的量化技术上，但我们专注于从任何降低模型权重的来源（如不正确的模型序列化）恢复模型精度。在这项工作中，我们提出了恢复LoRA，一个轻量级的和数据集不可知的方法来恢复退化模型的准确性。恢复-LoRA使用合成数据和logit蒸馏来学习选择性层上的LoRA适配器，以便于将降级模型与其全精度模型对齐。我们研究了Recovery-LoRA在各种小语言模型（SLM）中的实用性，包括具有不同注意力架构的模型，多头注意力（MHA）和组查询注意力（GQA），以及几个评估数据集。我们的研究结果表明，恢复LoRA恢复模型精度的5-17%的MHA和GQA SLM。
摘要：Inference optimizations such as quantization, pruning, format and datatype conversion, model export, and serialization can lead to functional degradations in language model task performance. While most efforts on performance recovery for deployment focus on robust quantization techniques, we focus on recovering model accuracies from any sources that degrade model weights, such as improper model serialization. In this work, we propose Recover-LoRA, a lightweight and dataset agnostic method to recover accuracy in degraded models. Recover-LoRA uses synthetic data and logit distillation to learn LoRA adapters on selective layers that facilitate aligning the degraded model to its full precision model. We investigate the utility of Recover-LoRA across a diverse set of small language models (SLMs), including models with varying attention architectures, multi-head attention (MHA) and group-query attention (GQA), as well as several evaluation datasets. Our results show that Recover-LoRA recovers model accuracies by 5-17% on MHA and GQA SLMs.

【38】Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models
标题：多样化、安全性较差：大型语言模型中测试时扩展的间接但普遍风险
链接：https://arxiv.org/abs/2510.08592

作者：Shahriar Kabir Nahin, Hadi Askari, Muhao Chen, Anshuman Chhabra
摘要：测试时间缩放（TTS）通过探索多个候选响应，然后在此集合上操作以找到最佳输出，从而改进了LLM推理。TTS背后的一个默认前提是，足够多样化的候选池增强了可靠性。在这项工作中，我们表明，这种假设在TTS介绍了一个以前无法识别的故障模式。当候选人的多样性被削减，即使是一个适度的量，TTS变得更有可能产生不安全的输出。我们提出了一个参考指导的多样性减少协议（RefDiv），作为一个诊断攻击的压力测试TTS管道。通过对四种开源模型（Qwen 3，Mistral，Llama3.1，Gemma 3）和两种广泛使用的TTS策略（Monte Carlo Tree Search和Best-of-N）的广泛实验，约束多样性始终表示TTS产生不安全结果的速度。这种效果往往比直接用高对抗意图分数提示产生的效果更强。这种观察到的现象也在TTS策略和闭源模型（例如OpenAI o3和Gemini-2.5-Pro）之间转移，因此表明这是TTS的一般和现存属性，而不是特定于模型的工件。此外，我们发现许多广泛使用的安全护栏分类器（例如Llama-Guard和OpenAI Moderation API）无法标记RefDiv生成的对抗性输入提示，这表明现有的防御措施对这种多样性驱动的故障模式提供了有限的保护。通过这项工作，我们希望激励未来的研究，设计强大的TTS策略，既有效又安全的多样性为目标的压力测试，如RefDiv所示。
摘要：Test-Time Scaling (TTS) improves LLM reasoning by exploring multiple candidate responses and then operating over this set to find the best output. A tacit premise behind TTS is that sufficiently diverse candidate pools enhance reliability. In this work, we show that this assumption in TTS introduces a previously unrecognized failure mode. When candidate diversity is curtailed, even by a modest amount, TTS becomes much more likely to produce unsafe outputs. We present a reference-guided diversity reduction protocol (RefDiv) that serves as a diagnostic attack to stress test TTS pipelines. Through extensive experiments across four open-source models (Qwen3, Mistral, Llama3.1, Gemma3) and two widely used TTS strategies (Monte Carlo Tree Search and Best-of-N), constraining diversity consistently signifies the rate at which TTS produces unsafe results. The effect is often stronger than that produced by prompts directly with high adversarial intent scores. This observed phenomenon also transfers across TTS strategies and to closed-source models (e.g. OpenAI o3 and Gemini-2.5-Pro), thus indicating that this is a general and extant property of TTS rather than a model-specific artifact. Additionally, we find that numerous widely used safety guardrail classifiers (e.g. Llama-Guard and OpenAI Moderation API), are unable to flag the adversarial input prompts generated by RefDiv, demonstrating that existing defenses offer limited protection against this diversity-driven failure mode. Through this work, we hope to motivate future research on designing robust TTS strategies that are both effective and secure against diversity-targeted stress tests as illustrated by RefDiv.

【39】A Design-based Solution for Causal Inference with Text: Can a Language Model Be Too Large?
标题：基于设计的文本因果推理解决方案：语言模型会太大吗？
链接：https://arxiv.org/abs/2510.08758

作者：Graham Tierney, Srikar Katta, Christopher Bail, Sunshine Hillygus, Alexander Volfovsky
摘要：许多社会科学问题都在问语言特性是如何影响受众的态度和行为的。由于文本属性通常是相互链接的（例如，愤怒的评论使用亵渎的语言），我们必须控制可能的潜在混淆以隔离因果效应。最近的文献提出采用大型语言模型（LLM）来学习文本的潜在表示，成功地预测治疗和结果。然而，由于治疗是文本的一个组成部分，这些深度学习方法有可能学习实际编码治疗本身的表示，从而导致重叠偏差。而不是依赖于事后调整，我们引入了一个新的实验设计，处理潜在的混淆，避免重叠的问题，并无偏估计治疗效果。我们应用这个设计在一个实验中，评估在政治沟通中表达谦卑的说服力。从方法论上讲，我们证明了基于LLM的方法比使用我们的真实文本和实验结果的简单词袋模型表现更差。实质上，我们分离出表达谦卑对政治声明的说服力的因果影响，为社交媒体平台，政策制定者和社会科学家提供了关于传播效果的新见解。
摘要：Many social science questions ask how linguistic properties causally affect an audience's attitudes and behaviors. Because text properties are often interlinked (e.g., angry reviews use profane language), we must control for possible latent confounding to isolate causal effects. Recent literature proposes adapting large language models (LLMs) to learn latent representations of text that successfully predict both treatment and the outcome. However, because the treatment is a component of the text, these deep learning methods risk learning representations that actually encode the treatment itself, inducing overlap bias. Rather than depending on post-hoc adjustments, we introduce a new experimental design that handles latent confounding, avoids the overlap issue, and unbiasedly estimates treatment effects. We apply this design in an experiment evaluating the persuasiveness of expressing humility in political communication. Methodologically, we demonstrate that LLM-based methods perform worse than even simple bag-of-words models using our real text and outcomes from our experiment. Substantively, we isolate the causal effect of expressing humility on the perceived persuasiveness of political statements, offering new insights on communication effects for social media platforms, policy makers, and social scientists.

Graph相关(图学习|图神经网络|图优化等)(8篇)

【1】What Do Temporal Graph Learning Models Learn?
标题：时态图学习模型学习什么？
链接：https://arxiv.org/abs/2510.09416

作者：Abigail J. Hayes, Tobias Schumacher, Markus Strohmaier
摘要：时态图的学习已经成为图表示学习的中心话题，许多基准测试表明了最先进模型的强大性能。然而，最近的工作引起了人们对基准测试结果可靠性的关注，注意到常用评估协议的问题以及简单化学品令人惊讶的竞争力。这种对比提出了一个问题，即时态图学习模型实际上使用底层图的哪些属性来形成它们的预测。我们通过系统地评估七种模型捕获与时间图的链接结构相关的八个基本属性的能力来解决这个问题。这些包括结构特征，如密度，时间模式，如近因，和边缘形成机制，如同质性。使用合成和真实世界的数据集，我们分析模型如何学习这些属性。我们的研究结果揭示了一个复杂的情况：模型很好地捕捉了一些属性，但无法复制其他属性。这样，我们就暴露了重要的局限性。总的来说，我们相信我们的研究结果为时序图学习模型的应用提供了实用的见解，并激励时序图学习研究中更多的可解释性驱动的评估。
摘要：Learning on temporal graphs has become a central topic in graph representation learning, with numerous benchmarks indicating the strong performance of state-of-the-art models. However, recent work has raised concerns about the reliability of benchmark results, noting issues with commonly used evaluation protocols and the surprising competitiveness of simple heuristics. This contrast raises the question of which properties of the underlying graphs temporal graph learning models actually use to form their predictions. We address this by systematically evaluating seven models on their ability to capture eight fundamental attributes related to the link structure of temporal graphs. These include structural characteristics such as density, temporal patterns such as recency, and edge formation mechanisms such as homophily. Using both synthetic and real-world datasets, we analyze how well models learn these attributes. Our findings reveal a mixed picture: models capture some attributes well but fail to reproduce others. With this, we expose important limitations. Overall, we believe that our results provide practical insights for the application of temporal graph learning models, and motivate more interpretability-driven evaluations in temporal graph learning research.

【2】Verifying Chain-of-Thought Reasoning via Its Computational Graph
标题：通过计算图验证思维链推理
链接：https://arxiv.org/abs/2510.09312

作者：Zheng Zhao, Yeskendir Koishekenov, Xianjun Yang, Naila Murray, Nicola Cancedda
摘要：当前的思想链（CoT）验证方法基于输出（黑盒）或激活（灰盒）预测推理正确性，但对计算失败的原因提供了有限的见解。我们介绍了一种白盒方法：基于电路的推理验证（CRV）。我们假设，正确的CoT步骤的属性图，被视为模型的潜在推理回路的执行痕迹，具有不同的结构指纹从那些不正确的步骤。通过训练这些图的结构特征的分类器，我们表明，这些痕迹包含一个强大的信号的推理错误。我们的白盒方法产生了其他方法无法获得的新的科学见解。(1)我们证明了错误的结构签名具有高度的预测性，建立了直接通过其计算图验证推理的可行性。(2)我们发现这些签名是高度领域特定的，揭示了不同推理任务的失败表现为不同的计算模式。(3)我们提供的证据表明，这些签名不仅仅是相关的;通过使用我们的分析来指导对单个转码器功能的有针对性的干预，我们成功地纠正了模型的错误推理。我们的工作表明，通过仔细检查模型的计算过程，我们可以从简单的错误检测转向对LLM推理的更深层次的因果理解。
摘要：Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails. We introduce a white-box method: Circuit-based Reasoning Verification (CRV). We hypothesize that attribution graphs of correct CoT steps, viewed as execution traces of the model's latent reasoning circuits, possess distinct structural fingerprints from those of incorrect steps. By training a classifier on structural features of these graphs, we show that these traces contain a powerful signal of reasoning errors. Our white-box approach yields novel scientific insights unattainable by other methods. (1) We demonstrate that structural signatures of error are highly predictive, establishing the viability of verifying reasoning directly via its computational graph. (2) We find these signatures to be highly domain-specific, revealing that failures in different reasoning tasks manifest as distinct computational patterns. (3) We provide evidence that these signatures are not merely correlational; by using our analysis to guide targeted interventions on individual transcoder features, we successfully correct the model's faulty reasoning. Our work shows that, by scrutinizing a model's computational process, we can move from simple error detection to a deeper, causal understanding of LLM reasoning.

【3】Agentic-KGR: Co-evolutionary Knowledge Graph Construction through Multi-Agent Reinforcement Learning
标题：统计学-KGR：通过多智能体强化学习构建协同进化知识图
链接：https://arxiv.org/abs/2510.09156

作者：Jing Li, Zhijie Sun, Zhicheng Zhou, Suming Qiu, Junjie Huang, Haijia Sun, Linyuan Qiu
摘要：目前的知识增强型大型语言模型（LLM）依赖于静态的、预先构建的知识库，这些知识库存在覆盖缺口和时间过时的问题，限制了它们在动态信息环境中的有效性。我们提出了一个新的框架，通过多轮强化学习（RL）实现LLM和知识图（KG）之间的协同进化。我们的方法引入了三个关键的创新：（1）一个动态的模式扩展机制，系统地扩展图本体超出预定义的边界在训练过程中;（2）检索增强记忆系统，使模型参数和知识结构之间的协同进化，通过不断优化;（3）可学习的多尺度即时压缩方法，其保留关键信息，同时通过自适应序列优化降低计算复杂度。实验结果表明，在知识提取任务中，监督基线和单轮强化学习方法有很大的改进。当与GraphRAG集成时，我们的方法在下游QA任务中实现了卓越的性能，与现有方法相比，在准确性和知识覆盖率方面都有显着提高。
摘要：Current knowledge-enhanced large language models (LLMs) rely on static, pre-constructed knowledge bases that suffer from coverage gaps and temporal obsolescence, limiting their effectiveness in dynamic information environments. We present Agentic-KGR, a novel framework enabling co-evolution between LLMs and knowledge graphs (KGs) through multi-round reinforcement learning (RL). Our approach introduces three key innovations: (1) a dynamic schema expansion mechanism that systematically extends graph ontologies beyond pre-defined boundaries during training; (2) a retrieval-augmented memory system enabling synergistic co-evolution between model parameters and knowledge structures through continuous optimization; (3) a learnable multi-scale prompt compression approach that preserves critical information while reducing computational complexity through adaptive sequence optimization. Experimental results demonstrate substantial improvements over supervised baselines and single-round RL approaches in knowledge extraction tasks. When integrated with GraphRAG, our method achieves superior performance in downstream QA tasks, with significant gains in both accuracy and knowledge coverage compared to existing methods.

【4】Spatio-Temporal Graph Convolutional Networks for EV Charging Demand Forecasting Using Real-World Multi-Modal Data Integration
标题：使用现实世界多模式数据集成进行电动汽车充电需求预测的时空图卷积网络
链接：https://arxiv.org/abs/2510.09048

作者：Jose Tupayachi, Mustafa C. Camur, Kevin Heaslip, Xueping Li
摘要：交通运输仍然是温室气体排放的主要贡献者，凸显了向电动汽车（EV）等可持续替代品过渡的紧迫性。然而，充电基础设施的空间分布不均匀和不规则使用给电网稳定性和投资规划带来了挑战。本研究介绍了TW-GCN，这是一个时空预测框架，将图卷积网络与时间架构相结合，以预测美国田纳西州的电动汽车充电需求。我们利用美国最大的电动汽车基础设施公司之一提供的真实交通流量、天气状况和专有数据来捕获空间依赖性和时间动态。在不同的滞后范围，聚类策略和序列长度上进行的广泛实验表明，中期（3小时）预测在响应性和稳定性之间实现了最佳平衡，1DCNN始终优于其他时间模型。区域分析显示，田纳西州东部、中部和西部的预测准确性存在差异，反映了车站密度、人口和当地需求变化如何影响模型性能。拟议的TW-GCN框架将数据驱动的智能集成到电动汽车基础设施规划中，支持可持续的移动过渡和弹性电网管理。
摘要：Transportation remains a major contributor to greenhouse gas emissions, highlighting the urgency of transitioning toward sustainable alternatives such as electric vehicles (EVs). Yet, uneven spatial distribution and irregular utilization of charging infrastructure create challenges for both power grid stability and investment planning. This study introduces TW-GCN, a spatio-temporal forecasting framework that combines Graph Convolutional Networks with temporal architectures to predict EV charging demand in Tennessee, United States (U.S.). We utilize real-world traffic flows, weather conditions, and proprietary data provided by one of the largest EV infrastructure company in the U.S. to capture both spatial dependencies and temporal dynamics. Extensive experiments across varying lag horizons, clustering strategies, and sequence lengths reveal that mid-horizon (3-hour) forecasts achieve the best balance between responsiveness and stability, with 1DCNN consistently outperforming other temporal models. Regional analysis shows disparities in predictive accuracy across East, Middle, and West Tennessee, reflecting how station density, population, and local demand variability shape model performance. The proposed TW-GCN framework advances the integration of data-driven intelligence into EV infrastructure planning, supporting both sustainable mobility transitions and resilient grid management.

【5】Graph Diffusion Transformers are In-Context Molecular Designers
标题：图扩散变换器是上下文中的分子设计器
链接：https://arxiv.org/abs/2510.08744

作者：Gang Liu, Jie Chen, Yihan Zhu, Michael Sun, Tengfei Luo, Nitesh V Chawla, Meng Jiang
备注：29 pages, 16 figures, 17 tables. Model available at: this https URL
摘要：情境学习允许大型模型从几个演示中适应新的任务，但它在分子设计中的成功有限。现有的数据库，如ChEMBL包含分子特性跨越数百万的生物测定，但标记的数据，为每一个属性仍然稀缺。为了解决这一限制，我们引入了演示条件扩散模型（DemoDiff），它使用一小组分子分数示例而不是文本描述来定义任务上下文。这些演示引导去噪Transformer生成与目标属性对齐的分子。为了可扩展的预训练，我们开发了一种具有节点对编码的新分子标记器，它在基序级别表示分子，需要的节点减少5.5$\times$。我们策划了一个包含数百万个上下文任务的数据集，这些任务来自多个来源，涵盖药物和材料，并在其上预训练了一个7亿个参数的模型。在六个类别的33个设计任务中，DemoDiff匹配或超过了100- 1000倍的语言模型，并实现了3.63的平均排名，而特定领域方法的平均排名为5.25-10.20。这些结果将DemoDiff定位为上下文分子设计的分子基础模型。我们的代码可在https://github.com/liugangcode/DemoDiff上获得。
摘要：In-context learning allows large models to adapt to new tasks from a few demonstrations, but it has shown limited success in molecular design. Existing databases such as ChEMBL contain molecular properties spanning millions of biological assays, yet labeled data for each property remain scarce. To address this limitation, we introduce demonstration-conditioned diffusion models (DemoDiff), which define task contexts using a small set of molecule-score examples instead of text descriptions. These demonstrations guide a denoising Transformer to generate molecules aligned with target properties. For scalable pretraining, we develop a new molecular tokenizer with Node Pair Encoding that represents molecules at the motif level, requiring 5.5$\times$ fewer nodes. We curate a dataset containing millions of context tasks from multiple sources covering both drugs and materials, and pretrain a 0.7-billion-parameter model on it. Across 33 design tasks in six categories, DemoDiff matches or surpasses language models 100-1000$\times$ larger and achieves an average rank of 3.63 compared to 5.25-10.20 for domain-specific approaches. These results position DemoDiff as a molecular foundation model for in-context molecular design. Our code is available at https://github.com/liugangcode/DemoDiff.

【6】Knowledge Graph Sparsification for GNN-based Rare Disease Diagnosis
标题：基于GNN的罕见病诊断的知识图谱精简
链接：https://arxiv.org/abs/2510.08655

作者：Premt Cara, Kamilia Zaripova, David Bani-Harouni, Nassir Navab, Azade Farshad
摘要：罕见遗传病诊断面临着严峻的挑战：患者数据不足，无法获得全基因组测序，以及大量可能的致病基因。这些限制导致诊断旅程延长、治疗不当和严重延误，对诊断工具稀缺的资源有限环境中的患者造成不成比例的影响。我们提出了RareNet，这是一种基于子图的图神经网络，它只需要患者表型来识别最可能的因果基因，并检索重点患者子图进行有针对性的临床研究。RareNet可以作为一种独立的方法，也可以作为其他候选基因优先排序方法的预处理或后处理过滤器，不断提高它们的性能，同时可能实现可解释的见解。通过对两个生物医学数据集的综合评估，我们展示了具有竞争力和鲁棒性的因果基因预测以及与其他框架集成时的显着性能增益。通过只需要在任何临床环境中都可以获得的表型数据，RareNet使复杂的遗传分析民主化，为缺乏先进基因组基础设施的服务不足的人群提供了特别的价值。
摘要：Rare genetic disease diagnosis faces critical challenges: insufficient patient data, inaccessible full genome sequencing, and the immense number of possible causative genes. These limitations cause prolonged diagnostic journeys, inappropriate treatments, and critical delays, disproportionately affecting patients in resource-limited settings where diagnostic tools are scarce. We propose RareNet, a subgraph-based Graph Neural Network that requires only patient phenotypes to identify the most likely causal gene and retrieve focused patient subgraphs for targeted clinical investigation. RareNet can function as a standalone method or serve as a pre-processing or post-processing filter for other candidate gene prioritization methods, consistently enhancing their performance while potentially enabling explainable insights. Through comprehensive evaluation on two biomedical datasets, we demonstrate competitive and robust causal gene prediction and significant performance gains when integrated with other frameworks. By requiring only phenotypic data, which is readily available in any clinical setting, RareNet democratizes access to sophisticated genetic analysis, offering particular value for underserved populations lacking advanced genomic infrastructure.

【7】Soft Graph Transformer for MIMO Detection
标题：用于多输入多输出检测的软图Transformer
链接：https://arxiv.org/abs/2509.12694

作者：Jiadong Hong, Lei Liu, Xinyu Bian, Wenjie Wang, Zhaoyang Zhang
备注：5 pages with 3 figures and 2 tables, submitted to IEEE for a possible publication
摘要：我们提出了软图Transformer（SGT），一个软输入软输出的MIMO检测设计的神经架构。虽然最大似然（ML）检测达到最佳精度，但其指数复杂性使其在大型系统中不可行，并且传统的消息传递算法依赖于在有限维中经常失败的渐近假设。最近的基于变换器的检测器表现出很强的性能，但通常忽略了MIMO因子图结构，不能利用先验软信息。SGT通过将自我注意力与图形感知的交叉注意力相结合来解决这些限制，自我注意力对符号和约束子图中的上下文依赖性进行编码，而图形感知的交叉注意力则执行跨子图的结构化消息传递。其软输入接口允许辅助先验的集成，在保持计算效率的同时产生有效的软输出。实验表明，SGT实现了接近ML的性能，并为利用软先验的接收机系统提供了一个灵活且可解释的框架。
摘要：We propose the Soft Graph Transformer (SGT), a soft-input-soft-output neural architecture designed for MIMO detection. While Maximum Likelihood (ML) detection achieves optimal accuracy, its exponential complexity makes it infeasible in large systems, and conventional message-passing algorithms rely on asymptotic assumptions that often fail in finite dimensions. Recent Transformer-based detectors show strong performance but typically overlook the MIMO factor graph structure and cannot exploit prior soft information. SGT addresses these limitations by combining self-attention, which encodes contextual dependencies within symbol and constraint subgraphs, with graph-aware cross-attention, which performs structured message passing across subgraphs. Its soft-input interface allows the integration of auxiliary priors, producing effective soft outputs while maintaining computational efficiency. Experiments demonstrate that SGT achieves near-ML performance and offers a flexible and interpretable framework for receiver systems that leverage soft priors.

【8】Decoding Positive Selection in Mycobacterium tuberculosis with Phylogeny-Guided Graph Attention Models
标题：利用系统发育引导的图注意力模型解码结核杆菌的阳性选择
链接：https://arxiv.org/abs/2510.08703

作者：Linfeng Wang, Susana Campino, Taane G. Clark, Jody E. Phelan
摘要：正选择驱动结核分枝杆菌适应性突变的出现，形成耐药性，传播性和毒力。系统发生树捕捉菌株之间的进化关系，并提供了一个自然的框架，检测这种适应性信号。我们提出了一种基于语义引导的图注意力网络（GAT）方法，介绍了一种将SNP注释的系统发育树转换为适合神经网络分析的图结构的方法。使用500 M。从四个主要谱系的结核病分离株和61个耐药基因的249个单核苷酸变异（84个耐药相关和165个中性），我们构建了节点代表分离株和边缘反映系统发育距离的图。被超过7个内部节点分离的分离株之间的边缘被修剪，以强调局部进化结构。节点特征编码SNP的存在或不存在，GAT架构包括两个注意力层，剩余连接，全局注意力池和多层感知器分类器。该模型在测试集上的准确度为0.88，当应用于146个WHO分类的“不确定”变体时，确定了41个候选者，这些候选者在多个谱系中出现收敛，与适应性进化一致。这项工作证明了将基因组转化为GNN兼容结构的可行性，并强调了基于注意力的模型作为检测阳性选择、帮助基因组监测和变体优先级排序的有效工具。
摘要：Positive selection drives the emergence of adaptive mutations in Mycobacterium tuberculosis, shaping drug resistance, transmissibility, and virulence. Phylogenetic trees capture evolutionary relationships among isolates and provide a natural framework for detecting such adaptive signals. We present a phylogeny-guided graph attention network (GAT) approach, introducing a method for converting SNP-annotated phylogenetic trees into graph structures suitable for neural network analysis. Using 500 M. tuberculosis isolates from four major lineages and 249 single-nucleotide variants (84 resistance-associated and 165 neutral) across 61 drug-resistance genes, we constructed graphs where nodes represented isolates and edges reflected phylogenetic distances. Edges between isolates separated by more than seven internal nodes were pruned to emphasise local evolutionary structure. Node features encoded SNP presence or absence, and the GAT architecture included two attention layers, a residual connection, global attention pooling, and a multilayer perceptron classifier. The model achieved an accuracy of 0.88 on a held-out test set and, when applied to 146 WHO-classified "uncertain" variants, identified 41 candidates with convergent emergence across multiple lineages, consistent with adaptive evolution. This work demonstrates the feasibility of transforming phylogenies into GNN-compatible structures and highlights attention-based models as effective tools for detecting positive selection, aiding genomic surveillance and variant prioritisation.

Transformer(4篇)

【1】Efficient Resource-Constrained Training of Vision Transformers via Subspace Optimization
标题：通过子空间优化对视觉变形器进行资源限制的高效训练
链接：https://arxiv.org/abs/2510.09160

作者：Le-Trung Nguyen, Enzo Tartaglione, Van-Tam Nguyen
摘要：随着人工智能越来越多地影响日常生活，能源消耗和数据隐私已成为紧迫的问题。设备上学习直接在边缘设备上训练模型，降低能耗并保护数据隐私。然而，现代神经网络的规模不断扩大，给设备上的训练带来了重大障碍。虽然之前的工作集中在紧凑的卷积架构上，但我们将基于子空间的训练应用于Transformer模型。受模型的基本信息位于固定子空间的想法的启发，我们引入了权重激活子空间迭代（WASI），该方法通过将训练限制在该子空间来减轻反向传播的内存瓶颈并提高Transformer模型的推理效率。我们的结果表明，WASI保持了与普通训练相当的准确性，同时将内存使用量减少了高达62\times $，并将计算成本（FLOPs）减少了高达2\times $。在Raspberry Pi 5上，WASI的训练和推理速度比普通训练快约1.5美元\倍。
摘要：As AI increasingly shapes daily life, energy consumption and data privacy have become pressing concerns. On-device learning trains models directly on edge devices, cutting energy consumption and safeguarding data privacy. However, the expanding scale of modern neural networks creates a major obstacle for on-device training. Although prior work has concentrated on compact convolutional architectures, we instead apply subspace-based training to transformer models. Motivated by the idea that a model's essential information lies in a fixed subspace, we introduce Weight-Activation Subspace Iteration (WASI), a method that mitigates the memory bottleneck of backpropagation and boosts inference efficiency in transformer models by restricting training to this subspace. Our results demonstrate that WASI maintains accuracy comparable to vanilla training while reducing memory usage by up to $62\times$ and computational cost (FLOPs) by up to $2\times$. On a Raspberry Pi 5, WASI achieves roughly $1.5\times$ faster training and inference than vanilla training.

【2】Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers
标题：用于减轻Transformer中的极端令牌现象的值状态门控注意
链接：https://arxiv.org/abs/2510.09017

作者：Rui Bu, Haofeng Zhong, Wenzheng Chen, Yangyan Li
摘要：基于Transformer架构的大型模型容易受到极端令牌现象的影响，例如注意力下沉和值状态流失。这些问题降低了模型性能，量化保真度和可解释性，这是由于一个有问题的相互强化机制，其中模型通过将注意力集中在具有接近零值状态的令牌上来学习低效的“无操作”行为。在本文中，我们提出了价值状态门控注意力（VGA），一个简单的，专用的，稳定的架构机制，执行“无操作”的注意力，直接打破这个循环。VGA引入了一个可学习的数据相关门，直接从值向量（V）计算，以调制输出。通过对底层梯度的理论分析，我们证明了用自身的函数来门控值状态在解耦值和注意力分数更新方面比之前的方法更有效。这创建了一个直接的监管途径，允许模型根据其新兴的价值表示来抑制代币的贡献。我们的实验表明，VGA显着减轻注意力下沉的形成和稳定的值状态规范，从而提高性能，鲁棒的量化保真度，并增强模型的可解释性。
摘要：Large models based on the Transformer architecture are susceptible to extreme-token phenomena, such as attention sinks and value-state drains. These issues, which degrade model performance, quantization fidelity, and interpretability, arise from a problematic mutual reinforcement mechanism where the model learns an inefficient 'no-op' behavior by focusing attention on tokens with near-zero value states. In this paper, we propose Value-State Gated Attention (VGA), a simple, dedicated, and stable architectural mechanism for performing 'no-op' attention efficiently by directly breaking this cycle. VGA introduces a learnable, data-dependent gate, computed directly from the value vectors (V), to modulate the output. Through a theoretical analysis of the underlying gradients, we show that gating the value-state with a function of itself is more effective at decoupling value and attention score updates than prior methods that gate on input embeddings. This creates a direct regulatory pathway that allows the model to suppress a token's contribution based on its emergent value representation. Our experiments demonstrate that VGA significantly mitigates the formation of attention sinks and stabilizes value-state norms, leading to improved performance, robust quantization fidelity, and enhanced model interpretability.

【3】Inverse-Free Wilson Loops for Transformers: A Practical Diagnostic for Invariance and Order Sensitivity
标题：Transformer的无逆Wilson环：不变性和阶敏感性的实用诊断
链接：https://arxiv.org/abs/2510.08648

作者：Edward Y. Chang, Ethan Y. Chang
备注：24 pages, 10 figures, 2 tables
摘要：大型语言模型可以在无害的编辑下改变答案，这在实践中很重要：当段落重新排序时，RAG输出翻转，微调侵蚀了在预训练中学习的不变性，辩论或思想链提示采取路径依赖路线，编译器融合或重新排序扰乱了决策边界附近的logits。这些失败违反了预期的不变性，破坏了持续集成，并迫使团队以安全换取速度。这些影响很小，但分布在各个层和位置上，对上下文长度和评估顺序敏感，并且通过重新训练或正式验证进行修复的成本很高。我们提出威尔逊，一个最小的事后诊断套件，转换简单的循环和重新排序检查内部表示成系统信号。WILSON结合了位置和层上的无逆曲率图，使用JVP和Hutchinson探针计算，并带有标记重新排序风险的激活级别标记。信号的计算成本很低，对于标准的Transformers来说与模型无关，并且作为阈值和CSV工件导出以供编排。这使得能够采取具体行动：保护RAG免受顺序效应的影响，捕捉微调回归，稳定辩论路径和长时间的多回合上下文，并在部署中进行门融合或重新排序。简而言之，WILSON有助于预测故障并批准安全优化，因此可靠性和吞吐量可以在不更改模型架构或培训的情况下共同提高。
摘要：Large language models can change answers under harmless edits that matter in practice: RAG outputs flip when passages are reordered, fine-tuning erodes invariances learned at pretraining, debate or chain-of-thought prompts take path-dependent routes, and compiler fusion or reordering perturbs logits near decision boundaries. These failures violate intended invariances, break continuous integration, and force teams to trade safety for speed. The effects are small yet distributed across layers and positions, sensitive to context length and evaluation order, and costly to repair with retraining or formal verification. We present WILSON, a minimal post-hoc diagnostic suite that converts simple loop and reordering checks on internal representations into system signals. WILSON combines an inverse-free curvature map over positions and layers, computed with JVPs and Hutchinson probes, with activation-level commutators that flag reorder risk. Signals are cheap to compute, model-agnostic for standard Transformers, and exported as thresholds and CSV artifacts for orchestrators. This enables concrete actions: guard RAG against order effects, catch fine-tuning regressions, stabilize debate pathways and long multi-turn contexts, and gate fusions or reorders in deployment. In short, WILSON helps anticipate failures and approve safe optimizations so reliability and throughput can improve together without changing model architecture or training.

【4】Efficient Autoregressive Inference for Transformer Probabilistic Models
标题：Transformer概率模型的有效自回归推理
链接：https://arxiv.org/abs/2510.09477

作者：Conor Hassan, Nasrulloh Loka, Cen-You Li, Daolang Huang, Paul E. Chang, Yang Yang, Francesco Silvestrin, Samuel Kaski, Luigi Acerbi
摘要：基于转换器的摊销概率推理模型，如神经过程，先验拟合网络和表格基础模型，擅长单次通过边际预测。然而，许多现实世界的应用，从信号插值到多列表格预测，需要捕获预测之间的依赖关系的相干联合分布。虽然纯自回归架构有效地生成了这样的分布，但它们牺牲了使这些模型对元学习强大的灵活的集合条件。相反，从基于集合的模型获得联合分布的标准方法需要在每个自回归步骤对整个增强条件集进行昂贵的重新编码。我们引入了一个因果自回归缓冲区，保留了这两种范式的优点。我们的方法将上下文编码与更新条件集相结合。该模型处理一次上下文并缓存它。然后动态缓冲区捕获目标依赖性：当目标被合并时，它们进入缓冲区并处理缓存的上下文和先前缓冲的目标。这使得有效的批量自回归生成和一次通过联合对数似然评估。统一的训练策略允许以最小的额外成本无缝集成基于集合的模式和自回归模式。在合成函数、EEG信号、认知模型和表格数据中，我们的方法与强基线的预测准确性相匹配，同时提供高达20倍的联合采样速度。我们的方法将自回归生成模型的效率与基于集合的条件反射的代表性能力相结合，使得联合预测对于基于变换的概率模型是实用的。
摘要：Transformer-based models for amortized probabilistic inference, such as neural processes, prior-fitted networks, and tabular foundation models, excel at single-pass marginal prediction. However, many real-world applications, from signal interpolation to multi-column tabular predictions, require coherent joint distributions that capture dependencies between predictions. While purely autoregressive architectures efficiently generate such distributions, they sacrifice the flexible set-conditioning that makes these models powerful for meta-learning. Conversely, the standard approach to obtain joint distributions from set-based models requires expensive re-encoding of the entire augmented conditioning set at each autoregressive step. We introduce a causal autoregressive buffer that preserves the advantages of both paradigms. Our approach decouples context encoding from updating the conditioning set. The model processes the context once and caches it. A dynamic buffer then captures target dependencies: as targets are incorporated, they enter the buffer and attend to both the cached context and previously buffered targets. This enables efficient batched autoregressive generation and one-pass joint log-likelihood evaluation. A unified training strategy allows seamless integration of set-based and autoregressive modes at minimal additional cost. Across synthetic functions, EEG signals, cognitive models, and tabular data, our method matches predictive accuracy of strong baselines while delivering up to 20 times faster joint sampling. Our approach combines the efficiency of autoregressive generative models with the representational power of set-based conditioning, making joint prediction practical for transformer-based probabilistic models.

GAN|对抗|攻击|生成相关(11篇)

【1】Near-Optimal Second-Order Guarantees for Model-Based Adversarial Imitation Learning
标题：基于模型的对抗模仿学习的近优二阶保证
链接：https://arxiv.org/abs/2510.09487

作者：Shangzhe Li, Dongruo Zhou, Weitong Zhang
备注：48 pages, 3 figures, 4 tables
摘要：我们研究了在线对抗性模仿学习（AIL），其中代理从离线专家演示中学习，并在没有奖励的情况下在线与环境交互。尽管有强有力的实证结果，在线互动的好处和随机性的影响仍然知之甚少。我们通过引入基于模型的AIL算法（MB-AIL）来解决这些差距，并在专家数据和无奖励交互的一般函数近似下建立其无水平，二阶样本复杂性保证。这些二阶边界提供了一个依赖于实例的结果，可以随着相关政策下的回报方差而扩展，因此随着系统接近确定性而收紧。再加上二阶，信息理论的下限上一个新构建的硬实例家庭，我们表明，MB-AIL达到最小最优的样本复杂性在线互动（对数因子）有限的专家演示和匹配的下限专家演示的依赖地平线$H$，精度$\sigma^2 $。实验进一步验证了我们的理论研究结果，并表明MB-AIL的实际实现匹配或超越现有方法的样本效率。
摘要：We study online adversarial imitation learning (AIL), where an agent learns from offline expert demonstrations and interacts with the environment online without access to rewards. Despite strong empirical results, the benefits of online interaction and the impact of stochasticity remain poorly understood. We address these gaps by introducing a model-based AIL algorithm (MB-AIL) and establish its horizon-free, second-order sample-complexity guarantees under general function approximations for both expert data and reward-free interactions. These second-order bounds provide an instance-dependent result that can scale with the variance of returns under the relevant policies and therefore tighten as the system approaches determinism. Together with second-order, information-theoretic lower bounds on a newly constructed hard-instance family, we show that MB-AIL attains minimax-optimal sample complexity for online interaction (up to logarithmic factors) with limited expert demonstrations and matches the lower bound for expert demonstrations in terms of the dependence on horizon $H$, precision $\epsilon$ and the policy variance $\sigma^2$. Experiments further validate our theoretical findings and demonstrate that a practical implementation of MB-AIL matches or surpasses the sample efficiency of existing methods.

【2】Few-shot multi-token DreamBooth with LoRa for style-consistent character generation
标题：具有LoRa的Few-Shot多令牌DreamBooth，用于风格一致的角色生成
链接：https://arxiv.org/abs/2510.09475

作者：Ruben Pascual, Mikel Sesma-Sara, Aranzazu Jurio, Daniel Paternain, Mikel Galar
摘要：视听行业正在经历一场深刻的变革，因为它正在整合人工智能的发展，不仅使日常任务自动化，而且还激发了新的艺术形式。本文讨论了产生几乎无限数量的新颖角色的问题，这些角色保留了一小部分人类设计的参考角色的艺术风格和共同的视觉特征，从而拓宽了动画，游戏，及相关领域。我们的解决方案基于DreamBooth，这是一种成熟的文本到图像扩散模型的微调技术，并使其适应两个核心挑战：捕获文本提示之外的复杂字符细节和训练数据的Few-Shot性质。为了实现这一目标，我们提出了一种多令牌策略，使用聚类将单独的令牌分配给各个角色及其集体风格，并结合基于LoRA的参数高效微调。通过删除特定于类的正则化集并在生成过程中引入随机标记和嵌入，我们的方法允许无限的字符创建，同时保留学习的风格。我们在五个小型专业数据集上评估了我们的方法，并使用定量指标和人类评估研究将其与相关基线进行了比较。我们的研究结果表明，我们的方法产生了高质量，多样化的字符，同时保留了参考字符的独特美学特征，人类评价进一步加强了其有效性，并突出了我们方法的潜力。
摘要：The audiovisual industry is undergoing a profound transformation as it is integrating AI developments not only to automate routine tasks but also to inspire new forms of art. This paper addresses the problem of producing a virtually unlimited number of novel characters that preserve the artistic style and shared visual traits of a small set of human-designed reference characters, thus broadening creative possibilities in animation, gaming, and related domains. Our solution builds upon DreamBooth, a well-established fine-tuning technique for text-to-image diffusion models, and adapts it to tackle two core challenges: capturing intricate character details beyond textual prompts and the few-shot nature of the training data. To achieve this, we propose a multi-token strategy, using clustering to assign separate tokens to individual characters and their collective style, combined with LoRA-based parameter-efficient fine-tuning. By removing the class-specific regularization set and introducing random tokens and embeddings during generation, our approach allows for unlimited character creation while preserving the learned style. We evaluate our method on five small specialized datasets, comparing it to relevant baselines using both quantitative metrics and a human evaluation study. Our results demonstrate that our approach produces high-quality, diverse characters while preserving the distinctive aesthetic features of the reference characters, with human evaluation further reinforcing its effectiveness and highlighting the potential of our method.

【3】Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols
标题：对可信数据库的自适应攻击颠覆人工智能控制协议
链接：https://arxiv.org/abs/2510.09462

作者：Mikhail Terekhov, Alexander Panfilov, Daniil Dzenhaliou, Caglar Gulcehre, Maksym Andriushchenko, Ameya Prabhu, Jonas Geiping
摘要：AI控制协议作为一种防御机制，阻止不受信任的LLM代理在自主设置中造成伤害。之前的工作将此视为安全问题，对使用部署上下文巧妙完成有害副任务（如后门插入）的漏洞进行压力测试。在实践中，大多数AI控制协议基本上都基于LLM监视器，这可能成为故障的中心点。我们研究了自适应攻击的不可信模型，知道协议和监控模型，这是合理的，如果不可信的模型是训练与后来的知识截止或可以搜索这些信息自主。我们实例化一个简单的自适应攻击向量，攻击者通过该向量在模型输出中嵌入公开已知或zero-shot提示的注入。使用这种策略，前沿模型始终避开不同的监视器，并在两个主要的AI控制基准上完成恶意任务。该攻击普遍适用于依赖监视器的当前协议。此外，最近的延迟重新采样协议甚至适得其反，因为它的重新采样放大了即时注入，并有效地将其重新构建为最佳的$n$攻击。一般来说，对监控模型的自适应攻击是当前控制协议中的一个主要盲点，应该成为未来AI控制机制评估的标准组成部分。
摘要：AI control protocols serve as a defense mechanism to stop untrusted LLM agents from causing harm in autonomous settings. Prior work treats this as a security problem, stress testing with exploits that use the deployment context to subtly complete harmful side tasks, such as backdoor insertion. In practice, most AI control protocols are fundamentally based on LLM monitors, which can become a central point of failure. We study adaptive attacks by an untrusted model that knows the protocol and the monitor model, which is plausible if the untrusted model was trained with a later knowledge cutoff or can search for this information autonomously. We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs. Using this tactic, frontier models consistently evade diverse monitors and complete malicious tasks on two main AI control benchmarks. The attack works universally against current protocols that rely on a monitor. Furthermore, the recent Defer-to-Resample protocol even backfires, as its resampling amplifies the prompt injection and effectively reframes it as a best-of-$n$ attack. In general, adaptive attacks on monitor models represent a major blind spot in current control protocols and should become a standard component of evaluations for future AI control mechanisms.

【4】Cross-Receiver Generalization for RF Fingerprint Identification via Feature Disentanglement and Adversarial Training
标题：通过特征解纠缠和对抗训练实现RF指纹识别的跨接收机通用
链接：https://arxiv.org/abs/2510.09405

作者：Yuhao Pan, Xiucheng Wang, Nan Cheng, Wenchao Xu
摘要：射频指纹识别（RFFI）是无线网络安全的关键技术，利用设备制造过程中引入的固有硬件级缺陷来实现精确的发射机识别。虽然深度神经网络在提取区分特征方面表现出了卓越的能力，但它们在现实世界中的部署受到了接收器引起的可变性的阻碍。在实践中，RF指纹信号包括发射机特定的特征以及信道失真和接收机引起的偏置。虽然信道均衡可以减轻信道噪声，但接收器引起的特征偏移在很大程度上仍未得到解决，导致RFFI模型过拟合接收器特定的模式。当训练和评估共享相同的接收器时，这种限制尤其成问题，因为在部署中更换接收器可能导致显著的性能下降。为了应对这一挑战，我们提出了一个RFFI框架，它对跨接收器的变化具有鲁棒性，集成了对抗性训练和风格转移，以明确地解开发射器和接收器的特征。通过实施域不变表示学习，我们的方法将真正的硬件签名与接收器伪影隔离开来，确保了对接收器变化的鲁棒性。在多接收器数据集上进行的大量实验表明，我们的方法始终优于最先进的基线，在不同的接收器设置中，平均精度提高了10%。
摘要：Radio frequency fingerprint identification (RFFI) is a critical technique for wireless network security, leveraging intrinsic hardware-level imperfections introduced during device manufacturing to enable precise transmitter identification. While deep neural networks have shown remarkable capability in extracting discriminative features, their real-world deployment is hindered by receiver-induced variability. In practice, RF fingerprint signals comprise transmitter-specific features as well as channel distortions and receiver-induced biases. Although channel equalization can mitigate channel noise, receiver-induced feature shifts remain largely unaddressed, causing the RFFI models to overfit to receiver-specific patterns. This limitation is particularly problematic when training and evaluation share the same receiver, as replacing the receiver in deployment can cause substantial performance degradation. To tackle this challenge, we propose an RFFI framework robust to cross-receiver variability, integrating adversarial training and style transfer to explicitly disentangle transmitter and receiver features. By enforcing domain-invariant representation learning, our method isolates genuine hardware signatures from receiver artifacts, ensuring robustness against receiver changes. Extensive experiments on multi-receiver datasets demonstrate that our approach consistently outperforms state-of-the-art baselines, achieving up to a 10% improvement in average accuracy across diverse receiver settings.

【5】GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis
标题：伟大：通过描述感知触发器合成在WLHF中进行可推广的后门攻击
链接：https://arxiv.org/abs/2510.09260

作者：Subrat Kishore Dutta, Yuelin Xu, Piyush Pant, Xiao Zhang
摘要：最近的工作表明，RLHF非常容易受到后门攻击，中毒计划注入恶意触发偏好数据。然而，现有的方法通常依赖于静态的、基于稀有令牌的触发器，限制了它们在现实场景中的有效性。在本文中，我们开发了GREAT，一个新的框架，通过情感感知触发合成在RLHF中制作可推广的后门。具体来说，GREAT针对一个脆弱的用户亚组，其特征是语义暴力的请求和情绪愤怒的触发器产生有害的反应。GREAT的核心是一个触发识别管道，它在潜在嵌入空间中运行，利用主成分分析和聚类技术来识别最具代表性的触发。为了实现这一点，我们提出了Erinyes，这是一个高质量的数据集，包含超过5000美元的愤怒触发器，使用原则性，层次性和多样性促进方法从GPT-4.1中策划。在基准RLHF数据集上的实验表明，GREAT在攻击成功率方面明显优于基线方法，特别是对于看不见的触发场景，同时在很大程度上保持了对良性输入的响应质量。
摘要：Recent work has shown that RLHF is highly susceptible to backdoor attacks, poisoning schemes that inject malicious triggers in preference data. However, existing methods often rely on static, rare-token-based triggers, limiting their effectiveness in realistic scenarios. In this paper, we develop GREAT, a novel framework for crafting generalizable backdoors in RLHF through emotion-aware trigger synthesis. Specifically, GREAT targets harmful response generation for a vulnerable user subgroup characterized by both semantically violent requests and emotionally angry triggers. At the core of GREAT is a trigger identification pipeline that operates in the latent embedding space, leveraging principal component analysis and clustering techniques to identify the most representative triggers. To enable this, we present Erinyes, a high-quality dataset of over $5000$ angry triggers curated from GPT-4.1 using a principled, hierarchical, and diversity-promoting approach. Experiments on benchmark RLHF datasets demonstrate that GREAT significantly outperforms baseline methods in attack success rates, especially for unseen trigger scenarios, while largely preserving the response quality on benign inputs.

【6】Provable Watermarking for Data Poisoning Attacks
标题：针对数据中毒攻击的可证明水印
链接：https://arxiv.org/abs/2510.09210

作者：Yifan Zhu, Lijia Yu, Xiao-Shan Gao
备注：42 pages, NeurIPS 2025
摘要：近年来，数据中毒攻击越来越多地被设计成无害甚至有益的，通常是为了验证数据集所有权或保护私人数据免受未经授权的使用。然而，这些发展有可能引起误解和冲突，因为数据中毒传统上被认为是机器学习系统的安全威胁。为了解决这个问题，无害的中毒生成器必须声明其生成的数据集的所有权，使用户能够识别潜在的中毒，以防止滥用。在本文中，我们提出了部署水印方案作为解决这一挑战。我们介绍了两种可证明和实用的数据中毒水印方法：{\em中毒后水印}和{\em中毒并发水印}。我们的分析表明，当水印长度为$\Theta（\sqrt{d}/\sqrt_w）$用于后中毒水印，并且落在$\Theta的范围内（1/\epsilon_w^2）$到$O（\sqrt{d}/\epsilon_p）$对于中毒并发水印，带水印的中毒数据集可证明确保了水印的可检测性和中毒效用，证明了数据中毒攻击下水印的实用性。我们通过对几种攻击、模型和数据集的实验来验证我们的理论发现。
摘要：In recent years, data poisoning attacks have been increasingly designed to appear harmless and even beneficial, often with the intention of verifying dataset ownership or safeguarding private data from unauthorized use. However, these developments have the potential to cause misunderstandings and conflicts, as data poisoning has traditionally been regarded as a security threat to machine learning systems. To address this issue, it is imperative for harmless poisoning generators to claim ownership of their generated datasets, enabling users to identify potential poisoning to prevent misuse. In this paper, we propose the deployment of watermarking schemes as a solution to this challenge. We introduce two provable and practical watermarking approaches for data poisoning: {\em post-poisoning watermarking} and {\em poisoning-concurrent watermarking}. Our analyses demonstrate that when the watermarking length is $\Theta(\sqrt{d}/\epsilon_w)$ for post-poisoning watermarking, and falls within the range of $\Theta(1/\epsilon_w^2)$ to $O(\sqrt{d}/\epsilon_p)$ for poisoning-concurrent watermarking, the watermarked poisoning dataset provably ensures both watermarking detectability and poisoning utility, certifying the practicality of watermarking under data poisoning attacks. We validate our theoretical findings through experiments on several attacks, models, and datasets.

【7】Regret Bounds for Adversarial Contextual Bandits with General Function Approximation and Delayed Feedback
标题：具有一般函数逼近和延迟反馈的对抗上下文盗贼的遗憾界限
链接：https://arxiv.org/abs/2510.09127

作者：Orin Levy, Liad Erez, Alon Cohen, Yishay Mansour
摘要：We present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem over $K$ actions in the presence of delayed feedback, a scenario where loss observations arrive with delays chosen by an adversary. As a preliminary result, assuming direct access to a finite policy class $\Pi$ we establish an optimal expected regret bound of $ O (\sqrt{KT \log |\Pi|} + \sqrt{D \log |\Pi|)} $ where $D$ is the sum of delays. For our main contribution, we study the general function approximation setting over a (possibly infinite) contextual loss function class $ \mathcal{F} $ with access to an online least-square regression oracle $\mathcal{O}$ over $\mathcal{F}$. In this setting, we achieve an expected regret bound of $O(\sqrt{KT\mathcal{R}_T(\mathcal{O})} + \sqrt{ d_{\max} D \beta})$ assuming FIFO order, where $d_{\max}$ is the maximal delay, $\mathcal{R}_T(\mathcal{O})$ is an upper bound on the oracle's regret and $\beta$ is a stability parameter associated with the oracle. We complement this general result by presenting a novel stability analysis of a Hedge-based version of Vovk's aggregating forecaster as an oracle implementation for least-square regression over a finite function class $\mathcal{F}$ and show that its stability parameter $\beta$ is bounded by $\log |\mathcal{F}|$, resulting in an expected regret bound of $O(\sqrt{KT \log |\mathcal{F}|} + \sqrt{d_{\max} D \log |\mathcal{F}|})$ which is a $\sqrt{d_{\max}}$ factor away from the lower bound of $\Omega(\sqrt{KT \log |\mathcal{F}|} + \sqrt{D \log |\mathcal{F}|})$ that we also present.
摘要：We present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem over $K$ actions in the presence of delayed feedback, a scenario where loss observations arrive with delays chosen by an adversary. As a preliminary result, assuming direct access to a finite policy class $\Pi$ we establish an optimal expected regret bound of $ O (\sqrt{KT \log |\Pi|} + \sqrt{D \log |\Pi|)} $ where $D$ is the sum of delays. For our main contribution, we study the general function approximation setting over a (possibly infinite) contextual loss function class $ \mathcal{F} $ with access to an online least-square regression oracle $\mathcal{O}$ over $\mathcal{F}$. In this setting, we achieve an expected regret bound of $O(\sqrt{KT\mathcal{R}_T(\mathcal{O})} + \sqrt{ d_{\max} D \beta})$ assuming FIFO order, where $d_{\max}$ is the maximal delay, $\mathcal{R}_T(\mathcal{O})$ is an upper bound on the oracle's regret and $\beta$ is a stability parameter associated with the oracle. We complement this general result by presenting a novel stability analysis of a Hedge-based version of Vovk's aggregating forecaster as an oracle implementation for least-square regression over a finite function class $\mathcal{F}$ and show that its stability parameter $\beta$ is bounded by $\log |\mathcal{F}|$, resulting in an expected regret bound of $O(\sqrt{KT \log |\mathcal{F}|} + \sqrt{d_{\max} D \log |\mathcal{F}|})$ which is a $\sqrt{d_{\max}}$ factor away from the lower bound of $\Omega(\sqrt{KT \log |\mathcal{F}|} + \sqrt{D \log |\mathcal{F}|})$ that we also present.

【8】MemLoss: Enhancing Adversarial Training with Recycling Adversarial Examples
标题：MemLoss：通过回收对抗性示例来加强对抗性训练
链接：https://arxiv.org/abs/2510.09105

作者：Soroush Mahdi, Maryam Amirmazlaghani, Saeed Saravani, Zahra Dehghanian
备注：24 pages
摘要：在本文中，我们提出了一种名为MemLoss的新方法来改进机器学习模型的对抗训练。MemLoss利用先前生成的对抗性示例（称为“内存对抗性示例”）来增强模型的鲁棒性和准确性，而不会影响干净数据的性能。通过在训练时期使用这些示例，MemLoss在自然准确性和对抗鲁棒性方面都提供了平衡的改进。在包括CIFAR-10在内的多个数据集上的实验结果表明，与现有的对抗性训练方法相比，我们的方法具有更好的准确性，同时保持了对攻击的强大鲁棒性。
摘要：In this paper, we propose a new approach called MemLoss to improve the adversarial training of machine learning models. MemLoss leverages previously generated adversarial examples, referred to as 'Memory Adversarial Examples,' to enhance model robustness and accuracy without compromising performance on clean data. By using these examples across training epochs, MemLoss provides a balanced improvement in both natural accuracy and adversarial robustness. Experimental results on multiple datasets, including CIFAR-10, demonstrate that our method achieves better accuracy compared to existing adversarial training methods while maintaining strong robustness against attacks.

【9】Robust Driving Control for Autonomous Vehicles: An Intelligent General-sum Constrained Adversarial Reinforcement Learning Approach
标题：自动驾驶汽车的鲁棒驾驶控制：一种智能广义和约束对抗强化学习方法
链接：https://arxiv.org/abs/2510.09041

作者：Junchao Fan, Xiaolin Chang
摘要：深度强化学习（DRL）在开发自动驾驶策略方面取得了显着的成功。然而，它对对抗性攻击的脆弱性仍然是现实世界部署的关键障碍。尽管现有的鲁棒方法已经取得了成功，但它们仍然存在三个关键问题：（i）这些方法是针对近视对抗性攻击进行训练的，限制了它们响应更具战略性的威胁的能力，（ii）它们难以引起真正的安全关键事件（例如，冲突），但往往会导致轻微的后果，以及（iii）这些方法可能会在训练过程中由于缺乏鲁棒约束而引入学习不稳定性和策略漂移。为了解决这些问题，我们提出了智能一般和约束对抗强化学习（IGCARL），一种新的鲁棒自动驾驶方法，由一个战略目标的对手和一个强大的驾驶代理。战略目标攻击者旨在利用DRL的时间决策能力来执行战略协调的多步攻击。此外，它明确侧重于通过采用一般和目标来诱发安全关键事件。鲁棒驾驶代理通过与对手交互来学习，以开发针对对抗性攻击的鲁棒自动驾驶策略。为了确保在对抗性环境中的稳定学习，并减轻由攻击引起的策略漂移，代理在约束公式下进行优化。大量的实验表明，IGCARL将成功率提高了至少27.9%，表现出对对抗性攻击的卓越鲁棒性，并增强了基于DRL的自动驾驶的安全性和可靠性。
摘要：Deep reinforcement learning (DRL) has demonstrated remarkable success in developing autonomous driving policies. However, its vulnerability to adversarial attacks remains a critical barrier to real-world deployment. Although existing robust methods have achieved success, they still suffer from three key issues: (i) these methods are trained against myopic adversarial attacks, limiting their abilities to respond to more strategic threats, (ii) they have trouble causing truly safety-critical events (e.g., collisions), but instead often result in minor consequences, and (iii) these methods can introduce learning instability and policy drift during training due to the lack of robust constraints. To address these issues, we propose Intelligent General-sum Constrained Adversarial Reinforcement Learning (IGCARL), a novel robust autonomous driving approach that consists of a strategic targeted adversary and a robust driving agent. The strategic targeted adversary is designed to leverage the temporal decision-making capabilities of DRL to execute strategically coordinated multi-step attacks. In addition, it explicitly focuses on inducing safety-critical events by adopting a general-sum objective. The robust driving agent learns by interacting with the adversary to develop a robust autonomous driving policy against adversarial attacks. To ensure stable learning in adversarial environments and to mitigate policy drift caused by attacks, the agent is optimized under a constrained formulation. Extensive experiments show that IGCARL improves the success rate by at least 27.9\% over state-of-the-art methods, demonstrating superior robustness to adversarial attacks and enhancing the safety and reliability of DRL-based autonomous driving.

【10】An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs
标题：改进的无模型决策估计系数及其在对抗性MDPs中的应用
链接：https://arxiv.org/abs/2510.08882

作者：Haolin Liu, Chen-Yu Wei, Julian Zimmert
摘要：我们研究决策与结构化观察（DMSO）。先前的工作（Foster等人，2021 b，2023 a）通过决策估计系数（DEC）表征了DMSO的复杂性，但在遗憾的上界和下界之间留下了差距，该差距与模型类的大小成比例。为了缩小这一差距，Foster et al.（2023 b）引入了乐观的DEC，实现了一个仅随值函数类的大小而扩展的界限。然而，他们基于乐观主义的探索只知道处理随机设置，目前还不清楚它是否扩展到对抗设置。我们介绍Dig-DEC，一个无模型的DEC，它消除了乐观主义，纯粹通过信息增益来驱动探索。Dig-DEC总是不大于乐观DEC，在特殊情况下可以小得多。重要的是，去除乐观使它能够在没有明确奖励估计的情况下处理对抗性环境。通过将Dig-DEC应用于具有随机转移和对抗性奖励的混合MDP，我们获得了几种一般转移结构下具有强盗反馈的混合MDP的第一个无模型遗憾界，解决了Liu et al.（2025）留下的主要公开问题。我们还改进了无模型学习中的在线函数估计过程：为了使平均估计误差最小化，我们改进了Foster et al.（2023 b）中的估计器，以实现更精确的集中度，将其遗憾界从$T^{3/4}$提高到$T^{2/3}$（on-policy），从$T^{5/6}$提高到$T^{7/9}$（off-policy）。对于Bellman完备MDP中的平方误差最小化，我们重新设计了他们的两个时标过程，将遗憾界从$T^{2/3}$提高到$\sqrt{T}$。这是基于DEC的方法第一次实现与基于优化的方法的性能匹配（Jin等人，2021; Xie等人，2023年）在贝尔曼完整的MDPs。
摘要：We study decision making with structured observation (DMSO). Previous work (Foster et al., 2021b, 2023a) has characterized the complexity of DMSO via the decision-estimation coefficient (DEC), but left a gap between the regret upper and lower bounds that scales with the size of the model class. To tighten this gap, Foster et al. (2023b) introduced optimistic DEC, achieving a bound that scales only with the size of the value-function class. However, their optimism-based exploration is only known to handle the stochastic setting, and it remains unclear whether it extends to the adversarial setting. We introduce Dig-DEC, a model-free DEC that removes optimism and drives exploration purely by information gain. Dig-DEC is always no larger than optimistic DEC and can be much smaller in special cases. Importantly, the removal of optimism allows it to handle adversarial environments without explicit reward estimators. By applying Dig-DEC to hybrid MDPs with stochastic transitions and adversarial rewards, we obtain the first model-free regret bounds for hybrid MDPs with bandit feedback under several general transition structures, resolving the main open problem left by Liu et al. (2025). We also improve the online function-estimation procedure in model-free learning: For average estimation error minimization, we refine the estimator in Foster et al. (2023b) to achieve sharper concentration, improving their regret bounds from $T^{3/4}$ to $T^{2/3}$ (on-policy) and from $T^{5/6}$ to $T^{7/9}$ (off-policy). For squared error minimization in Bellman-complete MDPs, we redesign their two-timescale procedure, improving the regret bound from $T^{2/3}$ to $\sqrt{T}$. This is the first time a DEC-based method achieves performance matching that of optimism-based approaches (Jin et al., 2021; Xie et al., 2023) in Bellman-complete MDPs.

【11】A unified Bayesian framework for adversarial robustness
标题：对抗稳健性的统一Bayesian框架
链接：https://arxiv.org/abs/2510.09288

作者：Pablo G. Arce, Roi Naveiro, David Ríos Insua
摘要：机器学习模型对对抗性攻击的脆弱性仍然是一个关键的安全挑战。传统的防御，如对抗性训练，通常通过最小化最坏情况下的损失来增强模型的鲁棒性。然而，这些确定性的方法并没有考虑到对手攻击的不确定性。虽然存在将概率分布放在对手身上的随机防御，但它们往往缺乏统计上的严格性，并且无法明确其潜在假设。为了解决这些问题，我们引入了一个正式的贝叶斯框架，该框架通过随机信道对对抗性不确定性进行建模，阐明所有概率假设。这产生了两种鲁棒化策略：在训练期间制定的主动防御，与对抗性训练相一致，以及在行动期间制定的反应性防御，与对抗性净化相一致。几个以前的防御可以恢复为我们的模型的限制情况。我们实证验证了我们的方法，展示了明确建模对抗性不确定性的好处。
摘要：The vulnerability of machine learning models to adversarial attacks remains a critical security challenge. Traditional defenses, such as adversarial training, typically robustify models by minimizing a worst-case loss. However, these deterministic approaches do not account for uncertainty in the adversary's attack. While stochastic defenses placing a probability distribution on the adversary exist, they often lack statistical rigor and fail to make explicit their underlying assumptions. To resolve these issues, we introduce a formal Bayesian framework that models adversarial uncertainty through a stochastic channel, articulating all probabilistic assumptions. This yields two robustification strategies: a proactive defense enacted during training, aligned with adversarial training, and a reactive defense enacted during operations, aligned with adversarial purification. Several previous defenses can be recovered as limiting cases of our model. We empirically validate our methodology, showcasing the benefits of explicitly modeling adversarial uncertainty.

半/弱/无/有监督|不确定性|主动学习(8篇)

【1】Characterizing 5G User Throughput via Uncertainty Modeling and Crowdsourced Measurements
标题：通过不确定性建模和众包测量来描述5G用户吞吐量
链接：https://arxiv.org/abs/2510.09239

作者：Javier Albert-Smet, Zoraida Frias, Luis Mendo, Sergio Melones, Eduardo Yraola
备注：Submitted to IEEE ICC 2026
摘要：随着5G无线接入网络（RAN）的更高容量将连接瓶颈转移到网络的更深处，表征下一代网络中的应用层用户吞吐量变得越来越具有挑战性。传统的方法，如路测和运营商设备计数器，成本高，有限，或无法捕获端到端（E2 E）服务质量（QoS）及其变化。在这项工作中，我们利用大规模的众包测量-包括E2 E，无线电，上下文和网络部署功能收集的用户设备（UE）-提出了一个不确定性的感知和可解释的方法下行链路用户吞吐量估计。我们首先验证了先前的4G方法，将R^2提高了8.7%，然后将其扩展到5G NSA和5G SA，为5G众包数据集提供了第一个基准。为了解决吞吐量的变化，我们应用NGBoost，一个模型，输出点估计和校准的置信区间，代表其首次使用在计算机通信领域。最后，我们使用所提出的模型来分析从4G到5G SA的演进，并表明吞吐量瓶颈从RAN转移到传输和服务层，正如E2 E指标在无线电相关功能中越来越重要所示。
摘要：Characterizing application-layer user throughput in next-generation networks is increasingly challenging as the higher capacity of the 5G Radio Access Network (RAN) shifts connectivity bottlenecks towards deeper parts of the network. Traditional methods, such as drive tests and operator equipment counters, are costly, limited, or fail to capture end-to-end (E2E) Quality of Service (QoS) and its variability. In this work, we leverage large-scale crowdsourced measurements-including E2E, radio, contextual and network deployment features collected by the user equipment (UE)-to propose an uncertainty-aware and explainable approach for downlink user throughput estimation. We first validate prior 4G methods, improving R^2 by 8.7%, and then extend them to 5G NSA and 5G SA, providing the first benchmarks for 5G crowdsourced datasets. To address the variability of throughput, we apply NGBoost, a model that outputs both point estimates and calibrated confidence intervals, representing its first use in the field of computer communications. Finally, we use the proposed model to analyze the evolution from 4G to 5G SA, and show that throughput bottlenecks move from the RAN to transport and service layers, as seen by E2E metrics gaining importance over radio-related features.

【2】Bi-level Meta-Policy Control for Dynamic Uncertainty Calibration in Evidential Deep Learning
标题：证据深度学习中用于动态不确定性校准的双层元策略控制
链接：https://arxiv.org/abs/2510.08938

作者：Zhen Yang, Yansong Ma, Lei Chen
摘要：传统的证据深度学习（EDL）方法依赖于静态超参数进行不确定性校准，限制了其在动态数据分布中的适应性，导致在高风险决策任务中校准和泛化能力较差。为了解决这个问题，我们提出了元策略控制器（MPC），一个动态的元学习框架，调整KL发散系数和Dirichlet先验强度的最佳不确定性建模。具体而言，MPC采用双层优化方法：在内环中，通过动态配置的适应当前训练状态的损失函数更新模型参数;在外环中，策略网络基于平衡预测准确性和不确定性质量的多目标奖励来优化KL发散系数和类特定Dirichlet先验强度。与之前使用固定先验的方法不同，我们的可学习Dirichlet先验能够灵活地适应类分布和训练动态。大量的实验结果表明，MPC显着提高了各种任务中模型预测的可靠性和校准，提高了不确定性校准，预测精度和基于置信度的样本拒绝后的性能保留。
摘要：Traditional Evidence Deep Learning (EDL) methods rely on static hyperparameter for uncertainty calibration, limiting their adaptability in dynamic data distributions, which results in poor calibration and generalization in high-risk decision-making tasks. To address this limitation, we propose the Meta-Policy Controller (MPC), a dynamic meta-learning framework that adjusts the KL divergence coefficient and Dirichlet prior strengths for optimal uncertainty modeling. Specifically, MPC employs a bi-level optimization approach: in the inner loop, model parameters are updated through a dynamically configured loss function that adapts to the current training state; in the outer loop, a policy network optimizes the KL divergence coefficient and class-specific Dirichlet prior strengths based on multi-objective rewards balancing prediction accuracy and uncertainty quality. Unlike previous methods with fixed priors, our learnable Dirichlet prior enables flexible adaptation to class distributions and training dynamics. Extensive experimental results show that MPC significantly enhances the reliability and calibration of model predictions across various tasks, improving uncertainty calibration, prediction accuracy, and performance retention after confidence-based sample rejection.

【3】Multi-fidelity Batch Active Learning for Gaussian Process Classifiers
标题：高斯过程分类器的多保真批量主动学习
链接：https://arxiv.org/abs/2510.08865

作者：Murray Cutforth, Yiming Yang, Tiffany Fan, Serge Guillas, Eric Darve
摘要：许多科学和工程问题依赖于昂贵的计算模拟，其中多保真度方法可以加速对参数空间的探索。我们研究了有效分配的模拟预算使用高斯过程（GP）模型中的二进制模拟输出的情况下。本文介绍了一种用于多保真度GP分类器的批量主动学习算法--伯努利参数互信息（BPMI）。BPMI通过采用链接函数的一阶泰勒展开来规避在概率空间中计算互信息的棘手性。我们评估BPMI对两个合成测试用例和一个复杂的，现实世界的应用程序，涉及模拟激光点火火箭燃烧室的几个基线。在所有的实验中，BPMI都表现出了卓越的性能，在固定的计算预算下实现了更高的预测精度。
摘要：Many science and engineering problems rely on expensive computational simulations, where a multi-fidelity approach can accelerate the exploration of a parameter space. We study efficient allocation of a simulation budget using a Gaussian Process (GP) model in the binary simulation output case. This paper introduces Bernoulli Parameter Mutual Information (BPMI), a batch active learning algorithm for multi-fidelity GP classifiers. BPMI circumvents the intractability of calculating mutual information in the probability space by employing a first-order Taylor expansion of the link function. We evaluate BPMI against several baselines on two synthetic test cases and a complex, real-world application involving the simulation of a laser-ignited rocket combustor. In all experiments, BPMI demonstrates superior performance, achieving higher predictive accuracy for a fixed computational budget.

【4】On the Alignment Between Supervised and Self-Supervised Contrastive Learning
标题：论监督对比学习和自我监督对比学习的一致性
链接：https://arxiv.org/abs/2510.08852

作者：Achleshwar Luthra, Priyadarsi Mishra, Tomer Galanti
摘要：自监督对比学习（CL）已经取得了显着的经验成功，经常产生与下游任务的监督预训练相媲美的表示。最近的理论解释了这一点，表明CL损失非常接近监督代理，只有负监督对比学习（NSCL）的损失，随着类的数量的增长。然而，这种损失水平的相似性留下了一个悬而未决的问题：CL和NSCL在整个培训过程中是否也在代表性水平上保持一致，而不仅仅是在他们的目标上？我们通过分析在共享随机性（相同的初始化，批次和增强）下训练的CL和NSCL模型的表示对齐来解决这个问题。首先，我们表明，他们的诱导表示保持相似：具体来说，我们证明了CL和NSCL的相似性矩阵保持接近现实条件下。我们的边界提供了高概率的保证对齐指标，如中心内核对齐（CKA）和代表性相似性分析（RSA），他们澄清对齐如何改善更多的类，更高的温度，以及它对批量大小的依赖。相比之下，我们证明了参数空间耦合本质上是不稳定的：CL和NSCL权重之间的分歧可以随着训练时间呈指数增长。最后，我们验证这些预测经验，表明CL-NSCL对齐加强规模和温度，NSCL跟踪CL比其他监督目标更密切。这使NSCL成为自我监督学习和监督学习之间的原则性桥梁。我们的代码和项目页面位于[\href{https：//github.com/DLFundamentals/understanding_ssl_v2}{code}，\href{https：//dlfundamentals.github.io/cl-nscl-representation-programment/}{project page}]。
摘要：Self-supervised contrastive learning (CL) has achieved remarkable empirical success, often producing representations that rival supervised pre-training on downstream tasks. Recent theory explains this by showing that the CL loss closely approximates a supervised surrogate, Negatives-Only Supervised Contrastive Learning (NSCL) loss, as the number of classes grows. Yet this loss-level similarity leaves an open question: {\em Do CL and NSCL also remain aligned at the representation level throughout training, not just in their objectives?} We address this by analyzing the representation alignment of CL and NSCL models trained under shared randomness (same initialization, batches, and augmentations). First, we show that their induced representations remain similar: specifically, we prove that the similarity matrices of CL and NSCL stay close under realistic conditions. Our bounds provide high-probability guarantees on alignment metrics such as centered kernel alignment (CKA) and representational similarity analysis (RSA), and they clarify how alignment improves with more classes, higher temperatures, and its dependence on batch size. In contrast, we demonstrate that parameter-space coupling is inherently unstable: divergence between CL and NSCL weights can grow exponentially with training time. Finally, we validate these predictions empirically, showing that CL-NSCL alignment strengthens with scale and temperature, and that NSCL tracks CL more closely than other supervised objectives. This positions NSCL as a principled bridge between self-supervised and supervised learning. Our code and project page are available at [\href{https://github.com/DLFundamentals/understanding_ssl_v2}{code}, \href{https://dlfundamentals.github.io/cl-nscl-representation-alignment/}{project page}].

【5】SHAP-Based Supervised Clustering for Sample Classification and the Generalized Waterfall Plot
标题：基于SHAP的有监督聚集用于样本分类和广义瀑布图
链接：https://arxiv.org/abs/2510.08737

作者：Justin Lin, Julia Fukuyama
备注：23 pages, 15 figures, 3 tables
摘要：在这个数据和技术不断发展的时代，大型黑盒模型正在成为常态，因为它们能够处理大量数据并学习极其复杂的输入输出关系。然而，这些方法的不足之处在于，它们无法解释预测过程，使它们不值得信赖，并且在高风险情况下使用不稳定。SHapley加法解释（SHAP）分析是一种可解释的人工智能方法，由于其能够根据原始特征解释模型预测而越来越受欢迎。对于数据集中的每个样本和特征，我们关联一个SHAP值，该值量化了该特征对该样本预测的贡献。对这些SHAP值进行聚类可以通过对不仅接收到相同预测而且由于类似原因接收到相同预测的样本进行分组来提供对数据的洞察。在这样做的过程中，我们绘制了不同样本达到相同预测的各种途径。为了展示这种方法，我们提出了一个模拟实验，除了在阿尔茨海默氏症的病例研究中使用的数据从阿尔茨海默氏症神经影像学倡议（ADNI）数据库。我们还提出了一个新的概括瀑布图的多分类。
摘要：In this growing age of data and technology, large black-box models are becoming the norm due to their ability to handle vast amounts of data and learn incredibly complex input-output relationships. The deficiency of these methods, however, is their inability to explain the prediction process, making them untrustworthy and their use precarious in high-stakes situations. SHapley Additive exPlanations (SHAP) analysis is an explainable AI method growing in popularity for its ability to explain model predictions in terms of the original features. For each sample and feature in the data set, we associate a SHAP value that quantifies the contribution of that feature to the prediction of that sample. Clustering these SHAP values can provide insight into the data by grouping samples that not only received the same prediction, but received the same prediction for similar reasons. In doing so, we map the various pathways through which distinct samples arrive at the same prediction. To showcase this methodology, we present a simulated experiment in addition to a case study in Alzheimer's disease using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. We also present a novel generalization of the waterfall plot for multi-classification.

【6】Enhancing Self-Supervised Learning with Semantic Pairs A New Dataset and Empirical Study
标题：用语义对增强自我监督学习新数据集和实证研究
链接：https://arxiv.org/abs/2510.08722

作者：Mohammad Alkhalefi, Georgios Leontidis, Mingjun Zhong
备注：16 pages, 7 figures, 5 tables
摘要：实例区分是一种自监督表示学习范式，其中数据集中的各个实例被视为不同的类。这通常是通过应用随机变换来生成每个实例的两个不同视图来实现的，这鼓励模型学习对这些视图中的公共底层对象不变的表示。
摘要：Instance discrimination is a self-supervised representation learning paradigm wherein individual instances within a dataset are treated as distinct classes. This is typically achieved by generating two disparate views of each instance by applying stochastic transformations, which encourages the model to learn representations that are invariant to the common underlying object across these views.

【7】Out-of-Distribution Detection in LiDAR Semantic Segmentation Using Epistemic Uncertainty from Hierarchical GMMs
标题：利用分层格纹的认识不确定性进行LiDART语义分割中的分布外检测
链接：https://arxiv.org/abs/2510.08631

作者：Hanieh Shojaei Miandashti, Claus Brenner
摘要：除了通过对LiDAR点云进行精确的语义分割来准确地理解场景外，检测分布外（OOD）对象（在训练过程中未遇到的实例）对于防止将未知对象错误地分配给已知类别至关重要。虽然监督OOD检测方法依赖于辅助OOD数据集，但无监督方法避免了这一要求，但通常依赖于预测熵，即通过对总体或多个后验权重样本进行平均而获得的预测分布的熵。然而，这些方法往往混淆了认识（模型）和任意（数据）的不确定性，错误地分类模糊的分布区域作为OOD。为了解决这个问题，我们提出了一种无监督的OOD检测方法，该方法采用来自深度神经网络特征空间中高斯混合模型（GMM）参数的分层贝叶斯建模的认知不确定性。在不需要辅助数据或额外训练阶段的情况下，我们的方法在SemanticKITTI数据集上优于现有的基于不确定性的方法，与先前工作中使用的预测熵方法相比，AUROC提高了18%，AUPRC增加了22%，FPR95减少了36%（从76%到40%）。
摘要：In addition to accurate scene understanding through precise semantic segmentation of LiDAR point clouds, detecting out-of-distribution (OOD) objects, instances not encountered during training, is essential to prevent the incorrect assignment of unknown objects to known classes. While supervised OOD detection methods depend on auxiliary OOD datasets, unsupervised methods avoid this requirement but typically rely on predictive entropy, the entropy of the predictive distribution obtained by averaging over an ensemble or multiple posterior weight samples. However, these methods often conflate epistemic (model) and aleatoric (data) uncertainties, misclassifying ambiguous in distribution regions as OOD. To address this issue, we present an unsupervised OOD detection approach that employs epistemic uncertainty derived from hierarchical Bayesian modeling of Gaussian Mixture Model (GMM) parameters in the feature space of a deep neural network. Without requiring auxiliary data or additional training stages, our approach outperforms existing uncertainty-based methods on the SemanticKITTI dataset, achieving an 18\% improvement in AUROC, 22\% increase in AUPRC, and 36\% reduction in FPR95 (from 76\% to 40\%), compared to the predictive entropy approach used in prior works.

【8】Unsupervised full-field Bayesian inference of orthotropic hyperelasticity from a single biaxial test: a myocardial case study
标题：单次双向测试中垂直异性超弹性的无监督全场Bayesian推断：心肌案例研究
链接：https://arxiv.org/abs/2510.09498

作者：Rogier P. Krijnen, Akshay Joshi, Siddhant Kumar, Mathias Peirlinck
摘要：在传统的均质化组织试验中完全捕获这种行为需要激发多种变形模式，即组合的三轴剪切试验和双轴拉伸试验。固有地，这样的多模式实验方案需要多个组织样品和广泛的样品操作。固有的样本间变异性和操作诱导的组织损伤可能对反向识别的组织行为产生不利影响。在这项工作中，我们的目标是克服这一差距，我们的注意力集中到使用的非均匀变形配置文件中的参数估计问题。更具体地说，我们适应EUCLID，一个无监督的方法自动发现的本构模型，对高度非线性，正交各向异性的本构模型，使用贝叶斯推理方法和三维连续体元素的参数识别的目的。我们展示了它的实力，定量推断，不同的噪声水平，从一个单一的异质双轴拉伸试验的合成心肌组织板的材料模型参数。该方法显示出良好的协议与地面实况模拟和相应的可信区间。我们的工作突出了表征高度非线性和正交各向异性材料模型从一个单一的双轴拉伸试验与不确定性量化的潜力。
摘要：Fully capturing this behavior in traditional homogenized tissue testing requires the excitation of multiple deformation modes, i.e. combined triaxial shear tests and biaxial stretch tests. Inherently, such multimodal experimental protocols necessitate multiple tissue samples and extensive sample manipulations. Intrinsic inter-sample variability and manipulation-induced tissue damage might have an adverse effect on the inversely identified tissue behavior. In this work, we aim to overcome this gap by focusing our attention to the use of heterogeneous deformation profiles in a parameter estimation problem. More specifically, we adapt EUCLID, an unsupervised method for the automated discovery of constitutive models, towards the purpose of parameter identification for highly nonlinear, orthotropic constitutive models using a Bayesian inference approach and three-dimensional continuum elements. We showcase its strength to quantitatively infer, with varying noise levels, the material model parameters of synthetic myocardial tissue slabs from a single heterogeneous biaxial stretch test. This method shows good agreement with the ground-truth simulations and with corresponding credibility intervals. Our work highlights the potential for characterizing highly nonlinear and orthotropic material models from a single biaxial stretch test with uncertainty quantification.

迁移|Zero/Few/One-Shot|自适应(5篇)

【1】MAKO: Meta-Adaptive Koopman Operators for Learning-based Model Predictive Control of Parametrically Uncertain Nonlinear Systems
标题：MAKO：参数不确定非线性系统基于学习的模型预测控制的元自适应Koopman运算器
链接：https://arxiv.org/abs/2510.09042

作者：Minghao Han, Kiwan Wong, Adrian Wing-Keung Law, Xunyuan Yin
摘要：在这项工作中，我们提出了一个基于元学习的Koopman建模和预测控制方法的参数不确定性的非线性系统。提出了一种基于自适应深度元学习的建模方法，称为Meta Adaptive Koopman Operator（MAKO）。在不知道参数不确定性的情况下，所提出的MAKO方法可以从多模态数据集中学习元模型，并通过使用在线数据有效地适应具有先前未见过的参数设置的新系统。基于学习的Meta Koopman模型，开发了一种预测控制方案，即使在存在以前看不见的参数设置的情况下，也能确保闭环系统的稳定性。通过大量的模拟，我们提出的方法在建模精度和控制效率方面都表现出了优于竞争对手的性能。
摘要：In this work, we propose a meta-learning-based Koopman modeling and predictive control approach for nonlinear systems with parametric uncertainties. An adaptive deep meta-learning-based modeling approach, called Meta Adaptive Koopman Operator (MAKO), is proposed. Without knowledge of the parametric uncertainty, the proposed MAKO approach can learn a meta-model from a multi-modal dataset and efficiently adapt to new systems with previously unseen parameter settings by using online data. Based on the learned meta Koopman model, a predictive control scheme is developed, and the stability of the closed-loop system is ensured even in the presence of previously unseen parameter settings. Through extensive simulations, our proposed approach demonstrates superior performance in both modeling accuracy and control efficacy as compared to competitive baselines.

【2】AB-PINNs: Adaptive-Basis Physics-Informed Neural Networks for Residual-Driven Domain Decomposition
标题：AB-PINN：用于剩余驱动区域分解的自适应基础物理信息神经网络
链接：https://arxiv.org/abs/2510.08924

作者：Jonah Botvinick-Greenhouse, Wael H. Ali, Mouhacine Benosman, Saviz Mowlavi
摘要：我们介绍了自适应基础物理信息神经网络（AB-PINNs），一种新的方法，区域分解训练PINNs，其中现有的子域动态适应未知解的内在特征。从经典的网格细化技术中汲取灵感，我们还通过在高剩余损失区域引入新的子域来修改整个训练过程中的区域分解，从而在微分方程的解具有挑战性的情况下提供额外的表达能力。我们灵活的区域分解方法非常适合多尺度问题，因为不同的子域可以学习捕获底层解决方案的不同尺度。此外，在训练过程中引入新子域的能力有助于防止收敛到不需要的局部最小值，并且与静态区域分解方法相比，可以减少对大量超参数调整的需求。在整个过程中，我们提出了全面的数值结果，证明了AB-PINNs在解决各种复杂的多尺度偏微分方程的有效性。
摘要：We introduce adaptive-basis physics-informed neural networks (AB-PINNs), a novel approach to domain decomposition for training PINNs in which existing subdomains dynamically adapt to the intrinsic features of the unknown solution. Drawing inspiration from classical mesh refinement techniques, we also modify the domain decomposition on-the-fly throughout training by introducing new subdomains in regions of high residual loss, thereby providing additional expressive power where the solution of the differential equation is challenging to represent. Our flexible approach to domain decomposition is well-suited for multiscale problems, as different subdomains can learn to capture different scales of the underlying solution. Moreover, the ability to introduce new subdomains during training helps prevent convergence to unwanted local minima and can reduce the need for extensive hyperparameter tuning compared to static domain decomposition approaches. Throughout, we present comprehensive numerical results which demonstrate the effectiveness of AB-PINNs at solving a variety of complex multiscale partial differential equations.

【3】Time-Aware Feature Selection: Adaptive Temporal Masking for Stable Sparse Autoencoder Training
标题：时间感知特征选择：用于稳定稀疏自动编码器训练的自适应时态掩蔽
链接：https://arxiv.org/abs/2510.08855

作者：T. Ed Li, Junyu Ren
备注：First submitted on February 10th, 2025 to ICLR 2025 Workshop (XAI4Science: From Understanding Model Behavior to Discovering New Scientific Knowledge). The paper was accepted but the workshop does not generate proceedings. Now uploading to arXiv to make the paper publicly available
摘要：理解大型语言模型的内部表示对于确保其可靠性和安全性至关重要，稀疏自动编码器（SAE）正在成为一种有前途的可解释性方法。然而，当前SAE训练方法面临特征吸收，其中特征（或神经元）被吸收到彼此中以最小化$L_1$惩罚，使得难以一致地识别和分析模型行为。我们引入了自适应时间掩蔽（ATM），这是一种新的训练方法，通过跟踪激活幅度，频率和重建贡献来动态调整特征选择，以计算随时间推移而演变的重要性分数。ATM应用基于这些重要性分数的统计阈值的概率掩蔽机制，创建更自然的特征选择过程。通过对Gemma-2-2b模型的广泛实验，我们证明了ATM与TopK和JumpReLU SAE等现有方法相比，吸收分数显著降低，同时保持了出色的重建质量。这些结果确立了ATM作为学习神经网络中稳定，可解释特征的原则性解决方案，为更可靠的模型分析提供了基础。
摘要：Understanding the internal representations of large language models is crucial for ensuring their reliability and safety, with sparse autoencoders (SAEs) emerging as a promising interpretability approach. However, current SAE training methods face feature absorption, where features (or neurons) are absorbed into each other to minimize $L_1$ penalty, making it difficult to consistently identify and analyze model behaviors. We introduce Adaptive Temporal Masking (ATM), a novel training approach that dynamically adjusts feature selection by tracking activation magnitudes, frequencies, and reconstruction contributions to compute importance scores that evolve over time. ATM applies a probabilistic masking mechanism based on statistical thresholding of these importance scores, creating a more natural feature selection process. Through extensive experiments on the Gemma-2-2b model, we demonstrate that ATM achieves substantially lower absorption scores compared to existing methods like TopK and JumpReLU SAEs, while maintaining excellent reconstruction quality. These results establish ATM as a principled solution for learning stable, interpretable features in neural networks, providing a foundation for more reliable model analysis.

【4】Zero-Shot Policy Transfer in Reinforcement Learning using Buckingham's Pi Theorem
标题：使用白金汉圆周率定理的强化学习中的零攻击策略转移
链接：https://arxiv.org/abs/2510.08768

作者：Francisco Pascoa, Ian Lalonde, Alexandre Girard
摘要：强化学习（RL）策略通常无法推广到具有不同物理参数的新机器人、任务或环境，这一挑战限制了它们在现实世界中的适用性。本文提出了一种简单的，zero-shot传输方法的基础上白金汉的π定理，以解决这一限制。该方法通过在无量纲空间中缩放其输入（观察）和输出（动作），使预先训练的策略适应新的系统上下文，不需要重新训练。该方法进行评估，对一个天真的传输基线在三个环境中的复杂性不断增加：一个模拟的钟摆，一个物理钟摆的模拟到真实的验证，和高维HalfCheetah。结果表明，缩放传输表现出动态相似的上下文中的性能没有损失。此外，在不相似的情况下，缩放策略始终优于朴素转移，显着扩大了原始策略仍然有效的情况下的容量。这些发现表明，维度分析提供了一个强大的和实用的工具，以提高RL政策的鲁棒性和推广。
摘要：Reinforcement learning (RL) policies often fail to generalize to new robots, tasks, or environments with different physical parameters, a challenge that limits their real-world applicability. This paper presents a simple, zero-shot transfer method based on Buckingham's Pi Theorem to address this limitation. The method adapts a pre-trained policy to new system contexts by scaling its inputs (observations) and outputs (actions) through a dimensionless space, requiring no retraining. The approach is evaluated against a naive transfer baseline across three environments of increasing complexity: a simulated pendulum, a physical pendulum for sim-to-real validation, and the high-dimensional HalfCheetah. Results demonstrate that the scaled transfer exhibits no loss of performance on dynamically similar contexts. Furthermore, on non-similar contexts, the scaled policy consistently outperforms the naive transfer, significantly expanding the volume of contexts where the original policy remains effective. These findings demonstrate that dimensional analysis provides a powerful and practical tool to enhance the robustness and generalization of RL policies.

【5】Structured Output Regularization: a framework for few-shot transfer learning
标题：结构化输出正规化：少量迁移学习的框架
链接：https://arxiv.org/abs/2510.08728

作者：Nicolas Ewen, Jairo Diaz-Rodriguez, Kelly Ramsay
摘要：传统的迁移学习通常通过冻结一些权重并添加特定于任务的层来重用大型预训练网络。虽然这种方法在计算上很高效，但它限制了模型适应特定于领域的特征的能力，并且仍然可能导致对非常有限的数据的过拟合。为了解决这些限制，我们提出了结构化输出正则化（SOR），这是一个简单而有效的框架，它冻结了内部网络结构（例如，卷积滤波器），同时使用组套索和$L_1$惩罚的组合。该框架将模型定制为具有最少附加参数的特定数据，并且易于应用于各种网络组件，例如卷积滤波器或神经网络中的各种块，从而能够广泛适用于迁移学习任务。我们评估SOR上的三个Few Shot医学成像分类任务，我们取得了竞争力的结果，使用DenseNet 121和EfficientNetB 4基地相比，建立基准。
摘要：Traditional transfer learning typically reuses large pre-trained networks by freezing some of their weights and adding task-specific layers. While this approach is computationally efficient, it limits the model's ability to adapt to domain-specific features and can still lead to overfitting with very limited data. To address these limitations, we propose Structured Output Regularization (SOR), a simple yet effective framework that freezes the internal network structures (e.g., convolutional filters) while using a combination of group lasso and $L_1$ penalties. This framework tailors the model to specific data with minimal additional parameters and is easily applicable to various network components, such as convolutional filters or various blocks in neural networks enabling broad applicability for transfer learning tasks. We evaluate SOR on three few shot medical imaging classification tasks and we achieve competitive results using DenseNet121, and EfficientNetB4 bases compared to established benchmarks.

强化学习(5篇)

【1】FM-IRL: Flow-Matching for Reward Modeling and Policy Regularization in Reinforcement Learning
标题：FM-IRL：强化学习中奖励建模和政策正规化的流匹配
链接：https://arxiv.org/abs/2510.09222

作者：Zhenglin Wan, Jingxuan Wu, Xingrui Yu, Chubin Zhang, Mingcong Lei, Bo An, Ivor Tsang
备注：20 pages
摘要：流匹配（FM）在复杂分布的建模方面表现出了卓越的能力，在离线模仿学习中克隆专家行为方面表现出了很强的性能。然而，尽管它的行为克隆表现力，FM为基础的政策是固有的限制，他们缺乏环境的互动和探索。这导致在专家演示之外的未知场景中泛化能力差，强调了与环境在线交互的必要性。不幸的是，通过在线交互优化FM策略是具有挑战性的和低效的，由于梯度计算的不稳定性和高推理成本。为了解决这些问题，我们建议让一个简单的MLP结构的学生政策探索环境，并通过RL算法与奖励模型在线更新。该奖励模型与教师FM模型相关联，包含丰富的专家数据分布信息。此外，同一教师FM模型被用来规范学生策略的行为，以稳定策略学习。由于学生的简单架构，我们避免了FM策略的梯度不稳定性，并实现了有效的在线探索，同时仍然利用教师FM模型的表现力。大量的实验表明，我们的方法显着提高了学习效率，泛化和鲁棒性，特别是当学习次优的专家数据。
摘要：Flow Matching (FM) has shown remarkable ability in modeling complex distributions and achieves strong performance in offline imitation learning for cloning expert behaviors. However, despite its behavioral cloning expressiveness, FM-based policies are inherently limited by their lack of environmental interaction and exploration. This leads to poor generalization in unseen scenarios beyond the expert demonstrations, underscoring the necessity of online interaction with environment. Unfortunately, optimizing FM policies via online interaction is challenging and inefficient due to instability in gradient computation and high inference costs. To address these issues, we propose to let a student policy with simple MLP structure explore the environment and be online updated via RL algorithm with a reward model. This reward model is associated with a teacher FM model, containing rich information of expert data distribution. Furthermore, the same teacher FM model is utilized to regularize the student policy's behavior to stabilize policy learning. Due to the student's simple architecture, we avoid the gradient instability of FM policies and enable efficient online exploration, while still leveraging the expressiveness of the teacher FM model. Extensive experiments show that our approach significantly enhances learning efficiency, generalization, and robustness, especially when learning from suboptimal expert data.

【2】Pinpointing crucial steps: Attribution-based Credit Assignment for Verifiable Reinforcement Learning
标题：确定关键步骤：可验证强化学习的基于归因的学分分配
链接：https://arxiv.org/abs/2510.08899

作者：Junxi Yin, Haisen Luo, Zhenyu Li, Yihua Liu, Dan Liu, Zequn Li, Xiaohang Xu
备注：12 pages, 5 figures
摘要：虽然带有可验证奖励的强化学习（RLVR）增强了LLM中的复杂推理，但目前的方法很难平衡探索和利用。这导致了关键问题，如中间步骤的信用分配不准确和过早的熵崩溃，限制了模型的性能。为了解决这个问题，我们引入了基于归因的政策优化贡献（ACPO），一个分阶段的框架，其中包括一个困难意识的课程。ACPO通过使用轨迹语义分割和基于属性的表示来动态调节策略熵，从而减轻其崩溃，从而改善了探索。同时，它通过因子化的奖励系统增强了开发，该系统精确量化了每个推理步骤的分层贡献，确保了准确的信用分配。在具有挑战性的基准测试（包括AIME、MATH和AMC）上进行的大量实验表明，ACPO显著优于现有的最先进方法。
摘要：While Reinforcement Learning with Verifiable Rewards (RLVR) enhances complex reasoning in LLMs, current methods struggle to balance exploration and exploitation. This leads to critical issues like inaccurate credit assignment for intermediate steps and premature entropy collapse, limiting model performance. To address this, we introduce Attribution-based Contribution to Policy Optimization (ACPO), a phased framework that incorporates a difficulty-aware curriculum. ACPO improves exploration by using trajectory semantic segmentation and an attribution-based representation to dynamically regulate policy entropy, thus mitigating its collapse. Concurrently, it enhances exploitation with a factorized reward system that precisely quantifies the hierarchical contribution of each reasoning step, ensuring accurate credit assignment. Extensive experiments on challenging benchmarks, including AIME, MATH, and AMC, demonstrate that ACPO significantly outperforms existing state-of-the-art approaches.

【3】Reinforcement Learning-Driven Edge Management for Reliable Multi-view 3D Reconstruction
标题：强化学习驱动的边缘管理，实现可靠的多视图3D重建
链接：https://arxiv.org/abs/2510.08839

作者：Motahare Mounesan, Sourya Saha, Houchao Gan, Md. Nurul Absur, Saptarshi Debroy
摘要：实时多视图3D重建是关键边缘原生用例（如消防救援）的关键任务应用程序，其中及时准确的3D场景建模可实现态势感知和明智的决策。然而，边缘资源可用性的动态和不可预测的性质引入了中断，例如图像质量下降、网络链路不稳定和服务器负载波动，这对重建管道的可靠性提出了挑战。在这项工作中，我们提出了一个基于强化学习（RL）的边缘资源管理框架，用于可靠的3D重建，以确保在合理的时间内进行高质量的重建，尽管系统在资源受限和易中断的环境下运行。特别是，该框架采用了两个合作的Q学习代理，一个用于相机选择，一个用于服务器选择，这两个都是完全在线操作，通过与边缘环境的交互来学习策略。为了支持现实约束下的学习和评估系统性能，我们实现了一个分布式测试平台，包括实验室托管的终端设备和FABRIC基础设施托管的边缘服务器，以模拟现实中断场景下的智能城市边缘基础设施。结果表明，该框架通过有效地平衡动态环境中的端到端延迟和重建质量，提高了应用程序的可靠性。
摘要：Real-time multi-view 3D reconstruction is a mission-critical application for key edge-native use cases, such as fire rescue, where timely and accurate 3D scene modeling enables situational awareness and informed decision-making. However, the dynamic and unpredictable nature of edge resource availability introduces disruptions, such as degraded image quality, unstable network links, and fluctuating server loads, which challenge the reliability of the reconstruction pipeline. In this work, we present a reinforcement learning (RL)-based edge resource management framework for reliable 3D reconstruction to ensure high quality reconstruction within a reasonable amount of time, despite the system operating under a resource-constrained and disruption-prone environment. In particular, the framework adopts two cooperative Q-learning agents, one for camera selection and one for server selection, both of which operate entirely online, learning policies through interactions with the edge environment. To support learning under realistic constraints and evaluate system performance, we implement a distributed testbed comprising lab-hosted end devices and FABRIC infrastructure-hosted edge servers to emulate smart city edge infrastructure under realistic disruption scenarios. Results show that the proposed framework improves application reliability by effectively balancing end-to-end latency and reconstruction quality in dynamic environments.

【4】Reinforcement Learning-Based Optimization of CT Acquisition and Reconstruction Parameters Through Virtual Imaging Trials
标题：通过虚拟成像试验基于强化学习的CT采集和重建参数优化
链接：https://arxiv.org/abs/2510.08763

作者：David Fenwick, Navid NaderiAlizadeh, Vahid Tarokh, Nicholas Felice, Darin Clark, Jayasai Rajagopal, Anuj Kapadia, Benjamin Wildman-Tobriner, Ehsan Samei, Ehsan Abadi
摘要：方案优化在计算机断层扫描（CT）中至关重要，以实现高诊断图像质量，同时最大限度地减少辐射剂量。然而，由于CT采集和重建参数之间的复杂的相互依赖性，传统的优化方法依赖于这些参数的组合的穷举测试，这通常是不切实际的。这项研究介绍了一种新的方法，将虚拟成像工具与强化学习相结合，以更有效地优化CT协议。使用经验证的CT模拟器对具有肝脏病变的人体模型进行成像，并使用新型CT重建工具包进行重建。优化的参数空间包括管电压、管电流、重建核、层厚和像素尺寸。使用邻近策略优化（PPO）代理执行优化过程，该代理被训练以最大化图像质量目标，特别是重建图像中的肝脏病变的可检测性指数（d '）。优化性能进行了比较，对一个穷举搜索上的超级计算机。所提出的强化学习方法在测试用例中实现了全局最大d'，同时比穷举搜索少了79.7%的步骤，证明了准确性和计算效率。所提出的框架是灵活的，可以适应各种图像质量目标。研究结果强调了将虚拟成像工具与强化学习集成用于CT协议管理的潜力。
摘要：Protocol optimization is critical in Computed Tomography (CT) to achieve high diagnostic image quality while minimizing radiation dose. However, due to the complex interdependencies among CT acquisition and reconstruction parameters, traditional optimization methods rely on exhaustive testing of combinations of these parameters, which is often impractical. This study introduces a novel methodology that combines virtual imaging tools with reinforcement learning to optimize CT protocols more efficiently. Human models with liver lesions were imaged using a validated CT simulator and reconstructed with a novel CT reconstruction toolkit. The optimization parameter space included tube voltage, tube current, reconstruction kernel, slice thickness, and pixel size. The optimization process was performed using a Proximal Policy Optimization (PPO) agent, which was trained to maximize an image quality objective, specifically the detectability index (d') of liver lesions in the reconstructed images. Optimization performance was compared against an exhaustive search performed on a supercomputer. The proposed reinforcement learning approach achieved the global maximum d' across test cases while requiring 79.7% fewer steps than the exhaustive search, demonstrating both accuracy and computational efficiency. The proposed framework is flexible and can accommodate various image quality objectives. The findings highlight the potential of integrating virtual imaging tools with reinforcement learning for CT protocol management.

【5】Application of Deep Reinforcement Learning to At-the-Money S&P 500 Options Hedging
标题：深度强化学习在标普500指数现货期权对冲中的应用
链接：https://arxiv.org/abs/2510.09247

作者：Zofia Bracha, Paweł Sakowski, Jakub Michańków
备注：35 pages
摘要：本文探讨了深度Q学习在标准普尔500指数的实值期权套期保值中的应用。我们开发了一个基于双延迟深度确定性政策梯度（TD 3）算法的代理，该算法经过训练可以模拟对冲决策，而无需对价格动态做出明确的模型假设。该代理人接受了2004- 2024年S\&P~500看涨期权的历史日内价格的培训，使用六个预测变量的单一时间序列：期权价格，标的资产价格，货币性，到期时间，已实现波动率和当前对冲头寸。训练采用了一种向前走的程序，这导致了近17年的样本外评估。深度强化学习（DRL）代理的性能是以同期的Black-Scholes delta对冲策略为基准的。我们使用年化回报率、波动率、信息比率和夏普比率等指标来评估这两种方法。为了测试模型的适应性，我们在不同的市场条件下进行了模拟，并增加了交易成本和风险意识惩罚等约束条件。我们的研究结果表明，DRL代理可以优于传统的对冲方法，特别是在波动或高成本的环境中，突出了其在实际交易环境中的鲁棒性和灵活性。虽然代理一贯优于三角洲对冲，其性能恶化时，风险意识参数较高。我们还观察到，用于波动率估计的时间间隔越长，结果越稳定。
摘要：This paper explores the application of deep Q-learning to hedging at-the-money options on the S\&P~500 index. We develop an agent based on the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, trained to simulate hedging decisions without making explicit model assumptions on price dynamics. The agent was trained on historical intraday prices of S\&P~500 call options across years 2004--2024, using a single time series of six predictor variables: option price, underlying asset price, moneyness, time to maturity, realized volatility, and current hedge position. A walk-forward procedure was applied for training, which led to nearly 17~years of out-of-sample evaluation. The performance of the deep reinforcement learning (DRL) agent is benchmarked against the Black--Scholes delta-hedging strategy over the same period. We assess both approaches using metrics such as annualized return, volatility, information ratio, and Sharpe ratio. To test the models' adaptability, we performed simulations across varying market conditions and added constraints such as transaction costs and risk-awareness penalties. Our results show that the DRL agent can outperform traditional hedging methods, particularly in volatile or high-cost environments, highlighting its robustness and flexibility in practical trading contexts. While the agent consistently outperforms delta-hedging, its performance deteriorates when the risk-awareness parameter is higher. We also observed that the longer the time interval used for volatility estimation, the more stable the results.

医学相关(8篇)

【1】Performance Analysis of Machine Learning Algorithms in Chronic Kidney Disease Prediction
标题：机器学习算法在慢性肾病预测中的性能分析
链接：https://arxiv.org/abs/2510.09493

作者：Iftekhar Ahmed, Tanzil Ebad Chowdhury, Biggo Bushon Routh, Nafisa Tasmiya, Shadman Sakib, Adil Ahmed Chowdhury
备注：11 pages, 7 figures, Presented at the 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), pp. 0417-0423
摘要：肾脏是人体的过滤器。全球约有10%的人口被认为受到慢性肾病（CKD）的影响，这会导致肾功能下降。为了保护处于危险中的患者免受额外的肾损伤，有效的CKD风险评估和适当的CKD监测至关重要。由于快速精确的检测能力，机器学习模型可以帮助从业者有效地实现这一目标;因此，由于其疾病预测能力，如今医疗保健领域的大量诊断系统和流程都依赖于机器学习。在这项研究中，我们设计并建议了用于诊断CKD的疾病预测计算机辅助设计。CKD的数据集是从UCL的机器学习库中获得的，有一些缺失值;这些缺失值是使用“均值模式”和“随机抽样方法”策略填充的。在成功实现缺失数据后，使用八种ML技术（随机森林，SVM，朴素贝叶斯，逻辑回归，KNN，XGBoost，决策树和AdaBoost）建立模型，并通过技术测量结果精度之间的性能评估比较，以找到具有最高精度的机器学习模型。其中，随机森林和逻辑回归的准确率高达99%，其次是Ada Boost，XGBoost，朴素贝叶斯，决策树和SVM，而KNN分类器模型的准确率为73%。
摘要：Kidneys are the filter of the human body. About 10% of the global population is thought to be affected by Chronic Kidney Disease (CKD), which causes kidney function to decline. To protect in danger patients from additional kidney damage, effective risk evaluation of CKD and appropriate CKD monitoring are crucial. Due to quick and precise detection capabilities, Machine Learning models can help practitioners accomplish this goal efficiently; therefore, an enormous number of diagnosis systems and processes in the healthcare sector nowadays are relying on machine learning due to its disease prediction capability. In this study, we designed and suggested disease predictive computer-aided designs for the diagnosis of CKD. The dataset for CKD is attained from the repository of machine learning of UCL, with a few missing values; those are filled in using "mean-mode" and "Random sampling method" strategies. After successfully achieving the missing data, eight ML techniques (Random Forest, SVM, Naive Bayes, Logistic Regression, KNN, XGBoost, Decision Tree, and AdaBoost) were used to establish models, and the performance evaluation comparisons among the result accuracies are measured by the techniques to find the machine learning models with the highest accuracy. Among them, Random Forest as well as Logistic Regression showed an outstanding 99% accuracy, followed by the Ada Boost, XGBoost, Naive Bayes, Decision Tree, and SVM, whereas the KNN classifier model stands last with an accuracy of 73%.

【2】Augmented data and neural networks for robust epidemic forecasting: application to COVID-19 in Italy
标题：增强数据和神经网络用于稳健的流行病预测：应用于意大利COVID-19
链接：https://arxiv.org/abs/2510.09192

作者：Giacomo Dimarco, Federica Ferrarese, Lorenzo Pareschi
摘要：在这项工作中，我们提出了一种数据增强策略，旨在改善神经网络的训练阶段，从而提高其预测的准确性。我们的方法依赖于通过适当的房室模型结合不确定性来生成合成数据。然后，可用数据用于校准模型，该模型进一步与深度学习技术集成，以生成用于训练的额外合成数据。结果表明，在这些增强数据集上训练的神经网络表现出显着提高的预测性能。我们特别关注两种不同的神经网络架构：物理信息神经网络（PINNs）和非线性自回归（NAR）模型。NAR方法被证明对短期预测特别有效，通过直接从数据中学习动态来提供准确的定量估计，并避免将物理约束嵌入训练中的额外计算成本。相比之下，PINN产生不太准确的定量预测，但捕获系统的定性长期行为，使它们更适合探索更广泛的动力学趋势。对意大利伦巴第大区COVID-19大流行第二阶段的数值模拟验证了所提出方法的有效性。
摘要：In this work, we propose a data augmentation strategy aimed at improving the training phase of neural networks and, consequently, the accuracy of their predictions. Our approach relies on generating synthetic data through a suitable compartmental model combined with the incorporation of uncertainty. The available data are then used to calibrate the model, which is further integrated with deep learning techniques to produce additional synthetic data for training. The results show that neural networks trained on these augmented datasets exhibit significantly improved predictive performance. We focus in particular on two different neural network architectures: Physics-Informed Neural Networks (PINNs) and Nonlinear Autoregressive (NAR) models. The NAR approach proves especially effective for short-term forecasting, providing accurate quantitative estimates by directly learning the dynamics from data and avoiding the additional computational cost of embedding physical constraints into the training. In contrast, PINNs yield less accurate quantitative predictions but capture the qualitative long-term behavior of the system, making them more suitable for exploring broader dynamical trends. Numerical simulations of the second phase of the COVID-19 pandemic in the Lombardy region (Italy) validate the effectiveness of the proposed approach.

【3】Cross-Representation Benchmarking in Time-Series Electronic Health Records for Clinical Outcome Prediction
标题：用于临床结果预测的时间序列电子健康记录的交叉表示基准
链接：https://arxiv.org/abs/2510.09159

作者：Tianyi Chen, Mingcheng Zhu, Zhiyao Luo, Tingting Zhu
摘要：电子健康记录（EHR）使深度学习能够用于临床预测，但由于评估实践不一致，表示患者数据的最佳方法仍不清楚。我们提出了第一个系统的基准比较EHR表示方法，包括多变量时间序列，事件流和文本事件流的LLM。该基准标准化了两种不同临床环境中的数据管理和评估：ICU任务（死亡率，表型）的MIMIC-IV数据集和纵向护理（30天再入院，1年胰腺癌）的EHRSHOT数据集。对于每一种范式，我们评估适当的建模家族-包括Transformers，MLP，LSTM和保留时间序列，CLMBR和基于计数的事件流模型，8- 20 B LLM文本流-并分析基于数据缺失的特征修剪的影响。我们的实验表明，事件流模型始终提供最强的性能。像CLMBR这样的预训练模型在Few-Shot设置中具有很高的样本效率，尽管在足够的数据下，更简单的基于计数的模型可能具有竞争力。此外，我们发现特征选择策略必须适应临床环境：修剪稀疏特征可以提高ICU预测，而保留它们对于纵向任务至关重要。我们的研究结果，通过一个统一的和可重复的管道，为选择基于临床背景和数据制度的EHR表示提供了实用的指导。
摘要：Electronic Health Records (EHRs) enable deep learning for clinical predictions, but the optimal method for representing patient data remains unclear due to inconsistent evaluation practices. We present the first systematic benchmark to compare EHR representation methods, including multivariate time-series, event streams, and textual event streams for LLMs. This benchmark standardises data curation and evaluation across two distinct clinical settings: the MIMIC-IV dataset for ICU tasks (mortality, phenotyping) and the EHRSHOT dataset for longitudinal care (30-day readmission, 1-year pancreatic cancer). For each paradigm, we evaluate appropriate modelling families--including Transformers, MLP, LSTMs and Retain for time-series, CLMBR and count-based models for event streams, 8-20B LLMs for textual streams--and analyse the impact of feature pruning based on data missingness. Our experiments reveal that event stream models consistently deliver the strongest performance. Pre-trained models like CLMBR are highly sample-efficient in few-shot settings, though simpler count-based models can be competitive given sufficient data. Furthermore, we find that feature selection strategies must be adapted to the clinical setting: pruning sparse features improves ICU predictions, while retaining them is critical for longitudinal tasks. Our results, enabled by a unified and reproducible pipeline, provide practical guidance for selecting EHR representations based on the clinical context and data regime.

【4】Federated Data Analytics for Cancer Immunotherapy: A Privacy-Preserving Collaborative Platform for Patient Management
标题：癌症免疫治疗的联合数据分析：患者管理的隐私保护协作平台
链接：https://arxiv.org/abs/2510.09155

作者：Mira Raheem, Michael Papazoglou, Bernd Krämer, Neamat El-Tazi, Amal Elgammal
备注：This manuscript is currently under review at * ACM Transactions on Computing for Healthcare (HEALTH)*
摘要：互联健康是一种多学科方法，专注于健康管理，在创建工具，服务和治疗时优先考虑患者的需求。这种模式通过促进在护理连续体中的所有患者持有人之间及时交换准确的患者信息来确保积极有效的护理。数字技术和流程创新的兴起有望通过整合各种医疗数据源来增强互联健康。这种集成旨在个性化护理，预测健康结果并简化患者管理，但挑战仍然存在，特别是在数据架构，应用程序互操作性和安全性方面。数据分析可以为知情决策和健康共创提供关键见解，但解决方案必须优先考虑最终用户，包括患者和医疗保健专业人员。这一观点是通过欧盟资助的项目中的敏捷系统开发团队进行探索的，该项目旨在开发一种集成的人工智能生成的解决方案，用于管理接受免疫治疗的癌症患者。本文提供了一个协作数字框架，将整个护理过程中的利益相关者整合在一起，利用联合大数据分析和人工智能来改善决策，同时确保隐私。使用真实数据验证了治疗建议和不良事件预测等分析能力，在与医疗合作伙伴的试点研究中达到了70%-90%的准确率，证明了该框架的有效性。
摘要：Connected health is a multidisciplinary approach focused on health management, prioritizing pa-tient needs in the creation of tools, services, and treatments. This paradigm ensures proactive and efficient care by facilitating the timely exchange of accurate patient information among all stake-holders in the care continuum. The rise of digital technologies and process innovations promises to enhance connected health by integrating various healthcare data sources. This integration aims to personalize care, predict health outcomes, and streamline patient management, though challeng-es remain, particularly in data architecture, application interoperability, and security. Data analytics can provide critical insights for informed decision-making and health co-creation, but solutions must prioritize end-users, including patients and healthcare professionals. This perspective was explored through an agile System Development Lifecycle in an EU-funded project aimed at developing an integrated AI-generated solution for managing cancer patients undergoing immunotherapy. This paper contributes with a collaborative digital framework integrating stakeholders across the care continuum, leveraging federated big data analytics and artificial intelligence for improved decision-making while ensuring privacy. Analytical capabilities, such as treatment recommendations and adverse event predictions, were validated using real-life data, achieving 70%-90% accuracy in a pilot study with the medical partners, demonstrating the framework's effectiveness.

【5】The Boundaries of Fair AI in Medical Image Prognosis: A Causal Perspective
标题：公平人工智能在医学图像预测中的界限：因果角度
链接：https://arxiv.org/abs/2510.08840

作者：Thai-Hoang Pham, Jiayuan Chen, Seungyeon Lee, Yuanlong Wang, Sayoko Moroi, Xueru Zhang, Ping Zhang
备注：Accepted at NeurIPS 2025
摘要：随着机器学习（ML）算法越来越多地用于医学图像分析，人们开始担心它们对某些社会群体的潜在偏见。尽管已经提出了许多方法来确保ML模型的公平性，但大多数现有的工作仅关注医学图像诊断任务，例如图像分类和分割，以及被忽视的预后场景，这些场景涉及预测随着时间的推移可能的结果或医疗状况的进展。为了解决这一差距，我们引入了FairTTE，这是第一个用于评估医学成像中事件发生时间（TTE）预测公平性的综合框架。FairTTE包含多种成像模式和TTE结果，集成了尖端的TTE预测和公平性算法，可对医学图像预后的公平性进行系统和细粒度的分析。利用因果分析技术，FairTTE发现并量化了医学成像数据集中嵌入的不同偏倚来源。我们的大规模评估表明，偏见是普遍存在于不同的成像模式，目前的公平性方法提供有限的缓解。我们进一步证明了潜在的偏见来源和模型差异之间的强关联，强调需要针对所有形式的偏见的整体方法。值得注意的是，我们发现，公平性变得越来越难以维持下分布的变化，强调现有解决方案的局限性和更强大的，公平的预测模型的迫切需要。
摘要：As machine learning (ML) algorithms are increasingly used in medical image analysis, concerns have emerged about their potential biases against certain social groups. Although many approaches have been proposed to ensure the fairness of ML models, most existing works focus only on medical image diagnosis tasks, such as image classification and segmentation, and overlooked prognosis scenarios, which involve predicting the likely outcome or progression of a medical condition over time. To address this gap, we introduce FairTTE, the first comprehensive framework for assessing fairness in time-to-event (TTE) prediction in medical imaging. FairTTE encompasses a diverse range of imaging modalities and TTE outcomes, integrating cutting-edge TTE prediction and fairness algorithms to enable systematic and fine-grained analysis of fairness in medical image prognosis. Leveraging causal analysis techniques, FairTTE uncovers and quantifies distinct sources of bias embedded within medical imaging datasets. Our large-scale evaluation reveals that bias is pervasive across different imaging modalities and that current fairness methods offer limited mitigation. We further demonstrate a strong association between underlying bias sources and model disparities, emphasizing the need for holistic approaches that target all forms of bias. Notably, we find that fairness becomes increasingly difficult to maintain under distribution shifts, underscoring the limitations of existing solutions and the pressing need for more robust, equitable prognostic models.

【6】Reproducible Evaluation of Data Augmentation and Loss Functions for Brain Tumor Segmentation
标题：脑肿瘤分割中数据增强和损失函数的可重复性评价
链接：https://arxiv.org/abs/2510.08617

作者：Saumya B
备注：Code and results available at this https URL
摘要：脑肿瘤分割对于诊断和治疗计划至关重要，但类不平衡和有限的模型泛化等挑战继续阻碍进展。这项工作提出了一个可重复的评估U-Net分割性能的脑肿瘤MRI使用焦点损失和基本的数据增强策略。在公开的MRI数据集上进行实验，重点关注焦点损失参数调整并评估三种数据增强技术的影响：水平翻转，旋转和缩放。具有焦点损失的U-Net实现了90%的精度，与最先进的结果相当。通过公开所有代码和结果，这项研究建立了一个透明的，可重复的基线，以指导未来的研究增强策略和损失函数设计在脑肿瘤分割。
摘要：Brain tumor segmentation is crucial for diagnosis and treatment planning, yet challenges such as class imbalance and limited model generalization continue to hinder progress. This work presents a reproducible evaluation of U-Net segmentation performance on brain tumor MRI using focal loss and basic data augmentation strategies. Experiments were conducted on a publicly available MRI dataset, focusing on focal loss parameter tuning and assessing the impact of three data augmentation techniques: horizontal flip, rotation, and scaling. The U-Net with focal loss achieved a precision of 90%, comparable to state-of-the-art results. By making all code and results publicly available, this study establishes a transparent, reproducible baseline to guide future research on augmentation strategies and loss function design in brain tumor segmentation.

【7】Interpretable Generative and Discriminative Learning for Multimodal and Incomplete Clinical Data
标题：多模式和不完整临床数据的可解释生成和区分学习
链接：https://arxiv.org/abs/2510.09513

作者：Albert Belenguer-Llorens, Carlos Sevilla-Salcedo, Janaina Mourao-Miranda, Vanessa Gómez-Verdejo
摘要：现实世界的临床问题通常以多模态数据为特征，通常与不完整的视图和队列中有限的样本量相关，这对机器学习算法造成了重大限制。在这项工作中，我们提出了一种贝叶斯方法，旨在有效地处理这些挑战，同时提供可解释的解决方案。我们的方法集成了（1）一个生成式的制定，以捕捉跨视图的关系与半监督策略，（2）一个歧视性的面向任务的制定，以确定相关的信息，为特定的下游目标。这种双重生成-判别公式提供了一般理解和特定于任务的见解;因此，它提供了对缺失视图的自动估算，同时支持跨不同数据源的鲁棒推断。当应用于多模态临床数据时，这种方法的潜力变得显而易见，在这种情况下，我们的算法能够捕获和解开生物，心理和社会人口模式之间的复杂相互作用。
摘要：Real-world clinical problems are often characterized by multimodal data, usually associated with incomplete views and limited sample sizes in their cohorts, posing significant limitations for machine learning algorithms. In this work, we propose a Bayesian approach designed to efficiently handle these challenges while providing interpretable solutions. Our approach integrates (1) a generative formulation to capture cross-view relationships with a semi-supervised strategy, and (2) a discriminative task-oriented formulation to identify relevant information for specific downstream objectives. This dual generative-discriminative formulation offers both general understanding and task-specific insights; thus, it provides an automatic imputation of the missing views while enabling robust inference across different data sources. The potential of this approach becomes evident when applied to the multimodal clinical data, where our algorithm is able to capture and disentangle the complex interactions among biological, psychological, and sociodemographic modalities.

【8】A Biophysically-Conditioned Generative Framework for 3D Brain Tumor MRI Synthesis
标题：用于3D脑肿瘤MRI合成的生物物理条件生成框架
链接：https://arxiv.org/abs/2510.09365

作者：Valentin Biller, Lucas Zimmer, Can Erdur, Sandeep Nagar, Daniel Rückert, Niklas Bubeck, Jonas Weidner
摘要：磁共振成像（MRI）修复支持许多临床和研究应用。我们介绍了第一个生成模型，条件体素水平，连续的肿瘤浓度合成高保真度脑肿瘤MRI。对于BraTS 2025修复挑战赛，我们通过将肿瘤浓度设置为零，使这种架构适应健康组织修复的补充任务。我们的潜在扩散模型以组织分割和肿瘤浓度为条件，为肿瘤合成和健康组织修复生成3D空间相干和解剖学一致的图像。对于健康修复，我们实现了18.5的PSNR，对于肿瘤修复，我们实现了17.4。我们的代码可从以下网址获得：https://github.com/valentin-biller/ldm.git
摘要：Magnetic resonance imaging (MRI) inpainting supports numerous clinical and research applications. We introduce the first generative model that conditions on voxel-level, continuous tumor concentrations to synthesize high-fidelity brain tumor MRIs. For the BraTS 2025 Inpainting Challenge, we adapt this architecture to the complementary task of healthy tissue restoration by setting the tumor concentrations to zero. Our latent diffusion model conditioned on both tissue segmentations and the tumor concentrations generates 3D spatially coherent and anatomically consistent images for both tumor synthesis and healthy tissue inpainting. For healthy inpainting, we achieve a PSNR of 18.5, and for tumor inpainting, we achieve 17.4. Our code is available at: https://github.com/valentin-biller/ldm.git

蒸馏|知识提取(1篇)

【1】FedL2T: Personalized Federated Learning with Two-Teacher Distillation for Seizure Prediction
标题：FedL2 T：用于癫痫发作预测的两名教师蒸馏的个性化联邦学习
链接：https://arxiv.org/abs/2510.08984

作者：Jionghao Lou, Jian Zhang, Zhongmei Li, Lanlan Chen, Enbo Feng
摘要：癫痫发作预测中的深度学习模型的训练需要大量的脑电图（EEG）数据。然而，由于注释成本和隐私限制，获取足够的标记EEG数据是困难的。联邦学习（FL）通过共享模型更新而不是原始数据来实现隐私保护的协作训练。然而，由于现实世界场景中固有的患者间变异性，现有的基于FL的癫痫发作预测方法难以在异构客户端设置下实现稳健的性能。为了应对这一挑战，我们提出了FedL2T，这是一个个性化的联邦学习框架，它利用一种新颖的双教师知识蒸馏策略为每个客户端生成卓越的个性化模型。具体而言，每个客户端同时从全局聚合模型和动态分配的对等模型中学习，促进更直接和丰富的知识交流。为了确保可靠的知识转移，FedL2T采用了自适应多级蒸馏策略，该策略根据任务置信度调整预测输出和中间特征表示。此外，引入近端正则化项来约束个性化模型更新，从而增强训练稳定性。在两个EEG数据集上进行的大量实验表明，FedL2T始终优于最先进的FL方法，特别是在低标签条件下。此外，FedL2T表现出快速和稳定的收敛到最佳性能，从而减少了通信轮数和相关的开销。这些结果强调了FedL2T作为隐私敏感医疗保健场景中癫痫发作预测的可靠和个性化解决方案的潜力。
摘要：The training of deep learning models in seizure prediction requires large amounts of Electroencephalogram (EEG) data. However, acquiring sufficient labeled EEG data is difficult due to annotation costs and privacy constraints. Federated Learning (FL) enables privacy-preserving collaborative training by sharing model updates instead of raw data. However, due to the inherent inter-patient variability in real-world scenarios, existing FL-based seizure prediction methods struggle to achieve robust performance under heterogeneous client settings. To address this challenge, we propose FedL2T, a personalized federated learning framework that leverages a novel two-teacher knowledge distillation strategy to generate superior personalized models for each client. Specifically, each client simultaneously learns from a globally aggregated model and a dynamically assigned peer model, promoting more direct and enriched knowledge exchange. To ensure reliable knowledge transfer, FedL2T employs an adaptive multi-level distillation strategy that aligns both prediction outputs and intermediate feature representations based on task confidence. In addition, a proximal regularization term is introduced to constrain personalized model updates, thereby enhancing training stability. Extensive experiments on two EEG datasets demonstrate that FedL2T consistently outperforms state-of-the-art FL methods, particularly under low-label conditions. Moreover, FedL2T exhibits rapid and stable convergence toward optimal performance, thereby reducing the number of communication rounds and associated overhead. These results underscore the potential of FedL2T as a reliable and personalized solution for seizure prediction in privacy-sensitive healthcare scenarios.

推荐(1篇)

【1】Cross-attention Secretly Performs Orthogonal Alignment in Recommendation Models
标题：交叉注意力在推荐模型中秘密执行垂直对齐
链接：https://arxiv.org/abs/2510.09435

作者：Hyunin Lee, Yong Zhang, Hoang Vu Nguyen, Xiaoyi Liu, Namyong Park, Christopher Jung, Rong Jin, Yang Wang, Zhigang Wang, Somayeh Sojoudi, Xue Feng
备注：19 pages
摘要：跨领域顺序推荐（CDSR）的目的是对齐从不同领域收集的异构用户行为序列。虽然交叉关注被广泛用于增强一致性和提高推荐性能，但其基本机制尚未完全理解。大多数研究者将交叉注意解释为剩余对齐，其中输出是通过引用另一个域数据（输入键和值）从查询输入中去除冗余信息并保留非冗余信息而生成的。除了流行的观点，我们引入正交对齐，交叉注意发现新的信息，是不存在于查询输入的现象，并进一步认为，这两个对比对齐机制可以共存于推荐模型。我们发现，当查询输入和输出的交叉注意是正交的，模型性能提高了300多个实验。值得注意的是，正交对齐自然出现，没有任何显式的正交约束。我们的关键见解是，正交对齐自然出现，因为它改进了缩放定律。我们表明，基线额外纳入交叉注意模块优于参数匹配的基线，实现了卓越的精度每模型参数。我们希望这些发现为多模态研究中的参数有效缩放提供新的方向。
摘要：Cross-domain sequential recommendation (CDSR) aims to align heterogeneous user behavior sequences collected from different domains. While cross-attention is widely used to enhance alignment and improve recommendation performance, its underlying mechanism is not fully understood. Most researchers interpret cross-attention as residual alignment, where the output is generated by removing redundant and preserving non-redundant information from the query input by referencing another domain data which is input key and value. Beyond the prevailing view, we introduce Orthogonal Alignment, a phenomenon in which cross-attention discovers novel information that is not present in the query input, and further argue that those two contrasting alignment mechanisms can co-exist in recommendation models We find that when the query input and output of cross-attention are orthogonal, model performance improves over 300 experiments. Notably, Orthogonal Alignment emerges naturally, without any explicit orthogonality constraints. Our key insight is that Orthogonal Alignment emerges naturally because it improves scaling law. We show that baselines additionally incorporating cross-attention module outperform parameter-matched baselines, achieving a superior accuracy-per-model parameter. We hope these findings offer new directions for parameter-efficient scaling in multi-modal research.

聚类(1篇)

【1】Deep Multimodal Subspace Clustering Networks
标题：深度多模式子空间集群网络
链接：https://arxiv.org/abs/1804.06498

作者：Mahdi Abavisani, Vishal M. Patel
备注：None
摘要：我们提出了基于卷积神经网络（CNN）的无监督多模态子空间聚类方法。该框架包括三个主要阶段-多模态编码器，自我表达层，和多模态解码器。编码器将多模态数据作为输入，并将它们融合到潜在的空间表示中。自我表达层负责执行自我表达属性并获取与数据点相对应的亲和矩阵。解码器重建原始输入数据。该网络在训练中使用解码器的重建与原始输入之间的距离。我们调查早期，晚期和中间融合技术，并提出了三种不同的编码器对应于他们的空间融合。对于不同的基于空间融合的方法，自表达层和多模式解码器本质上是相同的。除了各种基于空间融合的方法之外，还提出了一种基于亲和力融合的网络，其中对应于不同模态的自我表达层被强制为相同的。在三个数据集上的大量实验表明，所提出的方法显着优于最先进的多模态子空间聚类方法。
摘要：We present convolutional neural network (CNN) based approaches for unsupervised multimodal subspace clustering. The proposed framework consists of three main stages - multimodal encoder, self-expressive layer, and multimodal decoder. The encoder takes multimodal data as input and fuses them to a latent space representation. The self-expressive layer is responsible for enforcing the self-expressiveness property and acquiring an affinity matrix corresponding to the data points. The decoder reconstructs the original input data. The network uses the distance between the decoder's reconstruction and the original input in its training. We investigate early, late and intermediate fusion techniques and propose three different encoders corresponding to them for spatial fusion. The self-expressive layers and multimodal decoders are essentially the same for different spatial fusion-based approaches. In addition to various spatial fusion-based methods, an affinity fusion-based network is also proposed in which the self-expressive layer corresponding to different modalities is enforced to be the same. Extensive experiments on three datasets show that the proposed methods significantly outperform the state-of-the-art multimodal subspace clustering methods.

超分辨率|去噪|去模糊|去雾(1篇)

【1】SkipSR: Faster Super Resolution with Token Skipping
标题：SkipSR：通过令牌跳过更快的超级分辨率
链接：https://arxiv.org/abs/2510.08799

作者：Rohan Choudhury, Shanchuan Lin, Jianyi Wang, Hao Chen, Qi Zhao, Feng Cheng, Lu Jiang, Kris Kitani, Laszlo A. Jeni
备注：14 pages, 7 figures
摘要：基于扩散的超分辨率（SR）是视频生成和视频恢复中的关键组件，但是速度慢且昂贵，限制了更高分辨率和更长视频的可扩展性。我们的关键见解是，视频中的许多区域本质上是低细节的，并且从细化中获得的好处很少，但目前的方法统一处理所有像素。为了利用这一点，我们提出了SkipSR，这是一个简单的框架，通过直接从低分辨率输入中识别低细节区域来加速视频SR，然后完全跳过对它们的计算，只对需要细化的区域进行超分辨率处理。这种简单而有效的策略保留了标准和一步扩散SR模型的感知质量，同时显着减少计算。在标准SR基准测试中，我们的方法在720p视频上实现了比先前模型快60%的端到端延迟，并且没有明显的质量损失。视频演示可在https://rccchoudhury.github.io/skipsr/上获得
摘要：Diffusion-based super-resolution (SR) is a key component in video generation and video restoration, but is slow and expensive, limiting scalability to higher resolutions and longer videos. Our key insight is that many regions in video are inherently low-detail and gain little from refinement, yet current methods process all pixels uniformly. To take advantage of this, we propose SkipSR, a simple framework for accelerating video SR by identifying low-detail regions directly from low-resolution input, then skipping computation on them entirely, only super-resolving the areas that require refinement. This simple yet effective strategy preserves perceptual quality in both standard and one-step diffusion SR models while significantly reducing computation. In standard SR benchmarks, our method achieves up to 60% faster end-to-end latency than prior models on 720p videos with no perceptible loss in quality. Video demos are available at https://rccchoudhury.github.io/skipsr/

点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】Exploring Single Domain Generalization of LiDAR-based Semantic Segmentation under Imperfect Labels
标题：探索不完美标签下基于LiDART的语义分割的单域概括
链接：https://arxiv.org/abs/2510.09035

作者：Weitong Kong, Zichao Zeng, Di Wen, Jiale Wei, Kunyu Peng, June Moh Goo, Jan Boehm, Rainer Stiefelhagen
摘要：准确的感知对于车辆安全至关重要，而LiDAR是自动驾驶的关键推动因素。为了确保在环境、传感器类型和天气条件下的鲁棒性能，而无需昂贵的重新注释，基于LiDAR的3D语义分割中的域泛化至关重要。然而，由于传感器缺陷、遮挡和人为错误，LiDAR注释通常是有噪声的。这种噪声降低了分割精度，并在域偏移下进一步放大，威胁系统可靠性。虽然噪声标签学习在图像中得到了很好的研究，但其在领域泛化下对3D LiDAR分割的扩展在很大程度上尚未探索，因为点云的稀疏和不规则结构限制了2D方法的直接使用。为了解决这一差距，我们引入了新的任务Domain Generalization for LiDAR Semantic Segmentation under Noisy Labels（DGLSS-NL），并通过调整从图像分类到3D分割的三种代表性噪声标签学习策略来建立第一个基准。然而，我们发现现有的噪声标签学习方法对LiDAR数据的适应性很差。因此，我们提出了DuNe，一个具有强分支和弱分支的双视图框架，该框架强制执行特征级一致性，并基于置信度感知的预测过滤应用交叉熵损失。我们的方法通过在10%对称标签噪声下在SemanticKITTI上实现56.86%mIoU，在nuScenes上实现42.28%，在SemanticPOSS上实现52.58%，显示了最先进的性能，总体算术平均值（AM）为49.57%，调和平均值（HM）为48.50%，从而在DGLSS-NL任务中展示了鲁棒的域泛化。代码可以在我们的项目页面上找到。
摘要：Accurate perception is critical for vehicle safety, with LiDAR as a key enabler in autonomous driving. To ensure robust performance across environments, sensor types, and weather conditions without costly re-annotation, domain generalization in LiDAR-based 3D semantic segmentation is essential. However, LiDAR annotations are often noisy due to sensor imperfections, occlusions, and human errors. Such noise degrades segmentation accuracy and is further amplified under domain shifts, threatening system reliability. While noisy-label learning is well-studied in images, its extension to 3D LiDAR segmentation under domain generalization remains largely unexplored, as the sparse and irregular structure of point clouds limits direct use of 2D methods. To address this gap, we introduce the novel task Domain Generalization for LiDAR Semantic Segmentation under Noisy Labels (DGLSS-NL) and establish the first benchmark by adapting three representative noisy-label learning strategies from image classification to 3D segmentation. However, we find that existing noisy-label learning approaches adapt poorly to LiDAR data. We therefore propose DuNe, a dual-view framework with strong and weak branches that enforce feature-level consistency and apply cross-entropy loss based on confidence-aware filtering of predictions. Our approach shows state-of-the-art performance by achieving 56.86% mIoU on SemanticKITTI, 42.28% on nuScenes, and 52.58% on SemanticPOSS under 10% symmetric label noise, with an overall Arithmetic Mean (AM) of 49.57% and Harmonic Mean (HM) of 48.50%, thereby demonstrating robust domain generalization in DGLSS-NL tasks. The code is available on our project page.

推理|分析|理解|解释(14篇)

【1】Efficient Bayesian Inference from Noisy Pairwise Comparisons
标题：从喧闹的成对比较中进行有效的Bayesian推理
链接：https://arxiv.org/abs/2510.09333

作者：Till Aczel, Lucas Theis, Wattenhofer Roger
摘要：评估生成模型具有挑战性，因为标准指标往往无法反映人类的偏好。人工评估更可靠，但成本高，噪音大，因为参与者的专业知识，注意力和勤奋程度各不相同。成对比较提高了一致性，但将它们汇总为整体质量分数需要仔细建模。基于Bradley-Terry的方法从比较中更新项目分数，但现有的方法要么忽略评分者的可变性，要么缺乏收敛保证，限制了鲁棒性和可解释性。我们引入BBQ，贝叶斯布拉德利-特里的变种，明确模型评分质量，降低权重或删除不可靠的参与者，并通过期望最大化算法提供保证单调似然收敛。实证结果表明，BBQ实现了更快的收敛，校准良好的不确定性估计，以及更强大的，可解释的排名相比，基线布拉德利-特里模型，即使有噪音或众包评分。该框架能够对生成模型进行更可靠和更具成本效益的人工评估。
摘要：Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ achieves faster convergence, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.

【2】Prime Implicant Explanations for Reaction Feasibility Prediction
标题：反应可行性预测的主要含义解释
链接：https://arxiv.org/abs/2510.09226

作者：Klaus Weinbauer, Tieu-Long Phan, Peter F. Stadler, Thomas Gärtner, Sagar Malhotra
备注：Presented at AIMLAI workshop at ECMLPKDD 2025
摘要：预测化学反应可行性的机器学习模型已成为自动合成规划的核心。尽管这些模型在预测方面取得了成功，但它们往往缺乏透明度和可解释性。我们介绍了一种新的配方的主要蕴涵的解释-也被称为最低限度的充分的理由-适合这个领域，并提出了一种算法，用于计算这种解释在小规模的反应预测任务。初步的实验表明，我们的概念的主要含义的解释保守地捕捉地面真相的解释。也就是说，这些解释通常包含多余的键和原子，但始终捕捉到预测反应可行性所必需的分子属性。
摘要：Machine learning models that predict the feasibility of chemical reactions have become central to automated synthesis planning. Despite their predictive success, these models often lack transparency and interpretability. We introduce a novel formulation of prime implicant explanations--also known as minimally sufficient reasons--tailored to this domain, and propose an algorithm for computing such explanations in small-scale reaction prediction tasks. Preliminary experiments demonstrate that our notion of prime implicant explanations conservatively captures the ground truth explanations. That is, such explanations often contain redundant bonds and atoms but consistently capture the molecular attributes that are essential for predicting reaction feasibility.

【3】RepDL: Bit-level Reproducible Deep Learning Training and Inference
标题：RepDL：位级可重复的深度学习训练和推理
链接：https://arxiv.org/abs/2510.09180

作者：Peichen Xie, Xian Zhang, Shuo Chen
备注：Originally drafted in 2023
摘要：非确定性和不可再现性给深度学习带来了重大挑战，导致运行和平台之间的结果不一致。这些问题源于两个起源：随机数生成和浮点计算。虽然可以通过确定性配置控制随机性，但浮点不一致性在很大程度上仍未得到解决。为了解决这个问题，我们引入了RepDL，这是一个开源库，可以确保在不同的计算环境中进行确定性和按位可再现的深度学习训练和推理。RepDL通过在浮点计算中强制执行正确的舍入和顺序不变性来实现这一点。源代码可在https://github.com/microsoft/RepDL上获得。
摘要：Non-determinism and non-reproducibility present significant challenges in deep learning, leading to inconsistent results across runs and platforms. These issues stem from two origins: random number generation and floating-point computation. While randomness can be controlled through deterministic configurations, floating-point inconsistencies remain largely unresolved. To address this, we introduce RepDL, an open-source library that ensures deterministic and bitwise-reproducible deep learning training and inference across diverse computing environments. RepDL achieves this by enforcing correct rounding and order invariance in floating-point computation. The source code is available at https://github.com/microsoft/RepDL .

【4】PAC Reasoning: Controlling the Performance Loss for Efficient Reasoning
标题：PAC推理：控制性能损失以实现高效推理
链接：https://arxiv.org/abs/2510.09133

作者：Hao Zeng, Jianguo Huang, Bingyi Jing, Hongxin Wei, Bo An
摘要：大型推理模型（LRM）在解决复杂问题方面取得了显著的进展。尽管取得了这一成功，但LRM通常在部署期间遭受高计算成本，突出了对有效推理的需求。一个流行的效率改进的方向是LRM之间的思维和非思维模式的动态切换。然而，这样的方法往往会引入额外的推理错误，并缺乏对性能损失的统计保证，这对高风险应用程序至关重要。在本文中，我们提出可能近似正确（PAC）推理，以控制效能损失在使用者指定的效能损失容忍度下。特别是，我们构建了一个置信上限的性能损失，制定为一个单调函数的不确定性得分，随后确定一个阈值切换到nonthinking模型。从理论上讲，使用阈值之间的思维和nonthinking模式切换，确保有界的性能损失分布自由的方式。在推理基准测试上的综合实验表明，该方法可以节省计算预算，并控制用户指定的性能损失.
摘要：Large reasoning models (LRMs) have achieved remarkable progress in complex problem-solving tasks. Despite this success, LRMs typically suffer from high computational costs during deployment, highlighting a need for efficient inference. A popular direction of efficiency improvement is to switch the LRM between thinking and nonthinking modes dynamically. However, such approaches often introduce additional reasoning errors and lack statistical guarantees for the performance loss, which are critical for high-stakes applications. In this work, we propose Probably Approximately Correct (PAC) reasoning that controls the performance loss under the user-specified performance loss tolerance. In particular, we construct an upper confidence bound on the performance loss, formulated as a monotone function of the uncertainty score, and subsequently determine a threshold for switching to the nonthinking model. Theoretically, using the threshold to switch between the thinking and nonthinking modes ensures bounded performance loss in a distribution-free manner. Our comprehensive experiments on reasoning benchmarks show that the proposed method can save computational budgets and control the user-specified performance loss.

【5】Slim Scheduler: A Runtime-Aware RL and Scheduler System for Efficient CNN Inference
标题：Slim Buttons：一个运行时感知RL和Buttons系统，用于高效的CNN推理
链接：https://arxiv.org/abs/2510.09018

作者：Ian Harshbarger, Calvin Chidambaram
摘要：大多数神经网络调度研究都集中在优化固定宽度的静态端到端模型，而忽略了适应异构硬件和波动运行时条件的动态方法。我们提出了一个混合调度框架，它集成了近端策略优化（PPO）强化学习策略和贪婪算法，以协调可精简模型的分布式推理。每个服务器运行一个本地贪婪调度器，该调度器批量处理兼容的请求，并根据VRAM和利用率约束管理实例扩展，而PPO路由器则学习全局路由策略，以进行设备选择、宽度比和批量配置。这种分层设计降低了搜索空间的复杂性，减轻了对特定硬件的过拟合，并平衡了效率和吞吐量。与纯粹随机的任务分配基线相比，Slim Rectors可以实现各种准确性和延迟权衡，例如：平均延迟减少96.45%，能源使用减少97.31%，准确性下降到最瘦的模型（70.3%）。然后，它可以以所述等待时间和能量的更高标准偏差为代价实现平均等待时间加上能量消耗的总体减少，同时提高准确性，从而影响总体任务吞吐量。
摘要：Most neural network scheduling research focuses on optimizing static, end-to-end models of fixed width, overlooking dynamic approaches that adapt to heterogeneous hardware and fluctuating runtime conditions. We present Slim Scheduler, a hybrid scheduling framework that integrates a Proximal Policy Optimization (PPO) reinforcement learning policy with algorithmic, greedy schedulers to coordinate distributed inference for slimmable models. Each server runs a local greedy scheduler that batches compatible requests and manages instance scaling based on VRAM and utilization constraints, while the PPO router learns global routing policies for device selection, width ratio, and batch configuration. This hierarchical design reduces search space complexity, mitigates overfitting to specific hardware, and balances efficiency and throughput. Compared to a purely randomized task distribution baseline, Slim Scheduler can achieve various accuracy and latency trade-offs such as: A 96.45% reduction in mean latency and a 97.31% reduction in energy usage dropping accuracy to the slimmest model available (70.3%). It can then accomplish an overall reduction in average latency plus energy consumption with an increase in accuracy at the cost of higher standard deviations of said latency and energy, effecting overall task throughput.

【6】Analytical Survey of Learning with Low-Resource Data: From Analysis to Investigation
标题：低资源数据学习的分析调查：从分析到调查
链接：https://arxiv.org/abs/2510.08962

作者：Xiaofeng Cao, Mingwei Xu, Xin Yu, Jiangchao Yao, Wei Ye, Shengjun Huang, Minling Zhang, Ivor W. Tsang, Yew Soon Ong, James T. Kwok, Heng Tao Shen
备注：Accepted by ACM Computing Surveys
摘要：使用高资源数据进行学习已经在人工智能（AI）领域取得了巨大的成功;然而，与数据注释和模型训练相关的成本仍然很高。人工智能研究的一个基本目标是利用有限资源的数据实现鲁棒的泛化。这项调查采用了可能近似正确（PAC）框架内的不可知主动抽样理论，以分析与模型不可知监督和无监督设置中的低资源数据学习相关的泛化错误和标签复杂性。基于这种分析，我们研究了一套为低资源数据学习量身定制的优化策略，包括梯度优化、元迭代优化、几何感知优化和LLMs优化。此外，我们提供了一个全面的概述，可以受益于低资源的数据，包括域转移，强化反馈和层次结构建模的多种学习范式。最后，我们总结了我们的分析和调查，总结了主要的发现，并强调了它们对低资源数据学习的影响。
摘要：Learning with high-resource data has demonstrated substantial success in artificial intelligence (AI); however, the costs associated with data annotation and model training remain significant. A fundamental objective of AI research is to achieve robust generalization with limited-resource data. This survey employs agnostic active sampling theory within the Probably Approximately Correct (PAC) framework to analyze the generalization error and label complexity associated with learning from low-resource data in both model-agnostic supervised and unsupervised settings. Based on this analysis, we investigate a suite of optimization strategies tailored for low-resource data learning, including gradient-informed optimization, meta-iteration optimization, geometry-aware optimization, and LLMs-powered optimization. Furthermore, we provide a comprehensive overview of multiple learning paradigms that can benefit from low-resource data, including domain transfer, reinforcement feedback, and hierarchical structure modeling. Finally, we conclude our analysis and investigation by summarizing the key findings and highlighting their implications for learning with low-resource data.

【7】MATT-CTR: Unleashing a Model-Agnostic Test-Time Paradigm for CTR Prediction with Confidence-Guided Inference Paths
标题：MATT-TLR：利用置信引导推理路径释放与模型不可知的测试时范式，用于TLR预测
链接：https://arxiv.org/abs/2510.08932

作者：Moyu Zhang, Yun Chen, Yujun Jin, Jinxin Hu, Yu Zhang, Xiaoyi Zeng
备注：10 pages, 4 figures, 2 tables
摘要：Recently, a growing body of research has focused on either optimizing CTR model architectures to better model feature interactions or refining training objectives to aid parameter learning, thereby achieving better predictive performance. However, previous efforts have primarily focused on the training phase, largely neglecting opportunities for optimization during the inference phase. Infrequently occurring feature combinations, in particular, can degrade prediction performance, leading to unreliable or low-confidence outputs. To unlock the predictive potential of trained CTR models, we propose a Model-Agnostic Test-Time paradigm (MATT), which leverages the confidence scores of feature combinations to guide the generation of multiple inference paths, thereby mitigating the influence of low-confidence features on the final prediction. Specifically, to quantify the confidence of feature combinations, we introduce a hierarchical probabilistic hashing method to estimate the occurrence frequencies of feature combinations at various orders, which serve as their corresponding confidence scores. Then, using the confidence scores as sampling probabilities, we generate multiple instance-specific inference paths through iterative sampling and subsequently aggregate the prediction scores from multiple paths to conduct robust predictions. Finally, extensive offline experiments and online A/B tests strongly validate the compatibility and effectiveness of MATT across existing CTR models.

【8】Velocity and Density-Aware RRI Analysis and Optimization for AoI Minimization in IoV SPS
标题：速度和密度感知RRI分析和优化IoV MPS中的AoI最小化
链接：https://arxiv.org/abs/2510.08911

作者：Maoxin Ji, Tong Wang, Qiong Wu, Pingyi Fan, Nan Cheng, Wen Chen
备注：This paper has been submitted to IEEE Communications Letters
摘要：Addressing the problem of Age of Information (AoI) deterioration caused by packet collisions and vehicle speed-related channel uncertainties in Semi-Persistent Scheduling (SPS) for the Internet of Vehicles (IoV), this letter proposes an optimization approach based on Large Language Models (LLM) and Deep Deterministic Policy Gradient (DDPG). First, an AoI calculation model influenced by vehicle speed, vehicle density, and Resource Reservation Interval (RRI) is established, followed by the design of a dual-path optimization scheme. The DDPG is guided by the state space and reward function, while the LLM leverages contextual learning to generate optimal parameter configurations. Experimental results demonstrate that LLM can significantly reduce AoI after accumulating a small number of exemplars without requiring model training, whereas the DDPG method achieves more stable performance after training.

【9】A Frequency-Domain Analysis of the Multi-Armed Bandit Problem: A New Perspective on the Exploration-Exploitation Trade-off
标题：多臂强盗问题的频域分析：探索与剥削权衡的新视角
链接：https://arxiv.org/abs/2510.08908

作者：Di Zhang
备注：6 pages
摘要：The stochastic multi-armed bandit (MAB) problem is one of the most fundamental models in sequential decision-making, with the core challenge being the trade-off between exploration and exploitation. Although algorithms such as Upper Confidence Bound (UCB) and Thompson Sampling, along with their regret theories, are well-established, existing analyses primarily operate from a time-domain and cumulative regret perspective, struggling to characterize the dynamic nature of the learning process. This paper proposes a novel frequency-domain analysis framework, reformulating the bandit process as a signal processing problem. Within this framework, the reward estimate of each arm is viewed as a spectral component, with its uncertainty corresponding to the component's frequency, and the bandit algorithm is interpreted as an adaptive filter. We construct a formal Frequency-Domain Bandit Model and prove the main theorem: the confidence bound term in the UCB algorithm is equivalent in the frequency domain to a time-varying gain applied to uncertain spectral components, a gain inversely proportional to the square root of the visit count. Based on this, we further derive finite-time dynamic bounds concerning the exploration rate decay. This theory not only provides a novel and intuitive physical interpretation for classical algorithms but also lays a rigorous theoretical foundation for designing next-generation algorithms with adaptive parameter adjustment.

【10】Spatial Deconfounder: Interference-Aware Deconfounding for Spatial Causal Inference
标题：空间解构者：具有干扰意识的空间因果推理解构
链接：https://arxiv.org/abs/2510.08762

作者：Ayush Khot, Miruna Oprescu, Maresa Schröder, Ai Kagawa, Xihaier Luo
备注：24 pages, 3 figures, 6 tables
摘要：Causal inference in spatial domains faces two intertwined challenges: (1) unmeasured spatial factors, such as weather, air pollution, or mobility, that confound treatment and outcome, and (2) interference from nearby treatments that violate standard no-interference assumptions. While existing methods typically address one by assuming away the other, we show they are deeply connected: interference reveals structure in the latent confounder. Leveraging this insight, we propose the Spatial Deconfounder, a two-stage method that reconstructs a substitute confounder from local treatment vectors using a conditional variational autoencoder (CVAE) with a spatial prior, then estimates causal effects via a flexible outcome model. We show that this approach enables nonparametric identification of both direct and spillover effects under weak assumptions--without requiring multiple treatment types or a known model of the latent field. Empirically, we extend SpaCE, a benchmark suite for spatial confounding, to include treatment interference, and show that the Spatial Deconfounder consistently improves effect estimation across real-world datasets in environmental health and social science. By turning interference into a multi-cause signal, our framework bridges spatial and deconfounding literatures to advance robust causal inference in structured data.

【11】Faithful and Interpretable Explanations for Complex Ensemble Time Series Forecasts using Surrogate Models and Forecastability Analysis
标题：使用代理模型和可预测性分析对复杂累积时间序列预测进行可靠且可解释的解释
链接：https://arxiv.org/abs/2510.08739

作者：Yikai Zhao, Jiekai Ma
摘要：Modern time series forecasting increasingly relies on complex ensemble models generated by AutoML systems like AutoGluon, delivering superior accuracy but with significant costs to transparency and interpretability. This paper introduces a comprehensive, dual-approach framework that addresses both the explainability and forecastability challenges in complex time series ensembles. First, we develop a surrogate-based explanation methodology that bridges the accuracy-interpretability gap by training a LightGBM model to faithfully mimic AutoGluon's time series forecasts, enabling stable SHAP-based feature attributions. We rigorously validated this approach through feature injection experiments, demonstrating remarkably high faithfulness between extracted SHAP values and known ground truth effects. Second, we integrated spectral predictability analysis to quantify each series' inherent forecastability. By comparing each time series' spectral predictability to its pure noise benchmarks, we established an objective mechanism to gauge confidence in forecasts and their explanations. Our empirical evaluation on the M5 dataset found that higher spectral predictability strongly correlates not only with improved forecast accuracy but also with higher fidelity between the surrogate and the original forecasting model. These forecastability metrics serve as effective filtering mechanisms and confidence scores, enabling users to calibrate their trust in both the forecasts and their explanations. We further demonstrated that per-item normalization is essential for generating meaningful SHAP explanations across heterogeneous time series with varying scales. The resulting framework delivers interpretable, instance-level explanations for state-of-the-art ensemble forecasts, while equipping users with forecastability metrics that serve as reliability indicators for both predictions and their explanations.

【12】The Enduring Dominance of Deep Neural Networks: A Critical Analysis of the Fundamental Limitations of Quantum Machine Learning and Spiking Neural Networks
标题：深度神经网络的持久主导地位：对量子机器学习和尖峰神经网络基本局限性的批判性分析
链接：https://arxiv.org/abs/2510.08591

作者：Takehiro Ishikawa
摘要：Recent advancements in QML and SNNs have generated considerable excitement, promising exponential speedups and brain-like energy efficiency to revolutionize AI. However, this paper argues that they are unlikely to displace DNNs in the near term. QML struggles with adapting backpropagation due to unitary constraints, measurement-induced state collapse, barren plateaus, and high measurement overheads, exacerbated by the limitations of current noisy intermediate-scale quantum hardware, overfitting risks due to underdeveloped regularization techniques, and a fundamental misalignment with machine learning's generalization. SNNs face restricted representational bandwidth, struggling with long-range dependencies and semantic encoding in language tasks due to their discrete, spike-based processing. Furthermore, the goal of faithfully emulating the brain might impose inherent inefficiencies like cognitive biases, limited working memory, and slow learning speeds. Even their touted energy-efficient advantages are overstated; optimized DNNs with quantization can outperform SNNs in energy costs under realistic conditions. Finally, SNN training incurs high computational overhead from temporal unfolding. In contrast, DNNs leverage efficient backpropagation, robust regularization, and innovations in LRMs that shift scaling to inference-time compute, enabling self-improvement via RL and search algorithms like MCTS while mitigating data scarcity. This superiority is evidenced by recent models such as xAI's Grok-4 Heavy, which advances SOTA performance, and gpt-oss-120b, which surpasses or approaches the performance of leading industry models despite its modest 120-billion-parameter size deployable on a single 80GB GPU. Furthermore, specialized ASICs amplify these efficiency gains. Ultimately, QML and SNNs may serve niche hybrid roles, but DNNs remain the dominant, practical paradigm for AI advancement.

【13】Conditional Flow Matching for Bayesian Posterior Inference
标题：用于Bayesian后验推理的条件流匹配
链接：https://arxiv.org/abs/2510.09534

作者：So Won Jeong, Percy S. Zhai, Veronika Ročová
摘要：We propose a generative multivariate posterior sampler via flow matching. It offers a simple training objective, and does not require access to likelihood evaluation. The method learns a dynamic, block-triangular velocity field in the joint space of data and parameters, which results in a deterministic transport map from a source distribution to the desired posterior. The inverse map, named vector rank, is accessible by reversibly integrating the velocity over time. It is advantageous to leverage the dynamic design: proper constraints on the velocity yield a monotone map, which leads to a conditional Brenier map, enabling a fast and simultaneous generation of Bayesian credible sets whose contours correspond to level sets of Monge-Kantorovich data depth. Our approach is computationally lighter compared to GAN-based and diffusion-based counterparts, and is capable of capturing complex posterior structures. Finally, frequentist theoretical guarantee on the consistency of the recovered posterior distribution, and of the corresponding Bayesian credible sets, is provided.

【14】Understanding Exoplanet Habitability: A Bayesian ML Framework for Predicting Atmospheric Absorption Spectra
标题：了解系外行星宜居性：预测大气吸收光谱的Bayesian ML框架
链接：https://arxiv.org/abs/2510.08766

作者：Vasuda Trehan, Kevin H. Knuth, M. J. Way
摘要：The evolution of space technology in recent years, fueled by advancements in computing such as Artificial Intelligence (AI) and machine learning (ML), has profoundly transformed our capacity to explore the cosmos. Missions like the James Webb Space Telescope (JWST) have made information about distant objects more easily accessible, resulting in extensive amounts of valuable data. As part of this work-in-progress study, we are working to create an atmospheric absorption spectrum prediction model for exoplanets. The eventual model will be based on both collected observational spectra and synthetic spectral data generated by the ROCKE-3D general circulation model (GCM) developed by the climate modeling program at NASA's Goddard Institute for Space Studies (GISS). In this initial study, spline curves are used to describe the bin heights of simulated atmospheric absorption spectra as a function of one of the values of the planetary parameters. Bayesian Adaptive Exploration is then employed to identify areas of the planetary parameter space for which more data are needed to improve the model. The resulting system will be used as a forward model so that planetary parameters can be inferred given a planet's atmospheric absorption spectrum. This work is expected to contribute to a better understanding of exoplanetary properties and general exoplanet climates and habitability.

检测相关(3篇)

【1】Improving Anomaly Detection in Industrial Time Series: The Role of Segmentation and Heterogeneous Ensemble
标题：改进工业时间序列中的异常检测：分段和异类合并的作用
链接：https://arxiv.org/abs/2510.09079

作者：Emilio Mastriani, Alessandro Costa, Federico Incardona, Kevin Munari, Sebastiano Spinello
备注：Conference paper. Under publication process at CODIT 2025
摘要：Concerning machine learning, segmentation models can identify state changes within time series, facilitating the detection of transitions between normal and anomalous conditions. Specific techniques such as Change Point Detection (CPD), particularly algorithms like ChangeFinder, have been successfully applied to segment time series and improve anomaly detection by reducing temporal uncertainty, especially in multivariate environments. In this work, we explored how the integration of segmentation techniques, combined with a heterogeneous ensemble, can enhance anomaly detection in an industrial production context. The results show that applying segmentation as a pre-processing step before selecting heterogeneous ensemble algorithms provided a significant advantage in our case study, improving the AUC-ROC metric from 0.8599 (achieved with a PCA and LSTM ensemble) to 0.9760 (achieved with Random Forest and XGBoost). This improvement is imputable to the ability of segmentation to reduce temporal ambiguity and facilitate the learning process of supervised algorithms. In our future work, we intend to assess the benefit of introducing weighted features derived from the study of change points, combined with segmentation and the use of heterogeneous ensembles, to further optimize model performance in early anomaly detection.

【2】Detecting spills using thermal imaging, pretrained deep learning models, and a robotic platform
标题：使用热成像、预训练的深度学习模型和机器人平台检测泄漏
链接：https://arxiv.org/abs/2510.08770

作者：Gregory Yeghiyan, Jurius Azar, Devson Butani, Chan-Jin Chung
备注：6 pages
摘要：This paper presents a real-time spill detection system that utilizes pretrained deep learning models with RGB and thermal imaging to classify spill vs. no-spill scenarios across varied environments. Using a balanced binary dataset (4,000 images), our experiments demonstrate the advantages of thermal imaging in inference speed, accuracy, and model size. We achieve up to 100% accuracy using lightweight models like VGG19 and NasNetMobile, with thermal models performing faster and more robustly across different lighting conditions. Our system runs on consumer-grade hardware (RTX 4080) and achieves inference times as low as 44 ms with model sizes under 350 MB, highlighting its deployability in safety-critical contexts. Results from experiments with a real robot and test datasets indicate that a VGG19 model trained on thermal imaging performs best.

【3】RFOD: Random Forest-based Outlier Detection for Tabular Data
标题：RFOD：基于随机森林的表格数据离群值检测
链接：https://arxiv.org/abs/2510.08747

作者：Yihao Ang, Peicheng Yao, Yifan Bao, Yushuo Feng, Qiang Huang, Anthony K. H. Tung, Zhiyong Huang
备注：13 pages, 13 figures, and 4 tables
摘要：Outlier detection in tabular data is crucial for safeguarding data integrity in high-stakes domains such as cybersecurity, financial fraud detection, and healthcare, where anomalies can cause serious operational and economic impacts. Despite advances in both data mining and deep learning, many existing methods struggle with mixed-type tabular data, often relying on encoding schemes that lose important semantic information. Moreover, they frequently lack interpretability, offering little insight into which specific values cause anomalies. To overcome these challenges, we introduce \textsf{\textbf{RFOD}}, a novel \textsf{\textbf{R}}andom \textsf{\textbf{F}}orest-based \textsf{\textbf{O}}utlier \textsf{\textbf{D}}etection framework tailored for tabular data. Rather than modeling a global joint distribution, \textsf{RFOD} reframes anomaly detection as a feature-wise conditional reconstruction problem, training dedicated random forests for each feature conditioned on the others. This design robustly handles heterogeneous data types while preserving the semantic integrity of categorical features. To further enable precise and interpretable detection, \textsf{RFOD} combines Adjusted Gower's Distance (AGD) for cell-level scoring, which adapts to skewed numerical data and accounts for categorical confidence, with Uncertainty-Weighted Averaging (UWA) to aggregate cell-level scores into robust row-level anomaly scores. Extensive experiments on 15 real-world datasets demonstrate that \textsf{RFOD} consistently outperforms state-of-the-art baselines in detection accuracy while offering superior robustness, scalability, and interpretability for mixed-type tabular data.

分类|识别(7篇)

【1】SilvaScenes: Tree Segmentation and Species Classification from Under-Canopy Images in Natural Forests
标题：SilvaScenes：天然森林树冠下图像的树木分割和物种分类
链接：https://arxiv.org/abs/2510.09458

作者：David-Alexandre Duclos, William Guimont-Martin, Gabriel Jeanson, Arthur Larochelle-Tremblay, Théo Defosse, Frédéric Moore, Philippe Nolet, François Pomerleau, Philippe Giguère
备注：8 pages, 5 figures
摘要：Interest in robotics for forest management is growing, but perception in complex, natural environments remains a significant hurdle. Conditions such as heavy occlusion, variable lighting, and dense vegetation pose challenges to automated systems, which are essential for precision forestry, biodiversity monitoring, and the automation of forestry equipment. These tasks rely on advanced perceptual capabilities, such as detection and fine-grained species classification of individual trees. Yet, existing datasets are inadequate to develop such perception systems, as they often focus on urban settings or a limited number of species. To address this, we present SilvaScenes, a new dataset for instance segmentation of tree species from under-canopy images. Collected across five bioclimatic domains in Quebec, Canada, SilvaScenes features 1476 trees from 24 species with annotations from forestry experts. We demonstrate the relevance and challenging nature of our dataset by benchmarking modern deep learning approaches for instance segmentation. Our results show that, while tree segmentation is easy, with a top mean average precision (mAP) of 67.65%, species classification remains a significant challenge with an mAP of only 35.69%. Our dataset and source code will be available at https://github.com/norlab-ulaval/SilvaScenes.

【2】On Uniformly Scaling Flows: A Density-Aligned Approach to Deep One-Class Classification
标题：关于均匀缩放流：深度一类分类的密度对齐方法
链接：https://arxiv.org/abs/2510.09452

作者：Faried Abu Zaid, Tim Katzke, Emmanuel Müller, Daniel Neider
摘要：Unsupervised anomaly detection is often framed around two widely studied paradigms. Deep one-class classification, exemplified by Deep SVDD, learns compact latent representations of normality, while density estimators realized by normalizing flows directly model the likelihood of nominal data. In this work, we show that uniformly scaling flows (USFs), normalizing flows with a constant Jacobian determinant, precisely connect these approaches. Specifically, we prove how training a USF via maximum-likelihood reduces to a Deep SVDD objective with a unique regularization that inherently prevents representational collapse. This theoretical bridge implies that USFs inherit both the density faithfulness of flows and the distance-based reasoning of one-class methods. We further demonstrate that USFs induce a tighter alignment between negative log-likelihood and latent norm than either Deep SVDD or non-USFs, and how recent hybrid approaches combining one-class objectives with VAEs can be naturally extended to USFs. Consequently, we advocate using USFs as a drop-in replacement for non-USFs in modern anomaly detection architectures. Empirically, this substitution yields consistent performance gains and substantially improved training stability across multiple benchmarks and model backbones for both image-level and pixel-level detection. These results unify two major anomaly detection paradigms, advancing both theoretical understanding and practical performance.

【3】Emotion-Disentangled Embedding Alignment for Noise-Robust and Cross-Corpus Speech Emotion Recognition
标题：用于噪音稳健和跨数据库语音情感识别的描述去纠缠嵌入对齐
链接：https://arxiv.org/abs/2510.09072

作者：Upasana Tiwari, Rupayan Chakraborty, Sunil Kumar Kopparapu
备注：13 pages, 1 figure
摘要：Effectiveness of speech emotion recognition in real-world scenarios is often hindered by noisy environments and variability across datasets. This paper introduces a two-step approach to enhance the robustness and generalization of speech emotion recognition models through improved representation learning. First, our model employs EDRL (Emotion-Disentangled Representation Learning) to extract class-specific discriminative features while preserving shared similarities across emotion categories. Next, MEA (Multiblock Embedding Alignment) refines these representations by projecting them into a joint discriminative latent subspace that maximizes covariance with the original speech input. The learned EDRL-MEA embeddings are subsequently used to train an emotion classifier using clean samples from publicly available datasets, and are evaluated on unseen noisy and cross-corpus speech samples. Improved performance under these challenging conditions demonstrates the effectiveness of the proposed method.

【4】Long-Tailed Recognition via Information-Preservable Two-Stage Learning
标题：通过信息可保留的两阶段学习进行长尾识别
链接：https://arxiv.org/abs/2510.08836

作者：Fudong Lin, Xu Yuan
备注：Accepted by NeurIPS 2025 as Spotlight
摘要：The imbalance (or long-tail) is the nature of many real-world data distributions, which often induces the undesirable bias of deep classification models toward frequent classes, resulting in poor performance for tail classes. In this paper, we propose a novel two-stage learning approach to mitigate such a majority-biased tendency while preserving valuable information within datasets. Specifically, the first stage proposes a new representation learning technique from the information theory perspective. This approach is theoretically equivalent to minimizing intra-class distance, yielding an effective and well-separated feature space. The second stage develops a novel sampling strategy that selects mathematically informative instances, able to rectify majority-biased decision boundaries without compromising a model's overall performance. As a result, our approach achieves the state-of-the-art performance across various long-tailed benchmark datasets, validated via extensive experiments. Our code is available at https://github.com/fudong03/BNS_IPDPP.

【5】CATS-Linear: Classification Auxiliary Linear Model for Time Series Forecasting
标题：CATS-Linear：时间序列预测的分类辅助线性模型
链接：https://arxiv.org/abs/2510.08661

作者：Zipo Jibao, Yingyi Fu, Xinyang Chen, Guoting Chen
摘要：Recent research demonstrates that linear models achieve forecasting performance competitive with complex architectures, yet methodologies for enhancing linear models remain underexplored. Motivated by the hypothesis that distinct time series instances may follow heterogeneous linear mappings, we propose the Classification Auxiliary Trend-Seasonal Decoupling Linear Model CATS-Linear, employing Classification Auxiliary Channel-Independence (CACI). CACI dynamically routes instances to dedicated predictors via classification, enabling supervised channel design. We further analyze the theoretical expected risks of different channel settings. Additionally, we redesign the trend-seasonal decomposition architecture by adding a decoupling -- linear mapping -- recoupling framework for trend components and complex-domain linear projections for seasonal components. Extensive experiments validate that CATS-Linear with fixed hyperparameters achieves state-of-the-art accuracy comparable to hyperparameter-tuned baselines while delivering SOTA accuracy against fixed-hyperparameter counterparts.

【6】Deep Sparse Representation-based Classification
标题：基于深度稀疏表示的分类
链接：https://arxiv.org/abs/1904.11093

作者：Mahdi Abavisani, Vishal M. Patel
备注：None
摘要：We present a transductive deep learning-based formulation for the sparse representation-based classification (SRC) method. The proposed network consists of a convolutional autoencoder along with a fully-connected layer. The role of the autoencoder network is to learn robust deep features for classification. On the other hand, the fully-connected layer, which is placed in between the encoder and the decoder networks, is responsible for finding the sparse representation. The estimated sparse codes are then used for classification. Various experiments on three different datasets show that the proposed network leads to sparse representations that give better classification results than state-of-the-art SRC methods. The source code is available at: github.com/mahdiabavisani/DSRC.

【7】Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition with Multimodal Training
标题：通过多模式训练提高单模式动态手势识别的性能
链接：https://arxiv.org/abs/1812.06145

作者：Mahdi Abavisani, Hamid Reza Vaezi Joze, Vishal M. Patel
备注：None
摘要：We present an efficient approach for leveraging the knowledge from multiple modalities in training unimodal 3D convolutional neural networks (3D-CNNs) for the task of dynamic hand gesture recognition. Instead of explicitly combining multimodal information, which is commonplace in many state-of-the-art methods, we propose a different framework in which we embed the knowledge of multiple modalities in individual networks so that each unimodal network can achieve an improved performance. In particular, we dedicate separate networks per available modality and enforce them to collaborate and learn to develop networks with common semantics and better representations. We introduce a "spatiotemporal semantic alignment" loss (SSA) to align the content of the features from different networks. In addition, we regularize this loss with our proposed "focal regularization parameter" to avoid negative knowledge transfer. Experimental results show that our framework improves the test time recognition accuracy of unimodal networks, and provides the state-of-the-art performance on various dynamic hand gesture recognition datasets.

表征(2篇)

【1】MODE: Learning compositional representations of complex systems with Mixtures Of Dynamical Experts
标题：模式：使用动态专家混合学习复杂系统的组合表示
链接：https://arxiv.org/abs/2510.09594

作者：Nathan Quiblier, Roy Friedman, Matthew Ricci
备注：30 pages, 5 figures
摘要：Dynamical systems in the life sciences are often composed of complex mixtures of overlapping behavioral regimes. Cellular subpopulations may shift from cycling to equilibrium dynamics or branch towards different developmental fates. The transitions between these regimes can appear noisy and irregular, posing a serious challenge to traditional, flow-based modeling techniques which assume locally smooth dynamics. To address this challenge, we propose MODE (Mixture Of Dynamical Experts), a graphical modeling framework whose neural gating mechanism decomposes complex dynamics into sparse, interpretable components, enabling both the unsupervised discovery of behavioral regimes and accurate long-term forecasting across regime transitions. Crucially, because agents in our framework can jump to different governing laws, MODE is especially tailored to the aforementioned noisy transitions. We evaluate our method on a battery of synthetic and real datasets from computational biology. First, we systematically benchmark MODE on an unsupervised classification task using synthetic dynamical snapshot data, including in noisy, few-sample settings. Next, we show how MODE succeeds on challenging forecasting tasks which simulate key cycling and branching processes in cell biology. Finally, we deploy our method on human, single-cell RNA sequencing data and show that it can not only distinguish proliferation from differentiation dynamics but also predict when cells will commit to their ultimate fate, a key outstanding challenge in computational biology.

【2】PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning
链接：https://arxiv.org/abs/2510.08919

作者：Daiki Yoshikawa, Takashi Matsubara
备注：23 pages
摘要：Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., dog $\preceq$ mammal $\preceq$ animal) and the compositionality across different concept families (e.g., "a dog in a car" $\preceq$ dog, car). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose PHyCLIP, which employs an $\ell_1$-Product metric on a Cartesian product of Hyperbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the $\ell_1$-product metric, analogous to a Boolean algebra. Experiments on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks demonstrate that PHyCLIP outperforms existing single-space approaches and offers more interpretable structures in the embedding space.

编码器(2篇)

【1】Struc-EMB: The Potential of Structure-Aware Encoding in Language Embeddings
标题：Struct-EMB：语言嵌入中结构感知编码的潜力
链接：https://arxiv.org/abs/2510.08774

作者：Shikun Liu, Haoyu Wang, Mufei Li, Pan Li
摘要：Text embeddings from Large Language Models (LLMs) have become foundational for numerous applications. However, these models typically operate on raw text, overlooking the rich structural information, such as hyperlinks or citations, that provides crucial context in many real-world datasets. This paper introduces and systematically evaluates a new paradigm for generating structure-aware text embeddings by integrating these structural relations directly into the LLM's internal encoding process, rather than relying on traditional post-hoc aggregation. We investigate two primary in-process methods: sequential concatenation and parallel caching. Through extensive zero-shot experiments across retrieval, clustering, classification, and recommendation tasks, we demonstrate that our structure-aware approaches consistently outperform both text-only and post-hoc baselines. Our analysis reveals critical trade-offs: sequential concatenation excels with noisy, moderate-length contexts, while parallel caching scales more effectively to long, high-signal contexts but is more susceptible to distractors. To address the challenge of noisy structural data, we also introduce and validate two effective techniques: Context Distillation and Semantic Balancing. This work provides the first comprehensive analysis of in-process structure-aware encoding, offering a blueprint for building more powerful and contextually aware embedding models.

【2】Estimating Brain Activity with High Spatial and Temporal Resolution using a Naturalistic MEG-fMRI Encoding Model
标题：使用自然主义MEG-fMRI编码模型以高空间和时间分辨率估计大脑活动
链接：https://arxiv.org/abs/2510.09415

作者：Beige Jerry Jin, Leila Wehbe
摘要：Current non-invasive neuroimaging techniques trade off between spatial resolution and temporal resolution. While magnetoencephalography (MEG) can capture rapid neural dynamics and functional magnetic resonance imaging (fMRI) can spatially localize brain activity, a unified picture that preserves both high resolutions remains an unsolved challenge with existing source localization or MEG-fMRI fusion methods, especially for single-trial naturalistic data. We collected whole-head MEG when subjects listened passively to more than seven hours of narrative stories, using the same stimuli in an open fMRI dataset (LeBel et al., 2023). We developed a transformer-based encoding model that combines the MEG and fMRI from these two naturalistic speech comprehension experiments to estimate latent cortical source responses with high spatiotemporal resolution. Our model is trained to predict MEG and fMRI from multiple subjects simultaneously, with a latent layer that represents our estimates of reconstructed cortical sources. Our model predicts MEG better than the common standard of single-modality encoding models, and it also yields source estimates with higher spatial and temporal fidelity than classic minimum-norm solutions in simulation experiments. We validated the estimated latent sources by showing its strong generalizability across unseen subjects and modalities. Estimated activity in our source space predict electrocorticography (ECoG) better than an ECoG-trained encoding model in an entirely new dataset. By integrating the power of large naturalistic experiments, MEG, fMRI, and encoding models, we propose a practical route towards millisecond-and-millimeter brain mapping.

优化|敛散性(10篇)

【1】Automated Evolutionary Optimization for Resource-Efficient Neural Network Training
标题：资源高效神经网络训练的自动进化优化
链接：https://arxiv.org/abs/2510.09566

作者：Ilia Revin, Leon Strelkov, Vadim A. Potemkin, Ivan Kireev, Andrey Savchenko
摘要：There are many critical challenges in optimizing neural network models, including distributed computing, compression techniques, and efficient training, regardless of their application to specific tasks. Solving such problems is crucial because the need for scalable and resource-efficient models is increasing. To address these challenges, we have developed a new automated machine learning (AutoML) framework, Parameter Efficient Training with Robust Automation (PETRA). It applies evolutionary optimization to model architecture and training strategy. PETRA includes pruning, quantization, and loss regularization. Experimental studies on real-world data with financial event sequences, as well as image and time-series -- benchmarks, demonstrate PETRA's ability to improve neural model performance and scalability -- namely, a significant decrease in model size (up to 75%) and latency (up to 33%), and an increase in throughput (by 13%) without noticeable degradation in the target metric.

【2】Locally Optimal Private Sampling: Beyond the Global Minimax
标题：本地最优私人抽样：超越全球极小化
链接：https://arxiv.org/abs/2510.09485

作者：Hrad Ghoukasian, Bonwoo Lee, Shahab Asoodeh
备注：44 pages, 11 figures. Accepted at NeurIPS 2025
摘要：We study the problem of sampling from a distribution under local differential privacy (LDP). Given a private distribution $P \in \mathcal{P}$, the goal is to generate a single sample from a distribution that remains close to $P$ in $f$-divergence while satisfying the constraints of LDP. This task captures the fundamental challenge of producing realistic-looking data under strong privacy guarantees. While prior work by Park et al. (NeurIPS'24) focuses on global minimax-optimality across a class of distributions, we take a local perspective. Specifically, we examine the minimax risk in a neighborhood around a fixed distribution $P_0$, and characterize its exact value, which depends on both $P_0$ and the privacy level. Our main result shows that the local minimax risk is determined by the global minimax risk when the distribution class $\mathcal{P}$ is restricted to a neighborhood around $P_0$. To establish this, we (1) extend previous work from pure LDP to the more general functional LDP framework, and (2) prove that the globally optimal functional LDP sampler yields the optimal local sampler when constrained to distributions near $P_0$. Building on this, we also derive a simple closed-form expression for the locally minimax-optimal samplers which does not depend on the choice of $f$-divergence. We further argue that this local framework naturally models private sampling with public data, where the public data distribution is represented by $P_0$. In this setting, we empirically compare our locally optimal sampler to existing global methods, and demonstrate that it consistently outperforms global minimax samplers.

【3】Rate optimal learning of equilibria from data
标题：从数据中对均衡进行最优学习
链接：https://arxiv.org/abs/2510.09325

作者：Till Freihaut, Luca Viano, Emanuele Nevali, Volkan Cevher, Matthieu Geist, Giorgia Ramponi
摘要：We close open theoretical gaps in Multi-Agent Imitation Learning (MAIL) by characterizing the limits of non-interactive MAIL and presenting the first interactive algorithm with near-optimal sample complexity. In the non-interactive setting, we prove a statistical lower bound that identifies the all-policy deviation concentrability coefficient as the fundamental complexity measure, and we show that Behavior Cloning (BC) is rate-optimal. For the interactive setting, we introduce a framework that combines reward-free reinforcement learning with interactive MAIL and instantiate it with an algorithm, MAIL-WARM. It improves the best previously known sample complexity from $\mathcal{O}(\varepsilon^{-8})$ to $\mathcal{O}(\varepsilon^{-2}),$ matching the dependence on $\varepsilon$ implied by our lower bound. Finally, we provide numerical results that support our theory and illustrate, in environments such as grid worlds, where Behavior Cloning fails to learn.

【4】Flow-Opt: Scalable Centralized Multi-Robot Trajectory Optimization with Flow Matching and Differentiable Optimization
标题：Flow-Opt：具有流量匹配和差异优化的可扩展集中式多机器人轨迹优化
链接：https://arxiv.org/abs/2510.09204

作者：Simon Idoko, Arun Kumar Singh
摘要：Centralized trajectory optimization in the joint space of multiple robots allows access to a larger feasible space that can result in smoother trajectories, especially while planning in tight spaces. Unfortunately, it is often computationally intractable beyond a very small swarm size. In this paper, we propose Flow-Opt, a learning-based approach towards improving the computational tractability of centralized multi-robot trajectory optimization. Specifically, we reduce the problem to first learning a generative model to sample different candidate trajectories and then using a learned Safety-Filter(SF) to ensure fast inference-time constraint satisfaction. We propose a flow-matching model with a diffusion transformer (DiT) augmented with permutation invariant robot position and map encoders as the generative model. We develop a custom solver for our SF and equip it with a neural network that predicts context-specific initialization. The initialization network is trained in a self-supervised manner, taking advantage of the differentiability of the SF solver. We advance the state-of-the-art in the following respects. First, we show that we can generate trajectories of tens of robots in cluttered environments in a few tens of milliseconds. This is several times faster than existing centralized optimization approaches. Moreover, our approach also generates smoother trajectories orders of magnitude faster than competing baselines based on diffusion models. Second, each component of our approach can be batched, allowing us to solve a few tens of problem instances in a fraction of a second. We believe this is a first such result; no existing approach provides such capabilities. Finally, our approach can generate a diverse set of trajectories between a given set of start and goal locations, which can capture different collision-avoidance behaviors.

【5】Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs
标题：多模式提示优化：为什么不利用多模式实现MLLM
链接：https://arxiv.org/abs/2510.09201

作者：Yumin Choi, Dongki Kim, Jinheon Baek, Sung Ju Hwang
摘要：Large Language Models (LLMs) have shown remarkable success, and their multimodal expansions (MLLMs) further unlock capabilities spanning images, videos, and other modalities beyond text. However, despite this shift, prompt optimization approaches, designed to reduce the burden of manual prompt crafting while maximizing performance, remain confined to text, ultimately limiting the full potential of MLLMs. Motivated by this gap, we introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of textual and non-textual prompts. To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection process of candidate prompts by leveraging earlier evaluations as priors in a Bayesian-based selection strategy. Through extensive experiments across diverse modalities that go beyond text, such as images, videos, and even molecules, we demonstrate that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.

【6】MCMC: Bridging Rendering, Optimization and Generative AI
标题：MCMC：连接渲染、优化和生成人工智能
链接：https://arxiv.org/abs/2510.09078

作者：Gurprit Singh, Wenzel Jakob
备注：SIGGRAPH Asia 2024 Courses. arXiv admin note: text overlap with arXiv:2208.11970 by other authors
摘要：Generative artificial intelligence (AI) has made unprecedented advances in vision language models over the past two years. During the generative process, new samples (images) are generated from an unknown high-dimensional distribution. Markov Chain Monte Carlo (MCMC) methods are particularly effective in drawing samples from such complex, high-dimensional distributions. This makes MCMC methods an integral component for models like EBMs, ensuring accurate sample generation. Gradient-based optimization is at the core of modern generative models. The update step during the optimization forms a Markov chain where the new update depends only on the current state. This allows exploration of the parameter space in a memoryless manner, thus combining the benefits of gradient-based optimization and MCMC sampling. MCMC methods have shown an equally important role in physically based rendering where complex light paths are otherwise quite challenging to sample from simple importance sampling techniques. A lot of research is dedicated towards bringing physical realism to samples (images) generated from diffusion-based generative models in a data-driven manner, however, a unified framework connecting these techniques is still missing. In this course, we take the first steps toward understanding each of these components and exploring how MCMC could potentially serve as a bridge, linking these closely related areas of research. Our course aims to provide necessary theoretical and practical tools to guide students, researchers and practitioners towards the common goal of generative physically based rendering. All Jupyter notebooks with demonstrations associated to this tutorial can be found on the project webpage: https://sinbag.github.io/mcmc/

【7】Convergence of optimizers implies eigenvalues filtering at equilibrium
标题：优化器的收敛意味着均衡时的特征值过滤
链接：https://arxiv.org/abs/2510.09034

作者：Jerome Bolte (TSE-R), Quoc-Tung Le (UGA, LJK), Edouard Pauwels (TSE-R)
摘要：Ample empirical evidence in deep neural network training suggests that a variety of optimizers tend to find nearly global optima. In this article, we adopt the reversed perspective that convergence to an arbitrary point is assumed rather than proven, focusing on the consequences of this assumption. From this viewpoint, in line with recent advances on the edge-of-stability phenomenon, we argue that different optimizers effectively act as eigenvalue filters determined by their hyperparameters. Specifically, the standard gradient descent method inherently avoids the sharpest minima, whereas Sharpness-Aware Minimization (SAM) algorithms go even further by actively favoring wider basins. Inspired by these insights, we propose two novel algorithms that exhibit enhanced eigenvalue filtering, effectively promoting wider minima. Our theoretical analysis leverages a generalized Hadamard--Perron stable manifold theorem and applies to general semialgebraic $C^2$ functions, without requiring additional non-degeneracy conditions or global Lipschitz bound assumptions. We support our conclusions with numerical experiments on feed-forward neural networks.

【8】LOTION: Smoothing the Optimization Landscape for Quantized Training
标题：乳液：平滑量化训练的优化环境
链接：https://arxiv.org/abs/2510.08757

作者：Mujin Kwun, Depen Morwani, Chloe Huangyuan Su, Stephanie Gil, Nikhil Anand, Sham Kakade
备注：9 pages of main text + appendices
摘要：Optimizing neural networks for quantized objectives is fundamentally challenging because the quantizer is piece-wise constant, yielding zero gradients everywhere except at quantization thresholds where the derivative is undefined. Most existing methods deal with this issue by relaxing gradient computations with techniques like Straight Through Estimators (STE) and do not provide any guarantees of convergence. In this work, taking inspiration from Nesterov smoothing, we approximate the quantized loss surface with a continuous loss surface. In particular, we introduce LOTION, \textbf{L}ow-precision \textbf{O}ptimization via s\textbf{T}ochastic-no\textbf{I}se sm\textbf{O}othi\textbf{N}g, a principled smoothing framework that replaces the raw quantized loss with its expectation under unbiased randomized-rounding noise. In this framework, standard optimizers are guaranteed to converge to a local minimum of the loss surface. Moreover, when using noise derived from stochastic rounding, we show that the global minima of the original quantized loss are preserved. We empirically demonstrate that this method outperforms standard QAT on synthetic testbeds and on 150M- and 300M- parameter language models.

【9】Conformal Risk Training: End-to-End Optimization of Conformal Risk Control
标题：适形风险训练：适形风险控制的端到端优化
链接：https://arxiv.org/abs/2510.08748

作者：Christopher Yeh, Nicolas Christianson, Adam Wierman, Yisong Yue
备注：accepted to NeurIPS 2025
摘要：While deep learning models often achieve high predictive accuracy, their predictions typically do not come with any provable guarantees on risk or reliability, which are critical for deployment in high-stakes applications. The framework of conformal risk control (CRC) provides a distribution-free, finite-sample method for controlling the expected value of any bounded monotone loss function and can be conveniently applied post-hoc to any pre-trained deep learning model. However, many real-world applications are sensitive to tail risks, as opposed to just expected loss. In this work, we develop a method for controlling the general class of Optimized Certainty-Equivalent (OCE) risks, a broad class of risk measures which includes as special cases the expected loss (generalizing the original CRC method) and common tail risks like the conditional value-at-risk (CVaR). Furthermore, standard post-hoc CRC can degrade average-case performance due to its lack of feedback to the model. To address this, we introduce "conformal risk training," an end-to-end approach that differentiates through conformal OCE risk control during model training or fine-tuning. Our method achieves provable risk guarantees while demonstrating significantly improved average-case performance over post-hoc approaches on applications to controlling classifiers' false negative rate and controlling financial risk in battery storage operation.

【10】Three Birds with One Stone: Improving Performance, Convergence, and System Throughput with Nest
标题：一箭双雕：利用Nest提高性能、融合和系统吞吐量
链接：https://arxiv.org/abs/2510.09578

作者：Yuqian Huo, David Quiroga, Anastasios Kyrillidis, Tirthak Patel
备注：This paper will appear in the Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), 2026
摘要：Variational quantum algorithms (VQAs) have the potential to demonstrate quantum utility on near-term quantum computers. However, these algorithms often get executed on the highest-fidelity qubits and computers to achieve the best performance, causing low system throughput. Recent efforts have shown that VQAs can be run on low-fidelity qubits initially and high-fidelity qubits later on to still achieve good performance. We take this effort forward and show that carefully varying the qubit fidelity map of the VQA over its execution using our technique, Nest, does not just (1) improve performance (i.e., help achieve close to optimal results), but also (2) lead to faster convergence. We also use Nest to co-locate multiple VQAs concurrently on the same computer, thus (3) increasing the system throughput, and therefore, balancing and optimizing three conflicting metrics simultaneously.

预测|估计(10篇)

【1】Geo-Aware Models for Stream Temperature Prediction across Different Spatial Regions and Scales
标题：不同空间区域和尺度溪流温度预测的地理感知模型
链接：https://arxiv.org/abs/2510.09500

作者：Shiyuan Luo, Runlong Yu, Shengyu Chen, Yingda Fan, Yiqun Xie, Yanhua Li, Xiaowei Jia
摘要：Understanding environmental ecosystems is vital for the sustainable management of our planet. However,existing physics-based and data-driven models often fail to generalize to varying spatial regions and scales due to the inherent data heterogeneity presented in real environmental ecosystems. This generalization issue is further exacerbated by the limited observation samples available for model training. To address these issues, we propose Geo-STARS, a geo-aware spatio-temporal modeling framework for predicting stream water temperature across different watersheds and spatial scales. The major innovation of Geo-STARS is the introduction of geo-aware embedding, which leverages geographic information to explicitly capture shared principles and patterns across spatial regions and scales. We further integrate the geo-aware embedding into a gated spatio-temporal graph neural network. This design enables the model to learn complex spatial and temporal patterns guided by geographic and hydrological context, even with sparse or no observational data. We evaluate Geo-STARS's efficacy in predicting stream water temperature, which is a master factor for water quality. Using real-world datasets spanning 37 years across multiple watersheds along the eastern coast of the United States, Geo-STARS demonstrates its superior generalization performance across both regions and scales, outperforming state-of-the-art baselines. These results highlight the promise of Geo-STARS for scalable, data-efficient environmental monitoring and decision-making.

【2】CRPS-LAM: Regional ensemble weather forecasting from matching marginals
标题：CRPS-LAM：匹配边缘地区的区域综合天气预报
链接：https://arxiv.org/abs/2510.09484

作者：Erik Larsson, Joel Oskarsson, Tomas Landelius, Fredrik Lindsten
备注：Preprint
摘要：Machine learning for weather prediction increasingly relies on ensemble methods to provide probabilistic forecasts. Diffusion-based models have shown strong performance in Limited-Area Modeling (LAM) but remain computationally expensive at sampling time. Building on the success of global weather forecasting models trained based on Continuous Ranked Probability Score (CRPS), we introduce CRPS-LAM, a probabilistic LAM forecasting model trained with a CRPS-based objective. By sampling and injecting a single latent noise vector into the model, CRPS-LAM generates ensemble members in a single forward pass, achieving sampling speeds up to 39 times faster than a diffusion-based model. We evaluate the model on the MEPS regional dataset, where CRPS-LAM matches the low errors of diffusion models. By retaining also fine-scale forecast details, the method stands out as an effective approach for probabilistic regional weather forecasting

【3】Interpretable Machine Learning for Predicting Startup Funding, Patenting, and Exits
标题：用于预测初创资金、专利和退出的可解释机器学习
链接：https://arxiv.org/abs/2510.09465

作者：Saeid Mashhadi, Amirhossein Saghezchi, Vesal Ghassemzadeh Kashani
摘要：This study develops an interpretable machine learning framework to forecast startup outcomes, including funding, patenting, and exit. A firm-quarter panel for 2010-2023 is constructed from Crunchbase and matched to U.S. Patent and Trademark Office (USPTO) data. Three horizons are evaluated: next funding within 12 months, patent-stock growth within 24 months, and exit through an initial public offering (IPO) or acquisition within 36 months. Preprocessing is fit on a development window (2010-2019) and applied without change to later cohorts to avoid leakage. Class imbalance is addressed using inverse-prevalence weights and the Synthetic Minority Oversampling Technique for Nominal and Continuous features (SMOTE-NC). Logistic regression and tree ensembles, including Random Forest, XGBoost, LightGBM, and CatBoost, are compared using the area under the precision-recall curve (PR-AUC) and the area under the receiver operating characteristic curve (AUROC). Patent, funding, and exit predictions achieve AUROC values of 0.921, 0.817, and 0.872, providing transparent and reproducible rankings for innovation finance.

【4】Failure Prediction at Runtime for Generative Robot Policies
标题：生成机器人策略的故障预测
链接：https://arxiv.org/abs/2510.09459

作者：Ralf Römer, Adrian Kobras, Luca Worbis, Angela P. Schoellig
备注：Accepted to NeurIPS 2025
摘要：Imitation learning (IL) with generative models, such as diffusion and flow matching, has enabled robots to perform complex, long-horizon tasks. However, distribution shifts from unseen environments or compounding action errors can still cause unpredictable and unsafe behavior, leading to task failure. Early failure prediction during runtime is therefore essential for deploying robots in human-centered and safety-critical environments. We propose FIPER, a general framework for Failure Prediction at Runtime for generative IL policies that does not require failure data. FIPER identifies two key indicators of impending failure: (i) out-of-distribution (OOD) observations detected via random network distillation in the policy's embedding space, and (ii) high uncertainty in generated actions measured by a novel action-chunk entropy score. Both failure prediction scores are calibrated using a small set of successful rollouts via conformal prediction. A failure alarm is triggered when both indicators, aggregated over short time windows, exceed their thresholds. We evaluate FIPER across five simulation and real-world environments involving diverse failure modes. Our results demonstrate that FIPER better distinguishes actual failures from benign OOD situations and predicts failures more accurately and earlier than existing methods. We thus consider this work an important step towards more interpretable and safer generative robot policies. Code, data and videos are available at https://tum-lsy.github.io/fiper_website.

【5】A PCA-based Data Prediction Method
标题：一种基于PCA的数据预测方法
链接：https://arxiv.org/abs/2510.09246

作者：Peteris Daugulis, Vija Vagale, Emiliano Mancini, Filippo Castiglione
备注：None
摘要：The problem of choosing appropriate values for missing data is often encountered in the data science. We describe a novel method containing both traditional mathematics and machine learning elements for prediction (imputation) of missing data. This method is based on the notion of distance between shifted linear subspaces representing the existing data and candidate sets. The existing data set is represented by the subspace spanned by its first principal components. Solutions for the case of the Euclidean metric are given.

【6】Score-Based Density Estimation from Pairwise Comparisons
标题：两两比较的基于分数的密度估计
链接：https://arxiv.org/abs/2510.09146

作者：Petrus Mikkola, Luigi Acerbi, Arto Klami
备注：32 pages, 26 figures
摘要：We study density estimation from pairwise comparisons, motivated by expert knowledge elicitation and learning from human feedback. We relate the unobserved target density to a tempered winner density (marginal density of preferred choices), learning the winner's score via score-matching. This allows estimating the target by `de-tempering' the estimated winner density's score. We prove that the score vectors of the belief and the winner density are collinear, linked by a position-dependent tempering field. We give analytical formulas for this field and propose an estimator for it under the Bradley-Terry model. Using a diffusion model trained on tempered samples generated via score-scaled annealed Langevin dynamics, we can learn complex multivariate belief densities of simulated experts, from only hundreds to thousands of pairwise comparisons.

【7】Simple and Robust Forecasting of Spatiotemporally Correlated Small Earth Data with A Tabular Foundation Model
标题：用表格基础模型进行时空相关小地球数据的简单稳健预测
链接：https://arxiv.org/abs/2510.08920

作者：Yuting Yang, Gang Mei, Zhengjing Ma, Nengxiong Xu, Jianbing Peng
摘要：Small Earth data are geoscience observations with limited short-term monitoring variability, providing sparse but meaningful measurements, typically exhibiting spatiotemporal correlations. Spatiotemporal forecasting on such data is crucial for understanding geoscientific processes despite their small scale. However, conventional deep learning models for spatiotemporal forecasting requires task-specific training for different scenarios. Foundation models do not need task-specific training, but they often exhibit forecasting bias toward the global mean of the pretraining distribution. Here we propose a simple and robust approach for spatiotemporally correlated small Earth data forecasting. The essential idea is to characterize and quantify spatiotemporal patterns of small Earth data and then utilize tabular foundation models for accurate forecasting across different scenarios. Comparative results across three typical scenarios demonstrate that our forecasting approach achieves superior accuracy compared to the graph deep learning model (T-GCN) and tabular foundation model (TabPFN) in the majority of instances, exhibiting stronger robustness.

【8】Counterfactually Fair Conformal Prediction
标题：反事实公平的保形预测
链接：https://arxiv.org/abs/2510.08724

作者：Ozgur Guldogan, Neeraj Sarna, Yuanyuan Li, Michael Berger
摘要：While counterfactual fairness of point predictors is well studied, its extension to prediction sets--central to fair decision-making under uncertainty--remains underexplored. On the other hand, conformal prediction (CP) provides efficient, distribution-free, finite-sample valid prediction sets, yet does not ensure counterfactual fairness. We close this gap by developing Counterfactually Fair Conformal Prediction (CF-CP) that produces counterfactually fair prediction sets. Through symmetrization of conformity scores across protected-attribute interventions, we prove that CF-CP results in counterfactually fair prediction sets while maintaining the marginal coverage property. Furthermore, we empirically demonstrate that on both synthetic and real datasets, across regression and classification tasks, CF-CP achieves the desired counterfactual fairness and meets the target coverage rate with minimal increase in prediction set size. CF-CP offers a simple, training-free route to counterfactually fair uncertainty quantification.

【9】DPCformer: An Interpretable Deep Learning Model for Genomic Prediction in Crops
标题：DPA former：用于农作物基因组预测的可解释深度学习模型
链接：https://arxiv.org/abs/2510.08662

作者：Pengcheng Deng, Kening Liu, Mengxi Zhou, Mingxi Li, Rui Yang, Chuzhe Cao, Maojun Wang, Zeyu Zhang
备注：This work has been accepted by BIBM 2025
摘要：Genomic Selection (GS) uses whole-genome information to predict crop phenotypes and accelerate breeding. Traditional GS methods, however, struggle with prediction accuracy for complex traits and large datasets. We propose DPCformer, a deep learning model integrating convolutional neural networks with a self-attention mechanism to model complex genotype-phenotype relationships. We applied DPCformer to 13 traits across five crops (maize, cotton, tomato, rice, chickpea). Our approach uses an 8-dimensional one-hot encoding for SNP data, ordered by chromosome, and employs the PMF algorithm for feature selection. Evaluations show DPCformer outperforms existing methods. In maize datasets, accuracy for traits like days to tasseling and plant height improved by up to 2.92%. For cotton, accuracy gains for fiber traits reached 8.37%. On small-sample tomato data, the Pearson Correlation Coefficient for a key trait increased by up to 57.35%. In chickpea, the yield correlation was boosted by 16.62%. DPCformer demonstrates superior accuracy, robustness in small-sample scenarios, and enhanced interpretability, providing a powerful tool for precision breeding and addressing global food security challenges.

【10】Inner-Instance Normalization for Time Series Forecasting
标题：时间序列预测的内部实例规范化
链接：https://arxiv.org/abs/2510.08657

作者：Zipo Jibao, Yingyi Fu, Xinyang Chen, Guoting Chen
摘要：Real-world time series are influenced by numerous factors and exhibit complex non-stationary characteristics. Non-stationarity can lead to distribution shifts, where the statistical properties of time series change over time, negatively impacting model performance. Several instance normalization techniques have been proposed to address distribution shifts in time series forecasting. However, existing methods fail to account for shifts within individual instances, leading to suboptimal performance. To tackle inner-instance distribution shifts, we propose two novel point-level methods: Learning Distribution (LD) and Learning Conditional Distribution (LCD). LD eliminates internal discrepancies by fitting the internal distribution of input and output with different parameters at different time steps, while LCD utilizes neural networks to predict scaling coefficients of the output. We evaluate the performance of the two methods with various backbone models across public benchmarks and demonstrate the effectiveness of the point-level paradigm through comparative experiments.

其他神经网络|深度学习|模型|建模(32篇)

【1】BaNEL: Exploration Posteriors for Generative Modeling Using Only Negative Rewards
标题：BNEL：仅使用负奖励的生成建模的探索后期
链接：https://arxiv.org/abs/2510.09596

作者：Sangyun Lee, Brandon Amos, Giulia Fanti
摘要：Today's generative models thrive with large amounts of supervised data and informative reward functions characterizing the quality of the generation. They work under the assumptions that the supervised data provides knowledge to pre-train the model, and the reward function provides dense information about how to further improve the generation quality and correctness. However, in the hardest instances of important problems, two problems arise: (1) the base generative model attains a near-zero reward signal, and (2) calls to the reward oracle are expensive. This setting poses a fundamentally different learning challenge than standard reward-based post-training. To address this, we propose BaNEL (Bayesian Negative Evidence Learning), an algorithm that post-trains the model using failed attempts only, while minimizing the number of reward evaluations (NREs). Our method is based on the idea that the problem of learning regularities underlying failures can be cast as another, in-loop generative modeling problem. We then leverage this model to assess whether new data resembles previously seen failures and steer the generation away from them. We show that BaNEL can improve model performance without observing a single successful sample on several sparse-reward tasks, outperforming existing novelty-bonus approaches by up to several orders of magnitude in success rate, while using fewer reward evaluations.

【2】Design Principles for Sequence Models via Coefficient Dynamics
标题：通过系数动力学实现序列模型的设计原则
链接：https://arxiv.org/abs/2510.09389

作者：Jerome Sieber, Antonio Orvieto, Melanie N. Zeilinger, Carmen Amo Alonso
摘要：Deep sequence models, ranging from Transformers and State Space Models (SSMs) to more recent approaches such as gated linear RNNs, fundamentally compute outputs as linear combinations of past value vectors. To draw insights and systematically compare such architectures, we develop a unified framework that makes this output operation explicit, by casting the linear combination coefficients as the outputs of autonomous linear dynamical systems driven by impulse inputs. This viewpoint, in spirit substantially different from approaches focusing on connecting linear RNNs with linear attention, reveals a common mathematical theme across diverse architectures and crucially captures softmax attention, on top of RNNs, SSMs, and related models. In contrast to new model proposals that are commonly evaluated on benchmarks, we derive design principles linking architectural choices to model properties. Thereby identifying tradeoffs between expressivity and efficient implementation, geometric constraints on input selectivity, and stability conditions for numerically stable training and information retention. By connecting several insights and observations from recent literature, the framework both explains empirical successes of recent designs and provides guiding principles for systematically designing new sequence model architectures.

【3】CHUCKLE -- When Humans Teach AI To Learn Emotions The Easy Way
标题：咯咯笑--当人类教人工智能以简单的方式学习情感时
链接：https://arxiv.org/abs/2510.09382

作者：Ankush Pratap Singh, Houwei Cao, Yong Liu
摘要：Curriculum learning (CL) structures training from simple to complex samples, facilitating progressive learning. However, existing CL approaches for emotion recognition often rely on heuristic, data-driven, or model-based definitions of sample difficulty, neglecting the difficulty for human perception, a critical factor in subjective tasks like emotion recognition. We propose CHUCKLE (Crowdsourced Human Understanding Curriculum for Knowledge Led Emotion Recognition), a perception-driven CL framework that leverages annotator agreement and alignment in crowd-sourced datasets to define sample difficulty, under the assumption that clips challenging for humans are similarly hard for machine learning models. Empirical results suggest that CHUCKLE increases the relative mean accuracy by 6.56% for LSTMs and 1.61% for Transformers over non-curriculum baselines, while reducing the number of gradient updates, thereby enhancing both training efficiency and model robustness.

【4】Task-Level Insights from Eigenvalues across Sequence Models
标题：来自序列模型特征值的任务级见解
链接：https://arxiv.org/abs/2510.09379

作者：Rahel Rickenbach, Jelena Trisovic, Alexandre Didier, Jerome Sieber, Melanie N. Zeilinger
摘要：Although softmax attention drives state-of-the-art performance for sequence models, its quadratic complexity limits scalability, motivating linear alternatives such as state space models (SSMs). While these alternatives improve efficiency, their fundamental differences in information processing remain poorly understood. In this work, we leverage the recently proposed dynamical systems framework to represent softmax, norm and linear attention as dynamical systems, enabling a structured comparison with SSMs by analyzing their respective eigenvalue spectra. Since eigenvalues capture essential aspects of dynamical system behavior, we conduct an extensive empirical analysis across diverse sequence models and benchmarks. We first show that eigenvalues influence essential aspects of memory and long-range dependency modeling, revealing spectral signatures that align with task requirements. Building on these insights, we then investigate how architectural modifications in sequence models impact both eigenvalue spectra and task performance. This correspondence further strengthens the position of eigenvalue analysis as a principled metric for interpreting, understanding, and ultimately improving the capabilities of sequence models.

【5】Deep Learning to Identify the Spatio-Temporal Cascading Effects of Train Delays in a High-Density Network
标题：深度学习识别高密度网络中火车延误的时空连锁效应
链接：https://arxiv.org/abs/2510.09350

作者：Vu Duc Anh Nguyen, Ziyue Li
备注：Accepted at SIGSPATIAL 2025 - GeoAI Workshop
摘要：The operational efficiency of railway networks, a cornerstone of modern economies, is persistently undermined by the cascading effects of train delays. Accurately forecasting this delay propagation is a critical challenge for real-time traffic management. While recent research has leveraged Graph Neural Networks (GNNs) to model the network structure of railways, a significant gap remains in developing frameworks that provide multi-step autoregressive forecasts at a network-wide scale, while simultaneously offering the live, interpretable explanations needed for decision support. This paper addresses this gap by developing and evaluating a novel XGeoAI framework for live, explainable, multi-step train delay forecasting. The core of this work is a two-stage, autoregressive Graph Attention Network (GAT) model, trained on a real-world dataset covering over 40% of the Dutch railway network. The model represents the system as a spatio-temporal graph of operational events (arrivals and departures) and is enriched with granular features, including platform and station congestion. To test its viability for live deployment, the model is rigorously evaluated using a sequential, k-step-ahead forecasting protocol that simulates real-world conditions where prediction errors can compound. The results demonstrate that while the proposed GATv2 model is challenged on pure error metrics (MAE) by a simpler Persistence baseline, it achieves consistently higher precision in classifying delay events -- a crucial advantage for a reliable decision support tool.

【6】Residual-Informed Learning of Solutions to Algebraic Loops
标题：代数循环解决方案的剩余知情学习
链接：https://arxiv.org/abs/2510.09317

作者：Felix Brandt, Andreas Heuermann, Philip Hannebohm, Bernhard Bachmann
备注：16 pages, 16 figures, 5 tables, submitted to IDaS-Schriftenreihe from Hochschule Bielefeld - University of Applied Sciences and Arts (HSBI)
摘要：This paper presents a residual-informed machine learning approach for replacing algebraic loops in equation-based Modelica models with neural network surrogates. A feedforward neural network is trained using the residual (error) of the algebraic loop directly in its loss function, eliminating the need for a supervised dataset. This training strategy also resolves the issue of ambiguous solutions, allowing the surrogate to converge to a consistent solution rather than averaging multiple valid ones. Applied to the large-scale IEEE 14-Bus system, our method achieves a 60% reduction in simulation time compared to conventional simulations, while maintaining the same level of accuracy through error control mechanisms.

【7】A Model-Driven Engineering Approach to AI-Powered Healthcare Platforms
标题：人工智能驱动医疗平台的模型驱动工程方法
链接：https://arxiv.org/abs/2510.09308

作者：Mira Raheem, Amal Elgammal, Michael Papazoglou, Bernd Krämer, Neamat El-Tazi
备注：Disclaimer: This manuscript is currently under review at * MDPI Informatics*
摘要：Artificial intelligence (AI) has the potential to transform healthcare by supporting more accurate diagnoses and personalized treatments. However, its adoption in practice remains constrained by fragmented data sources, strict privacy rules, and the technical complexity of building reliable clinical systems. To address these challenges, we introduce a model driven engineering (MDE) framework designed specifically for healthcare AI. The framework relies on formal metamodels, domain-specific languages (DSLs), and automated transformations to move from high level specifications to running software. At its core is the Medical Interoperability Language (MILA), a graphical DSL that enables clinicians and data scientists to define queries and machine learning pipelines using shared ontologies. When combined with a federated learning architecture, MILA allows institutions to collaborate without exchanging raw patient data, ensuring semantic consistency across sites while preserving privacy. We evaluate this approach in a multi center cancer immunotherapy study. The generated pipelines delivered strong predictive performance, with support vector machines achieving up to 98.5 percent and 98.3 percent accuracy in key tasks, while substantially reducing manual coding effort. These findings suggest that MDE principles metamodeling, semantic integration, and automated code generation can provide a practical path toward interoperable, reproducible, and trustworthy digital health platforms.

【8】Mitigating Model Drift in Developing Economies Using Synthetic Data and Outliers
标题：使用合成数据和离群值缓解发展中经济体的模型漂移
链接：https://arxiv.org/abs/2510.09294

作者：Ilyas Varshavskiy, Bonu Boboeva, Shuhrat Khalilbekov, Azizjon Azimi, Sergey Shulgin, Akhlitdin Nizamitdinov, Haitz Saez de Ocariz Borde
摘要：Machine Learning models in finance are highly susceptible to model drift, where predictive performance declines as data distributions shift. This issue is especially acute in developing economies such as those in Central Asia and the Caucasus - including Tajikistan, Uzbekistan, Kazakhstan, and Azerbaijan - where frequent and unpredictable macroeconomics shocks destabilize financial data. To the best of our knowledge, this is among the first studies to examine drift mitigation methods on financial datasets from these regions. We investigate the use of synthetic outliers, a largely unexplored approach, to improve model stability against unforeseen shocks. To evaluate effectiveness, we introduce a two-level framework that measures both the extent of performance degradation and the severity of shocks. Our experiments on macroeconomic tabular datasets show that adding a small proportion of synthetic outliers generally improves stability compared to baseline models, though the optimal amount varies by dataset and model

【9】Placeit! A Framework for Learning Robot Object Placement Skills
标题：就位！学习机器人物体放置技能的框架
链接：https://arxiv.org/abs/2510.09267

作者：Amina Ferrad, Johann Huber, François Hélénon, Julien Gleyze, Mahdi Khoramshahi, Stéphane Doncieux
备注：8 pages, 8 figures. Draft version
摘要：Robotics research has made significant strides in learning, yet mastering basic skills like object placement remains a fundamental challenge. A key bottleneck is the acquisition of large-scale, high-quality data, which is often a manual and laborious process. Inspired by Graspit!, a foundational work that used simulation to automatically generate dexterous grasp poses, we introduce Placeit!, an evolutionary-computation framework for generating valid placement positions for rigid objects. Placeit! is highly versatile, supporting tasks from placing objects on tables to stacking and inserting them. Our experiments show that by leveraging quality-diversity optimization, Placeit! significantly outperforms state-of-the-art methods across all scenarios for generating diverse valid poses. A pick&place pipeline built on our framework achieved a 90% success rate over 120 real-world deployments. This work positions Placeit! as a powerful tool for open-environment pick-and-place tasks and as a valuable engine for generating the data needed to train simulation-based foundation models in robotics.

【10】Investigating the Impact of Rational Dilated Wavelet Transform on Motor Imagery EEG Decoding with Deep Learning Models
标题：使用深度学习模型研究有理扩张的子波变换对运动图像脑电解码的影响
链接：https://arxiv.org/abs/2510.09242

作者：Marco Siino, Giuseppe Bonomo, Rosario Sorbello, Ilenia Tinnirello
摘要：The present study investigates the impact of the Rational Discrete Wavelet Transform (RDWT), used as a plug-in preprocessing step for motor imagery electroencephalographic (EEG) decoding prior to applying deep learning classifiers. A systematic paired evaluation (with/without RDWT) is conducted on four state-of-the-art deep learning architectures: EEGNet, ShallowConvNet, MBEEG\_SENet, and EEGTCNet. This evaluation was carried out across three benchmark datasets: High Gamma, BCI-IV-2a, and BCI-IV-2b. The performance of the RDWT is reported with subject-wise averages using accuracy and Cohen's kappa, complemented by subject-level analyses to identify when RDWT is beneficial. On BCI-IV-2a, RDWT yields clear average gains for EEGTCNet (+4.44 percentage points, pp; kappa +0.059) and MBEEG\_SENet (+2.23 pp; +0.030), with smaller improvements for EEGNet (+2.08 pp; +0.027) and ShallowConvNet (+0.58 pp; +0.008). On BCI-IV-2b, the enhancements observed are modest yet consistent for EEGNet (+0.21 pp; +0.044) and EEGTCNet (+0.28 pp; +0.077). On HGD, average effects are modest to positive, with the most significant gain observed for MBEEG\_SENet (+1.65 pp; +0.022), followed by EEGNet (+0.76 pp; +0.010) and EEGTCNet (+0.54 pp; +0.008). Inspection of the subject material reveals significant enhancements in challenging recordings (e.g., non-stationary sessions), indicating that RDWT can mitigate localized noise and enhance rhythm-specific information. In conclusion, RDWT is shown to be a low-overhead, architecture-aware preprocessing technique that can yield tangible gains in accuracy and agreement for deep model families and challenging subjects.

【11】On the Implicit Adversariality of Catastrophic Forgetting in Deep Continual Learning
标题：深度持续学习中灾难性遗忘的隐性不利性
链接：https://arxiv.org/abs/2510.09181

作者：Ze Peng, Jian Zhang, Jintao Guo, Lei Qi, Yang Gao, Yinghuan Shi
摘要：Continual learning seeks the human-like ability to accumulate new skills in machine intelligence. Its central challenge is catastrophic forgetting, whose underlying cause has not been fully understood for deep networks. In this paper, we demystify catastrophic forgetting by revealing that the new-task training is implicitly an adversarial attack against the old-task knowledge. Specifically, the new-task gradients automatically and accurately align with the sharp directions of the old-task loss landscape, rapidly increasing the old-task loss. This adversarial alignment is intriguingly counter-intuitive because the sharp directions are too sparsely distributed to align with by chance. To understand it, we theoretically show that it arises from training's low-rank bias, which, through forward and backward propagation, confines the two directions into the same low-dimensional subspace, facilitating alignment. Gradient projection (GP) methods, a representative family of forgetting-mitigating methods, reduce adversarial alignment caused by forward propagation, but cannot address the alignment due to backward propagation. We propose backGP to address it, which reduces forgetting by 10.8% and improves accuracy by 12.7% on average over GP methods.

【12】Beyond Pairwise Connections: Extracting High-Order Functional Brain Network Structures under Global Constraints
标题：超越成对连接：在全球约束下提取高级功能大脑网络结构
链接：https://arxiv.org/abs/2510.09175

作者：Ling Zhan, Junjie Huang, Xiaoyao Yu, Wenyu Chen, Tao Jia
备注：33 pages, 10 figures, NeurIPS
摘要：Functional brain network (FBN) modeling often relies on local pairwise interactions, whose limitation in capturing high-order dependencies is theoretically analyzed in this paper. Meanwhile, the computational burden and heuristic nature of current hypergraph modeling approaches hinder end-to-end learning of FBN structures directly from data distributions. To address this, we propose to extract high-order FBN structures under global constraints, and implement this as a Global Constraints oriented Multi-resolution (GCM) FBN structure learning framework. It incorporates 4 types of global constraint (signal synchronization, subject identity, expected edge numbers, and data labels) to enable learning FBN structures for 4 distinct levels (sample/subject/group/project) of modeling resolution. Experimental results demonstrate that GCM achieves up to a 30.6% improvement in relative accuracy and a 96.3% reduction in computational time across 5 datasets and 2 task settings, compared to 9 baselines and 10 state-of-the-art methods. Extensive experiments validate the contributions of individual components and highlight the interpretability of GCM. This work offers a novel perspective on FBN structure learning and provides a foundation for interdisciplinary applications in cognitive neuroscience. Code is publicly available on https://github.com/lzhan94swu/GCM.

【13】Training Feature Attribution for Vision Models
标题：视觉模型的训练特征归因
链接：https://arxiv.org/abs/2510.09135

作者：Aziz Bacha, Thomas George
摘要：Deep neural networks are often considered opaque systems, prompting the need for explainability methods to improve trust and accountability. Existing approaches typically attribute test-time predictions either to input features (e.g., pixels in an image) or to influential training examples. We argue that both perspectives should be studied jointly. This work explores *training feature attribution*, which links test predictions to specific regions of specific training images and thereby provides new insights into the inner workings of deep models. Our experiments on vision datasets show that training feature attribution yields fine-grained, test-specific explanations: it identifies harmful examples that drive misclassifications and reveals spurious correlations, such as patch-based shortcuts, that conventional attribution methods fail to expose.

【14】On the Fairness of Privacy Protection: Measuring and Mitigating the Disparity of Group Privacy Risks for Differentially Private Machine Learning
标题：隐私保护的公平性：衡量和减轻差异隐私机器学习的群体隐私风险差异
链接：https://arxiv.org/abs/2510.09114

作者：Zhi Yang, Changwu Huang, Ke Tang, Xin Yao
摘要：While significant progress has been made in conventional fairness-aware machine learning (ML) and differentially private ML (DPML), the fairness of privacy protection across groups remains underexplored. Existing studies have proposed methods to assess group privacy risks, but these are based on the average-case privacy risks of data records. Such approaches may underestimate the group privacy risks, thereby potentially underestimating the disparity across group privacy risks. Moreover, the current method for assessing the worst-case privacy risks of data records is time-consuming, limiting their practical applicability. To address these limitations, we introduce a novel membership inference game that can efficiently audit the approximate worst-case privacy risks of data records. Experimental results demonstrate that our method provides a more stringent measurement of group privacy risks, yielding a reliable assessment of the disparity in group privacy risks. Furthermore, to promote privacy protection fairness in DPML, we enhance the standard DP-SGD algorithm with an adaptive group-specific gradient clipping strategy, inspired by the design of canaries in differential privacy auditing studies. Extensive experiments confirm that our algorithm effectively reduces the disparity in group privacy risks, thereby enhancing the fairness of privacy protection in DPML.

【15】When a Robot is More Capable than a Human: Learning from Constrained Demonstrators
标题：当机器人比人类更有能力时：向受约束的示威者学习
链接：https://arxiv.org/abs/2510.09096

作者：Xinhu Li, Ayush Jain, Zhaojing Yang, Yigit Korkmaz, Erdem Bıyık
摘要：Learning from demonstrations enables experts to teach robots complex tasks using interfaces such as kinesthetic teaching, joystick control, and sim-to-real transfer. However, these interfaces often constrain the expert's ability to demonstrate optimal behavior due to indirect control, setup restrictions, and hardware safety. For example, a joystick can move a robotic arm only in a 2D plane, even though the robot operates in a higher-dimensional space. As a result, the demonstrations collected by constrained experts lead to suboptimal performance of the learned policies. This raises a key question: Can a robot learn a better policy than the one demonstrated by a constrained expert? We address this by allowing the agent to go beyond direct imitation of expert actions and explore shorter and more efficient trajectories. We use the demonstrations to infer a state-only reward signal that measures task progress, and self-label reward for unknown states using temporal interpolation. Our approach outperforms common imitation learning in both sample efficiency and task completion time. On a real WidowX robotic arm, it completes the task in 12 seconds, 10x faster than behavioral cloning, as shown in real-robot videos on https://sites.google.com/view/constrainedexpert .

【16】MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
标题：MMAudioSep：驯服视频到音频生成模型以实现视频/文本查询声音分离
链接：https://arxiv.org/abs/2510.09065

作者：Akira Takahashi, Shusuke Takahashi, Yuki Mitsufuji
备注：4 pages, 4 figures, 2 tables
摘要：We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at https://github.com/sony/mmaudiosep.

【17】The Environmental Impacts of Machine Learning Training Keep Rising Evidencing Rebound Effect
标题：机器学习训练对环境的影响不断上升，证明反弹效应
链接：https://arxiv.org/abs/2510.09022

作者：Clément Morand (STL), Anne-Laure Ligozat (ENSIIE, LISN, STL), Aurélie Névéol (STL, LISN)
备注：arXiv admin note: text overlap with arXiv:2412.17376
摘要：Recent Machine Learning (ML) approaches have shown increased performance on benchmarks but at the cost of escalating computational demands. Hardware, algorithmic and carbon optimizations have been proposed to curb energy consumption and environmental impacts. Can these strategies lead to sustainable ML model training? Here, we estimate the environmental impacts associated with training notable AI systems over the last decade, including Large Language Models, with a focus on the life cycle of graphics cards. Our analysis reveals two critical trends: First, the impacts of graphics cards production have increased steadily over this period; Second, energy consumption and environmental impacts associated with training ML models have increased exponentially, even when considering reduction strategies such as location shifting to places with less carbon intensive electricity mixes. Optimization strategies do not mitigate the impacts induced by model training, evidencing rebound effect. We show that the impacts of hardware must be considered over the entire life cycle rather than the sole use phase in order to avoid impact shifting. Our study demonstrates that increasing efficiency alone cannot ensure sustainability in ML. Mitigating the environmental impact of AI also requires reducing AI activities and questioning the scale and frequency of resource-intensive training.

【18】SQS: Bayesian DNN Compression through Sparse Quantized Sub-distributions
标题：SQS：通过稀疏量化子分布的Bayesian DNN压缩
链接：https://arxiv.org/abs/2510.08999

作者：Ziyi Wang, Nan Jiang, Guang Lin, Qifan Song
摘要：Compressing large-scale neural networks is essential for deploying models on resource-constrained devices. Most existing methods adopt weight pruning or low-bit quantization individually, often resulting in suboptimal compression rates to preserve acceptable performance drops. We introduce a unified framework for simultaneous pruning and low-bit quantization via Bayesian variational learning (SQS), which achieves higher compression rates than prior baselines while maintaining comparable performance. The key idea is to employ a spike-and-slab prior to inducing sparsity and model quantized weights using Gaussian Mixture Models (GMMs) to enable low-bit precision. In theory, we provide the consistent result of our proposed variational approach to a sparse and quantized deep neural network. Extensive experiments on compressing ResNet, BERT-base, Llama3, and Qwen2.5 models show that our method achieves higher compression rates than a line of existing methods with comparable performance drops.

【19】Learning Regularizers: Learning Optimizers that can Regularize
标题：学习调节器：可以调节的学习优化器
链接：https://arxiv.org/abs/2510.08968

作者：Suraj Kumar Sahoo, Narayanan C Krishnan
摘要：Learned Optimizers (LOs), a type of Meta-learning, have gained traction due to their ability to be parameterized and trained for efficient optimization. Traditional gradient-based methods incorporate explicit regularization techniques such as Sharpness-Aware Minimization (SAM), Gradient-norm Aware Minimization (GAM), and Gap-guided Sharpness-Aware Minimization (GSAM) to enhance generalization and convergence. In this work, we explore a fundamental question: \textbf{Can regularizers be learned?} We empirically demonstrate that LOs can be trained to learn and internalize the effects of traditional regularization techniques without explicitly applying them to the objective function. We validate this through extensive experiments on standard benchmarks (including MNIST, FMNIST, CIFAR and Neural Networks such as MLP, MLP-Relu and CNN), comparing LOs trained with and without access to explicit regularizers. Regularized LOs consistently outperform their unregularized counterparts in terms of test accuracy and generalization. Furthermore, we show that LOs retain and transfer these regularization effects to new optimization tasks by inherently seeking minima similar to those targeted by these regularizers. Our results suggest that LOs can inherently learn regularization properties, \textit{challenging the conventional necessity of explicit optimizee loss regularization.

【20】Variability Aware Recursive Neural Network (VARNN): A Residual-Memory Model for Capturing Temporal Deviation in Sequence Regression Modeling
标题：可变性感知回归神经网络（VAR NN）：一种用于捕获序列回归建模中时间偏差的剩余记忆模型
链接：https://arxiv.org/abs/2510.08944

作者：Haroon Gharwi, Kai Shu
摘要：Real-world time series data exhibit non-stationary behavior, regime shifts, and temporally varying noise (heteroscedastic) that degrade the robustness of standard regression models. We introduce the Variability-Aware Recursive Neural Network (VARNN), a novel residual-aware architecture for supervised time-series regression that learns an explicit error memory from recent prediction residuals and uses it to recalibrate subsequent predictions. VARNN augments a feed-forward predictor with a learned error-memory state that is updated from residuals over a short context steps as a signal of variability and drift, and then conditions the final prediction at the current time step. Across diverse dataset domains, appliance energy, healthcare, and environmental monitoring, experimental results demonstrate VARNN achieves superior performance and attains lower test MSE with minimal computational overhead over static, dynamic, and recurrent baselines. Our findings show that the VARNN model offers robust predictions under a drift and volatility environment, highlighting its potential as a promising framework for time-series learning.

【21】Sparse components distinguish visual pathways & their alignment to neural networks
标题：稀疏成分区分视觉路径及其与神经网络的对齐
链接：https://arxiv.org/abs/2510.08858

作者：Ammar I Marvi, Nancy G Kanwisher, Meenakshi Khosla
摘要：The ventral, dorsal, and lateral streams in high-level human visual cortex are implicated in distinct functional processes. Yet, deep neural networks (DNNs) trained on a single task model the entire visual system surprisingly well, hinting at common computational principles across these pathways. To explore this inconsistency, we applied a novel sparse decomposition approach to identify the dominant components of visual representations within each stream. Consistent with traditional neuroscience research, we find a clear difference in component response profiles across the three visual streams -- identifying components selective for faces, places, bodies, text, and food in the ventral stream; social interactions, implied motion, and hand actions in the lateral stream; and some less interpretable components in the dorsal stream. Building on this, we introduce Sparse Component Alignment (SCA), a new method for measuring representational alignment between brains and machines that better captures the latent neural tuning of these two visual systems. Using SCA, we find that standard visual DNNs are more aligned with the ventral than either dorsal or lateral representations. SCA reveals these distinctions with greater resolution than conventional population-level geometry, offering a measure of representational alignment that is sensitive to a system's underlying axes of neural tuning.

【22】TAPAS: Datasets for Learning the Learning with Errors Problem
标题：TAPAS：学习有错误学习问题的数据集
链接：https://arxiv.org/abs/2510.08797

作者：Eshika Saxena, Alberto Alfarano, François Charton, Emily Wenger, Kristin Lauter
摘要：AI-powered attacks on Learning with Errors (LWE), an important hard math problem in post-quantum cryptography, rival or outperform "classical" attacks on LWE under certain parameter settings. Despite the promise of this approach, a dearth of accessible data limits AI practitioners' ability to study and improve these attacks. Creating LWE data for AI model training is time- and compute-intensive and requires significant domain expertise. To fill this gap and accelerate AI research on LWE attacks, we propose the TAPAS datasets, a Toolkit for Analysis of Post-quantum cryptography using AI Systems. These datasets cover several LWE settings and can be used off-the-shelf by AI practitioners to prototype new approaches to cracking LWE. This work documents TAPAS dataset creation, establishes attack performance baselines, and lays out directions for future work.

【23】PO-CKAN:Physics Informed Deep Operator Kolmogorov Arnold Networks with Chunk Rational Structure
标题：PO-CKAN：具有块理性结构的物理知识深度操作员Kolmogorov Arnold网络
链接：https://arxiv.org/abs/2510.08795

作者：Junyi Wu, Guang Lin
摘要：We propose PO-CKAN, a physics-informed deep operator framework based on Chunkwise Rational Kolmogorov--Arnold Networks (KANs), for approximating the solution operators of partial differential equations. This framework leverages a Deep Operator Network (DeepONet) architecture that incorporates Chunkwise Rational Kolmogorov--Arnold Network (CKAN) sub-networks for enhanced function approximation. The principles of Physics-Informed Neural Networks (PINNs) are integrated into the operator learning framework to enforce physical consistency. This design enables the efficient learning of physically consistent spatio-temporal solution operators and allows for rapid prediction for parametric time-dependent PDEs with varying inputs (e.g., parameters, initial/boundary conditions) after training. Validated on challenging benchmark problems, PO-CKAN demonstrates accurate operator learning with results closely matching high-fidelity solutions. PO-CKAN adopts a DeepONet-style branch--trunk architecture with its sub-networks instantiated as rational KAN modules, and enforces physical consistency via a PDE residual (PINN-style) loss. On Burgers' equation with $\nu=0.01$, PO-CKAN reduces the mean relative $L^2$ error by approximately 48\% compared to PI-DeepONet, and achieves competitive accuracy on the Eikonal and diffusion--reaction benchmarks.

【24】Weights initialization of neural networks for function approximation
标题：函数逼近的神经网络权重初始化
链接：https://arxiv.org/abs/2510.08780

作者：Xinwen Hu, Yunqing Huang, Nianyu Yi, Peimeng Yin
备注：19 pages, 10 figures
摘要：Neural network-based function approximation plays a pivotal role in the advancement of scientific computing and machine learning. Yet, training such models faces several challenges: (i) each target function often requires training a new model from scratch; (ii) performance is highly sensitive to architectural and hyperparameter choices; and (iii) models frequently generalize poorly beyond the training domain. To overcome these challenges, we propose a reusable initialization framework based on basis function pretraining. In this approach, basis neural networks are first trained to approximate families of polynomials on a reference domain. Their learned parameters are then used to initialize networks for more complex target functions. To enhance adaptability across arbitrary domains, we further introduce a domain mapping mechanism that transforms inputs into the reference domain, thereby preserving structural correspondence with the pretrained models. Extensive numerical experiments in one- and two-dimensional settings demonstrate substantial improvements in training efficiency, generalization, and model transferability, highlighting the promise of initialization-based strategies for scalable and modular neural function approximation. The full code is made publicly available on Gitee.

【25】Prioritizing Latency with Profit: A DRL-Based Admission Control for 5G Network Slices
标题：优先考虑延迟与利润：针对5G网络切片的基于DRL的准入控制
链接：https://arxiv.org/abs/2510.08769

作者：Proggya Chakraborty, Aaquib Asrar, Jayasree Sengupta, Sipra Das Bit
备注：6 pages, 8 figures, Accepted for publication at IEEE ANTS 2025
摘要：5G networks enable diverse services such as eMBB, URLLC, and mMTC through network slicing, necessitating intelligent admission control and resource allocation to meet stringent QoS requirements while maximizing Network Service Provider (NSP) profits. However, existing Deep Reinforcement Learning (DRL) frameworks focus primarily on profit optimization without explicitly accounting for service delay, potentially leading to QoS violations for latency-sensitive slices. Moreover, commonly used epsilon-greedy exploration of DRL often results in unstable convergence and suboptimal policy learning. To address these gaps, we propose DePSAC -- a Delay and Profit-aware Slice Admission Control scheme. Our DRL-based approach incorporates a delay-aware reward function, where penalties due to service delay incentivize the prioritization of latency-critical slices such as URLLC. Additionally, we employ Boltzmann exploration to achieve smoother and faster convergence. We implement and evaluate DePSAC on a simulated 5G core network substrate with realistic Network Slice Request (NSLR) arrival patterns. Experimental results demonstrate that our method outperforms the DSARA baseline in terms of overall profit, reduced URLLC slice delays, improved acceptance rates, and improved resource consumption. These findings validate the effectiveness of the proposed DePSAC in achieving better QoS-profit trade-offs for practical 5G network slicing scenarios.

【26】In-Context Learning for Non-Stationary MIMO Equalization
标题：用于非平稳MMO均衡的上下文学习
链接：https://arxiv.org/abs/2510.08711

作者：Jiachen Jiang, Zhen Qin, Zhihui Zhu
摘要：Channel equalization is fundamental for mitigating distortions such as frequency-selective fading and inter-symbol interference. Unlike standard supervised learning approaches that require costly retraining or fine-tuning for each new task, in-context learning (ICL) adapts to new channels at inference time with only a few examples. However, existing ICL-based equalizers are primarily developed for and evaluated on static channels within the context window. Indeed, to our knowledge, prior principled analyses and theoretical studies of ICL focus exclusively on the stationary setting, where the function remains fixed within the context. In this paper, we investigate the ability of ICL to address non-stationary problems through the lens of time-varying channel equalization. We employ a principled framework for designing efficient attention mechanisms with improved adaptivity in non-stationary tasks, leveraging algorithms from adaptive signal processing to guide better designs. For example, new attention variants can be derived from the Least Mean Square (LMS) adaptive algorithm, a Least Root Mean Square (LRMS) formulation for enhanced robustness, or multi-step gradient updates for improved long-term tracking. Experimental results demonstrate that ICL holds strong promise for non-stationary MIMO equalization, and that attention mechanisms inspired by classical adaptive algorithms can substantially enhance adaptability and performance in dynamic environments. Our findings may provide critical insights for developing next-generation wireless foundation models with stronger adaptability and robustness.

【27】FreqCa: Accelerating Diffusion Models via Frequency-Aware Caching
标题：FreqCa：通过频率感知缓存加速扩散模型
链接：https://arxiv.org/abs/2510.08669

作者：Jiacheng Liu, Peiliang Cai, Qinming Zhou, Yuqi Lin, Deyang Kong, Benhao Huang, Yupei Pan, Haowen Xu, Chang Zou, Junshu Tang, Shikang Zheng, Linfeng Zhang
备注：15 pages, 11 figures
摘要：The application of diffusion transformers is suffering from their significant inference costs. Recently, feature caching has been proposed to solve this problem by reusing features from previous timesteps, thereby skipping computation in future timesteps. However, previous feature caching assumes that features in adjacent timesteps are similar or continuous, which does not always hold in all settings. To investigate this, this paper begins with an analysis from the frequency domain, which reveal that different frequency bands in the features of diffusion models exhibit different dynamics across timesteps. Concretely, low-frequency components, which decide the structure of images, exhibit higher similarity but poor continuity. In contrast, the high-frequency bands, which decode the details of images, show significant continuity but poor similarity. These interesting observations motivate us to propose Frequency-aware Caching (FreqCa) which directly reuses features of low-frequency components based on their similarity, while using a second-order Hermite interpolator to predict the volatile high-frequency ones based on its continuity. Besides, we further propose to cache Cumulative Residual Feature (CRF) instead of the features in all the layers, which reduces the memory footprint of feature caching by 99%. Extensive experiments on FLUX.1-dev, FLUX.1-Kontext-dev, Qwen-Image, and Qwen-Image-Edit demonstrate its effectiveness in both generation and editing. Codes are available in the supplementary materials and will be released on GitHub.

【28】Hypothesis Hunting with Evolving Networks of Autonomous Scientific Agents
标题：利用不断发展的自治科学主体网络进行假设搜寻
链接：https://arxiv.org/abs/2510.08619

作者：Tennison Liu, Silas Ruhrberg Estévez, David L. Bentley, Mihaela van der Schaar
摘要：Large-scale scientific datasets -- spanning health biobanks, cell atlases, Earth reanalyses, and more -- create opportunities for exploratory discovery unconstrained by specific research questions. We term this process hypothesis hunting: the cumulative search for insight through sustained exploration across vast and complex hypothesis spaces. To support it, we introduce AScience, a framework modeling discovery as the interaction of agents, networks, and evaluation norms, and implement it as ASCollab, a distributed system of LLM-based research agents with heterogeneous behaviors. These agents self-organize into evolving networks, continually producing and peer-reviewing findings under shared standards of evaluation. Experiments show that such social dynamics enable the accumulation of expert-rated results along the diversity-quality-novelty frontier, including rediscoveries of established biomarkers, extensions of known pathways, and proposals of new therapeutic targets. While wet-lab validation remains indispensable, our experiments on cancer cohorts demonstrate that socially structured, agentic networks can sustain exploratory hypothesis hunting at scale.

【29】A Multimodal Approach to SME Credit Scoring Integrating Transaction and Ownership Networks
标题：整合交易和所有权网络的中小企业信用评分多模式方法
链接：https://arxiv.org/abs/2510.09407

作者：Sahab Zandi, Kamesh Korangi, Juan C. Moreno-Paredes, María Óskarsdóttir, Christophe Mues, Cristián Bravo
摘要：Small and Medium-sized Enterprises (SMEs) are known to play a vital role in economic growth, employment, and innovation. However, they tend to face significant challenges in accessing credit due to limited financial histories, collateral constraints, and exposure to macroeconomic shocks. These challenges make an accurate credit risk assessment by lenders crucial, particularly since SMEs frequently operate within interconnected firm networks through which default risk can propagate. This paper presents and tests a novel approach for modelling the risk of SME credit, using a unique large data set of SME loans provided by a prominent financial institution. Specifically, our approach employs Graph Neural Networks to predict SME default using multilayer network data derived from common ownership and financial transactions between firms. We show that combining this information with traditional structured data not only improves application scoring performance, but also explicitly models contagion risk between companies. Further analysis shows how the directionality and intensity of these connections influence financial risk contagion, offering a deeper understanding of the underlying processes. Our findings highlight the predictive power of network data, as well as the role of supply chain networks in exposing SMEs to correlated default risk.

【30】Distributionally robust approximation property of neural networks
标题：神经网络的分布鲁棒逼近特性
链接：https://arxiv.org/abs/2510.09177

作者：Mihriban Ceylan, David J. Prömel
摘要：The universal approximation property uniformly with respect to weakly compact families of measures is established for several classes of neural networks. To that end, we prove that these neural networks are dense in Orlicz spaces, thereby extending classical universal approximation theorems even beyond the traditional $L^p$-setting. The covered classes of neural networks include widely used architectures like feedforward neural networks with non-polynomial activation functions, deep narrow networks with ReLU activation functions and functional input neural networks.

【31】Physically Valid Biomolecular Interaction Modeling with Gauss-Seidel Projection
标题：利用高斯-赛德尔投影进行物理有效的生物分子相互作用建模
链接：https://arxiv.org/abs/2510.08946

作者：Siyuan Chen, Minghao Guo, Caoliwen Wang, Anka He Chen, Yikun Zhang, Jingjing Chai, Yin Yang, Wojciech Matusik, Peter Yichen Chen
摘要：Biomolecular interaction modeling has been substantially advanced by foundation models, yet they often produce all-atom structures that violate basic steric feasibility. We address this limitation by enforcing physical validity as a strict constraint during both training and inference with a uniffed module. At its core is a differentiable projection that maps the provisional atom coordinates from the diffusion model to the nearest physically valid conffguration. This projection is achieved using a Gauss-Seidel scheme, which exploits the locality and sparsity of the constraints to ensure stable and fast convergence at scale. By implicit differentiation to obtain gradients, our module integrates seamlessly into existing frameworks for end-to-end ffnetuning. With our Gauss-Seidel projection module in place, two denoising steps are sufffcient to produce biomolecular complexes that are both physically valid and structurally accurate. Across six benchmarks, our 2-step model achieves the same structural accuracy as state-of-the-art 200-step diffusion baselines, delivering approximately 10 times faster wall-clock speed while guaranteeing physical validity.

【32】Mirror Flow Matching with Heavy-Tailed Priors for Generative Modeling on Convex Domains
标题：具有重尾先验的镜像流匹配用于凸面生成建模
链接：https://arxiv.org/abs/2510.08929

作者：Yunrui Guan, Krishnakumar Balasubramanian, Shiqian Ma
摘要：We study generative modeling on convex domains using flow matching and mirror maps, and identify two fundamental challenges. First, standard log-barrier mirror maps induce heavy-tailed dual distributions, leading to ill-posed dynamics. Second, coupling with Gaussian priors performs poorly when matching heavy-tailed targets. To address these issues, we propose Mirror Flow Matching based on a \emph{regularized mirror map} that controls dual tail behavior and guarantees finite moments, together with coupling to a Student-$t$ prior that aligns with heavy-tailed targets and stabilizes training. We provide theoretical guarantees, including spatial Lipschitzness and temporal regularity of the velocity field, Wasserstein convergence rates for flow matching with Student-$t$ priors and primal-space guarantees for constrained generation, under $\varepsilon$-accurate learned velocity fields. Empirically, our method outperforms baselines in synthetic convex-domain simulations and achieves competitive sample quality on real-world constrained generative tasks.

其他(39篇)

【1】STaTS: Structure-Aware Temporal Sequence Summarization via Statistical Window Merging
标题：STaTS：通过统计窗口合并的结构感知时态序列摘要
链接：https://arxiv.org/abs/2510.09593

作者：Disharee Bhowmick, Ranjith Ramanathan, Sathyanarayanan N. Aakur
备注：10 pages, 5 figures, 4 tables. Under Review
摘要：Time series data often contain latent temporal structure, transitions between locally stationary regimes, repeated motifs, and bursts of variability, that are rarely leveraged in standard representation learning pipelines. Existing models typically operate on raw or fixed-window sequences, treating all time steps as equally informative, which leads to inefficiencies, poor robustness, and limited scalability in long or noisy sequences. We propose STaTS, a lightweight, unsupervised framework for Structure-Aware Temporal Summarization that adaptively compresses both univariate and multivariate time series into compact, information-preserving token sequences. STaTS detects change points across multiple temporal resolutions using a BIC-based statistical divergence criterion, then summarizes each segment using simple functions like the mean or generative models such as GMMs. This process achieves up to 30x sequence compression while retaining core temporal dynamics. STaTS operates as a model-agnostic preprocessor and can be integrated with existing unsupervised time series encoders without retraining. Extensive experiments on 150+ datasets, including classification tasks on the UCR-85, UCR-128, and UEA-30 archives, and forecasting on ETTh1 and ETTh2, ETTm1, and Electricity, demonstrate that STaTS enables 85-90\% of the full-model performance while offering dramatic reductions in computational cost. Moreover, STaTS improves robustness under noise and preserves discriminative structure, outperforming uniform and clustering-based compression baselines. These results position STaTS as a principled, general-purpose solution for efficient, structure-aware time series modeling.

【2】A methodology for clinically driven interactive segmentation evaluation
标题：临床驱动的交互式分割评估的方法
链接：https://arxiv.org/abs/2510.09499

作者：Parhom Esmaeili, Virginia Fernandez, Pedro Borges, Eli Gibson, Sebastien Ourselin, M. Jorge Cardoso
备注：10 pages, Medical Image Computing and Computed Assisted Intervention 2025
摘要：Interactive segmentation is a promising strategy for building robust, generalisable algorithms for volumetric medical image segmentation. However, inconsistent and clinically unrealistic evaluation hinders fair comparison and misrepresents real-world performance. We propose a clinically grounded methodology for defining evaluation tasks and metrics, and built a software framework for constructing standardised evaluation pipelines. We evaluate state-of-the-art algorithms across heterogeneous and complex tasks and observe that (i) minimising information loss when processing user interactions is critical for model robustness, (ii) adaptive-zooming mechanisms boost robustness and speed convergence, (iii) performance drops if validation prompting behaviour/budgets differ from training, (iv) 2D methods perform well with slab-like images and coarse targets, but 3D context helps with large or irregularly shaped targets, (v) performance of non-medical-domain models (e.g. SAM2) degrades with poor contrast and complex shapes.

【3】Geodesic Calculus on Latent Spaces
标题：潜空间上的测地演算
链接：https://arxiv.org/abs/2510.09468

作者：Florine Hartwig, Josua Sassen, Juliane Braunsmann, Martin Rumpf, Benedikt Wirth
摘要：Latent manifolds of autoencoders provide low-dimensional representations of data, which can be studied from a geometric perspective. We propose to describe these latent manifolds as implicit submanifolds of some ambient latent space. Based on this, we develop tools for a discrete Riemannian calculus approximating classical geometric operators. These tools are robust against inaccuracies of the implicit representation often occurring in practical examples. To obtain a suitable implicit representation, we propose to learn an approximate projection onto the latent manifold by minimizing a denoising objective. This approach is independent of the underlying autoencoder and supports the use of different Riemannian geometries on the latent manifolds. The framework in particular enables the computation of geodesic paths connecting given end points and shooting geodesics via the Riemannian exponential maps on latent manifolds. We evaluate our approach on various autoencoders trained on synthetic and real data.

【4】Bandits with Single-Peaked Preferences and Limited Resources
标题：具有单一偏好且资源有限的盗贼
链接：https://arxiv.org/abs/2510.09425

作者：Gur Keinan, Rotem Torkan, Omer Ben-Porat
摘要：We study an online stochastic matching problem in which an algorithm sequentially matches $U$ users to $K$ arms, aiming to maximize cumulative reward over $T$ rounds under budget constraints. Without structural assumptions, computing the optimal matching is NP-hard, making online learning computationally infeasible. To overcome this barrier, we focus on \emph{single-peaked preferences} -- a well-established structure in social choice theory, where users' preferences are unimodal with respect to a common order over arms. We devise an efficient algorithm for the offline budgeted matching problem, and leverage it into an efficient online algorithm with a regret of $\tilde O(UKT^{2/3})$. Our approach relies on a novel PQ tree-based order approximation method. If the single-peaked structure is known, we develop an efficient UCB-like algorithm that achieves a regret bound of $\tilde O(U\sqrt{TK})$.

【5】HINT: Helping Ineffective Rollouts Navigate Towards Effectiveness
标题：提示：帮助无效的滚动实现有效性
链接：https://arxiv.org/abs/2510.09388

作者：Xinyi Wang, Jinyi Han, Zishang Jiang, Tingyun Li, Jiaqing Liang, Sihang Jiang, Zhaoqian Dai, Shuguang Ma, Fei Yu, Yanghua Xiao
摘要：Reinforcement Learning (RL) has become a key driver for enhancing the long chain-of-thought (CoT) reasoning capabilities of Large Language Models (LLMs). However, prevalent methods like GRPO often fail when task difficulty exceeds the model's capacity, leading to reward sparsity and inefficient training. While prior work attempts to mitigate this using off-policy data, such as mixing RL with Supervised Fine-Tuning (SFT) or using hints, they often misguide policy updates In this work, we identify a core issue underlying these failures, which we term low training affinity. This condition arises from a large distributional mismatch between external guidance and the model's policy. To diagnose this, we introduce Affinity, the first quantitative metric for monitoring exploration efficiency and training stability. To improve Affinity, we propose HINT: Helping Ineffective rollouts Navigate Towards effectiveness, an adaptive hinting framework. Instead of providing direct answers, HINT supplies heuristic hints that guide the model to discover solutions on its own, preserving its autonomous reasoning capabilities. Extensive experiments on mathematical reasoning tasks show that HINT consistently outperforms existing methods, achieving state-of-the-art results with models of various scales, while also demonstrating significantly more stable learning and greater data efficiency.Code is available on Github.

【6】Safety Game: Balancing Safe and Informative Conversations with Blackbox Agentic AI using LP Solvers
标题：安全游戏：使用LP Solvers与Blackbox抽象AI平衡安全且信息丰富的对话
链接：https://arxiv.org/abs/2510.09330

作者：Tuan Nguyen, Long Tran-Thanh
摘要：Ensuring that large language models (LLMs) comply with safety requirements is a central challenge in AI deployment. Existing alignment approaches primarily operate during training, such as through fine-tuning or reinforcement learning from human feedback, but these methods are costly and inflexible, requiring retraining whenever new requirements arise. Recent efforts toward inference-time alignment mitigate some of these limitations but still assume access to model internals, which is impractical, and not suitable for third party stakeholders who do not have access to the models. In this work, we propose a model-independent, black-box framework for safety alignment that does not require retraining or access to the underlying LLM architecture. As a proof of concept, we address the problem of trading off between generating safe but uninformative answers versus helpful yet potentially risky ones. We formulate this dilemma as a two-player zero-sum game whose minimax equilibrium captures the optimal balance between safety and helpfulness. LLM agents operationalize this framework by leveraging a linear programming solver at inference time to compute equilibrium strategies. Our results demonstrate the feasibility of black-box safety alignment, offering a scalable and accessible pathway for stakeholders, including smaller organizations and entities in resource-constrained settings, to enforce safety across rapidly evolving LLM ecosystems.

【7】Incentivizing Time-Aware Fairness in Data Sharing
标题：激励数据共享中的时间意识公平性
链接：https://arxiv.org/abs/2510.09240

作者：Jiangwei Chen, Kieu Thao Nguyen Pham, Rachael Hwee Ling Sim, Arun Verma, Zhaoxuan Wu, Chuan-Sheng Foo, Bryan Kian Hsiang Low
备注：Accepted to NeurIPS 2025
摘要：In collaborative data sharing and machine learning, multiple parties aggregate their data resources to train a machine learning model with better model performance. However, as the parties incur data collection costs, they are only willing to do so when guaranteed incentives, such as fairness and individual rationality. Existing frameworks assume that all parties join the collaboration simultaneously, which does not hold in many real-world scenarios. Due to the long processing time for data cleaning, difficulty in overcoming legal barriers, or unawareness, the parties may join the collaboration at different times. In this work, we propose the following perspective: As a party who joins earlier incurs higher risk and encourages the contribution from other wait-and-see parties, that party should receive a reward of higher value for sharing data earlier. To this end, we propose a fair and time-aware data sharing framework, including novel time-aware incentives. We develop new methods for deciding reward values to satisfy these incentives. We further illustrate how to generate model rewards that realize the reward values and empirically demonstrate the properties of our methods on synthetic and real-world datasets.

【8】IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery in the Absence of Tabular Data
标题：IRIS：在缺乏表格数据的情况下发现可验证原因的迭代集成框架
链接：https://arxiv.org/abs/2510.09217

作者：Tao Feng, Lizhen Qu, Niket Tandon, Gholamreza Haffari
备注：ACL 2025
摘要：Causal discovery is fundamental to scientific research, yet traditional statistical algorithms face significant challenges, including expensive data collection, redundant computation for known relations, and unrealistic assumptions. While recent LLM-based methods excel at identifying commonly known causal relations, they fail to uncover novel relations. We introduce IRIS (Iterative Retrieval and Integrated System for Real-Time Causal Discovery), a novel framework that addresses these limitations. Starting with a set of initial variables, IRIS automatically collects relevant documents, extracts variables, and uncovers causal relations. Our hybrid causal discovery method combines statistical algorithms and LLM-based methods to discover known and novel causal relations. In addition to causal discovery on initial variables, the missing variable proposal component of IRIS identifies and incorporates missing variables to expand the causal graphs. Our approach enables real-time causal discovery from only a set of initial variables without requiring pre-existing datasets.

【9】Robustness and Regularization in Hierarchical Re-Basin
标题：分层再盆地的鲁棒性和规范性
链接：https://arxiv.org/abs/2510.09174

作者：Benedikt Franke, Florian Heinrich, Markus Lange, Arne Raul
备注：Published in 32th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2024
摘要：This paper takes a closer look at Git Re-Basin, an interesting new approach to merge trained models. We propose a hierarchical model merging scheme that significantly outperforms the standard MergeMany algorithm. With our new algorithm, we find that Re-Basin induces adversarial and perturbation robustness into the merged models, with the effect becoming stronger the more models participate in the hierarchical merging scheme. However, in our experiments Re-Basin induces a much bigger performance drop than reported by the original authors.

【10】Logits Replay + MoClip: Stabilized, Low-Cost Post-Training with Minimal Forgetting
标题：Logits Replay + Molip：稳定、低成本的训练后，忘记最少
链接：https://arxiv.org/abs/2510.09152

作者：Suming Qiu, Jing Li, Zhicheng Zhou, Junjie Huang, Linyuan Qiu, Zhijie Sun
摘要：Large language models (LLMs) often face a trade-off in post-training: improvements on specialized domains frequently come at the expense of general capabilities. Existing solutions attempt to mitigate this tension via regularization, selective parameter updates, or data-centric replay, but each imposes significant costs in computation, data access, or adaptability. Recent work has shown that training signals can be compressed to subsets of logits without severe accuracy loss, suggesting a path toward efficient adaptation. However, naive truncation destabilizes optimization and exacerbates forgetting. We introduce Logits Replay + MoClip, a two-stage framework that compresses supervision in the logit space and stabilizes optimization at the update level. In Stage 0, we record dynamic Top-K token subsets that cover a probability threshold, always including the gold label. In Stage 1, we replay these compact subsets to compute exact renormalized losses, avoiding full softmax computation and implicitly regularizing. To ensure stability, we design MoClip, an optimizer that caps gradient-momentum rotation and applies an arctan2-based rescaling of updates. Empirically, our method improves domain performance on Communication Technology (CT) and NL2SQL tasks while mitigating forgetting on general benchmarks (MMLU, BBH, GPQA, MATH), and reduces training cost by over 40%. Together, these contributions offer a scalable, architecture-agnostic path for domain adaptation of LLMs without sacrificing generalization.

【11】A Novel Multi-branch ConvNeXt Architecture for Identifying Subtle Pathological Features in CT Scans
标题：用于识别CT扫描中微妙病理特征的新型多分支ConvNeXt架构
链接：https://arxiv.org/abs/2510.09107

作者：Irash Perera (1), Uthayasanker Thayasivam (1) ((1) Department of Computer Science and Engineering, University of Moratuwa, Colombo, Sri Lanka)
摘要：Intelligent analysis of medical imaging plays a crucial role in assisting clinical diagnosis, especially for identifying subtle pathological features. This paper introduces a novel multi-branch ConvNeXt architecture designed specifically for the nuanced challenges of medical image analysis. While applied here to the specific problem of COVID-19 diagnosis, the methodology offers a generalizable framework for classifying a wide range of pathologies from CT scans. The proposed model incorporates a rigorous end-to-end pipeline, from meticulous data preprocessing and augmentation to a disciplined two-phase training strategy that leverages transfer learning effectively. The architecture uniquely integrates features extracted from three parallel branches: Global Average Pooling, Global Max Pooling, and a new Attention-weighted Pooling mechanism. The model was trained and validated on a combined dataset of 2,609 CT slices derived from two distinct datasets. Experimental results demonstrate a superior performance on the validation set, achieving a final ROC-AUC of 0.9937, a validation accuracy of 0.9757, and an F1-score of 0.9825 for COVID-19 cases, outperforming all previously reported models on this dataset. These findings indicate that a modern, multi-branch architecture, coupled with careful data handling, can achieve performance comparable to or exceeding contemporary state-of-the-art models, thereby proving the efficacy of advanced deep learning techniques for robust medical diagnostics.

【12】Neural Codecs as Biosignal Tokenizers
标题：神经编解码器作为生物信号令牌器
链接：https://arxiv.org/abs/2510.09095

作者：Kleanthis Avramidis, Tiantian Feng, Woojae Jeong, Jihwan Lee, Wenhui Cui, Richard M Leahy, Shrikanth Narayanan
备注：25 pages, 7 figures, 10 tables, currently under peer review
摘要：Neurophysiological recordings such as electroencephalography (EEG) offer accessible and minimally invasive means of estimating physiological activity for applications in healthcare, diagnostic screening, and even immersive entertainment. However, these recordings yield high-dimensional, noisy time-series data that typically require extensive pre-processing and handcrafted feature extraction to reveal meaningful information. Recently, there has been a surge of interest in applying representation learning techniques from large pre-trained (foundation) models to effectively decode and interpret biosignals. We discuss the challenges posed for incorporating such methods and introduce BioCodec, an alternative representation learning framework inspired by neural codecs to capture low-level signal characteristics in the form of discrete tokens. Pre-trained on thousands of EEG hours, BioCodec shows efficacy across multiple downstream tasks, ranging from clinical diagnostic tasks and sleep physiology to decoding speech and motor imagery, particularly in low-resource settings. Additionally, we provide a qualitative analysis of codebook usage and estimate the spatial coherence of codebook embeddings from EEG connectivity. Notably, we also document the suitability of our method to other biosignal data, i.e., electromyographic (EMG) signals. Overall, the proposed approach provides a versatile solution for biosignal tokenization that performs competitively with state-of-the-art models. The source code and model checkpoints are shared.

【13】FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms
标题：FLToP CTC：通过相对阈值进行帧级令牌修剪，以在不同平台上进行高效且节省内存的解码
链接：https://arxiv.org/abs/2510.09085

作者：Atul Shree, Harshith Jupuru
备注：5 pages, 5 figures
摘要：CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments. Traditional CTC decoders, requiring up to 90% of processing time in systems (e.g., wav2vec2-large on L4 GPUs), face inefficiencies due to exhaustive token-level operations. This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC), a novel decoding algorithm that employs frame-level token pruning guided by a relative threshold probability. By dynamically eliminating low-probability tokens per frame, FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation. On LibriSpeech, FLToP CTC achieves a 10.5x runtime speedup and 2.78x memory reduction versus standard CTC decoders. Its simplicity enables seamless integration into CTC decoders across platforms (CPUs, GPUs, etc.). FLToP CTC addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.

【14】Auto-scaling Continuous Memory for GUI Agent
标题：自动扩展图形用户界面代理的连续内存
链接：https://arxiv.org/abs/2510.09038

作者：Wenyi Wu, Kun Zhou, Ruoxin Yuan, Vivian Yu, Stephen Wang, Zhiting Hu, Biwei Huang
摘要：We study how to endow GUI agents with scalable memory that help generalize across unfamiliar interfaces and long-horizon tasks. Prior GUI agents compress past trajectories into text tokens, which balloons context length and misses decisive visual cues (e.g., exact widget size and position). We propose a continuous memory that encodes each GUI trajectory into a fixed-length sequence of continuous embeddings using the VLM itself as an encoder; these embeddings are plugged directly into the backbone's input layer, sharply reducing context cost while preserving fine-grained visual information. As memory size and retrieval depth increase, performance improves monotonically, unlike text memories that degrade with long prompts. To grow memory at low cost, we introduce an auto-scaling data flywheel that (i) discovers new environments via search, (ii) synthesizes tasks with an open-source VLM, (iii) rolls out trajectories with the agent, and (iv) verifies success with the same VLM. Using this pipeline, we collect 100k+ trajectories for about \$4000 and fine-tune only the memory encoder (LoRA on a Q-Former, 1.2\% parameters) with 1,500 samples. On real-world GUI benchmarks, our memory-augmented agent consistently improves success rates under long horizons and distribution shifts. Notably, Qwen-2.5-VL-7B + continuous memory achieves performance comparable to state-of-the-art closed-source models (e.g., GPT-4o, Claude-4).

【15】RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows
标题：RefGrader：使用统计工作流程对数学竞赛证明进行自动评分
链接：https://arxiv.org/abs/2510.09021

作者：Hamed Mahdavi (1), Pouria Mahdavinia (1), Samira Malek (1), Pegah Mohammadipour (1), Alireza Hashemi (2), Majid Daliri (3), Alireza Farhadi (4), Amir Khasahmadi (5), Niloofar Mireshghallah (6), Vasant Honavar (1) ((1) Pennsylvania State University, (2) City University of New York, (3) New York University, (4) Amirkabir University of Technology, (5) Autodesk, (6) Carnegie Mellon University)
摘要：State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.

【16】MagicDock: Toward Docking-oriented De Novo Ligand Design via Gradient Inversion
标题：MagicDock：通过梯度倒置实现面向对接的De Novo配体设计
链接：https://arxiv.org/abs/2510.09020

作者：Zekai Chen, Xunkai Li, Sirui Zhang, Henan Sun, Jia Li, Zhenjun Li, Bing Zhou, Rong-Hua Li, Guoren Wang
备注：52 pages, 14 figures, 12 tables
摘要：De novo ligand design is a fundamental task that seeks to generate protein or molecule candidates that can effectively dock with protein receptors and achieve strong binding affinity entirely from scratch. It holds paramount significance for a wide spectrum of biomedical applications. However, most existing studies are constrained by the \textbf{Pseudo De Novo}, \textbf{Limited Docking Modeling}, and \textbf{Inflexible Ligand Type}. To address these issues, we propose MagicDock, a forward-looking framework grounded in the progressive pipeline and differentiable surface modeling. (1) We adopt a well-designed gradient inversion framework. To begin with, general docking knowledge of receptors and ligands is incorporated into the backbone model. Subsequently, the docking knowledge is instantiated as reverse gradient flows by binding prediction, which iteratively guide the de novo generation of ligands. (2) We emphasize differentiable surface modeling in the docking process, leveraging learnable 3D point-cloud representations to precisely capture binding details, thereby ensuring that the generated ligands preserve docking validity through direct and interpretable spatial fingerprints. (3) We introduce customized designs for different ligand types and integrate them into a unified gradient inversion framework with flexible triggers, thereby ensuring broad applicability. Moreover, we provide rigorous theoretical guarantees for each component of MagicDock. Extensive experiments across 9 scenarios demonstrate that MagicDock achieves average improvements of 27.1\% and 11.7\% over SOTA baselines specialized for protein or molecule ligand design, respectively.

【17】PlatformX: An End-to-End Transferable Platform for Energy-Efficient Neural Architecture Search
标题：PlatformX：用于节能神经架构搜索的端到端可移植平台
链接：https://arxiv.org/abs/2510.08993

作者：Xiaolong Tu, Dawei Chen, Kyungtae Han, Onur Altintas, Haoxin Wang
摘要：Hardware-Aware Neural Architecture Search (HW-NAS) has emerged as a powerful tool for designing efficient deep neural networks (DNNs) tailored to edge devices. However, existing methods remain largely impractical for real-world deployment due to their high time cost, extensive manual profiling, and poor scalability across diverse hardware platforms with complex, device-specific energy behavior. In this paper, we present PlatformX, a fully automated and transferable HW-NAS framework designed to overcome these limitations. PlatformX integrates four key components: (i) an energy-driven search space that expands conventional NAS design by incorporating energy-critical configurations, enabling exploration of high-efficiency architectures; (ii) a transferable kernel-level energy predictor across devices and incrementally refined with minimal on-device samples; (iii) a Pareto-based multi-objective search algorithm that balances energy and accuracy to identify optimal trade-offs; and (iv) a high-resolution runtime energy profiling system that automates on-device power measurement using external monitors without human intervention. We evaluate PlatformX across multiple mobile platforms, showing that it significantly reduces search overhead while preserving accuracy and energy fidelity. It identifies models with up to 0.94 accuracy or as little as 0.16 mJ per inference, both outperforming MobileNet-V2 in accuracy and efficiency. Code and tutorials are available at github.com/amai-gsu/PlatformX.

【18】Uncolorable Examples: Preventing Unauthorized AI Colorization via Perception-Aware Chroma-Restrictive Perturbation
标题：不可着色的示例：通过感知色彩限制扰动来防止未经授权的人工智能着色
链接：https://arxiv.org/abs/2510.08979

作者：Yuki Nii, Futa Waseda, Ching-Chun Chang, Isao Echizen
摘要：AI-based colorization has shown remarkable capability in generating realistic color images from grayscale inputs. However, it poses risks of copyright infringement -- for example, the unauthorized colorization and resale of monochrome manga and films. Despite these concerns, no effective method currently exists to prevent such misuse. To address this, we introduce the first defensive paradigm, Uncolorable Examples, which embed imperceptible perturbations into grayscale images to invalidate unauthorized colorization. To ensure real-world applicability, we establish four criteria: effectiveness, imperceptibility, transferability, and robustness. Our method, Perception-Aware Chroma-Restrictive Perturbation (PAChroma), generates Uncolorable Examples that meet these four criteria by optimizing imperceptible perturbations with a Laplacian filter to preserve perceptual quality, and applying diverse input transformations during optimization to enhance transferability across models and robustness against common post-processing (e.g., compression). Experiments on ImageNet and Danbooru datasets demonstrate that PAChroma effectively degrades colorization quality while maintaining the visual appearance. This work marks the first step toward protecting visual content from illegitimate AI colorization, paving the way for copyright-aware defenses in generative media.

【19】Diagnosing and Mitigating System Bias in Self-Rewarding RL
标题：诊断和缓解自我奖励RL中的系统偏差
链接：https://arxiv.org/abs/2510.08977

作者：Chuyi Tan, Peiwen Yuan, Xinglin Wang, Yiwei Li, Shaoxiong Feng, Yueqi Zhang, Jiayi Shi, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li
摘要：Reinforcement learning with verifiable rewards (RLVR) scales the reasoning ability of large language models (LLMs) but remains bottlenecked by limited labeled samples for continued data scaling. Reinforcement learning with intrinsic rewards (RLIR), where the policy model assigns rewards to its own rollouts, enables sustainable scaling in unlabeled settings, yet its performance and stability lag behind RLVR. We trace this gap to a system bias: the model tends to overestimate its high-confidence rollouts, leading to biased and unstable reward estimation. This bias accumulates as training progresses, with deviations from the oracle drifting toward over-reward, causing unstable training. We characterize this bias using three metrics: $\rho_{\text{noise}}$, $\rho_{\text{selfbias}}$, and $\rho_{\text{symbias}}$. We find that $\rho_{\text{noise}}$ and $\rho_{\text{symbias}}$ impact convergence, while $\rho_{\text{selfbias}}$ amplifies both correct and incorrect updates, leading to instability. To mitigate this, we propose reinforcement learning with ensembled rewards (RLER), which aggregates diverse models and adapts reward interpolation and rollout selection. Extensive experiments show that RLER improves by +13.6% over RLIR and is only 3.6% below RLVR, achieving stable scaling on unlabeled samples, making it highly applicable.

【20】HiBBO: HiPPO-based Space Consistency for High-dimensional Bayesian Optimisation
标题：HiBBO：基于HiMPO的空间一致性，用于多维Bayesian优化
链接：https://arxiv.org/abs/2510.08965

作者：Junyu Xuan, Wenlong Chen, Yingzhen Li
摘要：Bayesian Optimisation (BO) is a powerful tool for optimising expensive blackbox functions but its effectiveness diminishes in highdimensional spaces due to sparse data and poor surrogate model scalability While Variational Autoencoder (VAE) based approaches address this by learning low-dimensional latent representations the reconstructionbased objective function often brings the functional distribution mismatch between the latent space and original space leading to suboptimal optimisation performance In this paper we first analyse the reason why reconstructiononly loss may lead to distribution mismatch and then propose HiBBO a novel BO framework that introduces the space consistency into the latent space construction in VAE using HiPPO - a method for longterm sequence modelling - to reduce the functional distribution mismatch between the latent space and original space Experiments on highdimensional benchmark tasks demonstrate that HiBBO outperforms existing VAEBO methods in convergence speed and solution quality Our work bridges the gap between high-dimensional sequence representation learning and efficient Bayesian Optimisation enabling broader applications in neural architecture search materials science and beyond.

【21】Denoised Diffusion for Object-Focused Image Augmentation
标题：目标聚焦图像增强的去噪扩散
链接：https://arxiv.org/abs/2510.08955

作者：Nisha Pillai, Aditi Virupakshaiah, Harrison W. Smith, Amanda J. Ashworth, Prasanna Gowda, Phillip R. Owens, Adam R. Rivers, Bindu Nanduri, Mahalingam Ramkumar
备注：None
摘要：Modern agricultural operations increasingly rely on integrated monitoring systems that combine multiple data sources for farm optimization. Aerial drone-based animal health monitoring serves as a key component but faces limited data availability, compounded by scene-specific issues such as small, occluded, or partially visible animals. Transfer learning approaches often fail to address this limitation due to the unavailability of large datasets that reflect specific farm conditions, including variations in animal breeds, environments, and behaviors. Therefore, there is a need for developing a problem-specific, animal-focused data augmentation strategy tailored to these unique challenges. To address this gap, we propose an object-focused data augmentation framework designed explicitly for animal health monitoring in constrained data settings. Our approach segments animals from backgrounds and augments them through transformations and diffusion-based synthesis to create realistic, diverse scenes that enhance animal detection and monitoring performance. Our initial experiments demonstrate that our augmented dataset yields superior performance compared to our baseline models on the animal detection task. By generating domain-specific data, our method empowers real-time animal health monitoring solutions even in data-scarce scenarios, bridging the gap between limited data and practical applicability.

【22】CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization
标题：CommandSans：通过手术精确及时消毒保护人工智能代理
链接：https://arxiv.org/abs/2510.08829

作者：Debeshee Das, Luca Beurer-Kellner, Marc Fischer, Maximilian Baader
摘要：The increasing adoption of LLM agents with access to numerous tools and sensitive data significantly widens the attack surface for indirect prompt injections. Due to the context-dependent nature of attacks, however, current defenses are often ill-calibrated as they cannot reliably differentiate malicious and benign instructions, leading to high false positive rates that prevent their real-world adoption. To address this, we present a novel approach inspired by the fundamental principle of computer security: data should not contain executable instructions. Instead of sample-level classification, we propose a token-level sanitization process, which surgically removes any instructions directed at AI systems from tool outputs, capturing malicious instructions as a byproduct. In contrast to existing safety classifiers, this approach is non-blocking, does not require calibration, and is agnostic to the context of tool outputs. Further, we can train such token-level predictors with readily available instruction-tuning data only, and don't have to rely on unrealistic prompt injection examples from challenges or of other synthetic origin. In our experiments, we find that this approach generalizes well across a wide range of attacks and benchmarks like AgentDojo, BIPIA, InjecAgent, ASB and SEP, achieving a 7-10x reduction of attack success rate (ASR) (34% to 3% on AgentDojo), without impairing agent utility in both benign and malicious settings.

【23】Humanoid Everyday: A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation
标题：Humanoid Everyday：开放世界类人操作的综合机器人数据集
链接：https://arxiv.org/abs/2510.08807

作者：Zhenyu Zhao, Hongyi Jing, Xiawei Liu, Jiageng Mao, Abha Jha, Hanwen Yang, Rong Xue, Sergey Zakharor, Vitor Guizilini, Yue Wang
摘要：From loco-motion to dextrous manipulation, humanoid robots have made remarkable strides in demonstrating complex full-body capabilities. However, the majority of current robot learning datasets and benchmarks mainly focus on stationary robot arms, and the few existing humanoid datasets are either confined to fixed environments or limited in task diversity, often lacking human-humanoid interaction and lower-body locomotion. Moreover, there are a few standardized evaluation platforms for benchmarking learning-based policies on humanoid data. In this work, we present Humanoid Everyday, a large-scale and diverse humanoid manipulation dataset characterized by extensive task variety involving dextrous object manipulation, human-humanoid interaction, locomotion-integrated actions, and more. Leveraging a highly efficient human-supervised teleoperation pipeline, Humanoid Everyday aggregates high-quality multimodal sensory data, including RGB, depth, LiDAR, and tactile inputs, together with natural language annotations, comprising 10.3k trajectories and over 3 million frames of data across 260 tasks across 7 broad categories. In addition, we conduct an analysis of representative policy learning methods on our dataset, providing insights into their strengths and limitations across different task categories. For standardized evaluation, we introduce a cloud-based evaluation platform that allows researchers to seamlessly deploy their policies in our controlled setting and receive performance feedback. By releasing Humanoid Everyday along with our policy learning analysis and a standardized cloud-based evaluation platform, we intend to advance research in general-purpose humanoid manipulation and lay the groundwork for more capable and embodied robotic agents in real-world scenarios. Our dataset, data collection code, and cloud evaluation website are made publicly available on our project website.

【24】Man-Made Heuristics Are Dead. Long Live Code Generators!
标题：人造启发法已经消亡。代码生成器万岁！
链接：https://arxiv.org/abs/2510.08803

作者：Rohit Dwivedula, Divyanshu Saxena, Aditya Akella, Swarat Chaudhuri, Daehyeok Kim
备注：10 pages, 2 figures, 2 tables. To be presented at HotNets 2025
摘要：Policy design for various systems controllers has conventionally been a manual process, with domain experts carefully tailoring heuristics for the specific instance in which the policy will be deployed. In this paper, we re-imagine policy design via a novel automated search technique fueled by recent advances in generative models, specifically Large Language Model (LLM)-driven code generation. We outline the design and implementation of PolicySmith, a framework that applies LLMs to synthesize instance-optimal heuristics. We apply PolicySmith to two long-standing systems policies - web caching and congestion control, highlighting the opportunities unraveled by this LLM-driven heuristic search. For caching, PolicySmith discovers heuristics that outperform established baselines on standard open-source traces. For congestion control, we show that PolicySmith can generate safe policies that integrate directly into the Linux kernel.

【25】Edu-EmotionNet: Cross-Modality Attention Alignment with Temporal Feedback Loops
标题：Edu-AcctionNet：跨模式注意力调整与时间反馈循环
链接：https://arxiv.org/abs/2510.08802

作者：S M Rafiuddin
备注：6 Pages, 6 Figures, 3 Tables, Accepted as a Regular Research paper at ICMLA 2025
摘要：Understanding learner emotions in online education is critical for improving engagement and personalized instruction. While prior work in emotion recognition has explored multimodal fusion and temporal modeling, existing methods often rely on static fusion strategies and assume that modality inputs are consistently reliable, which is rarely the case in real-world learning environments. We introduce Edu-EmotionNet, a novel framework that jointly models temporal emotion evolution and modality reliability for robust affect recognition. Our model incorporates three key components: a Cross-Modality Attention Alignment (CMAA) module for dynamic cross-modal context sharing, a Modality Importance Estimator (MIE) that assigns confidence-based weights to each modality at every time step, and a Temporal Feedback Loop (TFL) that leverages previous predictions to enforce temporal consistency. Evaluated on educational subsets of IEMOCAP and MOSEI, re-annotated for confusion, curiosity, boredom, and frustration, Edu-EmotionNet achieves state-of-the-art performance and demonstrates strong robustness to missing or noisy modalities. Visualizations confirm its ability to capture emotional transitions and adaptively prioritize reliable signals, making it well suited for deployment in real-time learning systems

【26】Deceptive Exploration in Multi-armed Bandits
标题：多武装匪徒的欺骗性探索
链接：https://arxiv.org/abs/2510.08794

作者：I. Arda Vurankaya, Mustafa O. Karabag, Wesley A. Suttle, Jesse Milzman, David Fridovich-Keil, Ufuk Topcu
摘要：We consider a multi-armed bandit setting in which each arm has a public and a private reward distribution. An observer expects an agent to follow Thompson Sampling according to the public rewards, however, the deceptive agent aims to quickly identify the best private arm without being noticed. The observer can observe the public rewards and the pulled arms, but not the private rewards. The agent, on the other hand, observes both the public and private rewards. We formalize detectability as a stepwise Kullback-Leibler (KL) divergence constraint between the actual pull probabilities used by the agent and the anticipated pull probabilities by the observer. We model successful pulling of public suboptimal arms as a % Bernoulli process where the success probability decreases with each successful pull, and show these pulls can happen at most at a $\Theta(\sqrt{T}) $ rate under the KL constraint. We then formulate a maximin problem based on public and private means, whose solution characterizes the optimal error exponent for best private arm identification. We finally propose an algorithm inspired by top-two algorithms. This algorithm naturally adapts its exploration according to the hardness of pulling arms based on the public suboptimality gaps. We provide numerical examples illustrating the $\Theta(\sqrt{T}) $ rate and the behavior of the proposed algorithm.

【27】Transmuting prompts into weights
标题：将提示转变为权重
链接：https://arxiv.org/abs/2510.08734

作者：Hanna Mazzawi, Benoit Dherin, Michael Munn, Michael Wunder, Javier Gonzalvo
摘要：A growing body of research has demonstrated that the behavior of large language models can be effectively controlled at inference time by directly modifying their internal states, either through vector additions to their activations or through updates to their weight matrices. These techniques, while powerful, are often guided by empirical heuristics, such as deriving steering vectors from the average activations of contrastive prompts. This work provides a theoretical foundation for these interventions, explaining how they emerge from the fundamental computations of the transformer architecture. Building on the recent finding that a prompt's influence can be mathematically mapped to implicit weight updates (Dherin et al., 2025), we generalize this theory to deep, multi-block transformers. We show how the information contained in any chunk of a user prompt is represented and composed internally through weight vectors and weight matrices. We then derive a principled method for condensing this information into token-independent thought vectors and thought matrices. These constructs provide a theoretical explanation for existing vector- and matrix-based model editing techniques and offer a direct, computationally-grounded method for transmuting textual input into reusable weight updates.

【28】Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs
标题：Neptune：在图形处理器上实现局部性和并行性的高级ML操作员融合
链接：https://arxiv.org/abs/2510.08726

作者：Yifan Zhao, Egan Johnson, Prasanth Chatarasi, Vikram Adve, Sasa Misailovic
摘要：Operator fusion has become a key optimization for deep learning, which combines multiple deep learning operators to improve data reuse and reduce global memory transfers. However, existing tensor compilers struggle to fuse complex reduction computations involving loop-carried dependencies, such as attention mechanisms. The paper introduces Neptune, a tensor compiler for advanced operator fusion for sequences of reduction operators. Neptune presents a new approach for advanced operator fusion, which intentionally breaks some existing dependencies and compensates by constructing algebraic correction expressions that allow the kernel to produce the correct result. On ten attention-based benchmarks, Neptune, starting from simple attention code and a high-level scheduling template, outperforms existing compilers like Triton, TVM, and FlexAttention, including Triton-based implementations of FlashAttention. Across four different GPU architectures from NVIDIA and AMD, Neptune-generated kernels have average speedup of $1.35\times$ over the next best alternative, demonstrating its effectiveness for deep learning workloads.

【29】Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting
标题：不要浪费错误：通过信心重新加权来利用负RL群体
链接：https://arxiv.org/abs/2510.08696

作者：Yunzhen Feng, Parag Jain, Anthony Hartshorn, Yaqi Duan, Julia Kempe
摘要：Reinforcement learning with verifiable rewards (RLVR) has become a standard recipe for improving large language models (LLMs) on reasoning tasks, with Group Relative Policy Optimization (GRPO) widely used in practice. Yet GRPO wastes substantial compute on negative groups: groups in which no sampled response is correct yield zero advantage and thus no gradient. We ask whether negative groups can be leveraged without extra supervision. Starting from a maximum-likelihood (MLE) objective in reward modeling, we show that the MLE gradient is equivalent to a policy gradient for a modified value function. This value function adds a confidence-weighted penalty on incorrect responses, imposing larger penalties on more confident mistakes. We refer to this as \textbf{L}ikelihood \textbf{E}stimation with \textbf{N}egative \textbf{S}amples (\textbf{LENS}). LENS modifies GRPO to assign non-zero, confidence-dependent rewards to incorrect generations, making negative groups informative and converting previously wasted samples into useful gradient updates. On the MATH benchmark with Llama-3.1-8B and Qwen-2.5-3B, the proposed variant consistently outperforms GRPO baseline, with significant gains on harder items. These results demonstrate a principled and practical way to "rescue" negative groups, improving efficiency and performance in RLVR.

【30】How Scale Breaks "Normalized Stress" and KL Divergence: Rethinking Quality Metrics
链接：https://arxiv.org/abs/2510.08660

作者：Kiran Smelser, Kaviru Gunaratne, Jacob Miller, Stephen Kobourov
备注：arXiv admin note: substantial text overlap with arXiv:2408.07724
摘要：Complex, high-dimensional data is ubiquitous across many scientific disciplines, including machine learning, biology, and the social sciences. One of the primary methods of visualizing these datasets is with two-dimensional scatter plots that visually capture some properties of the data. Because visually determining the accuracy of these plots is challenging, researchers often use quality metrics to measure the projection's accuracy and faithfulness to the original data. One of the most commonly employed metrics, normalized stress, is sensitive to uniform scaling (stretching, shrinking) of the projection, despite this act not meaningfully changing anything about the projection. Another quality metric, the Kullback--Leibler (KL) divergence used in the popular t-Distributed Stochastic Neighbor Embedding (t-SNE) technique, is also susceptible to this scale sensitivity. We investigate the effect of scaling on stress and KL divergence analytically and empirically by showing just how much the values change and how this affects dimension reduction technique evaluations. We introduce a simple technique to make both metrics scale-invariant and show that it accurately captures expected behavior on a small benchmark.

【31】Which Is Better For Reducing Outdated and Vulnerable Dependencies: Pinning or Floating?
标题：哪一个更适合减少过时和脆弱的依赖：固定还是浮动？
链接：https://arxiv.org/abs/2510.08609

作者：Imranur Rahman, Jill Marley, William Enck, Laurie Williams
备注：Accepted to ASE 2025
摘要：Developers consistently use version constraints to specify acceptable versions of the dependencies for their project. \emph{Pinning} dependencies can reduce the likelihood of breaking changes, but comes with a cost of manually managing the replacement of outdated and vulnerable dependencies. On the other hand, \emph{floating} can be used to automatically get bug fixes and security fixes, but comes with the risk of breaking changes. Security practitioners advocate \emph{pinning} dependencies to prevent against software supply chain attacks, e.g., malicious package updates. However, since \emph{pinning} is the tightest version constraint, \emph{pinning} is the most likely to result in outdated dependencies. Nevertheless, how the likelihood of becoming outdated or vulnerable dependencies changes across version constraint types is unknown. The goal of this study is to aid developers in making an informed dependency version constraint choice by empirically evaluating the likelihood of dependencies becoming outdated or vulnerable across version constraint types at scale. In this study, we first identify the trends in dependency version constraint usage and the patterns of version constraint type changes made by developers in the npm, PyPI, and Cargo ecosystems. We then modeled the dependency state transitions using survival analysis and estimated how the likelihood of becoming outdated or vulnerable changes when using \emph{pinning} as opposed to the rest of the version constraint types. We observe that among outdated and vulnerable dependencies, the most commonly used version constraint type is \emph{floating-minor}, with \emph{pinning} being the next most common. We also find that \emph{floating-major} is the least likely to result in outdated and \emph{floating-minor} is the least likely to result in vulnerable dependencies.

【32】Evolutionary Computation as Natural Generative AI
标题：进化计算作为自然生成人工智能
链接：https://arxiv.org/abs/2510.08590

作者：Yaxin Shi, Abhishek Gupta, Ying Wu, Melvin Wong, Ivor Tsang, Thiago Rios, Stefan Menzel, Bernhard Sendhoff, Yaqing Hou, Yew-Soon Ong
备注：15 pages, 8 figures
摘要：Generative AI (GenAI) has achieved remarkable success across a range of domains, but its capabilities remain constrained to statistical models of finite training sets and learning based on local gradient signals. This often results in artifacts that are more derivative than genuinely generative. In contrast, Evolutionary Computation (EC) offers a search-driven pathway to greater diversity and creativity, expanding generative capabilities by exploring uncharted solution spaces beyond the limits of available data. This work establishes a fundamental connection between EC and GenAI, redefining EC as Natural Generative AI (NatGenAI) -- a generative paradigm governed by exploratory search under natural selection. We demonstrate that classical EC with parent-centric operators mirrors conventional GenAI, while disruptive operators enable structured evolutionary leaps, often within just a few generations, to generate out-of-distribution artifacts. Moreover, the methods of evolutionary multitasking provide an unparalleled means of integrating disruptive EC (with cross-domain recombination of evolved features) and moderated selection mechanisms (allowing novel solutions to survive), thereby fostering sustained innovation. By reframing EC as NatGenAI, we emphasize structured disruption and selection pressure moderation as essential drivers of creativity. This perspective extends the generative paradigm beyond conventional boundaries and positions EC as crucial to advancing exploratory design, innovation, scientific discovery, and open-ended generation in the GenAI era.

【33】Limitations of Normalization in Attention Mechanism
标题：注意机制正常化的局限性
链接：https://arxiv.org/abs/2508.17821

作者：Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, Radu State
备注：10 pages, 4 figures
摘要：This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.

【34】Reliability Sensitivity with Response Gradient
标题：具有响应梯度的可靠性灵敏度
链接：https://arxiv.org/abs/2510.09315

作者：Siu-Kui Au, Zi-Jun Cao
备注：45 pages, 8 figures. Submitted to Structural Safety (Elsevier) on 5 Oct 2025
摘要：Engineering risk is concerned with the likelihood of failure and the scenarios when it occurs. The sensitivity of failure probability to change in system parameters is relevant to risk-informed decision making. Computing sensitivity is at least one level more difficult than the probability itself, which is already challenged by a large number of input random variables, rare events and implicit nonlinear `black-box' response. Finite difference with Monte Carlo probability estimates is spurious, requiring the number of samples to grow with the reciprocal of step size to suppress estimation variance. Many existing works gain efficiency by exploiting a specific class of input variables, sensitivity parameters, or response in its exact or surrogate form. For general systems, this work presents a theory and associated Monte Carlo strategy for computing sensitivity using response values and gradients with respect to sensitivity parameters. It is shown that the sensitivity at a given response threshold can be expressed via the expectation of response gradient conditional on the threshold. Determining the expectation requires conditioning on the threshold that is a zero-probability event, but it can be resolved by the concept of kernel smoothing. The proposed method offers sensitivity estimates for all response thresholds generated in a single Monte Carlo run. It is investigated in a number of examples featuring sensitivity parameters of different nature. As response gradient becomes increasingly available, it is hoped that this work can provide the basis for embedding sensitivity calculations with reliability in the same Monte Carlo run.

【35】Smart navigation of a gravity-driven glider with adjustable centre-of-mass
标题：重心可调的重力驱动滑翔机的智能导航
链接：https://arxiv.org/abs/2510.09250

作者：X. Jiang, J. Qiu, K. Gustavsson, B. Mehlig, L. Zhao
备注：13 pages, 8 figures
摘要：Artificial gliders are designed to disperse as they settle through a fluid, requiring precise navigation to reach target locations. We show that a compact glider settling in a viscous fluid can navigate by dynamically adjusting its centre-of-mass. Using fully resolved direct numerical simulations (DNS) and reinforcement learning, we find two optimal navigation strategies that allow the glider to reach its target location accurately. These strategies depend sensitively on how the glider interacts with the surrounding fluid. The nature of this interaction changes as the particle Reynolds number Re$_p$ changes. Our results explain how the optimal strategy depends on Re$_p$. At large Re$_p$, the glider learns to tumble rapidly by moving its centre-of-mass as its orientation changes. This generates a large horizontal inertial lift force, which allows the glider to travel far. At small Re$_p$, by contrast, high viscosity hinders tumbling. In this case, the glider learns to adjust its centre-of-mass so that it settles with a steady, inclined orientation that results in a horizontal viscous force. The horizontal range is much smaller than for large Re$_p$, because this viscous force is much smaller than the inertial lift force at large Re$_p$. *These authors contributed equally.

【36】A Representer Theorem for Hawkes Processes via Penalized Least Squares Minimization
标题：基于惩罚最小化的Hawkes过程的一个代表定理
链接：https://arxiv.org/abs/2510.08916

作者：Hideaki Kim, Tomoharu Iwata
备注：Under review
摘要：The representer theorem is a cornerstone of kernel methods, which aim to estimate latent functions in reproducing kernel Hilbert spaces (RKHSs) in a nonparametric manner. Its significance lies in converting inherently infinite-dimensional optimization problems into finite-dimensional ones over dual coefficients, thereby enabling practical and computationally tractable algorithms. In this paper, we address the problem of estimating the latent triggering kernels--functions that encode the interaction structure between events--for linear multivariate Hawkes processes based on observed event sequences within an RKHS framework. We show that, under the principle of penalized least squares minimization, a novel form of representer theorem emerges: a family of transformed kernels can be defined via a system of simultaneous integral equations, and the optimal estimator of each triggering kernel is expressed as a linear combination of these transformed kernels evaluated at the data points. Remarkably, the dual coefficients are all analytically fixed to unity, obviating the need to solve a costly optimization problem to obtain the dual coefficients. This leads to a highly efficient estimator capable of handling large-scale data more effectively than conventional nonparametric approaches. Empirical evaluations on synthetic datasets reveal that the proposed method attains competitive predictive accuracy while substantially improving computational efficiency over existing state-of-the-art kernel method-based estimators.

【37】Gradient-Guided Furthest Point Sampling for Robust Training Set Selection
标题：学生引导的强化点采样用于稳健的训练集选择
链接：https://arxiv.org/abs/2510.08906

作者：Morris Trestman, Stefan Gugler, Felix A. Faber, O. A. von Lilienfeld
备注：18 pages, 18 figures, journal article
摘要：Smart training set selections procedures enable the reduction of data needs and improves predictive robustness in machine learning problems relevant to chemistry. We introduce Gradient Guided Furthest Point Sampling (GGFPS), a simple extension of Furthest Point Sampling (FPS) that leverages molecular force norms to guide efficient sampling of configurational spaces of molecules. Numerical evidence is presented for a toy-system (Styblinski-Tang function) as well as for molecular dynamics trajectories from the MD17 dataset. Compared to FPS and uniform sampling, our numerical results indicate superior data efficiency and robustness when using GGFPS. Distribution analysis of the MD17 data suggests that FPS systematically under-samples equilibrium geometries, resulting in large test errors for relaxed structures. GGFPS cures this artifact and (i) enables up to two fold reductions in training cost without sacrificing predictive accuracy compared to FPS in the 2-dimensional Styblinksi-Tang system, (ii) systematically lowers prediction errors for equilibrium as well as strained structures in MD17, and (iii) systematically decreases prediction error variances across all of the MD17 configuration spaces. These results suggest that gradient-aware sampling methods hold great promise as effective training set selection tools, and that naive use of FPS may result in imbalanced training and inconsistent prediction outcomes.

【38】QuIRK: Quantum-Inspired Re-uploading KAN
标题：QuIRK：量子启发重新上传KAN
链接：https://arxiv.org/abs/2510.08650

作者：Vinayak Sharma, Ashish Padhy, Vijay Jagdish Karanjkar, Sourav Behera, Lord Sen, Shyamapada Mukherjee, Aviral Shrivastava
摘要：Kolmogorov-Arnold Networks or KANs have shown the ability to outperform classical Deep Neural Networks, while using far fewer trainable parameters for regression problems on scientific domains. Even more powerful has been their interpretability due to their structure being composed of univariate B-Spline functions. This enables us to derive closed-form equations from trained KANs for a wide range of problems. This paper introduces a quantum-inspired variant of the KAN based on Quantum Data Re-uploading~(DR) models. The Quantum-Inspired Re-uploading KAN or QuIRK model replaces B-Splines with single-qubit DR models as the univariate function approximator, allowing them to match or outperform traditional KANs while using even fewer parameters. This is especially apparent in the case of periodic functions. Additionally, since the model utilizes only single-qubit circuits, it remains classically tractable to simulate with straightforward GPU acceleration. Finally, we also demonstrate that QuIRK retains the interpretability advantages and the ability to produce closed-form solutions.

【39】PyNoetic: A modular python framework for no-code development of EEG brain-computer interfaces
标题：PyNoetic：一个模块化Python框架，用于脑机接口的无代码开发
链接：https://arxiv.org/abs/2509.00670

作者：Gursimran Singh, Aviral Chharia, Rahul Upadhyay, Vinay Kumar, Luca Longo
备注：PLoS One 2025. Project Website: https://neurodiag.github.io/PyNoetic
摘要：Electroencephalography (EEG)-based Brain-Computer Interfaces (BCIs) have emerged as a transformative technology with applications spanning robotics, virtual reality, medicine, and rehabilitation. However, existing BCI frameworks face several limitations, including a lack of stage-wise flexibility essential for experimental research, steep learning curves for researchers without programming expertise, elevated costs due to reliance on proprietary software, and a lack of all-inclusive features leading to the use of multiple external tools affecting research outcomes. To address these challenges, we present PyNoetic, a modular BCI framework designed to cater to the diverse needs of BCI research. PyNoetic is one of the very few frameworks in Python that encompasses the entire BCI design pipeline, from stimulus presentation and data acquisition to channel selection, filtering, feature extraction, artifact removal, and finally simulation and visualization. Notably, PyNoetic introduces an intuitive and end-to-end GUI coupled with a unique pick-and-place configurable flowchart for no-code BCI design, making it accessible to researchers with minimal programming experience. For advanced users, it facilitates the seamless integration of custom functionalities and novel algorithms with minimal coding, ensuring adaptability at each design stage. PyNoetic also includes a rich array of analytical tools such as machine learning models, brain-connectivity indices, systematic testing functionalities via simulation, and evaluation methods of novel paradigms. PyNoetic's strengths lie in its versatility for both offline and real-time BCI development, which streamlines the design process, allowing researchers to focus on more intricate aspects of BCI development and thus accelerate their research endeavors. Project Website: https://neurodiag.github.io/PyNoetic

机器翻译由腾讯交互翻译提供，仅供参考

点击“阅读原文”获取带摘要的学术速递