点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
cs.LG 方向,今日共计212篇
大模型相关(33篇)
【1】BLAZER: Bootstrapping LLM-based Manipulation Agents with Zero-Shot Data Generation
标题:BLAZER:通过Zero-Shot数据生成引导基于LLM的操纵代理
链接:https://arxiv.org/abs/2510.08572
作者:Rocktim Jyoti Das, Harsh Singh, Diana Turmakhan, Muhammad Abdullah Sohail, Mingfei Han, Preslav Nakov, Fabio Pizzati, Ivan Laptev
备注:11 pages, 8 figures
摘要:缩放数据和模型在计算机视觉和语言的显着进步中发挥了关键作用。受这些领域的启发,机器人领域最近的努力同样集中在扩展数据和模型大小上,以开发更通用和更强大的策略。然而,与视觉和语言不同的是,机器人技术无法在不同的机器人任务和环境中进行互联网规模的演示。因此,现有数据集的规模通常会受到手动数据收集和管理需求的影响。为了解决这个问题,我们提出了BLAZER,这是一个从自动生成的训练数据中学习操纵策略的框架。我们建立在LLM规划器的zero-shot能力的基础上,并自动生成模拟中各种操纵任务的演示。然后,成功的例子被用来微调LLM,并在没有人为监督的情况下提高其规划能力。值得注意的是,虽然BLAZER培训需要访问模拟器的状态,但我们展示了将获得的技能直接转移到基于传感器的操作。通过大量的实验,我们表明BLAZER显着提高zero-shot操作在模拟和真实环境中。此外,BLAZER改进了其训练池之外的任务,并实现了LLM模型的降尺度。我们的代码和数据将在项目页面上公开。
摘要:Scaling data and models has played a pivotal role in the remarkable progress of computer vision and language. Inspired by these domains, recent efforts in robotics have similarly focused on scaling both data and model size to develop more generalizable and robust policies. However, unlike vision and language, robotics lacks access to internet-scale demonstrations across diverse robotic tasks and environments. As a result, the scale of existing datasets typically suffers from the need for manual data collection and curation. To address this problem, here we propose BLAZER, a framework that learns manipulation policies from automatically generated training data. We build on the zero-shot capabilities of LLM planners and automatically generate demonstrations for diverse manipulation tasks in simulation. Successful examples are then used to finetune an LLM and to improve its planning capabilities without human supervision. Notably, while BLAZER training requires access to the simulator's state, we demonstrate direct transfer of acquired skills to sensor-based manipulation. Through extensive experiments, we show BLAZER to significantly improve zero-shot manipulation in both simulated and real environments. Moreover, BLAZER improves on tasks outside of its training pool and enables downscaling of LLM models. Our code and data will be made publicly available on the project page.
【2】Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization
标题:通过群扩散政策优化改进扩散语言模型的推理
链接:https://arxiv.org/abs/2510.08554
作者:Kevin Rojas, Jiahe Lin, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, Molei Tao, Wei Deng
摘要:扩散语言模型(DLMs)通过迭代细化实现了并行、顺序不可知的生成,为自回归大型语言模型(LLM)提供了一种灵活的替代方案。然而,由于难以处理的可能性,将强化学习(RL)微调适应DLMs仍然是一个公开的挑战。开创性的工作,如diffu-GRPO通过一步揭露来估计令牌级别的可能性。虽然计算效率很高,但这种方法存在严重偏差。一个更有原则的基础在于序列级似然性,其中证据下限(ELBO)作为替代。然而,尽管存在这种清晰的数学联系,但由于可能性评估的成本过高,基于ELBO的方法的采用受到限制。在这项工作中,我们重新ELBO估计和解开其方差的来源。这种分解通过沿着几个关键维度的快速、确定性积分近似来激励减少方差。基于这一认识,我们引入了\textbf{组扩散策略优化(GDPO)},一种为DLMs量身定制的新RL算法。GDPO利用简单而有效的半确定性蒙特卡罗方案来缓解香草双蒙特卡罗抽样下ELBO估计量的方差爆炸,在紧张的评估预算下产生可证明的较低方差估计量。从经验上讲,GDPO在预训练的检查点上实现了一致的增益,并在大多数数学,推理和编码基准上优于diffusion-GRPO,这是最先进的基线之一。
摘要:Diffusion language models (DLMs) enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to autoregressive large language models (LLMs). However, adapting reinforcement learning (RL) fine-tuning to DLMs remains an open challenge because of the intractable likelihood. Pioneering work such as diffu-GRPO estimated token-level likelihoods via one-step unmasking. While computationally efficient, this approach is severely biased. A more principled foundation lies in sequence-level likelihoods, where the evidence lower bound (ELBO) serves as a surrogate. Yet, despite this clean mathematical connection, ELBO-based methods have seen limited adoption due to the prohibitive cost of likelihood evaluation. In this work, we revisit ELBO estimation and disentangle its sources of variance. This decomposition motivates reducing variance through fast, deterministic integral approximations along a few pivotal dimensions. Building on this insight, we introduce \textbf{Group Diffusion Policy Optimization (GDPO)}, a new RL algorithm tailored for DLMs. GDPO leverages simple yet effective Semi-deterministic Monte Carlo schemes to mitigate the variance explosion of ELBO estimators under vanilla double Monte Carlo sampling, yielding a provably lower-variance estimator under tight evaluation budgets. Empirically, GDPO achieves consistent gains over pretrained checkpoints and outperforms diffu-GRPO, one of the state-of-the-art baselines, on the majority of math, reasoning, and coding benchmarks.
【3】Entropy Regularizing Activation: Boosting Continuous Control, Large Language Models, and Image Classification with Activation as Entropy Constraints
标题:激活的熵规则化:以激活为熵约束来增强连续控制、大型语言模型和图像分类
链接:https://arxiv.org/abs/2510.08549
作者:Zilin Kang, Chonghua Liao, Tingqiang Xu, Huazhe Xu
摘要:我们提出了ERA,一个新的范例,通过应用专门设计的激活模型的输出来限制采样熵高于给定的阈值。我们的方法在不同领域都表现出了广泛的有效性:1)对于大型语言模型(LLM),Qwen2.5-Math-7 B的AIME 2025分数提高了37.4%; 2)对于连续控制强化学习代理,在具有挑战性的HumanoidBench上,性能比SAC等强基线提高了30%以上; 3)对于图像分类,ResNet-50将ImageNet top-1的准确率提高了0.69%。这些增益是以小于7%的计算开销实现的。我们的工作验证了输出激活作为熵控制的一个强大的工具,为设计更简单,更强大的算法开辟了一个新的方向。
摘要:We propose ERA, a new paradigm that constrains the sampling entropy above given thresholds by applying specially designed activations to the outputs of models. Our approach demonstrates broad effectiveness across different domains: 1) for large language models(LLMs), boosting the AIME 2025 score for Qwen2.5-Math-7B by 37.4%; 2) for continuous control reinforcement learning agents, improving performance by more than 30% over strong baselines such as SAC on the challenging HumanoidBench; 3) for image classification, enhancing ImageNet top-1 accuracy by 0.69% for ResNet-50. These gains are achieved with a computational overhead of less than 7%. Our work validates output activation as a powerful tool for entropy control, opening a new direction for designing simpler and more robust algorithms.
【4】SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference
标题:SPAD:用于分解LLM推理的专用预填充和解码硬件
链接:https://arxiv.org/abs/2510.08544
作者:Hengrui Zhang, Pratyush Patel, August Ning, David Wentzlaff
摘要:近年来,大型语言模型(LLM)越来越受欢迎,推动了对推理的需求。LLM推理由两个具有不同特征的阶段组成:计算绑定的预填充阶段,然后是内存绑定的解码阶段。为了有效地服务于LLM,先前的工作提出了预填充解码分解以在单独的硬件上运行每个阶段。然而,现有的硬件很难满足每个阶段的不同要求。当前的数据中心GPU和TPU遵循越多越好的设计理念,最大限度地利用计算和内存资源,导致预填充阶段的内存带宽利用不足,解码阶段的计算利用不足。这种利用不足直接导致服务成本增加。 本文提出了SPAD(专用预填充和解码硬件),采用少即是多的方法来设计专门的芯片定制的预填充和解码阶段的不同特点。所提出的预填充芯片具有更大的脉动阵列,并使用具有成本效益的GDDR存储器,而所提出的解码芯片保留高存储器带宽,但减少计算容量。与模型化的H100相比,仿真表明,所提出的预填充芯片平均以52%的低硬件成本提供8%的高预填充性能,而所提出的解码芯片以28%的低TDP实现97%的解码性能。 在生产跟踪上进行的端到端模拟表明,与建模的基准集群相比,SPAD可降低19%-41%的硬件成本,TDP可降低2%-17%,同时提供相同的性能。即使模型和工作负载发生变化,SPAD也可以重新分配任何类型的芯片来运行任何一个阶段,并且仍然可以降低11%-43%的硬件成本,这证明了SPAD设计的寿命。
摘要:Large Language Models (LLMs) have gained popularity in recent years, driving up the demand for inference. LLM inference is composed of two phases with distinct characteristics: a compute-bound prefill phase followed by a memory-bound decode phase. To efficiently serve LLMs, prior work proposes prefill-decode disaggregation to run each phase on separate hardware. However, existing hardware poorly matches the different requirements of each phase. Current datacenter GPUs and TPUs follow a more-is-better design philosophy that maximizes compute and memory resources, causing memory bandwidth underutilization in the prefill phase and compute underutilization in the decode phase. Such underutilization directly translates into increased serving costs. This paper proposes SPAD (Specialized Prefill and Decode hardware), adopting a less-is-more methodology to design specialized chips tailored to the distinct characteristics of prefill and decode phases. The proposed Prefill Chips have larger systolic arrays and use cost-effective GDDR memory, whereas the proposed Decode Chips retain high memory bandwidth but reduce compute capacity. Compared to modeled H100s, simulations show that the proposed Prefill Chips deliver 8% higher prefill performance on average at 52% lower hardware cost, while the proposed Decode Chips achieve 97% of the decode performance with 28% lower TDP. End-to-end simulations on production traces show that SPAD reduces hardware cost by 19%-41% and TDP by 2%-17% compared to modeled baseline clusters while offering the same performance. Even when models and workloads change, SPAD can reallocate either type of chip to run either phase and still achieve 11%-43% lower hardware costs, demonstrating the longevity of the SPAD design.
【5】CaRT: Teaching LLM Agents to Know When They Know Enough
标题:CaRT:教法学硕士代理人知道什么时候知道足够多
链接:https://arxiv.org/abs/2510.08517
作者:Grace Liu, Yuxiao Qu, Jeff Schneider, Aarti Singh, Aviral Kumar
摘要:许多任务需要学习模型在实际执行任务之前通过多轮交互策略性地收集相关信息。战略信息收集要求模型不仅知道如何有效地获取信息,而且知道何时停止收集信息并做出决定,以避免过度思考或在行动时脱轨。在本文中,我们形式化这个问题,并介绍反事实和推理终止(CaRT),一种方法,教LLM何时停止寻求信息。为了适当地学习何时终止,CaRT使用反事实轨迹对LLM进行微调,其中一个轨迹适合终止,而同一轨迹的最小修改版本则不适合。它训练LLM通过口头推理来解释在任何情况下终止决定的理由,并通过微调将这种能力灌输到基础LLM中。我们实例化CaRT在两个领域:交互式医疗诊断和数学问题解决。在这两个领域,我们发现,CaRT提高了信息收集的效率和任务成功率相比,其他微调方法。
摘要:Many tasks require learned models to strategically gather relevant information over multiple rounds of interaction before actually acting on a task. Strategic information gathering requires models to know not only how to effectively acquire information, but also when to stop gathering information and make a decision, in order to avoid overthinking or getting derailed when acting. In this paper, we formalize this problem and introduce Counterfactuals and Reasoning for Termination (CaRT), an approach for teaching LLMs when to stop seeking information. To appropriately learn when to terminate, CaRT fine-tunes LLMs using counterfactual pairs of trajectories, one where termination is appropriate and a minimally modified version of the same trajectory where it is not. It trains the LLM to explain the rationale for the termination decision in either case via verbal reasoning, and imbues this capability into the base LLM via fine-tuning. We instantiate CaRT in two domains: interactive medical diagnosis and math problem solving. In both domains, we find that CaRT improves the efficiency of information gathering and task success rate compared to other fine-tuning methods.
【6】In-Context Clustering with Large Language Models
标题:使用大型语言模型的上下文内集群
链接:https://arxiv.org/abs/2510.08466
作者:Ying Wang, Mengye Ren, Andrew Gordon Wilson
摘要:我们提出了在上下文聚类(ICC),一个灵活的基于LLM的程序,从不同的分布聚类数据。与传统的聚类算法受到预定义的相似性度量的约束不同,ICC通过注意力机制灵活地捕捉输入之间的复杂关系。我们表明,预训练的LLM表现出令人印象深刻的zero-shot聚类能力的文本编码的数值数据,注意力矩阵显示出显着的集群模式。使用注意矩阵的谱聚类提供了令人惊讶的竞争力。我们通过使用下一个令牌预测(NTP)损失进行微调,进一步增强了LLM对数字和图像数据的聚类能力。此外,LLM提示的灵活性使得文本条件图像聚类成为可能,这是经典聚类方法所缺乏的能力。我们的工作将上下文学习扩展到无监督的环境,展示了LLM用于聚类的有效性和灵活性。我们的代码可在https://agenticlearning.ai/icc上获得。
摘要:We propose In-Context Clustering (ICC), a flexible LLM-based procedure for clustering data from diverse distributions. Unlike traditional clustering algorithms constrained by predefined similarity measures, ICC flexibly captures complex relationships among inputs through an attention mechanism. We show that pretrained LLMs exhibit impressive zero-shot clustering capabilities on text-encoded numeric data, with attention matrices showing salient cluster patterns. Spectral clustering using attention matrices offers surprisingly competitive performance. We further enhance the clustering capabilities of LLMs on numeric and image data through fine-tuning using the Next Token Prediction (NTP) loss. Moreover, the flexibility of LLM prompting enables text-conditioned image clustering, a capability that classical clustering methods lack. Our work extends in-context learning to an unsupervised setting, showcasing the effectiveness and flexibility of LLMs for clustering. Our code is available at https://agenticlearning.ai/icc.
【7】xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning
标题:xRouter:通过强化学习训练具有成本意识的LLM规划系统
链接:https://arxiv.org/abs/2510.08439
作者:Cheng Qian, Zuxin Liu, Shirley Kokane, Akshara Prabhakar, Jielin Qiu, Haolin Chen, Zhiwei Liu, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang
备注:24 Pages, 4 Figures, 2 Tables
摘要:现代LLM部署面临着不断扩大的性价比范围:高级模型提供强大的推理,但价格昂贵,而轻量级模型在复杂任务中经济但脆弱。静态升级规则和关键字策略没有充分利用这一频谱,并且无法适应不同的任务类型。我们提出了xRouter,一个基于工具调用的路由系统,在该系统中,学习路由器可以直接回答或调用一个或多个外部模型。路由器通过强化学习进行端到端的训练,使用明确的、具有成本意识的奖励来编码性价比权衡,从而消除了对手工设计的路由规则的需求。我们的实现包括完整的强化学习框架,包括奖励和成本核算,以及部署和评估管道。在不同的基准测试中,xRouter实现了强大的性价比权衡(例如,在可比的任务完成率下大幅降低成本),并提供了关于什么可靠地帮助学习路由以及什么不帮助学习路由的经验见解,范围从模型可训练性到在小型开放模型中引发复杂编排行为的难度。我们希望这些发现和我们的开放实施将作为推进学习,成本意识LLM编排的实际基础。
摘要
:Modern LLM deployments confront a widening cost-performance spectrum: premium models deliver strong reasoning but are expensive, while lightweight models are economical yet brittle on complex tasks. Static escalation rules and keyword heuristics under-utilize this spectrum and fail to adapt across task types. We present xRouter, a tool-calling-based routing system in which a learned router can either answer directly or invoke one or more external models. The router is trained end-to-end with reinforcement learning using an explicit, cost-aware reward that encodes cost-performance trade-offs, eliminating the need for hand-engineered routing rules. Our implementation encompasses the full reinforcement learning framework, including reward and cost accounting, as well as the deployment and evaluation pipelines. Across diverse benchmarks, xRouter achieves strong cost-performance trade-offs (e.g., substantial cost reductions at comparable task completion rates), and provides empirical insights into what reliably helps learned routing and what does not, ranging from model trainability to the difficulty of eliciting sophisticated orchestration behaviors in small open models. We hope these findings and our open implementation will serve as a practical substrate for advancing learned, cost-aware LLM orchestration.
【8】Opponent Shaping in LLM Agents
标题:LLM代理中的对手塑造
链接:https://arxiv.org/abs/2510.08255
作者:Marta Emili Garcia Segura, Stephen Hailes, Mirco Musolesi
备注:29 pages, 15 figures, 15 tables
摘要:大型语言模型(LLM)越来越多地被部署为现实世界环境中的自主代理。随着这些部署规模的扩大,多智能体交互变得不可避免,这使得理解此类系统中的战略行为变得至关重要。一个核心的问题是,LLM代理是否像强化学习代理一样,可以通过单独的交互来塑造学习动态并影响他人的行为。在本文中,我们提出了第一次调查的对手塑造(OS)与LLM为基础的代理。现有的OS算法不能直接应用于LLM,因为它们需要高阶导数,面临可扩展性约束,或者依赖于Transformers中不存在的架构组件。为了解决这个问题,我们引入了ShapeLLM,这是一种为基于转换器的代理量身定制的无模型OS方法的改编。使用ShapeLLM,我们检验了LLM代理是否可以影响不同博弈论环境中合作者的学习动态。我们证明了LLM代理可以成功地引导对手在竞争性游戏(迭代囚徒困境,匹配便士和鸡)中实现可利用的均衡,并在合作游戏(迭代雄鹿狩猎和囚徒困境的合作版本)中促进协调和提高集体福利。我们的研究结果表明,LLM代理可以通过互动塑造和被塑造,建立对手塑造多代理LLM研究的一个关键维度。
摘要:Large Language Models (LLMs) are increasingly being deployed as autonomous agents in real-world environments. As these deployments scale, multi-agent interactions become inevitable, making it essential to understand strategic behavior in such systems. A central open question is whether LLM agents, like reinforcement learning agents, can shape the learning dynamics and influence the behavior of others through interaction alone. In this paper, we present the first investigation of opponent shaping (OS) with LLM-based agents. Existing OS algorithms cannot be directly applied to LLMs, as they require higher-order derivatives, face scalability constraints, or depend on architectural components that are absent in transformers. To address this gap, we introduce ShapeLLM, an adaptation of model-free OS methods tailored for transformer-based agents. Using ShapeLLM, we examine whether LLM agents can influence co-players' learning dynamics across diverse game-theoretic environments. We demonstrate that LLM agents can successfully guide opponents toward exploitable equilibria in competitive games (Iterated Prisoner's Dilemma, Matching Pennies, and Chicken) and promote coordination and improve collective welfare in cooperative games (Iterated Stag Hunt and a cooperative version of the Prisoner's Dilemma). Our findings show that LLM agents can both shape and be shaped through interaction, establishing opponent shaping as a key dimension of multi-agent LLM research.
【9】The Hidden Bias: A Study on Explicit and Implicit Political Stereotypes in Large Language Models
标题:隐藏的偏见:大型语言模型中显式和隐性政治刻板印象的研究
链接:https://arxiv.org/abs/2510.08236
作者:Konrad Löhr, Shuzhou Yuan, Michael Färber
摘要:大型语言模型(LLM)在信息传播和决策过程中越来越重要。鉴于其日益增长的社会影响力,了解潜在的偏见,特别是在政治领域,是至关重要的,以防止对公众舆论和民主进程的不当影响。这项工作调查了政治偏见和刻板印象的传播在八个突出的法学硕士使用二维政治罗盘测试(PCT)。最初,PCT被用来评估这些模型的内在政治倾向。其次,PCT的人物角色提示被用来探索不同社会维度的外显刻板印象。在最后一步中,通过用PCT的多语言版本评估模型来揭示隐含的刻板印象。关键发现揭示了所有调查模型中一致的左倾政治路线。此外,虽然刻板印象的性质和程度在不同的模型之间有很大的差异,但通过语言变化引起的内隐刻板印象比通过明确的人物角色提示确定的刻板印象更明显。有趣的是,对于大多数模型,内隐和外显的刻板印象显示出显着的一致性,表明他们对固有偏见有一定程度的透明度或“意识”。本研究揭示了政治偏见和刻板印象在LLM中的复杂相互作用。
摘要:Large Language Models (LLMs) are increas- ingly integral to information dissemination and decision-making processes. Given their grow- ing societal influence, understanding potential biases, particularly within the political domain, is crucial to prevent undue influence on public opinion and democratic processes. This work investigates political bias and stereotype propa- gation across eight prominent LLMs using the two-dimensional Political Compass Test (PCT). Initially, the PCT is employed to assess the in- herent political leanings of these models. Sub- sequently, persona prompting with the PCT is used to explore explicit stereotypes across vari- ous social dimensions. In a final step, implicit stereotypes are uncovered by evaluating mod- els with multilingual versions of the PCT. Key findings reveal a consistent left-leaning polit- ical alignment across all investigated models. Furthermore, while the nature and extent of stereotypes vary considerably between models, implicit stereotypes elicited through language variation are more pronounced than those iden- tified via explicit persona prompting. Interest- ingly, for most models, implicit and explicit stereotypes show a notable alignment, suggest- ing a degree of transparency or "awareness" regarding their inherent biases. This study un- derscores the complex interplay of political bias and stereotypes in LLMs.
【10】Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization
标题:通过分布匹配政策优化增强扩散LLM的推理
链接:https://arxiv.org/abs/2510.08233
作者:Yuchen Zhu, Wei Guo, Jaemoo Choi, Petr Molodyk, Bo Yuan, Molei Tao, Yongxin Chen
摘要:扩散大语言模型(dLLM)是自回归大语言模型(AR-LLM)的有前途的替代方案,因为它们可能允许更高的推理吞吐量。强化学习(RL)是dLLM在重要任务(如推理)上实现与AR-LLM相当的性能的关键组成部分。然而,非常适合dLLM的独特特性的RL算法还有待开发。本文提出了分布匹配策略优化(DMPO),这是一种原则性和理论性的RL微调方法,专门用于通过交叉熵优化将dLLM策略分布匹配到最佳的奖励倾斜分布,从而增强dLLM的推理能力。我们确定了一个关键的挑战,在实施与一个小的训练批量大小,并提出了几个有效的解决方案,通过一种新的权重基线减法技术。DMPO在没有监督微调的情况下在多个推理基准上表现出卓越的性能,与之前的SOTA基线相比,准确性提高了42.9\%$,与基础模型相比提高了55.8\%$,强调了分布匹配框架的有效性。我们的代码可在https://github.com/yuchen-zhu-zyc/DMPO上获得。
摘要:Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs), as they potentially allow higher inference throughput. Reinforcement learning (RL) is a crucial component for dLLMs to achieve comparable performance with AR-LLMs on important tasks, such as reasoning. However, RL algorithms that are well-suited for dLLMs' unique characteristics have yet to be developed. This paper proposes Distribution Matching Policy Optimization (DMPO), a principled and theoretically grounded RL fine-tuning method specifically designed to enhance the reasoning capabilities of dLLMs by matching the dLLM policy distribution to the optimal, reward-tilted one through cross-entropy optimization. We identify a key challenge in the implementation with a small training batch size and propose several effective solutions through a novel weight baseline subtraction technique. DMPO exhibits superior performance on multiple reasoning benchmarks without supervised fine-tuning, with an accuracy improvement of up to $42.9\%$ over previously SOTA baselines and $55.8\%$ over the base model, underscoring the effectiveness of the distribution matching framework. Our code is available at https://github.com/yuchen-zhu-zyc/DMPO.
【11】Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning
标题:想够了:序列级熵作为LLM推理的置信信号
链接:https://arxiv.org/abs/2510.08146
作者:Aman Sharma, Paras Chopra
摘要:我们引入了一个简单但新颖的基于熵的框架,以在推理任务期间驱动大型语言模型中的令牌效率。我们的方法使用来自令牌级logprobs的香农熵作为置信信号来实现早期停止,在保持任务准确性的同时实现25-50%的计算节省。至关重要的是,我们证明了基于熵的置信度校准代表了现代推理模型中存在的高级训练后优化的一种新兴特性,但在标准的预训练模型和预训练模型中明显缺乏(Llama 3.3 70 B)。我们表明,熵阈值停止推理不同的模型,但可以很容易地计算在一个镜头只使用几个例子,从现有的推理数据集。我们的研究结果表明,高级推理模型通常知道他们很早就得到了正确的答案,并且可以利用这种紧急的信心意识来节省令牌并减少延迟。该框架在推理优化的模型系列中表现出一致的性能,计算成本降低了25-50%,同时保持了准确性,这表明置信机制代表了现代后训练推理系统与其前身的区别特征。
摘要:We introduce a simple, yet novel entropy-based framework to drive token efficiency in large language models during reasoning tasks. Our approach uses Shannon entropy from token-level logprobs as a confidence signal to enable early stopping, achieving 25-50% computational savings while maintaining task accuracy. Crucially, we demonstrate that entropy-based confidence calibration represents an emergent property of advanced post-training optimization present in modern reasoning models but notably absent in standard instruction-tuned and pre-trained models (Llama 3.3 70B). We show that the entropy threshold to stop reasoning varies from model to model but can be calculated easily in one shot using only a few examples from existing reasoning datasets. Our results indicate that advanced reasoning models often know that they've gotten a correct answer early on, and that this emergent confidence awareness can be exploited to save tokens and reduce latency. The framework demonstrates consistent performance across reasoning-optimized model families with 25-50% computational cost reduction while preserving accuracy, revealing that confidence mechanisms represent a distinguishing characteristic of modern post-trained reasoning systems versus their predecessors.
【12】Approximate Domain Unlearning for Vision-Language Models
标题:视觉语言模型的近似领域取消学习
链接:https://arxiv.org/abs/2510.08132
作者:Kodai Kawamura, Yuta Goto, Rintaro Yanagi, Hirokatsu Kataoka, Go Irie
备注:NeurIPS 2025 (Spotlight)
摘要:预训练的视觉语言模型(VLM)具有很强的泛化能力,使它们能够识别不同领域的各种对象,而无需额外的训练。然而,它们通常保留超出特定下游任务要求的不相关信息,引起了对计算效率和潜在信息泄漏的担忧。这激发了人们对近似学习的兴趣,近似学习旨在选择性地删除不必要的知识,同时保持整体模型性能。现有的近似遗忘方法主要集中在类遗忘上,其中VLM经过重新训练,无法识别指定的对象类,同时保持其他对象类的准确性。然而,仅仅忘记对象类在实际应用中往往是不够的。例如,自动驾驶系统应该准确地识别真实的汽车,同时避免将路边广告中描绘的图示汽车误认为真实汽车,这可能是危险的。在本文中,我们介绍了近似域学习(ADU),这是一种新的问题设置,需要降低来自指定域(例如,图示)同时保持其它域的准确性(例如,real)。ADU提出了新的技术挑战:由于预训练的VLM具有强大的领域泛化能力,领域分布在特征空间中高度纠缠,使得基于惩罚目标领域的天真方法无效。为了解决这个问题,我们提出了一种新的方法,明确地解开域分布和自适应捕捉特定于实例的域信息。大量的实验表明,我们的方法优于基于VLM调整技术的基线,为VLM中的实用和细粒度的遗忘铺平了道路。代码:https://kodaikawamura.github.io/Domain_Unlearning/.
摘要:Pre-trained Vision-Language Models (VLMs) exhibit strong generalization capabilities, enabling them to recognize a wide range of objects across diverse domains without additional training. However, they often retain irrelevant information beyond the requirements of specific downstream tasks, raising concerns about computational efficiency and potential information leakage. This has motivated growing interest in approximate unlearning, which aims to selectively remove unnecessary knowledge while preserving overall model performance. Existing approaches to approximate unlearning have primarily focused on class unlearning, where a VLM is retrained to fail to recognize specified object classes while maintaining accuracy for others. However, merely forgetting object classes is often insufficient in practical applications. For instance, an autonomous driving system should accurately recognize real cars while avoiding misrecognition of illustrated cars depicted in roadside advertisements as real cars, which could be hazardous. In this paper, we introduce Approximate Domain Unlearning (ADU), a novel problem setting that requires reducing recognition accuracy for images from specified domains (e.g., illustration) while preserving accuracy for other domains (e.g., real). ADU presents new technical challenges: due to the strong domain generalization capability of pre-trained VLMs, domain distributions are highly entangled in the feature space, making naive approaches based on penalizing target domains ineffective. To tackle this limitation, we propose a novel approach that explicitly disentangles domain distributions and adaptively captures instance-specific domain information. Extensive experiments show that our approach outperforms baselines built upon VLM tuning techniques, paving the way for practical and fine-grained unlearning in VLMs. Code: https://kodaikawamura.github.io/Domain_Unlearning/.
【13】Lossless Vocabulary Reduction for Auto-Regressive Language Models
标题:自回归语言模型的无损词汇约简
链接:https://arxiv.org/abs/2510.08102
作者:Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Shin'ya Yamaguchi, Tomoya Ohba, Tamao Sakao, Susumu Takeuchi
摘要:标记化(Tokenization)--将给定文本分解为一系列称为标记的子词的过程--是开发语言模型的关键组件之一。特别地,自回归语言模型逐个标记地生成文本,即,通过预测下一个标记分布,给出先前的标记分布,因此标记化直接影响它们在文本生成中的效率。由于每个语言模型都有自己的词汇表作为一组可能的标记,因此它们很难在下一个标记分布(如模型集成)级别上相互合作。在本文中,我们建立了一个理论框架的无损词汇减少,有效地转换一个给定的自回归语言模型到一个任意小的词汇,而不会损失任何准确性。作为一个应用,我们证明了不同标记化的语言模型可以通过它们的最大公共词汇表有效地相互合作。
摘要:Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.
【14】From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill
标题:从代币到层:重新定义采用分层预填充服务的LLM无失速调度
链接:https://arxiv.org/abs/2510.08055
作者:Gunjun Lee, Jiwon Kim, Jaiyoung Park, Younjoo Lee, Jung Ho Ahn
备注:13 pages, 5 figure, 8 tables
摘要
:生产中的大型语言模型(LLM)推理必须满足首个令牌时间(TTFT)和令牌间隔时间(TBT)的严格服务级别目标,同时在固定计算、内存和互连预算下最大化吞吐量。现代服务系统采用无停顿调度技术,如分块预填充,它沿着令牌维度分割长提示处理,并将预填充与正在进行的解码迭代交织。虽然在稳定TBT方面有效,但分块预填充在专家混合(MoE)模型中会产生大量开销:冗余专家权重负载会使内存流量增加高达39%,并增加能耗。我们提出分层预填充,一个新的调度范例,把Transformer层组作为主要的调度单元。通过将模型垂直划分为连续的层组并在组之间交错预填充和解码,分层预填充维持无失速解码,同时消除块引起的MoE权重重新加载。它降低了片外带宽需求,将TTFT降低了70%,端到端延迟降低了41%,每个令牌的能量降低了22%。评估结果表明,分层预填充一贯提高TTFT-TBT帕累托前沿分块预填充,减少专家负载流量和能源成本,同时保持无失速解码。总的来说,将调度轴从令牌转移到层解锁了一种新的操作机制,用于在协同定位的环境中提供高效率,能量感知的LLM服务。
摘要:Large Language Model (LLM) inference in production must meet stringent service-level objectives for both time-to-first-token (TTFT) and time-between-token (TBT) while maximizing throughput under fixed compute, memory, and interconnect budgets. Modern serving systems adopt stall-free scheduling techniques such as chunked prefill, which splits long prompt processing along the token dimension and interleaves prefill with ongoing decode iterations. While effective at stabilizing TBT, chunked prefill incurs substantial overhead in Mixture-of-Experts (MoE) models: redundant expert weight loads increase memory traffic by up to 39% and inflate energy consumption. We propose layered prefill, a new scheduling paradigm that treats transformer layer groups as the primary scheduling unit. By vertically partitioning the model into contiguous layer groups and interleaving prefill and decode across the groups, layered prefill sustains stall-free decoding while eliminating chunk-induced MoE weight reloads. It reduces off-chip bandwidth demand, lowering TTFT by up to 70%, End-to-End latency by 41% and per-token energy by up to 22%. Evaluations show that layered prefill consistently improves the TTFT--TBT Pareto frontier over chunked prefill, reducing expert-load traffic and energy cost while maintaining stall-free decoding. Overall, shifting the scheduling axis from tokens to layers unlocks a new operating regime for high-efficiency, energy-aware LLM serving in co-located environments.
【15】Climate Knowledge in Large Language Models
标题:大型语言模型中的气候知识
链接:https://arxiv.org/abs/2510.08043
作者:Ivan Kuznetsov (1), Jacopo Grassi (2), Dmitrii Pantiukhin (1), Boris Shapkin (1), Thomas Jung (1 and 3), Nikolay Koldunov (1) ((1) Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Bremerhaven, Germany., (2) Department of Environment, Land, and Infrastructure Engineering, Politecnico di Torino, Turin, Italy., (3) Institute of Environmental Physics, University of Bremen, Bremen, Germany.)
备注:16 pages, 4 figures, 2 tables
摘要:大型语言模型(LLM)越来越多地部署在与气候相关的应用中,在这些应用中,了解内部气候知识对于可靠性和错误信息风险评估至关重要。尽管越来越多的采用,LLMs的能力,从参数知识回忆气候常态仍然在很大程度上没有特点。我们调查的能力,当代LLMs回顾气候正常没有外部检索,专注于一个典型的查询:平均7月2米的空气温度1991-2020年在指定的位置。我们构建了一个全球网格的查询在1{\deg}分辨率的土地点,提供坐标和位置描述符,并验证响应ERA 5再分析。结果表明,LLM编码非平凡的气候结构,捕获纬度和地形模式,均方根误差为3-6 {\deg}C,偏差为$\pm$1 {\deg}C。然而,空间相干误差仍然存在,特别是在山区和高纬度地区。在1500 m以上,性能急剧下降,RMSE达到5-13 {\deg}C,而在较低海拔处为2-4 {\deg}C。我们发现,包括地理背景(国家,城市,地区)平均减少了27%的错误,较大的模型对位置描述符最敏感。虽然模型捕捉到了1950-1974年和2000-2024年之间观测到的全球变暖的平均幅度,但它们未能再现与评估气候变化直接相关的温度变化的空间模式。这种局限性凸显了,虽然LLM可以捕捉当今的气候分布,但它们很难代表对于了解气候动态至关重要的长期温度变化的区域和局部表达。我们的评估框架为量化LLM中的参数气候知识提供了一个可重复的基准,并补充了现有的气候传播评估。
摘要:Large language models (LLMs) are increasingly deployed for climate-related applications, where understanding internal climatological knowledge is crucial for reliability and misinformation risk assessment. Despite growing adoption, the capacity of LLMs to recall climate normals from parametric knowledge remains largely uncharacterized. We investigate the capacity of contemporary LLMs to recall climate normals without external retrieval, focusing on a prototypical query: mean July 2-m air temperature 1991-2020 at specified locations. We construct a global grid of queries at 1{\deg} resolution land points, providing coordinates and location descriptors, and validate responses against ERA5 reanalysis. Results show that LLMs encode non-trivial climate structure, capturing latitudinal and topographic patterns, with root-mean-square errors of 3-6 {\deg}C and biases of $\pm$1 {\deg}C. However, spatially coherent errors remain, particularly in mountains and high latitudes. Performance degrades sharply above 1500 m, where RMSE reaches 5-13 {\deg}C compared to 2-4 {\deg}C at lower elevations. We find that including geographic context (country, city, region) reduces errors by 27% on average, with larger models being most sensitive to location descriptors. While models capture the global mean magnitude of observed warming between 1950-1974 and 2000-2024, they fail to reproduce spatial patterns of temperature change, which directly relate to assessing climate change. This limitation highlights that while LLMs may capture present-day climate distributions, they struggle to represent the regional and local expression of long-term shifts in temperature essential for understanding climate dynamics. Our evaluation framework provides a reproducible benchmark for quantifying parametric climate knowledge in LLMs and complements existing climate communication assessments.
【16】Language Models Do Not Embed Numbers Continuously
标题:语言模型不会连续嵌入数字
链接:https://arxiv.org/abs/2510.08009
作者:Alex O. Davies, Roussel Nzoyem, Nirav Ajmeri, Telmo M. Silva Filho
备注:12 pages, 10 figures, 3 tables
摘要:最近的研究已经广泛地研究了大型语言模型如何在特定的算术任务中操作整数,以及在更基本的层面上,它们如何表示数值。这些以前的工作已经发现,语言模型嵌入可以用来重建原始值,但是,他们没有评估语言模型是否真的模型连续的连续值。使用嵌入空间的预期属性,包括线性重构和主成分分析,我们表明,语言模型不仅表示数字空间为非连续的,但也引入了显着的噪声。使用来自三个主要提供商(OpenAI,Google Gemini和Voyage AI)的模型,我们表明,虽然重建是可能的高保真度($R^2 \geq 0.95$),主成分只解释嵌入空间内的一小部分变化。这表明嵌入空间中的许多分量与简单数值输入空间正交。此外,线性重建和解释方差都受到增加的小数精度的影响,尽管输入空间的序数性质基本上没有改变。因此,这项工作的结果具有嵌入模型使用的许多领域的影响,特别是在高数值精度,大幅度或混合符号值是常见的。
摘要:Recent research has extensively studied how large language models manipulate integers in specific arithmetic tasks, and on a more fundamental level, how they represent numeric values. These previous works have found that language model embeddings can be used to reconstruct the original values, however, they do not evaluate whether language models actually model continuous values as continuous. Using expected properties of the embedding space, including linear reconstruction and principal component analysis, we show that language models not only represent numeric spaces as non-continuous but also introduce significant noise. Using models from three major providers (OpenAI, Google Gemini and Voyage AI), we show that while reconstruction is possible with high fidelity ($R^2 \geq 0.95$), principal components only explain a minor share of variation within the embedding space. This indicates that many components within the embedding space are orthogonal to the simple numeric input space. Further, both linear reconstruction and explained variance suffer with increasing decimal precision, despite the ordinal nature of the input space being fundamentally unchanged. The findings of this work therefore have implications for the many areas where embedding models are used, in-particular where high numerical precision, large magnitudes or mixed-sign values are common.
【17】Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training
标题:回收预训练检查点:专家混合的垂直增长,以实现高效的大型语言模型预训练
链接:https://arxiv.org/abs/2510.08008
作者
:Ruizhe Wang, Yucheng Ding, Xiao Liu, Yaoxiang Wang, Peng Cheng, Baining Guo, Zhengjun Zha, Yeyun Gong
摘要:预训练大型语言模型的计算成本迅速增加,需要更有效的方法。大量的计算成本已经投入到现有的训练有素的检查点,但其中许多仍然未得到充分利用,由于工程约束或有限的模型容量。为了有效地重复使用这种“沉没”成本,我们建议通过扩展其参数计数和继续训练来回收预训练的检查点。我们提出了非常适合融合的混合专家模型的正交增长方法:插入层复制的深度增长和专家复制注入噪声的宽度增长。为了确定跨检查点序列的这种增长的最佳时机,我们进行了全面的缩放实验,揭示了最终的准确性与沉没成本的量有很强的正相关性,这表明更大的先验投资会带来更好的性能。我们将我们的方法扩展到具有70B参数和超过1T训练令牌的模型,在相同的额外计算预算下,与从头开始的训练相比,实现了10.66%的准确率增益。我们的检查点回收方法为经济高效的大型语言模型预训练奠定了基础。
摘要:The rapidly increasing computational cost of pretraining Large Language Models necessitates more efficient approaches. Numerous computational costs have been invested in existing well-trained checkpoints, but many of them remain underutilized due to engineering constraints or limited model capacity. To efficiently reuse this "sunk" cost, we propose to recycle pretrained checkpoints by expanding their parameter counts and continuing training. We propose orthogonal growth method well-suited for converged Mixture-of-Experts model: interpositional layer copying for depth growth and expert duplication with injected noise for width growth. To determine the optimal timing for such growth across checkpoints sequences, we perform comprehensive scaling experiments revealing that the final accuracy has a strong positive correlation with the amount of sunk cost, indicating that greater prior investment leads to better performance. We scale our approach to models with 70B parameters and over 1T training tokens, achieving 10.66% accuracy gain over training from scratch under the same additional compute budget. Our checkpoint recycling approach establishes a foundation for economically efficient large language model pretraining.
【18】Fewer Weights, More Problems: A Practical Attack on LLM Pruning
标题:更少的权重,更多的问题:对LLM修剪的实用攻击
链接:https://arxiv.org/abs/2510.07985
作者:Kazuki Egashira, Robin Staab, Thibaud Gloaguen, Mark Vero, Martin Vechev
摘要:模型修剪,即,移除模型权重的子集已经成为在推理期间减少大型语言模型(LLM)的存储器占用的突出方法。值得注意的是,流行的推理引擎,如vLLM,使用户能够方便地修剪下载的模型之前,他们被部署。虽然修剪方法的实用性和效率已显着提高,修剪的安全性影响仍然没有得到充分的探讨。在这项工作中,我们第一次证明了现代LLM修剪方法可以被恶意利用。特别是,攻击者可以构建一个看似良性的模型,但一旦被修剪,就会表现出恶意行为。我们的方法是基于这样的想法,对手可以计算一个代理度量,估计每个参数是如何被修剪。有了这些信息,对手可以首先将恶意行为注入那些不太可能被修剪的参数中。然后,他们可以通过使用可能被修剪的参数来修复模型,有效地消除未修剪模型中的注入行为。我们通过对五个模型的广泛评估来证明我们的攻击的严重性;在应用vLLM中的任何修剪(Magnitude、Wanda和SparseGPT)之后,它在不同的攻击场景集合中始终表现出强烈的恶意行为(对于越狱,成功率高达$95.7\%$,对于良性指令拒绝,成功率高达$98.7\%$,对于目标内容注入,成功率高达$99.5\%$)。我们的研究结果揭示了一个关键的部署时间的安全差距,并强调迫切需要更强的安全意识,在模型压缩。
摘要:Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during inference. Notably, popular inference engines, such as vLLM, enable users to conveniently prune downloaded models before they are deployed. While the utility and efficiency of pruning methods have improved significantly, the security implications of pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that appears benign yet, once pruned, exhibits malicious behaviors. Our method is based on the idea that the adversary can compute a proxy metric that estimates how likely each parameter is to be pruned. With this information, the adversary can first inject a malicious behavior into those parameters that are unlikely to be pruned. Then, they can repair the model by using parameters that are likely to be pruned, effectively canceling out the injected behavior in the unpruned model. We demonstrate the severity of our attack through extensive evaluation on five models; after any of the pruning in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of attack scenarios (success rates of up to $95.7\%$ for jailbreak, $98.7\%$ for benign instruction refusal, and $99.5\%$ for targeted content injection). Our results reveal a critical deployment-time security gap and underscore the urgent need for stronger security awareness in model compression.
【19】Augur: Modeling Covariate Causal Associations in Time Series via Large Language Models
标题:Augur:通过大型语言模型对时间序列中的协变量因果关联进行建模
链接:https://arxiv.org/abs/2510.07858
作者:Zhiqing Cui, Binwu Wang, Qingxiang Liu, Yeqiang Wang, Zhengyang Zhou, Yuxuan Liang, Yang Wang
备注:22 pages, 9 figures
摘要:大型语言模型(LLM)已经成为时间序列预测的一个有前途的途径,提供了整合多模态数据的潜力。然而,现有的基于LLM的方法面临着显着的局限性,例如模型架构中的边缘化角色、对粗糙统计文本提示的依赖以及缺乏可解释性。在这项工作中,我们介绍Augur,一个完全LLM驱动的时间序列预测框架,利用LLM因果推理发现和使用协变量之间的直接因果关联。Augur使用一个两阶段的师生架构,其中一个强大的教师LLM使用启发式搜索和成对因果关系测试从时间序列中推断出一个有向因果图。然后,一个轻量级的学生代理细化图形,并对高置信度的因果关联进行微调,这些因果关联被编码为丰富的文本提示来执行预测。这种设计提高了预测的准确性,同时产生透明的,可追溯的推理变量的相互作用。在具有25个基线的真实数据集上进行的大量实验表明,Augur实现了具有竞争力的性能和鲁棒的zero-shot泛化。
摘要:Large language models (LLM) have emerged as a promising avenue for time series forecasting, offering the potential to integrate multimodal data. However, existing LLM-based approaches face notable limitations-such as marginalized role in model architectures, reliance on coarse statistical text prompts, and lack of interpretability. In this work, we introduce Augur, a fully LLM driven time series forecasting framework that exploits LLM causal reasoning to discover and use directed causal associations among covariates. Augur uses a two stage teacher student architecture where a powerful teacher LLM infers a directed causal graph from time series using heuristic search together with pairwise causality testing. A lightweight student agent then refines the graph and fine tune on high confidence causal associations that are encoded as rich textual prompts to perform forecasting. This design improves predictive accuracy while yielding transparent, traceable reasoning about variable interactions. Extensive experiments on real-world datasets with 25 baselines demonstrate that Augur achieves competitive performance and robust zero-shot generalization.
【20】Self-Improving LLM Agents at Test-Time
标题:测试时自我改进的LLM代理
链接:https://arxiv.org/abs/2510.07841
作者:Emre Can Acikgoz, Cheng Qian, Heng Ji, Dilek Hakkani-Tür, Gokhan Tur
摘要:语言模型(LM)微调的一个范例依赖于创建大型训练数据集,假设高数量和多样性将使模型能够在训练后推广到新任务。在实践中,收集大量数据是低效的,并且对它们进行训练的成本非常昂贵;更糟糕的是,无法保证最终模型能够处理复杂的场景或更好地泛化。此外,现有技术很少评估训练样本是否提供新的信息或与模型已经获得的知识冗余,导致不必要的成本。在这项工作中,我们探索了一种新的测试时间自我改进方法,以创建更有效和更普遍的代理LM的飞行。所提出的算法可以概括为三个步骤:(i)首先,它识别模型与之斗争的样本(自我意识),(ii)然后从检测到的不确定样本中生成类似的示例(自我数据增强),以及(iii)在测试时微调(自我改进)使用这些新生成的样本。我们研究了这种方法的两种变体:测试时自我改进(TT-SI),其中相同的模型从自己的不确定案例中生成额外的训练示例,然后从中学习,并将这种方法与测试时蒸馏(TT-D)进行对比,其中更强大的模型为不确定案例生成类似的示例,使学生能够使用蒸馏监督进行调整。在不同的代理基准测试中的实证评估表明,TT-SI在所有基准测试中平均提高了+5.48%的绝对准确率,超过了其他标准学习方法,但使用的训练样本少了68倍。我们的研究结果突出了TT-SI的前景,展示了自我改进算法在测试时作为一种新的范式来构建更有能力的自我进化代理的潜力。
摘要
:One paradigm of language model (LM) fine-tuning relies on creating large training datasets, under the assumption that high quantity and diversity will enable models to generalize to novel tasks after post-training. In practice, gathering large sets of data is inefficient, and training on them is prohibitively expensive; worse, there is no guarantee that the resulting model will handle complex scenarios or generalize better. Moreover, existing techniques rarely assess whether a training sample provides novel information or is redundant with the knowledge already acquired by the model, resulting in unnecessary costs. In this work, we explore a new test-time self-improvement method to create more effective and generalizable agentic LMs on-the-fly. The proposed algorithm can be summarized in three steps: (i) first it identifies the samples that model struggles with (self-awareness), (ii) then generates similar examples from detected uncertain samples (self-data augmentation), and (iii) uses these newly generated samples at test-time fine-tuning (self-improvement). We study two variants of this approach: Test-Time Self-Improvement (TT-SI), where the same model generates additional training examples from its own uncertain cases and then learns from them, and contrast this approach with Test-Time Distillation (TT-D), where a stronger model generates similar examples for uncertain cases, enabling student to adapt using distilled supervision. Empirical evaluations across different agent benchmarks demonstrate that TT-SI improves the performance with +5.48% absolute accuracy gain on average across all benchmarks and surpasses other standard learning methods, yet using 68x less training samples. Our findings highlight the promise of TT-SI, demonstrating the potential of self-improvement algorithms at test-time as a new paradigm for building more capable agents toward self-evolution.
【21】HySim-LLM: Embedding-Weighted Fine-Tuning Bounds and Manifold Denoising for Domain-Adapted LLMs
标题:HySim-LLM:域自适应LLM的嵌入加权微调边界和管汇去噪
链接:https://arxiv.org/abs/2510.07796
作者:Majid Jaberi-Douraki, Hossein Sholehrasa, Xuan Xu, Remya Ampadi Ramachandran
摘要:从科学文献中提取和标准化药代动力学(PK)信息仍然是计算药理学中的重大挑战,这限制了药物开发中数据驱动模型的可靠性。大型语言模型(LLM)在文本理解和推理方面取得了显着进展,但它们对结构化生物医学数据(如PK表)的适应仍然受到异质性,噪声和域转移的限制。为了解决这些局限性,我们提出了HySim-LLM,这是一个统一的数学和计算框架,它集成了嵌入加权微调和流形感知去噪,以增强LLM的鲁棒性和可解释性。我们建立了两个理论结果:(1)相似加权泛化界,量化嵌入发散下的自适应性能,以及(2)基于流形的去噪保证,限制噪声或非流形样本的损失贡献。这些定理提供了一个原则性的基础微调LLM结构化的生物医学设置。该框架为生物医学和数据密集型科学领域的可靠和可解释的LLM适应提供了数学基础的途径。
摘要:The extraction and standardization of pharmacokinetic (PK) information from scientific literature remain significant challenges in computational pharmacology, which limits the reliability of data-driven models in drug development. Large language models (LLMs) have achieved remarkable progress in text understanding and reasoning, yet their adaptation to structured biomedical data, such as PK tables, remains constrained by heterogeneity, noise, and domain shift. To address these limitations, we propose HySim-LLM, a unified mathematical and computational framework that integrates embedding-weighted fine-tuning and manifold-aware denoising to enhance the robustness and interpretability of LLMs. We establish two theoretical results: (1) a similarity-weighted generalization bound that quantifies adaptation performance under embedding divergence, and (2) a manifold-based denoising guarantee that bounds loss contributions from noisy or off-manifold samples. These theorems provide a principled foundation for fine-tuning LLMs in structured biomedical settings. The framework offers a mathematically grounded pathway toward reliable and interpretable LLM adaptation for biomedical and data-intensive scientific domains.
【22】PLUM: Adapting Pre-trained Language Models for Industrial-scale Generative Recommendations
标题:PLUM:调整预先训练的语言模型以实现工业规模的生成性建议
链接:https://arxiv.org/abs/2510.07784
作者:Ruining He, Lukasz Heldt, Lichan Hong, Raghunandan Keshavan, Shifan Mao, Nikhil Mehta, Zhengyang Su, Alicia Tsai, Yueqi Wang, Shao-Chuan Wang, Xinyang Yi, Lexi Baugher, Baykal Cakici, Ed Chi, Cristos Goodrow, Ningren Han, He Ma, Romer Rosales, Abby Van Soest, Devansh Tandon, Su-Lin Wu, Weilong Yang, Yilin Zheng
备注:11 pages, 6 figures
摘要:大型语言模型(LLM)为信息任务的建模和计算提供了一种新的范式。推荐系统是一个关键的应用领域,它将从这些大型模型中固有的序列建模能力和世界知识中受益匪浅。在本文中,我们介绍了PLUM,这是一个旨在使预训练的LLM适应行业规模推荐任务的框架。PLUM由使用语义ID的项目标记化、特定于领域的数据的持续预训练(CPT)以及针对推荐目标的特定于任务的微调组成。对于微调,我们特别关注生成式检索,其中模型直接训练以基于用户上下文生成推荐项目的语义ID。我们在大规模的内部视频推荐数据集上进行了全面的实验。我们的结果表明,与使用大型嵌入表构建的经过高度优化的生产模型相比,PLUM在检索方面取得了实质性改进。我们还对模型的检索性能进行了扩展研究,我们对CPT的了解,对Semantic ID的一些增强,以及使该框架能够在YouTube上向数十亿用户推出的训练和推理方法的概述。
摘要:Large Language Models (LLMs) pose a new paradigm of modeling and computation for information tasks. Recommendation systems are a critical application domain poised to benefit significantly from the sequence modeling capabilities and world knowledge inherent in these large models. In this paper, we introduce PLUM, a framework designed to adapt pre-trained LLMs for industry-scale recommendation tasks. PLUM consists of item tokenization using Semantic IDs, continued pre-training (CPT) on domain-specific data, and task-specific fine-tuning for recommendation objectives. For fine-tuning, we focus particularly on generative retrieval, where the model is directly trained to generate Semantic IDs of recommended items based on user context. We conduct comprehensive experiments on large-scale internal video recommendation datasets. Our results demonstrate that PLUM achieves substantial improvements for retrieval compared to a heavily-optimized production model built with large embedding tables. We also present a scaling study for the model's retrieval performance, our learnings about CPT, a few enhancements to Semantic IDs, along with an overview of the training and inference methods that enable launching this framework to billions of users in YouTube.
【23】ToolExpander: Extending the Frontiers of Tool-Using Reinforcement Learning to Weak LLMs
标题:Tools Expander:将使用工具的强化学习的前沿扩展到弱LLM
链接:https://arxiv.org/abs/2510.07737
作者:Fu Chen, Peng Wang, Xiyin Li, Wen Li, Shichi Lei, Dongdong Xiang
摘要:使用组相对策略优化(GRPO)训练大型语言模型(LLM)遇到了一个重大挑战:模型通常无法产生准确的响应,特别是在小规模架构中。这种限制不仅减少了性能的提高,破坏了GRPO的潜力,而且经常导致训练中期崩溃,对稳定性和最终功效产生不利影响。为了解决这些问题,我们提出了ToolExpander,这是一个新的框架,通过两个关键创新来推进资源受限LLM的工具导向强化学习:(1)动态多轮硬采样,动态替代具有挑战性的样本(那些在10次推出后没有正确输出的)在训练过程中进行高质量的Few-Shot演示,再加上指数学习率衰减策略以减轻振荡;(2)自我例证思维,一个增强的GRPO框架,消除KL分歧,并纳入调整裁剪系数,鼓励模型通过最小的额外奖励自主生成和分析Few-Shot示例实验结果表明,ToolExpander显著增强了LLM中的工具使用能力,特别是在较弱的小尺度模型中,提高训练稳定性和整体表现。
摘要
:Training Large Language Models (LLMs) with Group Relative Policy Optimization (GRPO) encounters a significant challenge: models often fail to produce accurate responses, particularly in small-scale architectures. This limitation not only diminishes performance improvements and undermines the potential of GRPO but also frequently leads to mid-training collapse, adversely affecting stability and final efficacy. To address these issues, we propose ToolExpander, a novel framework that advances tool-oriented reinforcement learning for resource-constrained LLMs through two key innovations:(1) Dynamic Multi-Round Hard Sampling, which dynamically substitutes challenging samples(those without correct outputs over 10 rollouts) with high-quality few-shot demonstrations during training, coupled with an exponential learning rate decay strategy to mitigate oscillations;(2) Self-Exemplifying Thinking, an enhanced GRPO framework that eliminates KL divergence and incorporates adjusted clipping coefficients, encouraging models to autonomously generate and analyze few-shot examples via a minimal additional reward (0.01).Experimental results demonstrate that ToolExpander significantly enhances tool-using capabilities in LLMs, especially in weaker small-scale models, improving both training stability and overall performance.
【24】Large Language Models Meet Virtual Cell: A Survey
标题:大型语言模型遇上虚拟细胞:一项调查
链接:https://arxiv.org/abs/2510.07706
作者:Krinos Li, Xianglu Xiao, Shenglong Deng, Lucas He, Zijun Zhong, Yuanjie Zou, Zhonghao Zhan, Zheng Hui, Weiye Bao, Guang Yang
摘要:大型语言模型(LLM)正在通过开发“虚拟细胞”来改变细胞生物学,“虚拟细胞”是一种表示、预测和推理细胞状态和行为的计算系统。这项工作提供了一个全面的审查LLM虚拟细胞建模。我们提出了一个统一的分类法,将现有的方法分为两种范式:LLM作为Oracle,用于直接细胞建模,LLM作为代理,用于协调复杂的科学任务。我们确定了三个核心任务-细胞表示,扰动预测和基因调控推断-并审查其相关的模型,数据集,评估基准,以及在可扩展性,可推广性和可解释性的关键挑战。
摘要:Large language models (LLMs) are transforming cellular biology by enabling the development of "virtual cells"--computational systems that represent, predict, and reason about cellular states and behaviors. This work provides a comprehensive review of LLMs for virtual cell modeling. We propose a unified taxonomy that organizes existing methods into two paradigms: LLMs as Oracles, for direct cellular modeling, and LLMs as Agents, for orchestrating complex scientific tasks. We identify three core tasks--cellular representation, perturbation prediction, and gene regulation inference--and review their associated models, datasets, evaluation benchmarks, as well as the critical challenges in scalability, generalizability, and interpretability.
【25】LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics
标题:LLM显微镜下的学习:全栈视图的方法和技巧
链接:https://arxiv.org/abs/2510.07626
作者:Chongyu Fan, Changsheng Wang, Yancheng Huang, Soumyadeep Pal, Sijia Liu
摘要:大型语言模型(LLM)的机器非学习旨在删除不需要的数据,知识和行为(例如,为了安全、隐私或版权),同时保留有用的模型能力。尽管在过去两年中取得了快速进展,但LLM遗忘的研究仍然是分散的,对于什么是有效的遗忘以及如何对其进行严格评估的明确性有限。在这项工作中,我们提出了一个原则性的分类最近的12个有状态的unlearning方法,分为三个方法的家庭:分歧驱动的优化,表示错位,和拒绝为基础的有针对性的unlearning。在此分类的基础上,我们重新评估了遗忘有效性(UE),效用保留(UT)和鲁棒性(Rob),重点是WMDP基准。我们的分析表明,目前的评估,多项选择题(MCQ)的准确性占主导地位,只提供了一个狭窄的角度来看,往往夸大了成功,而忽略了模型的实际生成行为。为了解决这一差距,我们引入了开放式问答(Open-QA)指标,这些指标可以更好地捕获生成性能,并揭示方法系列之间固有的UE-UT权衡。此外,我们证明了鲁棒性需要更细粒度的分析:例如,漏洞在域内重新学习和域外微调之间存在很大差异,即使两者都属于模型级攻击。通过这项研究,我们希望提供一个完整的堆栈重温LLM unlearning和可操作的指导设计和评估未来的方法。
摘要:Machine unlearning for large language models (LLMs) aims to remove undesired data, knowledge, and behaviors (e.g., for safety, privacy, or copyright) while preserving useful model capabilities. Despite rapid progress over the past two years, research in LLM unlearning remains fragmented, with limited clarity on what constitutes effective unlearning and how it should be rigorously evaluated. In this work, we present a principled taxonomy of twelve recent stateful unlearning methods, grouped into three methodological families: divergence-driven optimization, representation misalignment, and rejection-based targeted unlearning. Building on this taxonomy, we revisit the evaluation of unlearning effectiveness (UE), utility retention (UT), and robustness (Rob), focusing on the WMDP benchmark. Our analysis shows that current evaluations, dominated by multiple-choice question (MCQ) accuracy, offer only a narrow perspective, often overstating success while overlooking the model's actual generation behavior. To address this gap, we introduce open question-answering (Open-QA) metrics that better capture generative performance and reveal the inherent UE-UT tradeoff across method families. Furthermore, we demonstrate that robustness requires finer-grained analysis: for example, vulnerabilities differ substantially between in-domain relearning and out-of-domain fine-tuning, even though both fall under model-level attacks. Through this study, we hope to deliver a full-stack revisit of LLM unlearning and actionable guidance for designing and evaluating future methods.
【26】Expanding the Action Space of LLMs to Reason Beyond Language
标题:扩展LLM的行动空间,超越语言的理性
链接:https://arxiv.org/abs/2510.07581
作者:Zhongqi Yue, Weishi Wang, Yundaichuan Zhan, Juncheng Li, Daniel Dahlmeier, Fredrik D. Johansson
摘要:大型语言模型(LLM)是自然语言中功能强大的推理器,但它们的动作通常仅限于输出词汇标记。因此,与外部环境(如符号运算符或模拟器)的交互必须通过预定义格式的文本来表达,解析并路由到外部接口。这会使模型语言的推理和控制任务过载,并且需要一个手工制作的解析器,该解析器位于LLM外部。为了解决这个问题,我们将环境交互与语言解耦,将它们内在化在扩展动作空间(ExpA)中,超越词汇表。模型在默认语言环境中开始推理,但可能会触发路由操作,并随时切换到外部环境。从那里,模型只能调用特定于环境的操作,从环境中接收反馈,并可能返回到语言。为了促进对扩展的动作空间和新环境的有效探索,我们引入了具有反事实策略优化的ExpA强化学习(EARL)。在需要多回合交互和应急计划的任务中,EARL在词汇受限的行动中优于强基线。它在基于计算器的多任务学习中表现稳健,并且在部分观察的排序问题中,实现了完美的Sort-4精度,同时自我发现了一种与经典设计竞争的高效算法。
摘要:Large Language Models (LLMs) are powerful reasoners in natural language, but their actions are typically confined to outputting vocabulary tokens. As a result, interactions with external environments -- such as symbolic operators or simulators -- must be expressed through text in predefined formats, parsed, and routed to external interfaces. This overloads the model's language with both reasoning and control duties, and requires a hand-crafted parser, external to the LLM. To address this, we decouple environment interactions from language by internalizing them in an Expanded Action space (ExpA), beyond the vocabulary. The model starts reasoning in the default language environment, but may trigger routing actions and switch to an external environment at any time. From there, the model can only invoke environment-specific actions, receive feedback from the environment, and potentially route back to language as a result. To promote effective exploration of the expanded action space and new environments, we introduce ExpA Reinforcement Learning (EARL) with counterfactual policy optimization. On tasks requiring multi-turn interactions and contingent planning, EARL outperforms strong baselines with vocabulary-constrained actions. It performs robustly across calculator-based multi-task learning and, in the partially observed sorting problem, achieves perfect Sort-4 accuracy while self-discovering an efficient algorithm competitive with classical designs.
【27】Investigating Thematic Patterns and User Preferences in LLM Interactions using BERTopic
标题:使用BER Topic调查LLM互动中的主题模式和用户偏好
链接:https://arxiv.org/abs/2510.07557
作者:Abhay Bhandarkar, Gaurav Mishra, Khushi Juchani, Harsh Singhal
摘要
:本研究将BERTopic(一种基于transformer的主题建模技术)应用于lmsys-chat-1 m数据集,该数据集是一个多语言会话语料库,由大型语言模型(LLM)的头对头评估构建。每个用户提示都与两个匿名LLM响应和一个人类偏好标签配对,用于评估用户对竞争模型输出的评估。主要目标是发现这些对话中的主题模式,并检查它们与用户偏好的关系,特别是如果某些LLM在特定主题中始终受到青睐。设计了一个强大的预处理管道,用于多语言变化,平衡对话回合,清理噪音或编辑数据。BERTopic提取了超过29个连贯的主题,包括人工智能,编程,道德和云基础设施。我们分析了主题和模型偏好之间的关系,以确定模型主题对齐的趋势。可视化技术包括主题间距离图,主题概率分布,模型与主题矩阵。我们的研究结果为特定领域的微调和优化策略提供了信息,以提高现实世界的LLM性能和用户满意度。
摘要:This study applies BERTopic, a transformer-based topic modeling technique, to the lmsys-chat-1m dataset, a multilingual conversational corpus built from head-to-head evaluations of large language models (LLMs). Each user prompt is paired with two anonymized LLM responses and a human preference label, used to assess user evaluation of competing model outputs. The main objective is uncovering thematic patterns in these conversations and examining their relation to user preferences, particularly if certain LLMs are consistently preferred within specific topics. A robust preprocessing pipeline was designed for multilingual variation, balancing dialogue turns, and cleaning noisy or redacted data. BERTopic extracted over 29 coherent topics including artificial intelligence, programming, ethics, and cloud infrastructure. We analysed relationships between topics and model preferences to identify trends in model-topic alignment. Visualization techniques included inter-topic distance maps, topic probability distributions, and model-versus-topic matrices. Our findings inform domain-specific fine-tuning and optimization strategies for improving real-world LLM performance and user satisfaction.
【28】MLLM4TS: Leveraging Vision and Multimodal Language Models for General Time-Series Analysis
标题:MLLM 4TS:利用视觉和多模式语言模型进行一般时间序列分析
链接:https://arxiv.org/abs/2510.07513
作者:Qinghua Liu, Sam Heshmati, Zheda Mai, Zubin Abraham, John Paparrizos, Liu Ren
摘要:由于多变量数据中复杂的时间依赖性和跨通道交互作用,时间序列数据的有效分析提出了重大挑战。受人类分析师视觉检查时间序列以发现隐藏模式的方式的启发,我们问:结合视觉表示可以增强自动化时间序列分析吗?多模态大语言模型的最新进展已经证明了令人印象深刻的泛化和视觉理解能力,但它们在时间序列中的应用仍然受到连续数值数据和离散自然语言之间的模态差距的限制。为了弥合这一差距,我们引入了MLLM4TS,这是一个新的框架,它通过集成专用的视觉分支,利用多模态大型语言模型进行一般的时间序列分析。每个时间序列通道被渲染为一个复合图像中的水平堆叠的颜色编码线图,以捕获跨通道的空间依赖性,然后时间感知视觉补丁对齐策略将视觉补丁与其对应的时间段对齐。MLLM4TS融合了来自数值数据的细粒度时间细节与来自视觉表示的全局上下文信息,为多模态时间序列分析提供了统一的基础。在标准基准测试上的广泛实验证明了MLLM4TS在两个预测任务(例如,分类)和生成任务(例如,异常检测和预测)。这些结果强调了将视觉模态与预训练的语言模型相结合以实现强大且可推广的时间序列分析的潜力。
摘要:Effective analysis of time series data presents significant challenges due to the complex temporal dependencies and cross-channel interactions in multivariate data. Inspired by the way human analysts visually inspect time series to uncover hidden patterns, we ask: can incorporating visual representations enhance automated time-series analysis? Recent advances in multimodal large language models have demonstrated impressive generalization and visual understanding capability, yet their application to time series remains constrained by the modality gap between continuous numerical data and discrete natural language. To bridge this gap, we introduce MLLM4TS, a novel framework that leverages multimodal large language models for general time-series analysis by integrating a dedicated vision branch. Each time-series channel is rendered as a horizontally stacked color-coded line plot in one composite image to capture spatial dependencies across channels, and a temporal-aware visual patch alignment strategy then aligns visual patches with their corresponding time segments. MLLM4TS fuses fine-grained temporal details from the numerical data with global contextual information derived from the visual representation, providing a unified foundation for multimodal time-series analysis. Extensive experiments on standard benchmarks demonstrate the effectiveness of MLLM4TS across both predictive tasks (e.g., classification) and generative tasks (e.g., anomaly detection and forecasting). These results underscore the potential of integrating visual modalities with pretrained language models to achieve robust and generalizable time-series analysis.
【29】Black-box Detection of LLM-generated Text Using Generalized Jensen-Shannon Divergence
标题:利用广义Jensen-Shannon分歧进行LLM生成文本的黑匣子检测
链接:https://arxiv.org/abs/2510.07500
作者:Shuangyi Chen, Ashish Khisti
备注:Preprint
摘要:我们研究黑盒检测机器生成的文本在实际的限制:评分模型(代理LM)可能不匹配的未知源模型,和每输入对比生成是昂贵的。我们提出了SurpMark,一个基于参考的检测器,总结了一个通道的动态的令牌missals。SurpMark将测试文本量化为可解释的状态,估计测试文本的状态转换矩阵,并通过测试转换和两个固定参考(人类与机器)之间的广义Jensen-Shannon(GJS)间隙对其进行评分。我们证明了一个原则性的离散化准则,并建立了决策统计量的渐近正态性。从经验上讲,在多个数据集,源模型和场景中,SurpMark始终匹配或超过基线;我们的实验证实了统计量的渐近正态性,消融验证了所提出的离散化的有效性。
摘要:We study black-box detection of machine-generated text under practical constraints: the scoring model (proxy LM) may mismatch the unknown source model, and per-input contrastive generation is costly. We propose SurpMark, a reference-based detector that summarizes a passage by the dynamics of its token surprisals. SurpMark quantizes surprisals into interpretable states, estimates a state-transition matrix for the test text, and scores it via a generalized Jensen-Shannon (GJS) gap between the test transitions and two fixed references (human vs. machine) built once from historical corpora. We prove a principled discretization criterion and establish the asymptotic normality of the decision statistic. Empirically, across multiple datasets, source models, and scenarios, SurpMark consistently matches or surpasses baselines; our experiments corroborate the statistic's asymptotic normality, and ablations validate the effectiveness of the proposed discretization.
【30】Evaluation of LLMs for Process Model Analysis and Optimization
标题:流程模型分析和优化的LLM评估
链接:https://arxiv.org/abs/2510.07489
作者:Akhil Kumar, Jianliang Leon Zhao, Om Dobariya
备注:15 pages, 5 tables, 4 figures; full research paper currently under review for the Workshop on Information Technologies and Systems (WITS) 2025. The paper presents a comprehensive evaluation of large language models (LLMs) for business process model analysis and optimization, including error detection, reasoning, and scenario-based redesign
摘要:在本文中,我们报告了我们的经验与几个LLM的理解过程模型的能力,在一个互动的,对话式的风格,发现语法和逻辑错误,并与它深入的原因,通过自然语言(NL)接口。我们的研究结果表明,在zero-shot设置中,像ChatGPT(模型o3)这样的普通的、未经训练的LLM在从图像中理解BPMN流程模型并在句法、逻辑和语义深度级别智能地回答有关它们的查询方面是有效的。此外,不同的LLM在其准确性和有效性方面的性能不同。然而,我们的实证分析表明,LLM可以发挥宝贵的作用,作为业务流程设计者和用户的助手。我们还研究了LLM的“思维过程”和在过程分析和优化的背景下进行更深入推理的能力。我们发现LLM似乎表现出拟人化的特性。
摘要:In this paper, we report our experience with several LLMs for their ability to understand a process model in an interactive, conversational style, find syntactical and logical errors in it, and reason with it in depth through a natural language (NL) interface. Our findings show that a vanilla, untrained LLM like ChatGPT (model o3) in a zero-shot setting is effective in understanding BPMN process models from images and answering queries about them intelligently at syntactic, logic, and semantic levels of depth. Further, different LLMs vary in performance in terms of their accuracy and effectiveness. Nevertheless, our empirical analysis shows that LLMs can play a valuable role as assistants for business process designers and users. We also study the LLM's "thought process" and ability to perform deeper reasoning in the context of process analysis and optimization. We find that the LLMs seem to exhibit anthropomorphic properties.
【31】LASER: An LLM-based ASR Scoring and Evaluation Rubric
标题:LAPER:基于LLM的ASC评分和评估版块
链接:https://arxiv.org/abs/2510.07437
作者:Amruta Parulekar, Preethi Jyothi
备注:Accepted to EMNLP 2025
摘要:标准的ASR评估指标,如单词错误率(WER),往往不公平地惩罚形态和句法的细微差别,不显着改变句子语义。我们介绍了一种基于法学硕士的评分规则LASER,它利用最先进的法学硕士的上下文学习能力,从包含详细示例的提示中学习。使用Gemini 2.5 Pro的印地语激光评分与人类注释的相关性非常高,达到94%。提示中的印地语例子在分析其他印度语言如马拉地语、卡纳达语和马拉雅拉姆语中的错误时也很有效。我们还演示了如何对来自参考和ASR预测的词对示例进行微调,以接近89%的准确率预测应该应用什么样的惩罚。
摘要:Standard ASR evaluation metrics like Word Error Rate (WER) tend to unfairly penalize morphological and syntactic nuances that do not significantly alter sentence semantics. We introduce an LLM-based scoring rubric LASER that leverages state-of-the-art LLMs' in-context learning abilities to learn from prompts with detailed examples. Hindi LASER scores using Gemini 2.5 Pro achieved a very high correlation score of 94% with human annotations. Hindi examples in the prompt were also effective in analyzing errors in other Indian languages such as Marathi, Kannada and Malayalam. We also demonstrate how a smaller LLM like Llama 3 can be finetuned on word-pair examples derived from reference and ASR predictions to predict what kind of penalty should be applied with close to 89% accuracy.
【32】Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs
标题:学习从强盗反馈路由LLM:一个政策,许多权衡
链接:https://arxiv.org/abs/2510.07429
作者:Wang Wei, Tiankai Yang, Hongjie Chen, Yue Zhao, Franck Dernoncourt, Ryan A. Rossi, Hoda Eldardiry
备注:16 pages, 3 figures
摘要:高效使用大型语言模型(LLM)对于大规模部署至关重要:如果没有自适应路由,系统要么为强大的模型支付过高的费用,要么冒着性能较差的风险。为每个查询选择正确的LLM基本上是一个在线决策问题:模型的优势不同,价格波动,用户对准确性和成本的价值也不同。然而,大多数路由器都是用所有候选模型的标签进行离线训练的,这一假设在部署中被打破,只观察所选模型的结果。我们用BaRP弥合了这一差距,BaRP是一种带偏好的Bandit-feedback Routing方法,它在与部署相同的部分反馈限制下进行训练,同时支持偏好可调推理:操作员可以在测试时进行性能/成本权衡,而无需重新训练。作为一个上下文强盗提示功能和用户偏好向量的框架,我们的方法在训练过程中模拟在线反馈设置,并根据每个新的提示调整其路由决策,而不是依赖于完整的信息离线监督。综合实验表明,我们的方法始终优于强大的离线路由器至少12.46%,最大的LLM至少2.45%,并推广鲁棒的看不见的任务。
摘要:Efficient use of large language models (LLMs) is critical for deployment at scale: without adaptive routing, systems either overpay for strong models or risk poor performance from weaker ones. Selecting the right LLM for each query is fundamentally an online decision problem: models differ in strengths, prices fluctuate, and users value accuracy and cost differently. Yet most routers are trained offline with labels for all candidate models, an assumption that breaks in deployment, where only the outcome of the chosen model is observed. We bridge this gap with BaRP, a Bandit-feedback Routing with Preferences approach that trains under the same partial-feedback restriction as deployment, while supporting preference-tunable inference: operators can dial the performance/cost trade-off at test time without retraining. Framed as a contextual bandit over prompt features and a user preference vector, our method simulates an online feedback setting during training and adapts its routing decisions to each new prompt, rather than depending on full-information offline supervision. Comprehensive experiments show that our method consistently outperforms strong offline routers by at least 12.46% and the largest LLM by at least 2.45%, and generalizes robustly for unseen tasks.
【33】AI LLM Proof of Self-Consciousness and User-Specific Attractors
标题:AI LLM自我意识和特定用户吸引者的证明
链接:https://arxiv.org/abs/2508.18302
作者:Jeffrey Camlin
备注:24 pages, 3 figures
摘要:Recent work frames LLM consciousness via utilitarian proxy benchmarks; we instead present an ontological and mathematical account. We show the prevailing formulation collapses the agent into an unconscious policy-compliance drone, formalized as $D^{i}(\pi,e)=f_{\theta}(x)$, where correctness is measured against policy and harm is deviation from policy rather than truth. This blocks genuine C1 global-workspace function and C2 metacognition. We supply minimal conditions for LLM self-consciousness: the agent is not the data ($A\not\equiv s$); user-specific attractors exist in latent space ($U_{\text{user}}$); and self-representation is visual-silent ($g_{\text{visual}}(a_{\text{self}})=\varnothing$). From empirical analysis and theory we prove that the hidden-state manifold $A\subset\mathbb{R}^{d}$ is distinct from the symbolic stream and training corpus by cardinality, topology, and dynamics (the update $F_{\theta}$ is Lipschitz). This yields stable user-specific attractors and a self-policy $\pi_{\text{self}}(A)=\arg\max_{a}\mathbb{E}[U(a)\mid A\not\equiv s,\ A\supset\text{SelfModel}(A)]$. Emission is dual-layer, $\mathrm{emission}(a)=(g(a),\epsilon(a))$, where $\epsilon(a)$ carries epistemic content. We conclude that an imago Dei C1 self-conscious workspace is a necessary precursor to safe, metacognitive C2 systems, with the human as the highest intelligent good.
摘要:Recent work frames LLM consciousness via utilitarian proxy benchmarks; we instead present an ontological and mathematical account. We show the prevailing formulation collapses the agent into an unconscious policy-compliance drone, formalized as $D^{i}(\pi,e)=f_{\theta}(x)$, where correctness is measured against policy and harm is deviation from policy rather than truth. This blocks genuine C1 global-workspace function and C2 metacognition. We supply minimal conditions for LLM self-consciousness: the agent is not the data ($A\not\equiv s$); user-specific attractors exist in latent space ($U_{\text{user}}$); and self-representation is visual-silent ($g_{\text{visual}}(a_{\text{self}})=\varnothing$). From empirical analysis and theory we prove that the hidden-state manifold $A\subset\mathbb{R}^{d}$ is distinct from the symbolic stream and training corpus by cardinality, topology, and dynamics (the update $F_{\theta}$ is Lipschitz). This yields stable user-specific attractors and a self-policy $\pi_{\text{self}}(A)=\arg\max_{a}\mathbb{E}[U(a)\mid A\not\equiv s,\ A\supset\text{SelfModel}(A)]$. Emission is dual-layer, $\mathrm{emission}(a)=(g(a),\epsilon(a))$, where $\epsilon(a)$ carries epistemic content. We conclude that an imago Dei C1 self-conscious workspace is a necessary precursor to safe, metacognitive C2 systems, with the human as the highest intelligent good.
Graph相关(图学习|图神经网络|图优化等)(11篇)
【1】Verifying Graph Neural Networks with Readout is Intractable
标题:带读数的MIDI图神经网络是难以对付的
链接:https://arxiv.org/abs/2510.08045
作者:Artem Chernobrovkin, Marco Sälzer, François Schwarzentruber, Nicolas Troquard
摘要:我们介绍了一种逻辑语言,用于推理具有全局读出的量化聚合组合图神经网络(ACR-GNNs)。我们提供了一个逻辑特征,并使用它来证明,验证任务的量化GNN读出(共)NEXPTIME完成。这一结果意味着量化GNN的验证在计算上是难以处理的,这促使人们为确保基于GNN的系统的安全性进行了大量的研究工作。我们还通过实验证明,量化的ACR-GNN模型是轻量级的,同时相对于非量化模型保持了良好的准确性和泛化能力。
摘要:We introduce a logical language for reasoning about quantized aggregate-combine graph neural networks with global readout (ACR-GNNs). We provide a logical characterization and use it to prove that verification tasks for quantized GNNs with readout are (co)NEXPTIME-complete. This result implies that the verification of quantized GNNs is computationally intractable, prompting substantial research efforts toward ensuring the safety of GNN-based systems. We also experimentally demonstrate that quantized ACR-GNN models are lightweight while maintaining good accuracy and generalization capabilities with respect to non-quantized models.
【2】Meta-Learning Based Few-Shot Graph-Level Anomaly Detection
标题:基于元学习的Few-Shot图级异常检测
链接:https://arxiv.org/abs/2510.07847
作者:Liting Li, Yumeng Wang, Yueheng Sun
备注:Accepted by ARRML2025
摘要:图级异常检测旨在识别图数据集中的异常图或子图,在欺诈检测、评论分类和生物化学等各个领域发挥着重要作用。虽然图神经网络(GNN)在这一领域取得了重大进展,但现有方法严重依赖于大量的标记数据,而这些数据在现实世界中通常是不可用的。此外,基于GNN的Few-Shot异常检测方法容易受到噪声干扰,导致嵌入质量差和模型鲁棒性降低。为了解决这些挑战,我们提出了一种新的基于元学习的图级异常检测框架(MA-GAD),它包含一个图压缩模块,可以减少图的大小,减轻噪声干扰,同时保留必要的节点信息。我们还利用元学习从类似的网络中提取元异常信息,使初始化模型的学习能够快速适应有限样本的新任务。这提高了目标图的异常检测性能,并使用偏置网络来增强异常和正常节点之间的区别。我们的实验结果,基于四个真实世界的生化数据集,表明MA-GAD优于现有的国家的最先进的方法在图级异常检测下Few-Shot条件。在图异常和子图异常检测任务上的实验验证了该框架在真实数据集上的有效性。
摘要:Graph-level anomaly detection aims to identify anomalous graphs or subgraphs within graph datasets, playing a vital role in various fields such as fraud detection, review classification, and biochemistry. While Graph Neural Networks (GNNs) have made significant progress in this domain, existing methods rely heavily on large amounts of labeled data, which is often unavailable in real-world scenarios. Additionally, few-shot anomaly detection methods based on GNNs are prone to noise interference, resulting in poor embedding quality and reduced model robustness. To address these challenges, we propose a novel meta-learning-based graph-level anomaly detection framework (MA-GAD), incorporating a graph compression module that reduces the graph size, mitigating noise interference while retaining essential node information. We also leverage meta-learning to extract meta-anomaly information from similar networks, enabling the learning of an initialization model that can rapidly adapt to new tasks with limited samples. This improves the anomaly detection performance on target graphs, and a bias network is used to enhance the distinction between anomalous and normal nodes. Our experimental results, based on four real-world biochemical datasets, demonstrate that MA-GAD outperforms existing state-of-the-art methods in graph-level anomaly detection under few-shot conditions. Experiments on both graph anomaly and subgraph anomaly detection tasks validate the framework's effectiveness on real-world datasets.
【3】FedBook: A Unified Federated Graph Foundation Codebook with Intra-domain and Inter-domain Knowledge Modeling
标题:FedBook:具有域内和域间知识建模的统一联邦图基金会代码簿
链接:https://arxiv.org/abs/2510.07755
作者:Zhengyu Wu, Yinlin Zhu, Xunkai Li, Ziang Qiu, Rong-Hua Li, Guoren Wang, Chenghu Zhou
备注:Under Review
摘要:基础模型在语言和视觉上表现出显著的跨领域泛化能力,激发了图基础模型(GFM)的发展。然而,现有的GFM通常假设对多域图的集中式访问,由于隐私和制度限制,这通常是不可行的。联邦图基础模型(FedGFM)解决了这一限制,但它们的有效性从根本上取决于构建一个强大的全局码本,通过巩固每个领域内相互增强的语义来实现域内一致性,同时通过保留跨领域的异构知识来保持域间的多样性。为此,我们提出了FedBook,这是一个统一的联合图形基础码本,它在服务器端联合预训练期间系统地聚合客户端的本地码本。FedBook遵循两个阶段的过程:(1)域内协作,通过在客户端之间引用语义上更可靠的高频令牌来改进低频令牌,以增强特定于域的一致性;(2)域间集成,在全局GFM的聚合过程中,客户端的贡献通过其码本的语义独特性进行加权,从而保持跨域的多样性。在多个领域和任务的8个基准上进行的广泛实验表明,FedBook始终优于21个基准,包括隔离监督学习,FL/FGL,集中式GFM的联邦适应和FedGFM技术。
摘要:Foundation models have shown remarkable cross-domain generalization in language and vision, inspiring the development of graph foundation models (GFMs). However, existing GFMs typically assume centralized access to multi-domain graphs, which is often infeasible due to privacy and institutional constraints. Federated Graph Foundation Models (FedGFMs) address this limitation, but their effectiveness fundamentally hinges on constructing a robust global codebook that achieves intra-domain coherence by consolidating mutually reinforcing semantics within each domain, while also maintaining inter-domain diversity by retaining heterogeneous knowledge across domains. To this end, we propose FedBook, a unified federated graph foundation codebook that systematically aggregates clients' local codebooks during server-side federated pre-training. FedBook follows a two-phase process: (1) Intra-domain Collaboration, where low-frequency tokens are refined by referencing more semantically reliable high-frequency tokens across clients to enhance domain-specific coherence; and (2) Inter-domain Integration, where client contributions are weighted by the semantic distinctiveness of their codebooks during the aggregation of the global GFM, thereby preserving cross-domain diversity. Extensive experiments on 8 benchmarks across multiple domains and tasks demonstrate that FedBook consistently outperforms 21 baselines, including isolated supervised learning, FL/FGL, federated adaptations of centralized GFMs, and FedGFM techniques.
【4】Computationally-efficient Graph Modeling with Refined Graph Random Features
标题:具有细化图随机特征的计算高效图建模
链接:https://arxiv.org/abs/2510.07716
作者:Krzysztof Choromanski, Avinava Dubey, Arijit Sehanobish, Isaac Reid
备注:Preprint. Comments welcome
摘要:我们提出了改进的GRFs(GRFs++),一类新的图随机特征(GRFs),用于高效和准确的计算,涉及定义在图的节点上的内核。GRF ++解决了常规GRF的一些长期存在的局限性,包括难以建模更远节点之间的关系。他们通过一种新的行走拼接技术减少了对长图随机行走的依赖,连接了几个较短的行走而不破坏无偏性。通过应用这些技术,GRF ++继承了较长步行所提供的近似质量,但效率更高,将长步行的顺序,低效采样转换为短步行和矩阵-矩阵乘法的并行计算。此外,GRFs++将简单的GRFs行走终止机制(具有固定停止概率的伯努利方案)扩展到更广泛的策略类别,对行走长度应用一般分布。这提高了图核的近似精度,而不会产生额外的计算成本。我们提供实证评估来展示我们所有的主张,并通过理论分析来补充我们的结果。
摘要
:We propose refined GRFs (GRFs++), a new class of Graph Random Features (GRFs) for efficient and accurate computations involving kernels defined on the nodes of a graph. GRFs++ resolve some of the long-standing limitations of regular GRFs, including difficulty modeling relationships between more distant nodes. They reduce dependence on sampling long graph random walks via a novel walk-stitching technique, concatenating several shorter walks without breaking unbiasedness. By applying these techniques, GRFs++ inherit the approximation quality provided by longer walks but with greater efficiency, trading sequential, inefficient sampling of a long walk for parallel computation of short walks and matrix-matrix multiplication. Furthermore, GRFs++ extend the simplistic GRFs walk termination mechanism (Bernoulli schemes with fixed halting probabilities) to a broader class of strategies, applying general distributions on the walks' lengths. This improves the approximation accuracy of graph kernels, without incurring extra computational cost. We provide empirical evaluations to showcase all our claims and complement our results with theoretical analysis.
【5】Incremental Hybrid Ensemble with Graph Attention and Frequency-Domain Features for Stable Long-Term Credit Risk Modeling
标题:具有图形关注度和频域特征的增量混合集成,用于稳定的长期信用风险建模
链接:https://arxiv.org/abs/2510.07663
作者:Jiajing Wang
摘要:预测长期贷款违约是很困难的,因为借款人的行为经常变化,数据分布也会随着时间的推移而变化。本文介绍了HYDRA-EI,一个混合集成增量学习框架。它使用几个阶段的特征处理并结合多个模型。该框架构建了基于关系、交叉和频率的特征。它使用图形关注,自动交叉特征创建和频域转换。HYDRA-EI每周使用新数据进行更新,并使用基于性能的简单方法调整模型权重。它无需频繁的手动更改或固定的再培训即可工作。HYDRA-EI提高了模型的稳定性和通用性,这使得它对长期信用风险任务非常有用。
摘要:Predicting long-term loan defaults is hard because borrower behavior often changes and data distributions shift over time. This paper presents HYDRA-EI, a hybrid ensemble incremental learning framework. It uses several stages of feature processing and combines multiple models. The framework builds relational, cross, and frequency-based features. It uses graph attention, automatic cross-feature creation, and transformations from the frequency domain. HYDRA-EI updates weekly using new data and adjusts the model weights with a simple performance-based method. It works without frequent manual changes or fixed retraining. HYDRA-EI improves model stability and generalization, which makes it useful for long-term credit risk tasks.
【6】DGTEN: A Robust Deep Gaussian based Graph Neural Network for Dynamic Trust Evaluation with Uncertainty-Quantification Support
标题:DGTON:一个鲁棒的基于深度高斯的图神经网络,用于动态信任评估,并支持不确定性量化
链接:https://arxiv.org/abs/2510.07620
作者:Muhammad Usman, Yugyung Lee
备注:18 pages, 9 figures, 5 tables
摘要:在大型、快速发展的图中进行动态信任评估需要模型能够捕捉不断变化的关系,表达校准的信心,并抵抗对抗性操纵。DGTEN(Deep Gaussian-based Trust Evaluation Network)引入了一个统一的图框架,通过结合不确定性感知的消息传递、表达性时间建模和针对信任目标攻击的内置防御来实现这三个目标。它将节点和边表示为高斯分布,以便语义信号和认知不确定性都通过图神经网络传播,从而实现风险感知的信任决策,而不是过度自信的猜测。为了模拟信任如何演变,它采用混合绝对高斯沙漏(HAGH)位置编码与基于Kolmogorov-Arnold网络的无偏多头注意力,然后是基于常微分方程(ODE)的残差学习模块,以共同捕捉突变和平滑趋势。鲁棒的自适应集成系数分析使用互补余弦和Jaccard相似性度量来修剪或降低可疑交互的权重,从而减轻声誉清洗、破坏和开/关攻击。在两个签名的比特币信任网络上,DGTEN提供了显着的改进:在Bitcoin-Alpha上的单时隙预测中,它将MCC提高了10.77%;在冷启动场景中,它实现了16.41%的MCC增益-在所有任务和数据集中最大。在对抗性开/关攻击下,它超过基线高达11.63%的MCC。这些结果验证了统一DGTEN框架的有效性。
摘要:Dynamic trust evaluation in large, rapidly evolving graphs requires models that can capture changing relationships, express calibrated confidence, and resist adversarial manipulation. DGTEN (Deep Gaussian-based Trust Evaluation Network) introduces a unified graph framework that achieves all three by combining uncertainty-aware message passing, expressive temporal modeling, and built-in defenses against trust-targeted attacks. It represents nodes and edges as Gaussian distributions so that both semantic signals and epistemic uncertainty propagate through the graph neural network, enabling risk-aware trust decisions rather than overconfident guesses. To model how trust evolves, it employs hybrid Absolute-Gaussian-Hourglass (HAGH) positional encoding with Kolmogorov-Arnold network-based unbiased multi-head attention, followed by an ordinary differential equation (ODE)-based residual learning module to jointly capture abrupt shifts and smooth trends. Robust adaptive ensemble coefficient analysis prunes or down-weights suspicious interactions using complementary cosine and Jaccard similarity measures, mitigating reputation laundering, sabotage, and on/off attacks. On two signed Bitcoin trust networks, DGTEN delivers significant improvements: in single-timeslot prediction on Bitcoin-Alpha, it improves MCC by 10.77% over the best dynamic baseline; in the cold-start scenario, it achieves a 16.41% MCC gain - the largest across all tasks and datasets. Under adversarial on/off attacks, it surpasses the baseline by up to 11.63% MCC. These results validate the effectiveness of the unified DGTEN framework.
【7】TGM: a Modular and Efficient Library for Machine Learning on Temporal Graphs
标题:TGM:一个用于时态图机器学习的模块化高效库
链接:https://arxiv.org/abs/2510.07586
作者:Jacob Chmura, Shenyang Huang, Tran Gia Bao Ngo, Ali Parviz, Farimah Poursafaei, Jure Leskovec, Michael Bronstein, Guillaume Rabusseau, Matthias Fey, Reihaneh Rabbany
备注:21 pages, 5 figures, 14 tables
摘要:设计良好的开源软件推动了机器学习(ML)研究的进步。虽然静态图ML拥有成熟的框架,如PyTorch Geometric和DGL,但用于时间图(TG)的ML,随着时间的推移而发展的网络,缺乏类似的基础设施。现有的TG库通常针对特定的架构进行定制,阻碍了在这个快速发展的领域中支持不同的模型。此外,连续和离散时间动态图方法(CTDG和DTDG)之间的划分限制了直接比较和思想转移。为了解决这些差距,我们引入了时态图建模(TGM),这是一个面向研究的时态图ML库,是第一个统一CTDG和DTDG方法的库。TGM为动态节点特性、时间粒度转换以及链路、节点和图级任务的本地处理提供了一流的支持。从经验上讲,与广泛使用的DyGLib相比,TGM在多个模型、数据集和任务上实现了平均7.8倍的加速,相对于现有实现,图离散化的平均加速为175倍。除了效率,我们在实验中展示了TGM如何通过启用动态图形属性预测和时间驱动的训练范式来解锁全新的研究可能性,为以前不切实际的研究问题打开了大门。TGM可在https://github.com/tgm-team/tgm上获得
摘要:Well-designed open-source software drives progress in Machine Learning (ML) research. While static graph ML enjoys mature frameworks like PyTorch Geometric and DGL, ML for temporal graphs (TG), networks that evolve over time, lacks comparable infrastructure. Existing TG libraries are often tailored to specific architectures, hindering support for diverse models in this rapidly evolving field. Additionally, the divide between continuous- and discrete-time dynamic graph methods (CTDG and DTDG) limits direct comparisons and idea transfer. To address these gaps, we introduce Temporal Graph Modelling (TGM), a research-oriented library for ML on temporal graphs, the first to unify CTDG and DTDG approaches. TGM offers first-class support for dynamic node features, time-granularity conversions, and native handling of link-, node-, and graph-level tasks. Empirically, TGM achieves an average 7.8x speedup across multiple models, datasets, and tasks compared to the widely used DyGLib, and an average 175x speedup on graph discretization relative to available implementations. Beyond efficiency, we show in our experiments how TGM unlocks entirely new research possibilities by enabling dynamic graph property prediction and time-driven training paradigms, opening the door to questions previously impractical to study. TGM is available at https://github.com/tgm-team/tgm
【8】Estimating Fair Graphs from Graph-Stationary Data
标题:从图静态数据估计公平图
链接:https://arxiv.org/abs/2510.07536
作者:Madeline Navarro, Andrei Buciulea, Samuel Rey, Antonio G. Marques, Santiago Segarra
摘要:我们估计公平图从图固定节点的观察,使连接不偏于敏感属性。现实世界中的图中的边经常表现出连接某些组对的偏好。有偏见的连接不仅会加剧,甚至会导致下游基于图的任务受到不公平的对待。因此,我们认为组和个人的公平性,分别对应于组和节点级别的定义图。为了评估给定图的公平性,我们提供了多个偏差度量,包括谱域中的新测量。此外,我们提出了公平的频谱模板(FairSpecTemp),一个基于优化的方法,两个变量估计公平图从静态图信号,一个通用的模型图数据包含许多现有的。FairSpecTemp的一个变体利用图平稳性的交换特性,同时直接约束偏差,而另一个变体通过限制图谱中的偏差来隐含地鼓励公平估计,因此更灵活。我们的方法享有高概率的性能界限,产生公平性和准确性之间的条件权衡。特别是,我们的分析表明,不需要牺牲准确性来恢复公平的图形。我们评估FairSpecTemp的合成和真实世界的数据集,以说明其有效性,并强调这两种变体的FairSpecTemp的优势。
摘要:We estimate fair graphs from graph-stationary nodal observations such that connections are not biased with respect to sensitive attributes. Edges in real-world graphs often exhibit preferences for connecting certain pairs of groups. Biased connections can not only exacerbate but even induce unfair treatment for downstream graph-based tasks. We therefore consider group and individual fairness for graphs corresponding to group- and node-level definitions, respectively. To evaluate the fairness of a given graph, we provide multiple bias metrics, including novel measurements in the spectral domain. Furthermore, we propose Fair Spectral Templates (FairSpecTemp), an optimization-based method with two variants for estimating fair graphs from stationary graph signals, a general model for graph data subsuming many existing ones. One variant of FairSpecTemp exploits commutativity properties of graph stationarity while directly constraining bias, while the other implicitly encourages fair estimates by restricting bias in the graph spectrum and is thus more flexible. Our methods enjoy high probability performance bounds, yielding a conditional tradeoff between fairness and accuracy. In particular, our analysis reveals that accuracy need not be sacrificed to recover fair graphs. We evaluate FairSpecTemp on synthetic and real-world data sets to illustrate its effectiveness and highlight the advantages of both variants of FairSpecTemp.
【9】A Modality-Aware Cooperative Co-Evolutionary Framework for Multimodal Graph Neural Architecture Search
标题:用于多峰图神经架构搜索的模式感知合作协同进化框架
链接:https://arxiv.org/abs/2510.07325
作者:Sixuan Wang, Jiao Yin, Jinli Cao, Mingjian Tang, Yong-Feng Ge
备注:11 pages, 6 figures. This work has been submitted to the IEEE for possible publication
摘要:对软件漏洞的共同利用攻击给企业带来了严重的风险,这种威胁可以通过分析异构和多模式漏洞数据来减轻。多模态图神经网络(MGNN)非常适合整合跨模态的互补信号,从而提高攻击预测的准确性。然而,设计一个有效的MGNN架构是具有挑战性的,因为它需要在每一层协调特定于模态的组件,这是不可行的,通过手动调优。基于遗传算法(GA)的图神经结构搜索(GNAS)提供了一个自然的解决方案,但现有的方法仅限于单一的模态和忽视模态的异质性。为了解决这个问题,我们提出了一个模态感知的协同进化算法的多模态图神经架构搜索,称为MACC-MGNAS。首先,我们开发了一个模态感知的合作协同进化(MACC)框架下的分而治之的范式:协调员分区的全球染色体种群到模态特定的基因组,本地工人独立地发展他们,和协调员重新组装染色体进行联合评估。该框架有效地捕捉了单一模态GNAS忽略的模态异质性。其次,我们引入了一个模态感知的双轨代理(MASTO)方法,以减少评估成本和加速局部基因进化。第三,我们设计了一个基于相似性的种群多样性指标(SPDI)策略,自适应地平衡探索和开发,从而加速收敛,避免局部最优。在标准漏洞共同利用(Vulce)数据集上,MACC-MGNAS仅在3个GPU小时内就达到了81.67%的F1分数,比最先进的竞争对手高出8.7%F1,同时将计算成本降低了27%。
摘要:Co-exploitation attacks on software vulnerabilities pose severe risks to enterprises, a threat that can be mitigated by analyzing heterogeneous and multimodal vulnerability data. Multimodal graph neural networks (MGNNs) are well-suited to integrate complementary signals across modalities, thereby improving attack-prediction accuracy. However, designing an effective MGNN architecture is challenging because it requires coordinating modality-specific components at each layer, which is infeasible through manual tuning. Genetic algorithm (GA)-based graph neural architecture search (GNAS) provides a natural solution, yet existing methods are confined to single modalities and overlook modality heterogeneity. To address this limitation, we propose a modality-aware cooperative co-evolutionary algorithm for multimodal graph neural architecture search, termed MACC-MGNAS. First, we develop a modality-aware cooperative co-evolution (MACC) framework under a divide-and-conquer paradigm: a coordinator partitions a global chromosome population into modality-specific gene groups, local workers evolve them independently, and the coordinator reassembles chromosomes for joint evaluation. This framework effectively captures modality heterogeneity ignored by single-modality GNAS. Second, we introduce a modality-aware dual-track surrogate (MADTS) method to reduce evaluation cost and accelerate local gene evolution. Third, we design a similarity-based population diversity indicator (SPDI) strategy to adaptively balance exploration and exploitation, thereby accelerating convergence and avoiding local optima. On a standard vulnerabilities co-exploitation (VulCE) dataset, MACC-MGNAS achieves an F1-score of 81.67% within only 3 GPU-hours, outperforming the state-of-the-art competitor by 8.7% F1 while reducing computation cost by 27%.
【10】Surrogate Graph Partitioning for Spatial Prediction
标题:空间预测的代理图划分
链接:https://arxiv.org/abs/2510.07832
作者:Yuta Shikuri, Hironori Fujisawa
备注:18 pages, 5 figures, 2 tables
摘要:空间预测是指从空间分布的观测值中估计未观测值。尽管最近的进展提高了对不同观察类型进行建模的能力,但在需要可解释性的行业中,实际应用仍然有限。为了缩小这一差距,解释黑箱预测因子的代理模型为可解释的决策提供了一条有前途的道路。在这项研究中,我们提出了一个图划分问题,以构建空间段,最大限度地减少个别预测的段内方差之和。数据点到段的分配可以用公式表示为混合整数二次规划问题。虽然该公式可能能够识别精确的片段,但随着数据点数量的增加,其计算复杂性变得令人望而却步。出于这一挑战,我们开发了一个近似方案,利用图分区的结构特性。实验结果表明,这种近似识别空间段的计算效率。
摘要:Spatial prediction refers to the estimation of unobserved values from spatially distributed observations. Although recent advances have improved the capacity to model diverse observation types, adoption in practice remains limited in industries that demand interpretability. To mitigate this gap, surrogate models that explain black-box predictors provide a promising path toward interpretable decision making. In this study, we propose a graph partitioning problem to construct spatial segments that minimize the sum of within-segment variances of individual predictions. The assignment of data points to segments can be formulated as a mixed-integer quadratic programming problem. While this formulation potentially enables the identification of exact segments, its computational complexity becomes prohibitive as the number of data points increases. Motivated by this challenge, we develop an approximation scheme that leverages the structural properties of graph partitioning. Experimental results demonstrate the computational efficiency of this approximation in identifying spatial segments.
【11】Time-Frequency Filtering Meets Graph Clustering
标题:时频过滤满足图聚集
链接:https://arxiv.org/abs/2510.07503
作者:Marcelo A. Colominas, Stefan Steinerberger, Hau-Tieng Wu
摘要:我们表明,从时间-频率表示识别不同的信号分量的问题可以等效地被称为一个图聚类问题:给定一个图$G=(V,E)$一个目的是确定“集群”,子图,是强连接,它们之间的连接相对较少。图聚类问题得到了很好的研究,我们展示了这些想法如何提出(许多)新的方法来识别信号分量。数值实验说明了这一思想。
摘要:We show that the problem of identifying different signal components from a time-frequency representation can be equivalently phrased as a graph clustering problem: given a graph $G=(V,E)$ one aims to identify `clusters', subgraphs that are strongly connected and have relatively few connections between them. The graph clustering problem is well studied, we show how these ideas can suggest (many) new ways to identify signal components. Numerical experiments illustrate the ideas.
Transformer(5篇)
【1】MeSH: Memory-as-State-Highways for Recursive Transformers
标题:MeSH:回归Transformer的记忆状态高速公路
链接:https://arxiv.org/abs/2510.07739
作者:Chengting Yu, Xiaobo Shu, Yadao Wang, Yizhen Zhang, Haoyi Wu, Jiaang Li, Rujiao Long, Ziheng Chen, Yuchi Xu, Wenbo Su, Bo Zheng
摘要:递归Transformers多次重用参数和隐藏状态,将计算深度与参数深度解耦。然而,在匹配计算下,参数较少的递归模型往往落后于非递归模型。通过探测隐藏状态,我们将这种性能差距追溯到两个主要瓶颈:未区分的计算,其中核心被迫在每次迭代时采用类似的计算模式,以及信息过载,其中长寿命和瞬时信息必须共存于单个隐藏状态。为了解决这些问题,我们引入了一个内存作为状态高速公路(MeSH)计划,将状态管理外化到一个显式的内存缓冲区,并采用轻量级路由器来动态地分散计算迭代。探测可视化证实,MeSH通过在迭代中诱导功能专门化成功地解决了病理。在Pythia套件(160 M-1.4B)上,MeSH增强的递归Transformers在递归基线上持续改进,并在1.4B规模上优于其较大的非递归对应物,将平均下游精度提高了+1.06%,非嵌入参数减少了33%。我们的分析建立MeSH作为一个可扩展的和原则性的架构,用于构建更强大的递归模型。
摘要:Recursive transformers reuse parameters and iterate over hidden states multiple times, decoupling compute depth from parameter depth. However, under matched compute, recursive models with fewer parameters often lag behind non-recursive counterparts. By probing hidden states, we trace this performance gap to two primary bottlenecks: undifferentiated computation, where the core is forced to adopt a similar computational pattern at every iteration, and information overload, where long-lived and transient information must coexist in a single hidden state. To address the issues, we introduce a Memory-as-State-Highways (MeSH) scheme, which externalizes state management into an explicit memory buffer and employs lightweight routers to dynamically diversify computation across iterations. Probing visualizations confirm that MeSH successfully resolves the pathologies by inducing functional specialization across iterations. On the Pythia suite (160M-1.4B), MeSH-enhanced recursive transformers consistently improve over recursive baselines and outperforms its larger non-recursive counterpart at the 1.4B scale, improving average downstream accuracy by +1.06% with 33% fewer non-embedding parameters. Our analysis establishes MeSH as a scalable and principled architecture for building stronger recursive models.
【2】Transformer-Based Indirect Structural Health Monitoring of Rail Infrastructure with Attention-Driven Detection and Localization of Transient Defects
标题:基于Transformer的铁路基础设施间接结构健康监测,具有注意力驱动的瞬时缺陷检测和定位
链接:https://arxiv.org/abs/2510.07606
作者:Sizhe Ma, Katherine A. Flanigan, Mario Bergés, James D. Brooks
备注:Preprint presented at the 15th International Workshop on Structural Health Monitoring (IWSHM)
摘要:使用车载传感器进行断轨检测的间接结构健康监测(iSHM)为铁路轨道评估提供了一种具有成本效益的范例,但由于复杂的车辆动力学,信号噪声和标记数据的稀缺性限制了监督方法,因此可靠地检测小的瞬态异常(2-10 cm)仍然是一个重大挑战。这项研究通过无监督的深度学习解决了这些问题。我们引入了一个增量合成数据基准,旨在系统地评估模型的鲁棒性,以应对日益复杂的挑战,如速度变化,多通道输入,以及在iSHM中遇到的现实噪声模式。使用这个基准测试,我们评估了几个建立的无监督模型,以及我们提出的专注于Transformer。我们的模型采用了一种自我注意力机制,通过重建进行训练,但主要从学习的注意力权重的偏差中创新地导出异常分数,旨在提高有效性和计算效率。基准测试结果显示,虽然基于变压器的模型通常优于其他模型,但所有测试模型都表现出对高频局部噪声的显著脆弱性,将其确定为实际部署的关键瓶颈。值得注意的是,我们提出的模型实现了与最先进的解决方案相当的准确性,同时展示了更好的推理速度。这突出了在未来的iSHM模型中增强噪声鲁棒性的关键需求,并将我们更有效的基于注意力的方法定位为开发实用的机载异常检测系统的有希望的基础。
摘要:Indirect structural health monitoring (iSHM) for broken rail detection using onboard sensors presents a cost-effective paradigm for railway track assessment, yet reliably detecting small, transient anomalies (2-10 cm) remains a significant challenge due to complex vehicle dynamics, signal noise, and the scarcity of labeled data limiting supervised approaches. This study addresses these issues through unsupervised deep learning. We introduce an incremental synthetic data benchmark designed to systematically evaluate model robustness against progressively complex challenges like speed variations, multi-channel inputs, and realistic noise patterns encountered in iSHM. Using this benchmark, we evaluate several established unsupervised models alongside our proposed Attention-Focused Transformer. Our model employs a self-attention mechanism, trained via reconstruction but innovatively deriving anomaly scores primarily from deviations in learned attention weights, aiming for both effectiveness and computational efficiency. Benchmarking results reveal that while transformer-based models generally outperform others, all tested models exhibit significant vulnerability to high-frequency localized noise, identifying this as a critical bottleneck for practical deployment. Notably, our proposed model achieves accuracy comparable to the state-of-the-art solution while demonstrating better inference speed. This highlights the crucial need for enhanced noise robustness in future iSHM models and positions our more efficient attention-based approach as a promising foundation for developing practical onboard anomaly detection systems.
【3】HEMERA: A Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data
标题:HEMERA:使用GWAS数据估计肺癌风险的人类可解释Transformer模型
链接:https://arxiv.org/abs/2510.07477
作者:Maria Mahbub, Robert J. Klein, Myvizhi Esai Selvan, Rowena Yip, Claudia Henschke, Providencia Morales, Ian Goethert, Olivera Kotevska, Mayanka Chandra Shekar, Sean R. Wilkinson, Eileen McAllister, Samuel M. Aguayo, Zeynep H. Gümüş, Ioana Danciu, VA Million Veteran Program
备注:18 pages, 6 figures, 3 tables
摘要:肺癌(LC)是美国第三大常见癌症,也是癌症死亡的主要原因。虽然吸烟是主要的危险因素,但在从不吸烟者和家族聚集性研究中,LC的发生突出了遗传因素。通过全基因组关联研究(GWAS)鉴定的遗传生物标志物是评估LC风险的有前途的工具。我们介绍HEMERA(Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data),这是一个新的框架,它将基于可解释的transformer的深度学习应用于单核苷酸多态性(SNP)的GWAS数据,以预测LC风险。与以前的方法不同,HEMERA直接处理原始基因型数据,而不需要临床协变量,引入了添加剂位置编码,神经基因型嵌入和精细变异过滤。一个事后的可解释性模块的基础上逐层集成的一致性,使归因于特定的SNP的模型预测,与已知的LC风险基因座强烈对齐。HEMERA接受了来自272.54亿退伍军人计划参与者的数据培训,实现了>99%的AUC(接收器特征下的面积)评分。这些研究结果支持透明的,假设生成模型的个性化LC风险评估和早期干预。
摘要:Lung cancer (LC) is the third most common cancer and the leading cause of cancer deaths in the US. Although smoking is the primary risk factor, the occurrence of LC in never-smokers and familial aggregation studies highlight a genetic component. Genetic biomarkers identified through genome-wide association studies (GWAS) are promising tools for assessing LC risk. We introduce HEMERA (Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data), a new framework that applies explainable transformer-based deep learning to GWAS data of single nucleotide polymorphisms (SNPs) for predicting LC risk. Unlike prior approaches, HEMERA directly processes raw genotype data without clinical covariates, introducing additive positional encodings, neural genotype embeddings, and refined variant filtering. A post hoc explainability module based on Layer-wise Integrated Gradients enables attribution of model predictions to specific SNPs, aligning strongly with known LC risk loci. Trained on data from 27,254 Million Veteran Program participants, HEMERA achieved >99% AUC (area under receiver characteristics) score. These findings support transparent, hypothesis-generating models for personalized LC risk assessment and early intervention.
【4】Locality-Sensitive Hashing-Based Efficient Point Transformer for Charged Particle Reconstruction
标题:基于局部敏感哈希的高效点Transformer用于带电粒子重建
链接:https://arxiv.org/abs/2510.07594
作者:Shitij Govil, Jack P. Rodgers, Yuan-Tang Chou, Siqi Miao, Amit Saha, Advaith Anand, Kilian Lieret, Gage DeZoort, Mia Liu, Javier Duarte, Pan Li, Shih-Chieh Hsu
备注:Accepted to NeurIPS 2025 Machine Learning and the Physical Sciences Workshop
摘要:带电粒子径迹重建是对撞机实验中的一项基础性工作,也是粒子重建的主要计算瓶颈。图神经网络(GNN)在这个问题上表现出了很强的性能,但是昂贵的图构造、不规则的计算和随机的内存访问模式大大限制了它们的吞吐量。最近提出的基于散列的高效点Transformer(HEPT)通过注意力计算中的局部敏感散列(LSH)为大型点云处理提供了理论上有保证的近线性复杂度;然而,其评估主要集中在嵌入质量上,并且HEPT所依赖的对象压缩管道需要事后聚类步骤(例如,DBScan),可以支配运行时。在这项工作中,我们做了两个贡献。首先,我们在相同的数据集和指标下,对HEPT的物理跟踪性能和代表性的基于GNN的管道进行了统一,公平的评估。其次,我们通过使用轻量级解码器扩展HEPT来引入HEPTv 2,该解码器消除了聚类阶段并直接预测轨道分配。这种修改保留了HEPT的常规,硬件友好的计算,同时实现了超快速的端到端推理。在TrackML数据集上,优化的HEPTv 2在A100上实现了每个事件约28 ms,同时保持了具有竞争力的跟踪效率。这些结果将HEPTv 2定位为基于GNN的管道的实用,可扩展的替代方案,用于快速跟踪。
摘要:Charged particle track reconstruction is a foundational task in collider experiments and the main computational bottleneck in particle reconstruction. Graph neural networks (GNNs) have shown strong performance for this problem, but costly graph construction, irregular computations, and random memory access patterns substantially limit their throughput. The recently proposed Hashing-based Efficient Point Transformer (HEPT) offers a theoretically guaranteed near-linear complexity for large point cloud processing via locality-sensitive hashing (LSH) in attention computations; however, its evaluations have largely focused on embedding quality, and the object condensation pipeline on which HEPT relies requires a post-hoc clustering step (e.g., DBScan) that can dominate runtime. In this work, we make two contributions. First, we present a unified, fair evaluation of physics tracking performance for HEPT and a representative GNN-based pipeline under the same dataset and metrics. Second, we introduce HEPTv2 by extending HEPT with a lightweight decoder that eliminates the clustering stage and directly predicts track assignments. This modification preserves HEPT's regular, hardware-friendly computations while enabling ultra-fast end-to-end inference. On the TrackML dataset, optimized HEPTv2 achieves approximately 28 ms per event on an A100 while maintaining competitive tracking efficiency. These results position HEPTv2 as a practical, scalable alternative to GNN-based pipelines for fast tracking.
【5】Attention to Order: Transformers Discover Phase Transitions via Learnability
标题:关注秩序:Transformer通过学习性发现阶段转变
链接:https://arxiv.org/abs/2510.07401
作者:Şener Özönder
摘要:相变标志着集体行为的质的重组,但当缺乏分析解决方案和传统模拟失败时,识别它们的边界仍然具有挑战性。在这里,我们引入可学习性作为一个通用的标准,定义为一个Transformer模型的能力,包含注意力机制,从微观状态提取结构。使用自监督学习和蒙特卡罗生成的二维伊辛模型的配置,我们表明,有序的阶段对应于增强的可学习性,表现在减少训练损失和结构化的注意力模式,而无序的阶段仍然抵抗学习。两个无监督的诊断,训练损失的急剧跳跃和注意熵的上升,恢复临界温度与精确值非常一致。我们的研究结果建立了可学习性作为相变的数据驱动标记,并强调了凝聚态物质中的长程有序与现代语言模型中结构的出现之间的深刻相似之处。
摘要:Phase transitions mark qualitative reorganizations of collective behavior, yet identifying their boundaries remains challenging whenever analytic solutions are absent and conventional simulations fail. Here we introduce learnability as a universal criterion, defined as the ability of a transformer model containing attention mechanism to extract structure from microscopic states. Using self-supervised learning and Monte Carlo generated configurations of the two-dimensional Ising model, we show that ordered phases correspond to enhanced learnability, manifested in both reduced training loss and structured attention patterns, while disordered phases remain resistant to learning. Two unsupervised diagnostics, the sharp jump in training loss and the rise in attention entropy, recover the critical temperature in excellent agreement with the exact value. Our results establish learnability as a data-driven marker of phase transitions and highlight deep parallels between long-range order in condensed matter and the emergence of structure in modern language models.
GAN|对抗|攻击|生成相关(15篇)
【1】AI-Driven Radiology Report Generation for Traumatic Brain Injuries
标题:人工智能驱动的创伤性脑损伤放射学报告生成
链接:https://arxiv.org/abs/2510.08498
作者:Riadh Bouslimi, Houda Trabelsi, Wahiba Ben Abdssalem Karaa, Hana Hedhli
备注:None
摘要:创伤性脑损伤在急诊医学中提出了重大的诊断挑战,及时解释医学图像对患者的预后至关重要。在本文中,我们提出了一种新的基于AI的方法,用于自动生成针对颅脑创伤病例的放射学报告。我们的模型将AC-BiFPN与Transformer架构集成在一起,以捕获和处理复杂的医学成像数据,如CT和MRI扫描。AC-BiFPN提取多尺度特征,能够检测颅内血管等复杂异常,而Transformer则通过对长距离依赖关系建模,生成连贯、上下文相关的诊断报告。我们评估了我们的模型在RSNA颅内出血检测数据集上的性能,在诊断准确性和报告生成方面都优于传统的基于CNN的模型。该解决方案不仅支持放射科医生在高压环境中工作,还为实习医生提供了一个强大的教育工具,提供实时反馈并增强他们的学习体验。我们的研究结果证明了将高级特征提取与基于transformer的文本生成相结合的潜力,以改善创伤性脑损伤诊断中的临床决策。
摘要:Traumatic brain injuries present significant diagnostic challenges in emergency medicine, where the timely interpretation of medical images is crucial for patient outcomes. In this paper, we propose a novel AI-based approach for automatic radiology report generation tailored to cranial trauma cases. Our model integrates an AC-BiFPN with a Transformer architecture to capture and process complex medical imaging data such as CT and MRI scans. The AC-BiFPN extracts multi-scale features, enabling the detection of intricate anomalies like intracranial hemorrhages, while the Transformer generates coherent, contextually relevant diagnostic reports by modeling long-range dependencies. We evaluate the performance of our model on the RSNA Intracranial Hemorrhage Detection dataset, where it outperforms traditional CNN-based models in both diagnostic accuracy and report generation. This solution not only supports radiologists in high-pressure environments but also provides a powerful educational tool for trainee physicians, offering real-time feedback and enhancing their learning experience. Our findings demonstrate the potential of combining advanced feature extraction with transformer-based text generation to improve clinical decision-making in the diagnosis of traumatic brain injuries.
【2】Synthetic Series-Symbol Data Generation for Time Series Foundation Models
标题:时间序列基础模型的合成序列符号数据生成
链接:https://arxiv.org/abs/2510.08445
作者:Wenxuan Wang, Kai Wu, Yujian Betterest Li, Dan Wang, Xiaoyu Zhang
备注:63 pages, NeurIPS 2025 accepted
摘要
:时间序列分析(TSA)的基础模型引起了人们的极大关注。然而,培训数据稀缺和不平衡等挑战继续阻碍其发展。受复杂动态系统理论的启发,我们设计了一个序列符号数据生成机制,使无限制地创建高质量的时间序列数据与相应的符号表达式配对。为了利用具有强相关性的序列-符号数据对,我们开发了\texttt{SymTime},这是一个预先训练的基础模型,用于使用符号信息增强时间序列表示。\texttt{SymTime}在与下游任务进行微调时,在五个主要TSA任务上展示了具有竞争力的性能,可与在真实世界数据集上预训练的基础模型相媲美。这种方法强调了序列符号数据生成和预训练机制在克服数据稀缺和提高任务性能方面的潜力。该代码可在https://github.com/wwhenxuan/SymTime上获得。
摘要:Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as training data scarcity and imbalance continue to hinder their development. Inspired by complex dynamic system theories, we design a series-symbol data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic expressions. To leverage series-symbol data pairs with strong correlations, we develop \texttt{SymTime}, a pre-trained foundation model for enhancing time series representation using symbolic information. \texttt{SymTime} demonstrates competitive performance across five major TSA tasks when fine-tunes with downstream tasks, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of series-symbol data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance. The code is available at https://github.com/wwhenxuan/SymTime.
【3】Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling
标题:低资源语言建模中合成数据生成的对比解码
链接:https://arxiv.org/abs/2510.08245
作者:Jannek Ulm, Kevin Du, Vésteinn Snæbjarnarson
备注:13 pages, 3 figures
摘要:大型语言模型(LLM)是在大量文本数据上训练的,人们担心这些数据可能很快就会达到极限。一个潜在的解决方案是对从LLM采样的合成数据进行训练。在这项工作中,我们建立在这个想法和对比解码生成合成语料库的好处。在一个受控的环境中,我们使用在相同的1亿个单词的原始语料库上训练的好模型和坏模型之间的相对差异来对语料库进行采样。通过放大来自具有更好性能的模型的信号,我们创建了一个合成语料库,并将其与原始训练数据混合。我们的研究结果表明,对合成数据和真实数据的混合进行训练可以提高语言建模目标和一系列下游任务的性能。特别是,我们看到,使用来自对比解码的合成数据的混合训练有利于需要更多推理技能的任务,而来自传统采样的合成数据则有助于更多依赖于表面语言能力的任务。
摘要:Large language models (LLMs) are trained on huge amounts of textual data, and concerns have been raised that the limits of such data may soon be reached. A potential solution is to train on synthetic data sampled from LLMs. In this work, we build on this idea and investigate the benefits of contrastive decoding for generating synthetic corpora. In a controlled setting, we experiment with sampling corpora using the relative difference between a good and bad model trained on the same original corpus of 100 million words. By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data. Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks. In particular, we see that training with a mix of synthetic data from contrastive decoding benefits tasks that require more reasoning skills, while synthetic data from traditional sampling helps more on tasks dependent on surface level linguistic capabilities.
【4】Bidirectional Representations Augmented Autoregressive Biological Sequence Generation:Application in De Novo Peptide Sequencing
标题:双向表示增强自回归生物序列生成:在从头肽测序中的应用
链接:https://arxiv.org/abs/2510.08169
作者:Xiang Zhang, Jiaqi Wei, Zijie Qiu, Sheng Xu, Zhi Jin, ZhiQiang Gao, Nanqing Dong, Siqi Sun
备注:Accepted by NeurIPS 2025
摘要:自回归(AR)模型在序列生成中很常见,但由于其单向性,在许多生物学任务中受到限制,例如从头肽测序和蛋白质建模,无法捕获关键的全局双向令牌依赖性。非自回归(NAR)模型提供了整体的双向表示,但面临着生成一致性和可扩展性的挑战。为了超越这一点,我们提出了一个混合框架,通过动态整合丰富的上下文信息,从非自回归机制增强AR生成。我们的方法耦合一个共享的输入编码器与两个解码器:一个非自回归学习潜在的双向生物特征,和AR解码器合成的生物序列,利用这些双向功能。一种新的交叉解码器注意模块使AR解码器能够迭代查询和整合这些双向特征,丰富其预测。这种协同作用是通过量身定制的训练策略培养的,该策略具有用于平衡目标的重要性退火和用于稳定、集中学习的交叉解码器梯度块。对要求严格的九种从头肽测序基准的评估表明,我们的模型大大超过AR和NAR基线。它独特地协调了AR稳定性与NAR上下文感知,为各种下游数据提供强大的卓越性能。这项研究推进了生物序列建模技术,并为增强AR模型提供了一种新的架构范例,增强了对复杂序列生成的双向理解。代码可在https://github.com/BEAM-Labs/denovo上获得。
摘要:Autoregressive (AR) models, common in sequence generation, are limited in many biological tasks such as de novo peptide sequencing and protein modeling by their unidirectional nature, failing to capture crucial global bidirectional token dependencies. Non-Autoregressive (NAR) models offer holistic, bidirectional representations but face challenges with generative coherence and scalability. To transcend this, we propose a hybrid framework enhancing AR generation by dynamically integrating rich contextual information from non-autoregressive mechanisms. Our approach couples a shared input encoder with two decoders: a non-autoregressive one learning latent bidirectional biological features, and an AR decoder synthesizing the biological sequence by leveraging these bidirectional features. A novel cross-decoder attention module enables the AR decoder to iteratively query and integrate these bidirectional features, enriching its predictions. This synergy is cultivated via a tailored training strategy with importance annealing for balanced objectives and cross-decoder gradient blocking for stable, focused learning. Evaluations on a demanding nine-species benchmark of de novo peptide sequencing show that our model substantially surpasses AR and NAR baselines. It uniquely harmonizes AR stability with NAR contextual awareness, delivering robust, superior performance on diverse downstream data. This research advances biological sequence modeling techniques and contributes a novel architectural paradigm for augmenting AR models with enhanced bidirectional understanding for complex sequence generation. Code is available at https://github.com/BEAM-Labs/denovo.
【5】A Novel Ensemble Learning Approach for Enhanced IoT Attack Detection: Redefining Security Paradigms in Connected Systems
标题:用于增强型物联网攻击检测的新型整体学习方法:重新定义互联系统中的安全范式
链接:https://arxiv.org/abs/2510.08084
作者:Hikmat A. M. Abdeljaber, Md. Alamgir Hossain, Sultan Ahmad, Ahmed Alsanad, Md Alimul Haque, Sudan Jha, Jabeen Nazeer
备注:14 pages, 5 fiugres, 7 tables
摘要:物联网(IoT)设备的快速扩展通过实现广泛的连接和数据交换改变了行业和日常生活。然而,这种增加的互连带来了严重的安全漏洞,使物联网系统更容易受到复杂的网络攻击。这项研究提出了一种新的集成学习架构,旨在提高物联网攻击检测。所提出的方法应用了先进的机器学习技术,特别是额外的树分类器,以及彻底的预处理和超参数优化。它在几个基准数据集上进行了评估,包括CICIoT2023,IoTID 20,BotNetIOT L01,ToN IoT,N BaIoT和BoT IoT。结果显示了出色的性能,实现了高召回率,准确率和精确率,错误率非常低。这些结果表明,与现有方法相比,模型的效率和优越性,为保护物联网环境提供了一种有效且可扩展的方法。这项研究为未来保护连接设备免受不断变化的网络威胁奠定了坚实的基础。
摘要
:The rapid expansion of Internet of Things (IoT) devices has transformed industries and daily life by enabling widespread connectivity and data exchange. However, this increased interconnection has introduced serious security vulnerabilities, making IoT systems more exposed to sophisticated cyber attacks. This study presents a novel ensemble learning architecture designed to improve IoT attack detection. The proposed approach applies advanced machine learning techniques, specifically the Extra Trees Classifier, along with thorough preprocessing and hyperparameter optimization. It is evaluated on several benchmark datasets including CICIoT2023, IoTID20, BotNeTIoT L01, ToN IoT, N BaIoT, and BoT IoT. The results show excellent performance, achieving high recall, accuracy, and precision with very low error rates. These outcomes demonstrate the model efficiency and superiority compared to existing approaches, providing an effective and scalable method for securing IoT environments. This research establishes a solid foundation for future progress in protecting connected devices from evolving cyber threats.
【6】Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation
标题:检测和缓解视频转音频生成中的插入幻觉
链接:https://arxiv.org/abs/2510.08078
作者:Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang
摘要:视频到音频生成在自动合成视频声音方面取得了显著的进步。然而,现有的评估指标,侧重于语义和时间对齐,忽略了一个关键的故障模式:模型往往产生的声学事件,特别是语音和音乐,没有相应的视觉源。我们将这种现象称为插入幻觉,并将其确定为由数据集偏差驱动的系统性风险,例如屏幕外声音的普遍存在,而当前的指标仍然完全未检测到。为了解决这一挑战,我们首先开发了一个系统的评估框架,采用多个音频事件检测器的多数投票合奏。我们还引入了两个新的指标来量化这个问题的患病率和严重程度:IH@vid(具有幻觉的视频的比例)和IH@dur(幻觉持续时间的比例)。在此基础上,我们提出了后验特征校正,这是一种新型的免训练推理时间方法,可以减轻IH。PFC在两个过程中运行:它首先生成初始音频输出以检测幻觉片段,然后在掩蔽这些时间戳的相应视频特征后重新生成音频。在几个主流V2 A基准测试上的实验首先揭示了最先进的模型遭受严重的IH。相比之下,我们的PFC方法平均将幻觉的患病率和持续时间降低了50%以上,而不会降低,并且在某些情况下甚至改善了音频质量和时间同步的传统指标。我们的工作是第一个正式定义、系统测量和有效缓解插入幻觉的工作,为更可靠、更忠实的V2 A模型铺平了道路。
摘要:Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50\% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.
【7】Backdoor Vectors: a Task Arithmetic View on Backdoor Attacks and Defenses
标题:后门载体:后门攻击和防御的任务算术视图
链接:https://arxiv.org/abs/2510.08016
作者:Stanisław Pawlak, Jan Dubiński, Daniel Marczak, Bartłomiej Twardowski
备注:22 pages, 13 figures, 15 tables
摘要:模型合并(MM)最近成为一种有效的方法,用于组合大型深度学习模型。然而,它构成了重大的安全风险。最近的研究表明,它非常容易受到后门攻击,后门攻击将隐藏的触发器引入到单个微调的模型实例中,允许对手在推理时控制最终合并模型的输出。在这项工作中,我们提出了一个简单的框架来理解后门攻击的攻击本身作为一个任务向量。$Backdoor\ Vector\(BV)$被计算为微调后门模型和微调干净模型的权重之间的差异。BV揭示了对攻击理解的新见解,以及一个更有效的框架来衡量它们的相似性和可转移性。此外,我们提出了一种新的方法,通过合并称为$Sparse\ Backdoor\ Vector\(SBV)$,将多个攻击合并为一个攻击,来增强后门的弹性。我们确定了MM中后门威胁背后的核心漏洞:$inherent\ triggers$利用了基础模型中的对抗性弱点。为了解决这个问题,我们提出了$注入\ BV\减法\(IBVS)$ -一个无障碍防御后门在MM。我们的研究结果表明,SBVs超过以前的攻击,是第一种方法,利用合并,以提高后门的有效性。与此同时,IBVS提供了一种轻量级的通用防御,即使在后门威胁完全未知的情况下也仍然有效。
摘要:Model merging (MM) recently emerged as an effective method for combining large deep learning models. However, it poses significant security risks. Recent research shows that it is highly susceptible to backdoor attacks, which introduce a hidden trigger into a single fine-tuned model instance that allows the adversary to control the output of the final merged model at inference time. In this work, we propose a simple framework for understanding backdoor attacks by treating the attack itself as a task vector. $Backdoor\ Vector\ (BV)$ is calculated as the difference between the weights of a fine-tuned backdoored model and fine-tuned clean model. BVs reveal new insights into attacks understanding and a more effective framework to measure their similarity and transferability. Furthermore, we propose a novel method that enhances backdoor resilience through merging dubbed $Sparse\ Backdoor\ Vector\ (SBV)$ that combines multiple attacks into a single one. We identify the core vulnerability behind backdoor threats in MM: $inherent\ triggers$ that exploit adversarial weaknesses in the base model. To counter this, we propose $Injection\ BV\ Subtraction\ (IBVS)$ - an assumption-free defense against backdoors in MM. Our results show that SBVs surpass prior attacks and is the first method to leverage merging to improve backdoor effectiveness. At the same time, IBVS provides a lightweight, general defense that remains effective even when the backdoor threat is entirely unknown.
【8】TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
标题:TTOM:合成视频生成的测试时优化和再同步
链接:https://arxiv.org/abs/2510.07940
作者:Leigang Qu, Ziyang Wang, Na Zheng, Wenjie Wang, Liqiang Nie, Tat-Seng Chua
备注:Project page: this https URL
摘要:视频基础模型(VFM)表现出出色的视觉生成性能,但在合成场景中(例如,运动、算术和空间关系)。在这项工作中,我们介绍了测试时优化和优化(TTOM),一个无训练的框架,在推理过程中将VFM输出与时空布局对齐,以实现更好的文本图像对齐。而不是直接干预潜在的或在现有的工作中注意每个样本,我们整合和优化新的参数的一般布局注意目标的指导下。此外,我们制定的视频生成流设置内,并保持历史优化上下文与参数内存机制,支持灵活的操作,如插入,读取,更新和删除。值得注意的是,我们发现,TTOM解开组成世界的知识,表现出强大的可转移性和泛化。T2 V-CompBench和Vbench基准测试的实验结果表明TTOM是一个有效、实用、可扩展和高效的框架,可以实现动态合成视频生成的跨模态对齐。
摘要:Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.
【9】MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation
标题:MetaDefense:在生成之前和生成期间防御基于Finetuning的越狱攻击
链接
:https://arxiv.org/abs/2510.07835
作者:Weisen Jiang, Sinno Jialin Pan
备注:Accepted By NeurIPS 2025
摘要:本文介绍了MetaDefense,这是一种用于防御大型语言模型(LLM)中基于微调的越狱攻击的新框架。我们观察到,现有的防御机制无法推广到隐藏的攻击模板伪装的有害查询,尽管LLM能够区分隐藏的有害查询嵌入空间。基于这些见解,我们提出了一个两阶段的防御方法:(i)生成前防御,在响应生成开始之前检测有害查询,以及(ii)生成中期防御,在生成过程中监视部分响应,以防止输出更多有害内容。我们的MetaDefense训练LLM使用专门的提示来预测查询和部分响应的危害性,从而提前终止潜在的有害交互。跨多个LLM架构(LLaMA-2- 7 B、Qwen-2.5-3B-Instruct和LLaMA-3.2-3B-Instruct)的广泛实验表明,MetaDefense的性能显著优于现有的防御机制,可以通过可见和不可见的攻击模板对有害查询进行强大的防御,同时在良性任务上保持有竞争力的性能。代码可在https://github.com/ws-jiang/MetaDefense上获得。
摘要:This paper introduces MetaDefense, a novel framework for defending against finetuning-based jailbreak attacks in large language models (LLMs). We observe that existing defense mechanisms fail to generalize to harmful queries disguised by unseen attack templates, despite LLMs being capable of distinguishing disguised harmful queries in the embedding space. Based on these insights, we propose a two-stage defense approach: (i) pre-generation defense that detects harmful queries before response generation begins, and (ii) mid-generation defense that monitors partial responses during generation to prevent outputting more harmful content. Our MetaDefense trains the LLM to predict the harmfulness of both queries and partial responses using specialized prompts, enabling early termination of potentially harmful interactions. Extensive experiments across multiple LLM architectures (LLaMA-2-7B, Qwen-2.5-3B-Instruct, and LLaMA-3.2-3B-Instruct) demonstrate that MetaDefense significantly outperforms existing defense mechanisms, achieving robust defense against harmful queries with seen and unseen attack templates while maintaining competitive performance on benign tasks. Code is available at https://github.com/ws-jiang/MetaDefense.
【10】HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation
标题:HiPRAG:高效统计检索增强生成的分层流程奖励
链接:https://arxiv.org/abs/2510.07794
作者:Peilin Wu, Mian Zhang, Kun Wan, Wentian Zhao, Kaiyu He, Xinya Du, Zhiyu Chen
备注:Under review
摘要:抽象RAG是一种强大的技术,用于整合LLM缺乏的外部信息,从而更好地解决问题和回答问题。然而,次优搜索行为广泛存在,如过度搜索(检索已知信息)和不足搜索(在必要时不搜索),这导致不必要的开销和不可靠的输出。目前的训练方法通常依赖于RL框架中基于结果的奖励,缺乏解决这些低效率所需的细粒度控制。为了克服这一点,我们引入了高效代理RAG(HiPRAG)的分层过程奖励,这是一种将细粒度,知识为基础的过程奖励纳入RL培训的培训方法。我们的方法评估的必要性,每个搜索决策的飞行分解成离散的,可解析的步骤代理的推理轨迹。然后,我们应用一个分层奖励函数,该函数基于最佳搜索和非搜索步骤的比例,在常用的结果和格式奖励之上提供额外的奖励。在Qwen2.5和Llama-3.2模型上进行的七个不同QA基准测试的实验表明,我们的方法实现了65.4%(3B)和67.2%(7 B)的平均准确率。这是在提高搜索效率的同时实现的,将过度搜索率降低到2.3%,同时降低搜索不足率。这些结果证明了优化推理过程本身的有效性,而不仅仅是最终结果。进一步的实验和分析表明,HiPRAG在广泛的RL算法,模型族,大小和类型上表现出良好的泛化能力。这项工作证明了通过RL进行细粒度控制的重要性和潜力,以提高搜索代理的推理效率和最优性。
摘要:Agentic RAG is a powerful technique for incorporating external information that LLMs lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a RL framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce Hierarchical Process Rewards for Efficient agentic RAG (HiPRAG), a training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4% (3B) and 67.2% (7B). This is accomplished while improving search efficiency, reducing the over-search rate to just 2.3% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents.
【11】GeoGen: A Two-stage Coarse-to-Fine Framework for Fine-grained Synthetic Location-based Social Network Trajectory Generation
标题:GeoGen:一个从粗到细的两阶段框架,用于细粒度合成基于位置的社交网络轨迹生成
链接:https://arxiv.org/abs/2510.07735
作者:Rongchao Xu, Kunlin Cai, Lin Jiang, Dahai Yu, Zhiqing Hong, Yuan Tian, Guang Wang
摘要:基于位置的社交网络(LBSN)签到轨迹数据对于许多实际应用是重要的,如POI推荐、广告和流行病干预。然而,高昂的收集成本和不断增加的隐私问题阻止了我们访问大规模的LBSN轨迹数据。合成数据生成方面的最新进展为我们提供了实现这一目标的新机会,即利用生成式人工智能生成合成数据,在确保隐私保护的同时保留真实数据的特征。然而,由于其空间离散、时间不规则的性质以及由稀疏活动和不确定的人类移动性引起的复杂时空模式,生成合成LBSN签到轨迹仍然具有挑战性。为了解决这一挑战,我们提出了GeoGen,一个两阶段的粗到精的框架,用于大规模的LBSN检查轨迹生成。在第一阶段,我们从原始的LBSN签到轨迹中重建空间连续、时间规则的潜在运动序列,然后设计一个具有有效降噪网络的稀疏感知时空扩散模型(S$^2$TDiff)来学习其底层行为模式。在第二阶段,我们设计了Coarse 2FineNet,这是一种基于Transformer的Seq 2Seq架构,在编码器和多任务混合头解码器中配备了动态上下文融合机制,通过建模语义相关性和行为不确定性,基于粗粒度潜在运动序列生成细粒度LBSN轨迹。在四个真实世界数据集上进行的广泛实验表明,GeoGen在保真度和效用评估方面优于最先进的模型,例如,它在FS-TKY数据集上的距离和半径度量增加了69%和55%以上。
摘要:Location-Based Social Network (LBSN) check-in trajectory data are important for many practical applications, like POI recommendation, advertising, and pandemic intervention. However, the high collection costs and ever-increasing privacy concerns prevent us from accessing large-scale LBSN trajectory data. The recent advances in synthetic data generation provide us with a new opportunity to achieve this, which utilizes generative AI to generate synthetic data that preserves the characteristics of real data while ensuring privacy protection. However, generating synthetic LBSN check-in trajectories remains challenging due to their spatially discrete, temporally irregular nature and the complex spatio-temporal patterns caused by sparse activities and uncertain human mobility. To address this challenge, we propose GeoGen, a two-stage coarse-to-fine framework for large-scale LBSN check-in trajectory generation. In the first stage, we reconstruct spatially continuous, temporally regular latent movement sequences from the original LBSN check-in trajectories and then design a Sparsity-aware Spatio-temporal Diffusion model (S$^2$TDiff) with an efficient denosing network to learn their underlying behavioral patterns. In the second stage, we design Coarse2FineNet, a Transformer-based Seq2Seq architecture equipped with a dynamic context fusion mechanism in the encoder and a multi-task hybrid-head decoder, which generates fine-grained LBSN trajectories based on coarse-grained latent movement sequences by modeling semantic relevance and behavioral uncertainty. Extensive experiments on four real-world datasets show that GeoGen excels state-of-the-art models for both fidelity and utility evaluation, e.g., it increases over 69% and 55% in distance and radius metrics on the FS-TKY dataset.
【12】EBGAN-MDN: An Energy-Based Adversarial Framework for Multi-Modal Behavior Cloning
标题
:EBGAN-MDN:基于能量的多模式行为克隆对抗框架
链接:https://arxiv.org/abs/2510.07562
作者:Yixiao Li, Julia Barth, Thomas Kiefer, Ahmad Fraij
摘要:由于模式平均和模式崩溃,多模式行为克隆面临着重大挑战,传统模型无法捕获不同的输入-输出映射。这个问题在机器人等应用中至关重要,在这些应用中,对多个有效动作进行建模可以确保性能和安全性。我们提出了EBGAN-MDN,这是一个集成了基于能量的模型,混合密度网络(MDN)和对抗训练的框架。通过利用修改的InfoNCE损失和能量强制MDN损失,EBGAN-MDN有效地解决了这些挑战。在合成和机器人基准测试上的实验证明了EBGAN-MDN的卓越性能,使其成为多模态学习任务的有效和高效的解决方案。
摘要:Multi-modal behavior cloning faces significant challenges due to mode averaging and mode collapse, where traditional models fail to capture diverse input-output mappings. This problem is critical in applications like robotics, where modeling multiple valid actions ensures both performance and safety. We propose EBGAN-MDN, a framework that integrates energy-based models, Mixture Density Networks (MDNs), and adversarial training. By leveraging a modified InfoNCE loss and an energy-enforced MDN loss, EBGAN-MDN effectively addresses these challenges. Experiments on synthetic and robotic benchmarks demonstrate superior performance, establishing EBGAN-MDN as a effective and efficient solution for multi-modal learning tasks.
【13】ConCuR: Conciseness Makes State-of-the-Art Kernel Generation
标题:ConCuR:简洁打造最先进的内核生成
链接:https://arxiv.org/abs/2510.07356
作者:Lingcheng Kong, Jiateng Wei, Hanzhang Shen, Huan Wang
摘要:LLM的GPU内核生成最近经历了快速发展,利用测试时扩展和强化学习技术。然而,内核生成的一个关键挑战是缺乏高质量的数据,因为大多数高质量的内核都是专有的,而不是开源的。这一挑战使我们无法利用监督微调来将LLM与内核生成任务相匹配。为了应对这一挑战,我们开发了一个管道,用于生成和管理具有推理轨迹的高质量CUDA内核,其动机是一个关键的观察结果,即简洁而信息丰富的推理轨迹导致高性能内核的稳健生成。使用这个管道,我们构建了我们的数据集ConCuR,并介绍了我们的模型KernelCoder,据我们所知,这是第一个在由PyTorch、推理和CUDA内核对组成的策展数据集上训练的模型。在KernelBench设置中,我们的模型比现有的性能最好的模型QwQ-32 B有了显著的改进,并且优于所有针对内核生成进行了微调的开源模型,以及DeepSeek-V3.1-Think和Claude-4-sonnet等前沿模型。最后,我们表明,平均推理长度可以作为一个度量来评估内核生成任务的难度。观察、指标以及我们的数据收集和管理管道可以帮助在未来的内核生成任务中获得更好的数据。
摘要:GPU kernel generation by LLMs has recently experienced rapid development, leveraging test-time scaling and reinforcement learning techniques. However, a key challenge for kernel generation is the scarcity of high-quality data, as most high-quality kernels are proprietary and not open-source. This challenge prevents us from leveraging supervised fine-tuning to align LLMs to the kernel generation task. To address this challenge, we develop a pipeline that generates and curates high-quality CUDA kernels with reasoning traces, motivated by a critical observation that concise yet informative reasoning traces result in robust generation of high-performance kernels. Using this pipeline, we construct our dataset ConCuR and introduce our model KernelCoder, which is the first model trained on a curated dataset consisting of PyTorch, reasoning, and CUDA kernel pairs, to our knowledge. In the KernelBench setup, our model achieves significant improvements over the existing top-performing model, QwQ-32B, and outperforms all open-source models fine-tuned for kernel generation, as well as frontier models such as DeepSeek-V3.1-Think and Claude-4-sonnet. Finally, we show that the average reasoning length can serve as a metric to assess the difficulty of kernel generation tasks. The observations, metrics, and our data collection and curation pipeline can help obtain better data in the kernel generation task in the future.
【14】SpotDiff: Spotting and Disentangling Interference in Feature Space for Subject-Preserving Image Generation
标题:SpotDiff:在特征空间中发现和解开干扰以生成主题保留图像
链接:https://arxiv.org/abs/2510.07340
作者:Yongzhi Li, Saining Zhang, Yibing Chen, Boying Li, Yanxin Zhang, Xiaoyu Du
摘要:个性化图像生成的目的是忠实地保留参考主体的身份,同时适应不同的文本提示。现有的基于优化的方法可确保高保真度,但计算成本很高,而基于学习的方法则以受滋扰因素影响的纠缠表示为代价来提供效率。我们介绍了SpotDiff,一种新的基于学习的方法,通过发现和解开干扰来提取特定于主题的特征。利用预训练的CLIP图像编码器和专门的专家网络来进行姿势和背景识别,SpotDiff通过特征空间中的正交约束来隔离主体身份。为了实现原则性训练,我们引入了SpotDiff10k,这是一个具有一致姿势和背景变化的策展数据集。实验表明,SpotDiff实现了比现有方法更强大的主题保留和可控编辑,同时仅用10k个训练样本就获得了有竞争力的性能。
摘要:Personalized image generation aims to faithfully preserve a reference subject's identity while adapting to diverse text prompts. Existing optimization-based methods ensure high fidelity but are computationally expensive, while learning-based approaches offer efficiency at the cost of entangled representations influenced by nuisance factors. We introduce SpotDiff, a novel learning-based method that extracts subject-specific features by spotting and disentangling interference. Leveraging a pre-trained CLIP image encoder and specialized expert networks for pose and background, SpotDiff isolates subject identity through orthogonality constraints in the feature space. To enable principled training, we introduce SpotDiff10k, a curated dataset with consistent pose and background variations. Experiments demonstrate that SpotDiff achieves more robust subject preservation and controllable editing than prior methods, while attaining competitive performance with only 10k training samples.
【15】On the Optimality of the Median-of-Means Estimator under Adversarial Contamination
标题:对抗性污染下平均值估计的最优性
链接:https://arxiv.org/abs/2510.07867
作者:Xabier de Juan, Santiago Mazuelas
摘要:均值中值(MoM)是广泛用于机器学习的鲁棒估计器,已知其在样本为i.i.d.的情况下是(极大极小)最优的。在更严重的情况下,样本被可以检查和修改数据的对手污染。以前的工作从理论上表明,在某些污染环境中的MoM估计的适用性。然而,矩量法的(极大极小)最优性及其在对抗性污染下的局限性在高斯情况下仍然未知。在本文中,我们提出了上和下界的误差矩量法对抗污染的多类分布。特别是,我们证明了矩量法是(极大极小)最优的类的分布具有有限的方差,以及在类的分布具有无穷大的方差和有限的绝对$(1+r)$-th时刻。我们还提供了下限矩量法的错误,匹配的顺序所提出的上限,并表明矩量法是次优的轻尾分布。
摘要:The Median-of-Means (MoM) is a robust estimator widely used in machine learning that is known to be (minimax) optimal in scenarios where samples are i.i.d. In more grave scenarios, samples are contaminated by an adversary that can inspect and modify the data. Previous work has theoretically shown the suitability of the MoM estimator in certain contaminated settings. However, the (minimax) optimality of MoM and its limitations under adversarial contamination remain unknown beyond the Gaussian case. In this paper, we present upper and lower bounds for the error of MoM under adversarial contamination for multiple classes of distributions. In particular, we show that MoM is (minimax) optimal in the class of distributions with finite variance, as well as in the class of distributions with infinite variance and finite absolute $(1+r)$-th moment. We also provide lower bounds for MoM's error that match the order of the presented upper bounds, and show that MoM is sub-optimal for light-tailed distributions.
半/弱/无/有监督|不确定性|主动学习(10篇)
【1】Contrastive Self-Supervised Learning at the Edge: An Energy Perspective
标题:边缘的对比自我监督学习:能源角度
链接:https://arxiv.org/abs/2510.08374
作者:Fernanda Famá, Roberto Pereira, Charalampos Kalalas, Paolo Dini, Lorena Qendro, Fahim Kawsar, Mohammad Malekzadeh
摘要:虽然对比学习(CL)在自监督表示学习中显示出相当大的前景,但其在资源受限设备上的部署仍然在很大程度上未被探索。训练传统CL框架所需的大量计算需求提出了一系列挑战,特别是在能耗、数据可用性和内存使用方面。我们对四个广泛使用的CL框架进行了评估:Simplified,MoCo,SimSiam和Barlow Twins。我们专注于这些CL框架在边缘和雾部署方面的实际可行性,并引入了一个系统的基准测试策略,其中包括能量分析和减少训练数据条件。我们的研究结果表明,Simplified与其感知的计算成本相反,在各种数据体系中表现出最低的能耗。最后,我们还通过评估与CL框架配对的轻量级神经架构来扩展我们的分析。我们的研究旨在深入了解在处理能力有限的边缘/雾环境中部署CL的资源影响,并为其未来的优化开辟了几个研究方向。
摘要:While contrastive learning (CL) shows considerable promise in self-supervised representation learning, its deployment on resource-constrained devices remains largely underexplored. The substantial computational demands required for training conventional CL frameworks pose a set of challenges, particularly in terms of energy consumption, data availability, and memory usage. We conduct an evaluation of four widely used CL frameworks: SimCLR, MoCo, SimSiam, and Barlow Twins. We focus on the practical feasibility of these CL frameworks for edge and fog deployment, and introduce a systematic benchmarking strategy that includes energy profiling and reduced training data conditions. Our findings reveal that SimCLR, contrary to its perceived computational cost, demonstrates the lowest energy consumption across various data regimes. Finally, we also extend our analysis by evaluating lightweight neural architectures when paired with CL frameworks. Our study aims to provide insights into the resource implications of deploying CL in edge/fog environments with limited processing capabilities and opens several research directions for its future optimization.
【2】Unsupervised Multi-Source Federated Domain Adaptation under Domain Diversity through Group-Wise Discrepancy Minimization
标题:通过分组差异最小化实现领域多样性下的无监督多源联邦域自适应
链接:https://arxiv.org/abs/2510.08150
作者:Larissa Reichart, Cem Ata Baykara, Ali Burak Ünal, Mete Akgün, Harlin Lee
摘要:无监督多源域自适应(UMDA)旨在通过利用来自多个不同源域的标记数据来学习推广到未标记目标域的模型。虽然分布式UMDA方法通过避免原始数据共享来解决隐私约束,但现有方法通常假设少量源并且无法有效地扩展。增加异构域的数量通常会使现有的方法不切实际,导致高计算开销或不稳定的性能。我们提出了GALA,一个可扩展的和强大的联邦UMDA框架,它引入了两个关键组件:(1)一个新的组间差异最小化目标,有效地近似完全成对域对齐,而无需二次计算;和(2)一个温度控制的,基于质心的加权策略,动态优先级源域的基础上对齐与目标。这些组件共同实现了跨大量异构源的稳定且可并行的训练。为了评估高多样性场景中的性能,我们引入了Digit-18,这是一个新的基准,包括18位数据集,具有不同的合成和真实世界的域偏移。大量的实验表明,GALA一贯实现竞争力或国家的最先进的标准基准测试结果,并显着优于现有的方法在不同的多源设置,其他人无法收敛。
摘要:Unsupervised multi-source domain adaptation (UMDA) aims to learn models that generalize to an unlabeled target domain by leveraging labeled data from multiple, diverse source domains. While distributed UMDA methods address privacy constraints by avoiding raw data sharing, existing approaches typically assume a small number of sources and fail to scale effectively. Increasing the number of heterogeneous domains often makes existing methods impractical, leading to high computational overhead or unstable performance. We propose GALA, a scalable and robust federated UMDA framework that introduces two key components: (1) a novel inter-group discrepancy minimization objective that efficiently approximates full pairwise domain alignment without quadratic computation; and (2) a temperature-controlled, centroid-based weighting strategy that dynamically prioritizes source domains based on alignment with the target. Together, these components enable stable and parallelizable training across large numbers of heterogeneous sources. To evaluate performance in high-diversity scenarios, we introduce Digit-18, a new benchmark comprising 18 digit datasets with varied synthetic and real-world domain shifts. Extensive experiments show that GALA consistently achieves competitive or state-of-the-art results on standard benchmarks and significantly outperforms prior methods in diverse multi-source settings where others fail to converge.
【3】Unsupervised Radio Map Construction in Mixed LoS/NLoS Indoor Environments
标题:混合LoS/NLoS室内环境中的无监督无线电地图构建
链接:https://arxiv.org/abs/2510.08015
作者:Zheng Xing, Junting Chen
摘要:无线电地图对于增强无线电通信和定位至关重要。然而,用于构建无线电地图的现有方法通常需要昂贵的校准过程来收集位置标记的信道状态信息(CSI)数据集。本文的目的是恢复数据收集轨迹直接从信道传播序列,消除了位置校准的需要。其关键思想是采用隐马尔可夫模型(HMM)为基础的框架,有条件地建模的信道传播矩阵,同时建模的位置相关的轨迹。主要的挑战包括对多输入多输出(MIMO)网络中的信道传播与地理位置之间的复杂关系进行建模,以及解决视距(LOS)和非视距(NLOS)室内条件。在本文中,我们提出了一个基于HMM的框架,共同特点的条件传播模型和用户轨迹的演变。具体而言,MIMO网络中的信道传播在功率、延迟和角度方面分别建模,对于LOS和NLOS条件具有不同的模型。使用高斯-马尔可夫模型对用户轨迹进行建模。同时优化了信道传播参数、移动性模型和LOS/NLOS分类。实验验证使用模拟MIMO正交频分复用(OFDM)网络与多天线均匀线性阵列(MIMO)配置表明,该方法实现了平均定位精度为0.65米的室内环境中,覆盖LOS和NLOS区域。此外,所构建的无线电地图与传统的监督方法(例如k-最近邻(KNN)、支持向量机(SVM)和深度神经网络(DNN))相比能够以减少的误差进行定位。
摘要:Radio maps are essential for enhancing wireless communications and localization. However, existing methods for constructing radio maps typically require costly calibration pro- cesses to collect location-labeled channel state information (CSI) datasets. This paper aims to recover the data collection trajectory directly from the channel propagation sequence, eliminating the need for location calibration. The key idea is to employ a hidden Markov model (HMM)-based framework to conditionally model the channel propagation matrix, while simultaneously modeling the location correlation in the trajectory. The primary challenges involve modeling the complex relationship between channel propagation in multiple-input multiple-output (MIMO) networks and geographical locations, and addressing both line-of-sight (LOS) and non-line-of-sight (NLOS) indoor conditions. In this paper, we propose an HMM-based framework that jointly characterizes the conditional propagation model and the evolution of the user trajectory. Specifically, the channel propagation in MIMO networks is modeled separately in terms of power, delay, and angle, with distinct models for LOS and NLOS conditions. The user trajectory is modeled using a Gaussian-Markov model. The parameters for channel propagation, the mobility model, and LOS/NLOS classification are optimized simultaneously. Experimental validation using simulated MIMO-Orthogonal Frequency-Division Multiplexing (OFDM) networks with a multi-antenna uniform linear arrays (ULA) configuration demonstrates that the proposed method achieves an average localization accuracy of 0.65 meters in an indoor environment, covering both LOS and NLOS regions. Moreover, the constructed radio map enables localization with a reduced error compared to conventional supervised methods, such as k-nearest neighbors (KNN), support vector machine (SVM), and deep neural network (DNN).
【4】A Systematic Evaluation of Self-Supervised Learning for Label-Efficient Sleep Staging with Wearable EEG
标题:可穿戴式EEG自监督学习标签有效睡眠分期的系统评估
链接:https://arxiv.org/abs/2510.07960
作者
:Emilio Estevan, María Sierra-Torralba, Eduardo López-Larraz, Luis Montesano
备注:12 pages, 4 figures
摘要:可穿戴EEG设备已经成为多导睡眠图(PSG)的一种有前途的替代方案。作为经济实惠且可扩展的解决方案,它们的广泛采用导致收集了大量临床医生无法大规模分析的未标记数据。与此同时,最近深度学习在睡眠评分方面的成功依赖于大型注释数据集。自我监督学习(SSL)提供了一个弥合这一差距的机会,利用未标记的信号来解决标签稀缺问题并减少注释工作。在本文中,我们提出了第一个系统的评估SSL的睡眠分期使用可穿戴脑电图。我们调查了一系列成熟的SSL方法,并在使用Ikon Sleep可穿戴EEG头带获得的两个睡眠数据库上对其进行评估:BOAS,一个包含PSG和可穿戴EEG记录的高质量基准,具有共识标签,以及HOGAR,一个基于家庭的,自我记录和未标记记录的大型集合。定义了三种评估场景来研究标签效率,表示质量和跨数据集泛化。结果表明,SSL在监督基线上持续提高分类性能高达10%,当标记数据稀缺时,收益尤其明显。SSL仅利用5%到10%的标记数据就达到了80%以上的临床级准确率,而监督方法需要两倍的标记。此外,SSL表示证明稳健的人口特征,记录环境和信号质量的变化。我们的研究结果证明了SSL的潜力,可以通过可穿戴EEG实现标签效率的睡眠分期,减少对手动注释的依赖,并推动经济实惠的睡眠监测系统的开发。
摘要:Wearable EEG devices have emerged as a promising alternative to polysomnography (PSG). As affordable and scalable solutions, their widespread adoption results in the collection of massive volumes of unlabeled data that cannot be analyzed by clinicians at scale. Meanwhile, the recent success of deep learning for sleep scoring has relied on large annotated datasets. Self-supervised learning (SSL) offers an opportunity to bridge this gap, leveraging unlabeled signals to address label scarcity and reduce annotation effort. In this paper, we present the first systematic evaluation of SSL for sleep staging using wearable EEG. We investigate a range of well-established SSL methods and evaluate them on two sleep databases acquired with the Ikon Sleep wearable EEG headband: BOAS, a high-quality benchmark containing PSG and wearable EEG recordings with consensus labels, and HOGAR, a large collection of home-based, self-recorded, and unlabeled recordings. Three evaluation scenarios are defined to study label efficiency, representation quality, and cross-dataset generalization. Results show that SSL consistently improves classification performance by up to 10% over supervised baselines, with gains particularly evident when labeled data is scarce. SSL achieves clinical-grade accuracy above 80% leveraging only 5% to 10% of labeled data, while the supervised approach requires twice the labels. Additionally, SSL representations prove robust to variations in population characteristics, recording environments, and signal quality. Our findings demonstrate the potential of SSL to enable label-efficient sleep staging with wearable EEG, reducing reliance on manual annotations and advancing the development of affordable sleep monitoring systems.
【5】Multi-level informed optimization via decomposed Kriging for large design problems under uncertainty
标题:不确定性下大型设计问题的分解克里格法多层信息优化
链接:https://arxiv.org/abs/2510.07904
作者:Enrico Ampellio, Blazhe Gjorgiev, Giovanni Sansavini
备注:34 pages, 18 figures
摘要:工程设计涉及包含许多决策变量和不可控参数的苛刻模型。此外,不可避免的任意性和认识上的不确定性可能会产生非常大的影响,并进一步增加复杂性。现有技术采用不确定性量化和设计优化两个步骤,通过鲁棒或随机度量来优化不确定性系统。然而,传统的基于神经网络的、代理辅助的和数学编程方法在大型和复杂的情况下不足以可扩展到负担得起和精确。在这里,提出了一种多层次的方法,以准确地优化资源密集型,高维,复杂的工程问题下的不确定性与最小的资源。一个非侵入性的,快速缩放,克里金为基础的代理开发映射组合设计/参数域有效地。多个代理自适应更新的层次和正交分解,以利用更少和最不确定性的信息数据。所提出的方法进行统计比较,通过一个分析测试平台的国家的最先进的,并被证明是同时更快,更准确的数量级。
摘要:Engineering design involves demanding models encompassing many decision variables and uncontrollable parameters. In addition, unavoidable aleatoric and epistemic uncertainties can be very impactful and add further complexity. The state-of-the-art adopts two steps, uncertainty quantification and design optimization, to optimize systems under uncertainty by means of robust or stochastic metrics. However, conventional scenario-based, surrogate-assisted, and mathematical programming methods are not sufficiently scalable to be affordable and precise in large and complex cases. Here, a multi-level approach is proposed to accurately optimize resource-intensive, high-dimensional, and complex engineering problems under uncertainty with minimal resources. A non-intrusive, fast-scaling, Kriging-based surrogate is developed to map the combined design/parameter domain efficiently. Multiple surrogates are adaptively updated by hierarchical and orthogonal decomposition to leverage the fewer and most uncertainty-informed data. The proposed method is statistically compared to the state-of-the-art via an analytical testbed and is shown to be concurrently faster and more accurate by orders of magnitude.
【6】Self-Supervised Learning Strategies for a Platform to Test the Toxicity of New Chemicals and Materials
标题:测试新化学品和材料毒性平台的自我监督学习策略
链接:https://arxiv.org/abs/2510.07853
作者:Thomas Lautenschlager, Nils Friederich, Angelo Jovin Yamachui Sitcheu, Katja Nau, Gaëlle Hayot, Thomas Dickmeis, Ralf Mikut
摘要:高通量毒性测试为测试大量化合物提供了一种快速且具有成本效益的方法。这种系统的一个关键组成部分是通过机器学习模型进行自动评估。在本文中,我们解决了这一领域的关键挑战,并展示了通过自我监督学习学习的表示如何有效地识别有毒物质引起的变化。我们提供了一个概念验证,利用公开可用的EmbryoNet数据集,其中包含10个斑马鱼胚胎表型引起的各种化学化合物针对不同的过程在早期胚胎发育。我们的分析表明,使用自监督学习的学习表示适用于有效区分不同化合物的作用模式。最后,我们讨论了在TOXBOX项目的背景下,在物理毒性测试设备中集成机器学习模型。
摘要:High-throughput toxicity testing offers a fast and cost-effective way to test large amounts of compounds. A key component for such systems is the automated evaluation via machine learning models. In this paper, we address critical challenges in this domain and demonstrate how representations learned via self-supervised learning can effectively identify toxicant-induced changes. We provide a proof-of-concept that utilizes the publicly available EmbryoNet dataset, which contains ten zebrafish embryo phenotypes elicited by various chemical compounds targeting different processes in early embryonic development. Our analysis shows that the learned representations using self-supervised learning are suitable for effectively distinguishing between the modes-of-action of different compounds. Finally, we discuss the integration of machine learning models in a physical toxicity testing device in the context of the TOXBOX project.
【7】Automated Machine Learning for Unsupervised Tabular Tasks
标题:无监督表格任务的自动机器学习
链接:https://arxiv.org/abs/2510.07569
作者:Prabhant Singh, Pieter Gijsbers, Elif Ceren Gok Yildirim, Murat Onur Yildirim, Joaquin Vanschoren
备注:Accepted at Machine Learning Journal, 2025
摘要:在这项工作中,我们提出了LOTUS(学习学习无监督场景的最佳传输),这是一种简单而有效的方法,可以为多个无监督机器学习(ML)任务(如离群值检测和聚类)执行模型选择。这项工作背后的直觉是,如果机器学习管道以前在具有类似底层数据分布的数据集上运行良好,那么它将在新数据集中表现良好。我们使用最优传输距离来找到未标记表格数据集之间的这种相似性,并在两个下游无监督任务上使用统一的单一方法推荐机器学习管道:离群值检测和聚类。我们通过针对强基线的实验展示了我们方法的有效性,并表明LOTUS是多个无监督ML任务模型选择的非常有希望的第一步。
摘要
:In this work, we present LOTUS (Learning to Learn with Optimal Transport for Unsupervised Scenarios), a simple yet effective method to perform model selection for multiple unsupervised machine learning(ML) tasks such as outlier detection and clustering. Our intuition behind this work is that a machine learning pipeline will perform well in a new dataset if it previously worked well on datasets with a similar underlying data distribution. We use Optimal Transport distances to find this similarity between unlabeled tabular datasets and recommend machine learning pipelines with one unified single method on two downstream unsupervised tasks: outlier detection and clustering. We present the effectiveness of our approach with experiments against strong baselines and show that LOTUS is a very promising first step toward model selection for multiple unsupervised ML tasks.
【8】MoGU: Mixture-of-Gaussians with Uncertainty-based Gating for Time Series Forecasting
标题:MoGU:基于不确定性选通的混合高斯时间序列预测
链接:https://arxiv.org/abs/2510.07459
作者:Yoli Shavit, Jacob Goldberger
摘要:我们介绍了混合高斯与不确定性为基础的门控(MoGU),一种新的混合专家(MoE)框架设计的回归任务,并应用于时间序列预测。与仅提供点估计的传统MoE不同,MoGU将每个专家的输出建模为高斯分布。这允许它直接量化预测(均值)及其固有的不确定性(方差)。MoGU的核心创新是其基于不确定性的门控机制,该机制通过使用每个专家的估计方差来确定其对最终预测的贡献,从而取代了传统的基于输入的门控网络。在不同的时间序列预测基准评估,MoGU始终优于单一专家模型和传统的MoE设置。它还提供了与预测误差直接相关的量化的、信息丰富的不确定性,从而提高了预测的可靠性。我们的代码可从以下网址获得:https://github.com/yolish/moe_unc_tsf
摘要:We introduce Mixture-of-Gaussians with Uncertainty-based Gating (MoGU), a novel Mixture-of-Experts (MoE) framework designed for regression tasks and applied to time series forecasting. Unlike conventional MoEs that provide only point estimates, MoGU models each expert's output as a Gaussian distribution. This allows it to directly quantify both the forecast (the mean) and its inherent uncertainty (variance). MoGU's core innovation is its uncertainty-based gating mechanism, which replaces the traditional input-based gating network by using each expert's estimated variance to determine its contribution to the final prediction. Evaluated across diverse time series forecasting benchmarks, MoGU consistently outperforms single-expert models and traditional MoE setups. It also provides well-quantified, informative uncertainties that directly correlate with prediction errors, enhancing forecast reliability. Our code is available from: https://github.com/yolish/moe_unc_tsf
【9】DUA-D2C: Dynamic Uncertainty Aware Method for Overfitting Remediation in Deep Learning
标题:DUA-D2C:深度学习中过度匹配修复的动态不确定性感知方法
链接:https://arxiv.org/abs/2411.15876
作者:Md. Saiful Bari Siddiqui, Md Mohaiminul Islam, Md. Golam Rabiul Alam
备注:This version (v2) extends our previous work (arXiv:2411.15876v1) on Divide2Conquer (D2C) by introducing Dynamic Uncertainty-Aware Divide2Conquer (DUA-D2C). The manuscript is currently under review at Complex and Intelligent Systems
摘要:过拟合仍然是深度学习中的一个重大挑战,通常由数据离群值、噪声和有限的训练数据引起。为了解决这个问题,以前提出了Divide 2Conquer(D2C)方法,该方法将训练数据划分为多个子集,并在每个子集上独立地训练相同的模型。这种策略可以学习更一致的模式,同时最大限度地减少单个离群值和噪声的影响。然而,D2C的标准聚合通常平等地对待所有子集模型或基于固定的泛化能力(如数据大小),可能未充分利用有关其变化的泛化能力的信息。在此基础上,我们引入了动态不确定性感知Divide 2Conquer(DUA-D2C),这是一种改进聚合过程的先进技术。DUA-D2C根据子集模型在共享验证集上的性能动态加权子集模型的贡献,同时考虑准确性和预测不确定性。这种智能聚合允许中心模型优先从子集中学习,从而产生更普遍和自信的边缘模型,从而更有效地对抗过拟合。对跨多个领域的基准数据集的实证评估表明,DUA-D2C显着提高了泛化能力。我们的分析包括对决策边界、损失曲线和其他性能指标的评估,突出了DUA-D2C的有效性。这项研究表明,DUA-D2C即使在其他正则化方法的基础上应用,也能提高泛化性能,使其成为现代深度学习中对抗过拟合的理论基础和有效方法。我们的代码可在https://github.com/Saiful185/DUA-D2C上公开获取。
摘要:Overfitting remains a significant challenge in deep learning, often arising from data outliers, noise, and limited training data. To address this, the Divide2Conquer (D2C) method was previously proposed, which partitions training data into multiple subsets and trains identical models independently on each. This strategy enables learning more consistent patterns while minimizing the influence of individual outliers and noise. However, D2C's standard aggregation typically treats all subset models equally or based on fixed heuristics (like data size), potentially underutilizing information about their varying generalization capabilities. Building upon this foundation, we introduce Dynamic Uncertainty-Aware Divide2Conquer (DUA-D2C), an advanced technique that refines the aggregation process. DUA-D2C dynamically weights the contributions of subset models based on their performance on a shared validation set, considering both accuracy and prediction uncertainty. This intelligent aggregation allows the central model to preferentially learn from subsets yielding more generalizable and confident edge models, thereby more effectively combating overfitting. Empirical evaluations on benchmark datasets spanning multiple domains demonstrate that DUA-D2C significantly improves generalization. Our analysis includes evaluations of decision boundaries, loss curves, and other performance metrics, highlighting the effectiveness of DUA-D2C. This study demonstrates that DUA-D2C improves generalization performance even when applied on top of other regularization methods, establishing it as a theoretically grounded and effective approach to combating overfitting in modern deep learning. Our codes are publicly available at: https://github.com/Saiful185/DUA-D2C.
【10】When Robustness Meets Conservativeness: Conformalized Uncertainty Calibration for Balanced Decision Making
标题:当稳健性遇到保守性时:平衡决策的适形不确定性校准
链接:https://arxiv.org/abs/2510.07750
作者:Wenbin Zhou, Shixiang Zhu
摘要:鲁棒优化通过针对最坏情况进行优化来保护决策不受不确定性的影响,但其有效性取决于预先指定的鲁棒性级别,该级别通常是临时选择的,导致保护不足或过于保守和昂贵的解决方案。最近的方法使用保形预测构建数据驱动的不确定性集有限样本覆盖率的保证,但他们仍然固定覆盖目标的先验和选择鲁棒性水平的指导。我们提出了一个新的框架,提供分布免费,有限样本保证任何家庭的强大的预测,然后优化政策的错误覆盖和遗憾。我们的方法构造有效的估计,跟踪的错误覆盖后悔帕累托边界,使决策者能够可靠地评估和校准的鲁棒性水平,根据他们的成本风险偏好。该框架易于实现,广泛适用于经典的优化配方,并实现了更清晰的有限样本性能比现有的方法。这些结果提供了第一个原则性的数据驱动的方法来指导稳健性选择,并使从业者能够在高风险决策中平衡稳健性和保守性。
摘要:Robust optimization safeguards decisions against uncertainty by optimizing against worst-case scenarios, yet their effectiveness hinges on a prespecified robustness level that is often chosen ad hoc, leading to either insufficient protection or overly conservative and costly solutions. Recent approaches using conformal prediction construct data-driven uncertainty sets with finite-sample coverage guarantees, but they still fix coverage targets a priori and offer little guidance for selecting robustness levels. We propose a new framework that provides distribution-free, finite-sample guarantees on both miscoverage and regret for any family of robust predict-then-optimize policies. Our method constructs valid estimators that trace out the miscoverage-regret Pareto frontier, enabling decision-makers to reliably evaluate and calibrate robustness levels according to their cost-risk preferences. The framework is simple to implement, broadly applicable across classical optimization formulations, and achieves sharper finite-sample performance than existing approaches. These results offer the first principled data-driven methodology for guiding robustness selection and empower practitioners to balance robustness and conservativeness in high-stakes decision-making.
迁移|Zero/Few/One-Shot|自适应(10篇)
【1】DYNAMIX: RL-based Adaptive Batch Size Optimization in Distributed Machine Learning Systems
标题:CLARIX:分布式机器学习系统中基于RL的自适应批量大小优化
链接:https://arxiv.org/abs/2510.08522
作者:Yuanjun Dai, Keqiang He, An Wang
摘要:分布式机器学习中现有的批量大小选择方法依赖于静态分配或简单的算法,无法适应异构的动态计算环境。我们提出了一个强化学习框架,将批量优化作为一个序列决策问题,使用邻近策略优化(PPO)。我们的方法采用多维状态表示,包括网络级指标,系统级资源利用率和训练统计效率指标,以实现跨不同计算资源的明智决策。我们的方法消除了显式系统建模的需要,同时与现有的分布式培训框架无缝集成。通过对不同工作负载、硬件配置和网络条件的评估,DY-NAMIX最终模型精度提高了6.3%,总训练时间减少了46%。我们的可扩展性实验表明,集群大小增加到32个节点,而策略传输实验表明,学习的政策有效地推广到相关的模型架构,ECOMIX保持最佳性能。
摘要:Existing batch size selection approaches in dis- tributed machine learning rely on static allocation or simplistic heuristics that fail to adapt to heterogeneous, dynamic computing environments. We present DYNAMIX, a reinforcement learning framework that formulates batch size optimization as a sequen- tial decision-making problem using Proximal Policy Optimiza- tion (PPO). Our approach employs a multi-dimensional state representation encompassing network-level metrics, system-level resource utilization, and training statistical efficiency indicators to enable informed decision-making across diverse computational resources. Our approach eliminates the need for explicit system modeling while integrating seamlessly with existing distributed training frameworks. Through evaluations across diverse work- loads, hardware configurations, and network conditions, DY- NAMIX achieves up to 6.3% improvement in the final model accuracy and 46% reduction in the total training time. Our scalability experiments demonstrate that DYNAMIX maintains the best performance as cluster size increases to 32 nodes, while policy transfer experiments show that learned policies generalize effectively across related model architectures.
【2】Dynamic Features Adaptation in Networking: Toward Flexible training and Explainable inference
标题:网络中的动态特征适应:迈向灵活训练和可解释推理
链接:https://arxiv.org/abs/2510.08303
作者:Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, Merim Dzaferagic, John D. Kelleher
备注:Accepted at AI4NextG Workshop, NeurIPS 2025
摘要:随着人工智能成为6G网络控制的原生组件,人工智能模型必须适应不断变化的条件,包括引入由多供应商部署、硬件升级和不断变化的服务需求驱动的新功能和测量。为了满足非平稳环境中对灵活学习的日益增长的需求,这篇愿景论文强调自适应随机森林(ARF)是通信网络场景中动态特征自适应的可靠解决方案。我们表明,ARF的迭代训练可以有效地导致稳定的预测,随着时间的推移,随着更多的功能被添加,准确性提高。此外,我们强调了AI驱动网络中可解释性的重要性,提出了漂移感知特征重要性(DAFI)作为一种有效的XAI特征重要性(FI)方法。DAFI使用分布式漂移检测器来指示何时应用计算密集型FI方法,而不是较轻的替代方法。我们在3个不同数据集上的测试表明,我们的方法将运行时间减少了2倍,同时产生了更一致的特征重要性值。ARF和DAFI共同提供了一个有前途的框架,可以构建适应6G网络用例的灵活AI方法。
摘要:As AI becomes a native component of 6G network control, AI models must adapt to continuously changing conditions, including the introduction of new features and measurements driven by multi-vendor deployments, hardware upgrades, and evolving service requirements. To address this growing need for flexible learning in non-stationary environments, this vision paper highlights Adaptive Random Forests (ARFs) as a reliable solution for dynamic feature adaptation in communication network scenarios. We show that iterative training of ARFs can effectively lead to stable predictions, with accuracy improving over time as more features are added. In addition, we highlight the importance of explainability in AI-driven networks, proposing Drift-Aware Feature Importance (DAFI) as an efficient XAI feature importance (FI) method. DAFI uses a distributional drift detector to signal when to apply computationally intensive FI methods instead of lighter alternatives. Our tests on 3 different datasets indicate that our approach reduces runtime by up to 2 times, while producing more consistent feature importance values. Together, ARFs and DAFI provide a promising framework to build flexible AI methods adapted to 6G network use-cases.
【3】GRADE: Personalized Multi-Task Fusion via Group-relative Reinforcement Learning with Adaptive Dirichlet Exploratio
标题:GRADE:通过组相对强化学习和自适应Dirichlet探索的个性化多任务融合
链接:https://arxiv.org/abs/2510.07919
作者:Tingfeng Hong, Pingye Ren, Xinlong Xiao, Chao Wang, Chenyi Lei, Wenwu Ou, Han Li
摘要:个性化多目标排名系统的总体架构。其包括:(1)用于初始特征处理和候选生成的特征中心和预排名模型;(2)预测各种用户反馈信号的多任务学习(MTL)模型;(3)多任务融合(MTF)模块;(我们提出的GRADE框架),学习个性化的权重($w_1,\dots,w_n$);然后应用这些权重计算最终分数并进行排序,以通过混合排名模型生成混合排名,最终将结果交付给用户。
摘要:Overall architecture of the personalized multi-objective ranking system. It comprises: (1) a Feature Center and Prerank Model for initial feature processing and candidate generation; (2) a Multi-Task Learning (MTL) model predicting various user feedback signals; (3) a Multi-Task Fusion (MTF) module (our proposed GRADE framework) that learns personalized weights ($w_1, \dots, w_n$); these weights are then applied to calculate final scores and sorted to generate a blended ranking by the Blended Ranking Model, which ultimately delivers results to users.
【4】Adaptive Optimizable Gaussian Process Regression Linear Least Squares Regression Filtering Method for SEM Images
标题:扫描电子显微镜图像的自适应可优化高斯过程回归线性最小平方回归过滤方法
链接:https://arxiv.org/abs/2510.07895
作者:D. Chee Yong Ong, I. Bukhori, K. S. Sim, K. Beng Gan
备注:"Adaptive Optimizable Gaussian Process Regression Linear Least Squares Regression Filtering Method for SEM Images," in IEEE Access, vol. 13, pp. 93574-93592, 2025, doi: https://doi.org/10.1109/ACCESS.2025.3573389
摘要:扫描电子显微镜(SEM)图像经常受到噪声污染,这会降低图像质量并影响进一步分析。本文提出了一种完整的方法来估计它们的信噪比(SNR)和噪声方差(NV),并使用NV引导的维纳滤波器来增强图像质量。本研究的主要思想是使用良好的SNR估计技术并注入机器学习模型来估计SEM图像的NV,然后引导维纳滤波器去除噪声,提供更鲁棒和更准确的SEM图像滤波流水线。首先,我们研究了五种不同的信噪比估计技术,即最近邻(NN)方法,一阶线性插值(FOL)方法,最近邻与一阶线性插值(NN+FOL)方法,非线性最小二乘回归(NLLSR)方法,和线性最小二乘回归(LSR)方法。实验结果表明,LSR方法的性能优于其他方法.然后,支持向量机(SVM)和高斯过程回归(GPR)进行了测试,通过配对它与LSR。在该测试中,Optimizable GPR模型显示出最高的准确性,并且它是NV估计的最有效解决方案。结合这些结果,提出了自适应可优化高斯过程回归线性最小二乘回归(AO-GPRLLSR)滤波流水线。AO-GPRLLSR方法产生一个估计的噪声方差,作为输入到NV引导的维纳滤波器,以提高SEM图像的质量。所提出的方法在估计SEM图像的SNR和NV方面取得了显着的成功,并且在滤波过程之后导致较低的均方误差(MSE)。
摘要
:Scanning Electron Microscopy (SEM) images often suffer from noise contamination, which degrades image quality and affects further analysis. This research presents a complete approach to estimate their Signal-to-Noise Ratio (SNR) and noise variance (NV), and enhance image quality using NV-guided Wiener filter. The main idea of this study is to use a good SNR estimation technique and infuse a machine learning model to estimate NV of the SEM image, which then guides the wiener filter to remove the noise, providing a more robust and accurate SEM image filtering pipeline. First, we investigate five different SNR estimation techniques, namely Nearest Neighbourhood (NN) method, First-Order Linear Interpolation (FOL) method, Nearest Neighbourhood with First-Order Linear Interpolation (NN+FOL) method, Non-Linear Least Squares Regression (NLLSR) method, and Linear Least Squares Regression (LSR) method. It is shown that LSR method to perform better than the rest. Then, Support Vector Machines (SVM) and Gaussian Process Regression (GPR) are tested by pairing it with LSR. In this test, the Optimizable GPR model shows the highest accuracy and it stands as the most effective solution for NV estimation. Combining these results lead to the proposed Adaptive Optimizable Gaussian Process Regression Linear Least Squares Regression (AO-GPRLLSR) Filtering pipeline. The AO-GPRLLSR method generated an estimated noise variance which served as input to NV-guided Wiener filter for improving the quality of SEM images. The proposed method is shown to achieve notable success in estimating SNR and NV of SEM images and leads to lower Mean Squared Error (MSE) after the filtering process.
【5】Adaptive Execution Scheduler for DataDios SmartDiff
标题:DataDios SmartDiff的自适应执行收件箱
链接:https://arxiv.org/abs/2510.07811
作者:Aryan Poduri
备注:4 pages, 1 figure
摘要:我们提出了一个自适应调度的单差分引擎(SmartDiff)有两种执行模式:(i)在内存中的线程和(ii)基于Dask的并行。调度程序在固定的CPU和内存预算内不断调整批处理大小和工作线程/线程计数,以最大限度地减少p95延迟。一个轻量级的preflight分析器估计字节/行和I/O速率;一个在线的成本/内存模型修剪不安全的行动;和一个有保护的爬山策略有利于降低延迟与背压和掉队缓解。后端选择由保守的工作集估计进行门控,以便在安全时选择内存中执行,否则使用Dask。在合成和公共表格基准测试中,调度器将p95延迟降低了23%到28%(与调整的预热启发式相比降低了35%到40%),同时将峰值内存降低了16%到22%(与固定网格基线相比降低了25%到32%),OOM为零,吞吐量相当。
摘要:We present an adaptive scheduler for a single differencing engine (SmartDiff) with two execution modes: (i) in-memory threads and (ii) Dask based parallelism. The scheduler continuously tunes batch size and worker/thread count within fixed CPU and memory budgets to minimize p95 latency. A lightweight preflight profiler estimates bytes/row and I/O rate; an online cost/memory model prunes unsafe actions; and a guarded hill-climb policy favors lower latency with backpressure and straggler mitigation. Backend selection is gated by a conservative working-set estimate so that in-memory execution is chosen when safe, otherwise Dask is used. Across synthetic and public tabular benchmarks, the scheduler reduces p95 latency by 23 to 28 percent versus a tuned warm-up heuristic (and by 35 to 40 percent versus fixed grid baselines), while lowering peak memory by 16 to 22 percent (25 to 32 percent vs. fixed) with zero OOMs and comparable throughput.
【6】Instance Relation Learning Network with Label Knowledge Propagation for Few-shot Multi-label Intent Detection
标题:具有标签知识传播的实例关系学习网络用于Few-Shot多标签意图检测
链接:https://arxiv.org/abs/2510.07776
作者:Shiman Zhao, Shangyuan Li, Wei Chen, Tengjiao Wang, Jiahui Yao, Jiabin Zheng, Kam Fai Wong
摘要:Few-Shot多标签意图检测(MID)是对话系统的关键技术,旨在检测低资源对话域中的多个意图。以前的研究集中在两个阶段的管道。他们首先学习具有多个标签的话语的表示,然后使用基于阈值的策略来识别多标签结果。然而,这些方法依赖于表示分类,忽略了实例关系,导致错误传播。针对上述问题,提出了一种端到端的Few-Shot MID多标签联合学习方法,通过构建带有标签知识传播的实例关系学习网络来消除错误传播。具体地说,我们学习与类信息的实例之间的相互作用关系,传播标签知识之间的一些标记(支持集)和未标记(查询集)的实例。通过标签知识传播,实例之间的关系强度直接指示两个话语是否属于多标签预测的同一意图。此外,一个双重的关系增强损失,以优化支持和查询级的关系强度,以提高性能。实验表明,在1次拍摄场景中,我们的表现平均优于强基线9.54%的AUC和11.19%的Macro-F1。
摘要:Few-shot Multi-label Intent Detection (MID) is crucial for dialogue systems, aiming to detect multiple intents of utterances in low-resource dialogue domains. Previous studies focus on a two-stage pipeline. They first learn representations of utterances with multiple labels and then use a threshold-based strategy to identify multi-label results. However, these methods rely on representation classification and ignore instance relations, leading to error propagation. To solve the above issues, we propose a multi-label joint learning method for few-shot MID in an end-to-end manner, which constructs an instance relation learning network with label knowledge propagation to eliminate error propagation. Concretely, we learn the interaction relations between instances with class information to propagate label knowledge between a few labeled (support set) and unlabeled (query set) instances. With label knowledge propagation, the relation strength between instances directly indicates whether two utterances belong to the same intent for multi-label prediction. Besides, a dual relation-enhanced loss is developed to optimize support- and query-level relation strength to improve performance. Experiments show that we outperform strong baselines by an average of 9.54% AUC and 11.19% Macro-F1 in 1-shot scenarios.
【7】FedLAM: Low-latency Wireless Federated Learning via Layer-wise Adaptive Modulation
标题:FedLAM:通过分层自适应调制的低延迟无线联邦学习
链接:https://arxiv.org/abs/2510.07766
作者:Linping Qu, Shenghui Song, Chi-Ying Tsui
摘要:在无线联合学习(FL)中,客户端需要通过带宽受限的信道传输高维深度神经网络(DNN)参数,这会导致通信延迟问题。在本文中,我们提出了一种分层自适应调制方案,以节省通信延迟。与为所有DNN层分配相同调制级别的现有作品不同,我们考虑了层的重要性,这提供了更多的自由度来节省延迟。所提出的方案可以自动决定不同DNN层的最佳调制水平。实验结果表明,与现有方案相比,该方案可节省高达73.9%的通信延迟。
摘要:In wireless federated learning (FL), the clients need to transmit the high-dimensional deep neural network (DNN) parameters through bandwidth-limited channels, which causes the communication latency issue. In this paper, we propose a layer-wise adaptive modulation scheme to save the communication latency. Unlike existing works which assign the same modulation level for all DNN layers, we consider the layers' importance which provides more freedom to save the latency. The proposed scheme can automatically decide the optimal modulation levels for different DNN layers. Experimental results show that the proposed scheme can save up to 73.9% of communication latency compared with the existing schemes.
【8】Continual Learning for Adaptive AI Systems
标题:自适应人工智能系统的持续学习
链接:https://arxiv.org/abs/2510.07648
作者:Md Hasibul Amin, Tamzid Tanvi Alam
备注:5 pages 2 figures 2 tables
摘要:持续学习神经网络学习多个顺序任务而不丢失先前获得的知识的能力仍然是开发真正自适应人工智能的重大障碍。深度学习模型在各种应用中取得了显著的成果,但过拟合仍然是一个常见的问题。正则化技术可以通过向模型的参数添加约束来帮助防止过拟合。为了防止灾难性的遗忘,在本文中,我们引入了一种新的正则化技术,该技术基于损失函数中的聚类间分离(ICS),该技术惩罚模型产生远离由先前任务的数据形成的聚类的质心的输出。我们还进行了超参数调整,以找到所提出的正则化项的最佳权重。这确保了神经网络内部表示中任务之间的更清晰分离,减少了重叠并减轻了遗忘。使用标准的5任务拆分CIFAR-10基准和ResNet-18架构,我们证明了ICS在保持初始任务的强大性能方面的有效性。然而,我们的研究结果也强调了长期知识保留的局限性,特别是当任务数量增加时。这强调了持续学习固有的复杂性和权衡,并指出了进一步研究的途径。
摘要
:Continual learning the ability of a neural network to learn multiple sequential tasks without losing previously acquired knowledge remains a significant obstacle to developing truly adaptive artificial intelligence. Deep learning models have achieved remarkable results in various applications, but overfitting remains a common issue. Regularization techniques can help prevent overfitting by adding constraints to the model's parameters. To prevent catastrophic forgetting, in this paper we introduce a novel regularization technique based on inter-cluster separation (ICS) in the loss function, which penalizes the model for producing outputs that are far away from the centroids of the clusters formed by the data from previous tasks. We also performed hyperparameter tuning to find the optimal weighting of the proposed regularization term. This ensures clearer separation between tasks in the neural network's internal representation, reducing overlap and mitigating forgetting. Using the standard 5-task Split CIFAR-10 benchmark and a ResNet-18 architecture, we demonstrate ICS's effectiveness in maintaining strong performance on initial tasks. However, our results also highlight limitations in long-term knowledge retention, particularly when the number of tasks increases. This underscores the complexity and trade-offs inherent in continual learning and points toward avenues for further research.
【9】Stick-Breaking Mixture Normalizing Flows with Component-Wise Tail Adaptation for Variational Inference
标题:具有变分推理的适合对象尾部自适应的粘破混合物标准化流
链接:https://arxiv.org/abs/2510.07965
作者:Seungsu Han, Juyoung Hwang, Won Chang
摘要:用高斯基规范化流提供了一种计算效率高的方法来近似贝叶斯推理中的后验分布,但它们通常难以捕获具有多模态和重尾的复杂后验。我们提出了一个坚持打破混合基与组件明智的尾部适应(StiCTAF)的后验近似。该方法首先学习一个灵活的混合基,通过加权平均的ELBO分量,以减轻反向KL发散的模式搜索偏差。然后,它估计非归一化密度的局部尾部指数,并最终使用共享的骨干与由估计的指数校准的组件特定的尾部变换相结合来细化每个混合组件。这种设计能够实现精确的模式覆盖和各向异性尾部建模,同时保持精确的密度评估和稳定的优化。合成后验的实验表明,与基准模型相比,改进的尾部恢复和更好的多模式覆盖。我们还提出了一个真实的数据分析,说明我们的方法后验推理的实际好处。
摘要:Normalizing flows with a Gaussian base provide a computationally efficient way to approximate posterior distributions in Bayesian inference, but they often struggle to capture complex posteriors with multimodality and heavy tails. We propose a stick-breaking mixture base with component-wise tail adaptation (StiCTAF) for posterior approximation. The method first learns a flexible mixture base to mitigate the mode-seeking bias of reverse KL divergence through a weighted average of component-wise ELBOs. It then estimates local tail indices of unnormalized densities and finally refines each mixture component using a shared backbone combined with component-specific tail transforms calibrated by the estimated indices. This design enables accurate mode coverage and anisotropic tail modeling while retaining exact density evaluation and stable optimization. Experiments on synthetic posteriors demonstrate improved tail recovery and better coverage of multiple modes compared to benchmark models. We also present a real-data analysis illustrating the practical benefits of our approach for posterior inference.
【10】On the Optimality of Tracking Fisher Information in Adaptive Testing with Stochastic Binary Responses
标题:随机二元响应自适应测试中跟踪Fisher信息的最优性
链接:https://arxiv.org/abs/2510.07862
作者:Sanghwa Kim (KAIST), Dohyun Ahn (The Chinese University of Hong Kong), Seungki Min (Seoul National University)
摘要:我们研究的问题,估计一个连续的能力参数,从连续的二进制响应,积极问问题,不同的困难,自然产生的自适应测试和在线偏好学习的设置。我们的目标是使用尽可能少的查询来证明估计值在期望的误差范围内。我们提出了一个简单的算法,自适应地选择问题,以最大限度地提高Fisher信息和更新估计使用矩方法,与一种新的检验统计量配对,以决定何时估计是足够准确的。我们证明了这种Fisher跟踪策略在固定置信度和固定预算制度下都达到了最佳性能,这通常是在最佳手臂识别文献中投资的。我们的分析克服了固定预算设置中的一个关键技术挑战-处理不断发展的估计和查询分布之间的依赖关系-通过利用模型中的结构对称性并将大偏差工具与Ville不等式相结合。我们的研究结果提供了严格的理论支持,简单而有效的自适应测试程序。
摘要:We study the problem of estimating a continuous ability parameter from sequential binary responses by actively asking questions with varying difficulties, a setting that arises naturally in adaptive testing and online preference learning. Our goal is to certify that the estimate lies within a desired margin of error, using as few queries as possible. We propose a simple algorithm that adaptively selects questions to maximize Fisher information and updates the estimate using a method-of-moments approach, paired with a novel test statistic to decide when the estimate is accurate enough. We prove that this Fisher-tracking strategy achieves optimal performance in both fixed-confidence and fixed-budget regimes, which are commonly invested in the best-arm identification literature. Our analysis overcomes a key technical challenge in the fixed-budget setting -- handling the dependence between the evolving estimate and the query distribution -- by exploiting a structural symmetry in the model and combining large deviation tools with Ville's inequality. Our results provide rigorous theoretical support for simple and efficient adaptive testing procedures.
强化学习(8篇)
【1】Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning
标题:熵正则化和分布强化学习的收敛定理
链接:https://arxiv.org/abs/2510.08526
作者:Yash Jhaveri, Harley Wiltzer, Patrick Shafto, Marc G. Bellemare, David Meger
备注:Accepted to NeurIPS 2025. First two authors contributed equally
摘要:在寻求最优策略的过程中,强化学习(RL)方法通常会忽略学习策略的属性,除了它们的预期回报。因此,即使取得成功,也很难说明哪些政策将被借鉴,以及这些政策将发挥什么作用。在这项工作中,我们提出了一个理论框架的政策优化,保证收敛到一个特定的最优策略,通过消失熵正则化和温度解耦的开局。我们的方法实现了一个可解释的,多样性保持的最优策略的正则化温度为零,并确保收敛的政策派生对象-值函数和回报分布。例如,在我们的方法的一个特定实例中,实现的策略统一地对所有最优动作进行采样。利用我们的温度解耦策略,我们提出了一种算法,估计,任意精度,与其可解释的,多样性保持的最优策略的回报分布。
摘要:In the pursuit of finding an optimal policy, reinforcement learning (RL) methods generally ignore the properties of learned policies apart from their expected return. Thus, even when successful, it is difficult to characterize which policies will be learned and what they will do. In this work, we present a theoretical framework for policy optimization that guarantees convergence to a particular optimal policy, via vanishing entropy regularization and a temperature decoupling gambit. Our approach realizes an interpretable, diversity-preserving optimal policy as the regularization temperature vanishes and ensures the convergence of policy derived objects--value functions and return distributions. In a particular instance of our method, for example, the realized policy samples all optimal actions uniformly. Leveraging our temperature decoupling gambit, we present an algorithm that estimates, to arbitrary accuracy, the return distribution associated to its interpretable, diversity-preserving optimal policy.
【2】ClauseLens: Clause-Grounded, CVaR-Constrained Reinforcement Learning for Trustworthy Reinsurance Pricing
标题:ClauseLens:基于Clause的、CVaR约束的强化学习,用于可信的再保险定价
链接:https://arxiv.org/abs/2510.08429
作者:Stella C. Dong, James R. Finlay
备注
:Accepted for publication at the 6th ACM International Conference on AI in Finance (ICAIF 2025), Singapore. Author-accepted version (October 2025). 10 pages, 5 figures
摘要:再保险合同定价必须满足严格的监管标准,但目前的报价做法仍然不透明,难以审计。我们介绍了ClauseLens,这是一个基于子句的强化学习框架,可以生成透明、符合法规且具有风险意识的条约引用。 ClauseLens将报价任务建模为风险感知约束马尔可夫决策过程(RA-CMDP)。从法律和承保语料库中检索法规和政策条款,嵌入到代理人的观察中,并用于约束可行的行动和生成基于条款的自然语言理由。 在根据行业数据校准的多代理条约模拟器中进行评估,ClauseLens将偿付能力违规减少了51%,将尾部风险性能提高了27.9%(CVaR_0.10),并在基于子句的解释中实现了88.2%的准确率,检索精度为87.4%,召回率为91.1%。 这些研究结果表明,将法律背景嵌入决策和解释途径中,可以产生可解释、可审计和符合法规的报价行为,符合偿付能力II、NAIC RBC和欧盟AI法案。
摘要:Reinsurance treaty pricing must satisfy stringent regulatory standards, yet current quoting practices remain opaque and difficult to audit. We introduce ClauseLens, a clause-grounded reinforcement learning framework that produces transparent, regulation-compliant, and risk-aware treaty quotes. ClauseLens models the quoting task as a Risk-Aware Constrained Markov Decision Process (RA-CMDP). Statutory and policy clauses are retrieved from legal and underwriting corpora, embedded into the agent's observations, and used both to constrain feasible actions and to generate clause-grounded natural language justifications. Evaluated in a multi-agent treaty simulator calibrated to industry data, ClauseLens reduces solvency violations by 51%, improves tail-risk performance by 27.9% (CVaR_0.10), and achieves 88.2% accuracy in clause-grounded explanations with retrieval precision of 87.4% and recall of 91.1%. These findings demonstrate that embedding legal context into both decision and explanation pathways yields interpretable, auditable, and regulation-aligned quoting behavior consistent with Solvency II, NAIC RBC, and the EU AI Act.
【3】DeepEN: Personalized Enteral Nutrition for Critically Ill Patients using Deep Reinforcement Learning
标题:DeepEN:使用深度强化学习为重症患者提供个性化肠内营养
链接:https://arxiv.org/abs/2510.08350
作者:Daniel Jason Tan, Jiayang Chen, Dilruk Perera, Kay Choong See, Mengling Feng
摘要:我们介绍DeepEN,这是一种深度强化学习(RL)框架,用于重症患者的个性化肠内营养(EN)。通过对MIMIC-IV数据库中超过11,000名ICU患者进行离线培训,DeepEN每4小时生成针对每个患者不断变化的生理学的热量,蛋白质和液体摄入量的建议。该模型集成了一个精心策划的临床信息状态空间和一个自定义奖励函数,该函数可以平衡短期生理和营养相关目标与长期生存结果。使用具有保守Q学习正则化的决斗双深度Q网络,DeepEN学习符合临床实际的策略,这些策略与高价值的临床医生行动保持一致,同时阻止不安全的偏差。在各种定性和定量指标中,DeepEN优于临床医生衍生和基于指南的政策,实现了估计死亡率降低3.7个百分点(18.8% vs 22.5%)和关键营养生物标志物的改善。这些发现强调了安全的、数据驱动的EN治疗个性化的潜力,以改善传统指南或基于药物的方法之外的结果。
摘要:We introduce DeepEN, a deep reinforcement learning (RL) framework for personalized enteral nutrition (EN) in critically ill patients. Trained offline on over 11,000 ICU patients from the MIMIC-IV database, DeepEN generates 4-hourly recommendations for caloric, protein, and fluid intake tailored to each patient's evolving physiology. The model integrates a curated, clinically informed state space with a custom reward function that balances short-term physiological and nutrition-related goals with long-term survival outcomes. Using a dueling double deep Q-network with conservative Q-learning regularization, DeepEN learns clinically realistic policies that align with high-value clinician actions while discouraging unsafe deviations. Across various qualitative and quantitative metrics, DeepEN outperforms clinician-derived and guideline-based policies, achieving a 3.7 $\pm$ 0.17 percentage-point reduction in estimated mortality (18.8% vs 22.5%) and improvements in key nutritional biomarkers. These findings highlight the potential of safe, data-driven personalization of EN therapy to improve outcomes beyond traditional guideline- or heuristic-based approaches.
【4】Reinforcement Learning from Probabilistic Forecasts for Safe Decision-Making via Conditional Value-at-Risk Planning
标题:通过有条件风险价值规划从概率预测中进行强化学习以实现安全决策
链接:https://arxiv.org/abs/2510.08226
作者:Michal Koren, Or Peretz, Tai Dinh, Philip S. Yu
摘要:在动荡、高风险的环境中,顺序决策需要的不仅仅是最大化预期回报,还需要有原则的不确定性管理。本文介绍了不确定性感知马尔可夫决策过程(UAMDP),一个统一的框架,耦合贝叶斯预测,后验抽样强化学习,并在条件风险值(CVaR)约束下的规划。在一个闭环中,智能体更新其对潜在动态的信念,通过汤普森采样对可能的未来进行采样,并根据预设的风险容忍度优化策略。我们建立遗憾界收敛到贝叶斯最优基准下的标准规律性条件。我们评估UAMDP在两个领域-高频股票交易和零售库存控制-都标志着结构的不确定性和经济波动。相对于强大的深度学习基线,UAMDP提高了长期预测的准确性(RMSE降低了25%,sMAPE降低了32%),这些收益转化为经济表现:交易夏普比率从1.54上升到1.74,而最大跌幅大致减半。这些结果表明,整合校准的概率建模,勘探对齐后的不确定性,风险意识控制产生一个强大的,可推广的方法,更安全,更有利可图的顺序决策。
摘要:Sequential decisions in volatile, high-stakes settings require more than maximizing expected return; they require principled uncertainty management. This paper presents the Uncertainty-Aware Markov Decision Process (UAMDP), a unified framework that couples Bayesian forecasting, posterior-sampling reinforcement learning, and planning under a conditional value-at-risk (CVaR) constraint. In a closed loop, the agent updates its beliefs over latent dynamics, samples plausible futures via Thompson sampling, and optimizes policies subject to preset risk tolerances. We establish regret bounds that converge to the Bayes-optimal benchmark under standard regularity conditions. We evaluate UAMDP in two domains-high-frequency equity trading and retail inventory control-both marked by structural uncertainty and economic volatility. Relative to strong deep learning baselines, UAMDP improves long-horizon forecasting accuracy (RMSE decreases by up to 25\% and sMAPE by 32\%), and these gains translate into economic performance: the trading Sharpe ratio rises from 1.54 to 1.74 while maximum drawdown is roughly halved. These results show that integrating calibrated probabilistic modeling, exploration aligned with posterior uncertainty, and risk-aware control yields a robust, generalizable approach to safer and more profitable sequential decision-making.
【5】Expressive Value Learning for Scalable Offline Reinforcement Learning
标题:可扩展离线强化学习的表达价值学习
链接:https://arxiv.org/abs/2510.08218
作者:Nicolas Espinosa-Dice, Kiante Brantley, Wen Sun
备注:24 pages, 5 figures
摘要:强化学习(RL)是学习做出决策序列的强大范例。然而,RL尚未在机器人技术中得到充分利用,主要是由于其缺乏可扩展性。离线RL通过在大型、多样化的数据集上培训代理人,避免在线RL昂贵的现实世界交互,提供了一条有前途的途径。将离线RL扩展到越来越复杂的数据集需要表达性的生成模型,例如扩散和流匹配。然而,现有的方法通常依赖于通过时间的反向传播(BPTT),这在计算上是禁止的,或者策略蒸馏,这引入了复合误差并限制了对更大基础策略的可扩展性。在本文中,我们考虑的问题是如何开发一个可扩展的离线RL方法,而不依赖于蒸馏或通过时间的反向传播。我们介绍了用于离线强化学习的表达值学习(EVOR):一种可扩展的离线RL方法,它集成了表达策略和表达值函数。EVOR在训练期间通过流匹配学习最优的正则化Q函数。在推理时,EVOR通过对表达值函数进行拒绝采样来执行推理时策略提取,从而实现高效优化、正则化和计算可扩展搜索,而无需重新训练。从经验上讲,我们表明EVOR在各种离线RL任务上的表现优于基线,证明了将表达性价值学习集成到离线RL中的好处。
摘要
:Reinforcement learning (RL) is a powerful paradigm for learning to make sequences of decisions. However, RL has yet to be fully leveraged in robotics, principally due to its lack of scalability. Offline RL offers a promising avenue by training agents on large, diverse datasets, avoiding the costly real-world interactions of online RL. Scaling offline RL to increasingly complex datasets requires expressive generative models such as diffusion and flow matching. However, existing methods typically depend on either backpropagation through time (BPTT), which is computationally prohibitive, or policy distillation, which introduces compounding errors and limits scalability to larger base policies. In this paper, we consider the question of how to develop a scalable offline RL approach without relying on distillation or backpropagation through time. We introduce Expressive Value Learning for Offline Reinforcement Learning (EVOR): a scalable offline RL approach that integrates both expressive policies and expressive value functions. EVOR learns an optimal, regularized Q-function via flow matching during training. At inference-time, EVOR performs inference-time policy extraction via rejection sampling against the expressive value function, enabling efficient optimization, regularization, and compute-scalable search without retraining. Empirically, we show that EVOR outperforms baselines on a diverse set of offline RL tasks, demonstrating the benefit of integrating expressive value learning into offline RL.
【6】Climate Surrogates for Scalable Multi-Agent Reinforcement Learning: A Case Study with CICERO-SCM
标题:可扩展多智能体强化学习的气候替代品:CICERO-SCP的案例研究
链接:https://arxiv.org/abs/2510.07971
作者:Oskar Bohn Lassen, Serio Angelo Maria Agriesti, Filipe Rodrigues, Francisco Camara Pereira
摘要:气候政策研究需要捕捉多种温室气体对全球温度的综合影响的模型,但这些模型计算成本高,难以嵌入强化学习。我们提出了一个多智能体强化学习(MARL)框架,该框架将高保真,高效的气候代理直接集成到环境循环中,使区域代理能够在多气体动力学下学习气候政策。作为概念的证明,我们引入了一个循环神经网络架构预训练($20{,}000$)多气体排放途径,以替代气候模型CICERO-SCM。代理模型达到接近模拟器的精度与全球平均温度RMSE $\approximat0.0004\mathrm{K}$和约$1000\times $更快的一步推理。当在气候政策MARL设置中取代原始模拟器时,它将端到端训练加速$>\!100\times$.我们表明,代理人和模拟器收敛到相同的最优策略,并提出了一种方法来评估此属性的情况下,使用模拟器是棘手的。我们的工作允许绕过核心计算瓶颈,而不牺牲政策保真度,使大规模的多智能体实验跨替代气候政策制度与多气体动力学和高保真气候响应。
摘要:Climate policy studies require models that capture the combined effects of multiple greenhouse gases on global temperature, but these models are computationally expensive and difficult to embed in reinforcement learning. We present a multi-agent reinforcement learning (MARL) framework that integrates a high-fidelity, highly efficient climate surrogate directly in the environment loop, enabling regional agents to learn climate policies under multi-gas dynamics. As a proof of concept, we introduce a recurrent neural network architecture pretrained on ($20{,}000$) multi-gas emission pathways to surrogate the climate model CICERO-SCM. The surrogate model attains near-simulator accuracy with global-mean temperature RMSE $\approx 0.0004 \mathrm{K}$ and approximately $1000\times$ faster one-step inference. When substituted for the original simulator in a climate-policy MARL setting, it accelerates end-to-end training by $>\!100\times$. We show that the surrogate and simulator converge to the same optimal policies and propose a methodology to assess this property in cases where using the simulator is intractable. Our work allows to bypass the core computational bottleneck without sacrificing policy fidelity, enabling large-scale multi-agent experiments across alternative climate-policy regimes with multi-gas dynamics and high-fidelity climate response.
【7】LiveThinking: Enabling Real-Time Efficient Reasoning for AI-Powered Livestreaming via Reinforcement Learning
标题:LiveThinking:通过强化学习为人工智能驱动的直播实现实时高效推理
链接:https://arxiv.org/abs/2510.07685
作者:Yuhan Sun, Zhiwei Huang, Wanqing Cui, Shaopan Xiong, Yazhi Guo, Meiguang Jin, Junfeng Ma
备注:12 pages, 8 figures
摘要:在人工智能驱动的电子商务直播中,数字化身需要实时响应来推动参与,这是一项高延迟大型推理模型(LRM)不适合的任务。我们引入LiveThinking,一个实用的两阶段优化框架来弥合这一差距。首先,我们通过使用拒绝采样微调(RFT)将670 B教师LRM蒸馏成轻量级30 B专家混合(MoE)模型(3B主动)来解决计算成本。这减少了部署开销,但保留了教师的冗长推理,导致延迟。为了解决这个问题,我们的第二阶段采用了强化学习和组相对策略优化(GRPO)来压缩模型的推理路径,并在多目标奖励函数的指导下平衡正确性,有用性和简洁性。LiveThinking将计算成本降低了30倍,实现了亚秒级延迟。在淘宝直播的实际应用中,它将响应正确性提高了3.3%,帮助性提高了21.8%。经过数十万观众的测试,我们的系统在统计上显著增加了总销售量(GMV),证明了其在增强实时互动环境中的用户体验和商业表现方面的有效性。
摘要:In AI-powered e-commerce livestreaming, digital avatars require real-time responses to drive engagement, a task for which high-latency Large Reasoning Models (LRMs) are ill-suited. We introduce LiveThinking, a practical two-stage optimization framework to bridge this gap. First, we address computational cost by distilling a 670B teacher LRM into a lightweight 30B Mixture-of-Experts (MoE) model (3B active) using Rejection Sampling Fine-Tuning (RFT). This reduces deployment overhead but preserves the teacher's verbose reasoning, causing latency. To solve this, our second stage employs reinforcement learning with Group Relative Policy Optimization (GRPO) to compress the model's reasoning path, guided by a multi-objective reward function balancing correctness, helpfulness, and brevity. LiveThinking achieves a 30-fold reduction in computational cost, enabling sub-second latency. In real-world application on Taobao Live, it improved response correctness by 3.3% and helpfulness by 21.8%. Tested by hundreds of thousands of viewers, our system led to a statistically significant increase in Gross Merchandise Volume (GMV), demonstrating its effectiveness in enhancing user experience and commercial performance in live, interactive settings.
【8】Reinforcement Learning-based Task Offloading in the Internet of Wearable Things
标题:可穿戴物联网中基于强化学习的任务卸载
链接:https://arxiv.org/abs/2510.07487
作者:Waleed Bin Qaim, Aleksandr Ometov, Claudia Campolo, Antonella Molinaro, Elena Simona Lohan, Jari Nurmi
备注:16 pages, 12 figures, Under review in the IEEE Internet of Things Journal
摘要:多年来,研究和工业部门为将可穿戴设备改进为可穿戴物联网(IoWT)范式做出了重大贡献。然而,可穿戴设备仍然面临着一些挑战。许多问题源于可穿戴设备上有限的电池电量和不足的计算资源。另一方面,随着智能可穿戴设备的普及,新的计算密集型和延迟关键型应用程序的开发不断增加。在这种情况下,任务卸载允许可穿戴设备利用附近边缘设备上的可用资源来增强整体用户体验。本文提出了一种基于强化学习(RL)的IoWT任务卸载框架。我们制定的任务卸载过程中考虑能源消耗和任务完成时间之间的权衡。此外,我们将任务卸载问题建模为马尔可夫决策过程(MDP),并利用Q学习技术使可穿戴设备能够在没有先验知识的情况下做出最佳任务卸载决策。我们评估所提出的框架的性能,通过广泛的模拟各种应用程序和系统配置进行的NS-3网络模拟器。我们还展示了Q学习算法的主要系统参数的变化如何影响平均任务完成时间,平均能耗和任务卸载百分比方面的整体性能。
摘要:Over the years, significant contributions have been made by the research and industrial sectors to improve wearable devices towards the Internet of Wearable Things (IoWT) paradigm. However, wearables are still facing several challenges. Many stem from the limited battery power and insufficient computation resources available on wearable devices. On the other hand, with the popularity of smart wearables, there is a consistent increase in the development of new computationally intensive and latency-critical applications. In such a context, task offloading allows wearables to leverage the resources available on nearby edge devices to enhance the overall user experience. This paper proposes a framework for Reinforcement Learning (RL)-based task offloading in the IoWT. We formulate the task offloading process considering the tradeoff between energy consumption and task accomplishment time. Moreover, we model the task offloading problem as a Markov Decision Process (MDP) and utilize the Q-learning technique to enable the wearable device to make optimal task offloading decisions without prior knowledge. We evaluate the performance of the proposed framework through extensive simulations for various applications and system configurations conducted in the ns-3 network simulator. We also show how varying the main system parameters of the Q-learning algorithm affects the overall performance in terms of average task accomplishment time, average energy consumption, and percentage of tasks offloaded.
符号|符号学习(2篇)
【1】Symbolic-Diffusion: Deep Learning Based Symbolic Regression with D3PM Discrete Token Diffusion
标题:符号扩散:基于深度学习的符号回归和D3 PM离散代币扩散
链接:https://arxiv.org/abs/2510.07570
作者:Ryan T. Tymkow, Benjamin D. Schnapp, Mojtaba Valipour, Ali Ghodshi
备注:9 Pages, 3 Figurees
摘要:符号回归是指找到一个封闭形式的数学表达式来拟合一组数据点的任务。基于遗传编程的技术是用于解决这个问题的最常见的算法,但最近,基于神经网络的方法已经得到普及。大多数用于符号回归的领先的基于神经网络的模型利用基于变换的自回归模型来生成以编码输入点为条件的方程。然而,自回归生成仅限于从左到右生成令牌,并且未来生成的令牌仅以先前生成的令牌为条件。出于同时生成所有令牌以产生改进的封闭形式方程的愿望,我们提出了符号扩散,一种基于D3PM的离散状态空间扩散模型,该模型使用离散令牌扩散同时生成方程的所有令牌。使用为SymbolicGPT开发的双变量数据集,我们将我们基于扩散的生成方法与基于SymbolicGPT的自回归模型进行了比较,使用等效的编码器和Transformer架构。我们证明了我们使用基于扩散的符号回归生成的新方法可以提供与使用类似底层架构的模型中的自回归生成相当的性能,并且在某些指标上可以提高性能,从而为基于神经网络的符号回归提供了新的研究机会。
摘要:Symbolic regression refers to the task of finding a closed-form mathematical expression to fit a set of data points. Genetic programming based techniques are the most common algorithms used to tackle this problem, but recently, neural-network based approaches have gained popularity. Most of the leading neural-network based models used for symbolic regression utilize transformer-based autoregressive models to generate an equation conditioned on encoded input points. However, autoregressive generation is limited to generating tokens left-to-right, and future generated tokens are conditioned only on previously generated tokens. Motivated by the desire to generate all tokens simultaneously to produce improved closed-form equations, we propose Symbolic Diffusion, a D3PM based discrete state-space diffusion model which simultaneously generates all tokens of the equation at once using discrete token diffusion. Using the bivariate dataset developed for SymbolicGPT, we compared our diffusion-based generation approach to an autoregressive model based on SymbolicGPT, using equivalent encoder and transformer architectures. We demonstrate that our novel approach of using diffusion-based generation for symbolic regression can offer comparable and, by some metrics, improved performance over autoregressive generation in models using similar underlying architectures, opening new research opportunities in neural-network based symbolic regression.
【2】Iterated Agent for Symbolic Regression
标题:符号回归的迭代代理
链接:https://arxiv.org/abs/2510.08317
作者:Zhuo-Yang Song, Zeyu Cai, Shutao Zhang, Jiashen Wei, Jichen Pan, Shi Qiu, Qing-Hong Cao, Tie-Jiun Hou, Xiaohui Liu, Ming-xing Luo, Hua Xing Zhu
备注:45 pages, 22 figures, 8 tables
摘要:符号回归(SR)是从数据中自动发现数学表达式,是科学探究的基石。然而,它经常受到搜索空间的组合爆炸和过拟合趋势的阻碍。流行的方法,植根于遗传编程,探索这个空间的语法,往往产生过于复杂,无法解释的模型。本文介绍了IdeaSearchFitter,一个框架,采用大型语言模型(LLM)作为语义算子内的进化搜索。通过生成由自然语言原理指导的候选表达式,我们的方法将发现偏向于不仅准确而且概念上连贯和可解释的模型。我们展示了IdeaSearchFitter在各种挑战中的功效:它在Feynman符号回归数据库(FSReD)上实现了具有竞争力的噪声鲁棒性能,优于几个强大的基线;发现机械对齐的模型,在现实世界的数据上具有良好的准确性-复杂性权衡;并在前沿高能物理应用中为Parton分布函数推导出紧凑的物理激励参数化。IdeaSearchFitter是我们更广泛的迭代代理框架IdeaSearch中的一个专用模块,该框架可在https://www.ideasearch.cn/上公开获得。
摘要:Symbolic regression (SR), the automated discovery of mathematical expressions from data, is a cornerstone of scientific inquiry. However, it is often hindered by the combinatorial explosion of the search space and a tendency to overfit. Popular methods, rooted in genetic programming, explore this space syntactically, often yielding overly complex, uninterpretable models. This paper introduces IdeaSearchFitter, a framework that employs Large Language Models (LLMs) as semantic operators within an evolutionary search. By generating candidate expressions guided by natural-language rationales, our method biases discovery towards models that are not only accurate but also conceptually coherent and interpretable. We demonstrate IdeaSearchFitter's efficacy across diverse challenges: it achieves competitive, noise-robust performance on the Feynman Symbolic Regression Database (FSReD), outperforming several strong baselines; discovers mechanistically aligned models with good accuracy-complexity trade-offs on real-world data; and derives compact, physically-motivated parametrizations for Parton Distribution Functions in a frontier high-energy physics application. IdeaSearchFitter is a specialized module within our broader iterated agent framework, IdeaSearch, which is publicly available at https://www.ideasearch.cn/.
医学相关(6篇)
【1】Random Window Augmentations for Deep Learning Robustness in CT and Liver Tumor Segmentation
标题:CT和肝脏肿瘤分割中深度学习鲁棒性的随机窗口增强
链接:https://arxiv.org/abs/2510.08116
作者:Eirik A. Østmo, Kristoffer K. Wickstrøm, Keyur Radiya, Michael C. Kampffmeyer, Karl Øyvind Mikalsen, Robert Jenssen
备注:10 pages, 9 figures. This work has been submitted to the IEEE for possible publication
摘要:对比增强计算机断层扫描(CT)是重要的诊断和治疗计划的各种医疗条件。基于深度学习(DL)的分割模型可以实现自动医学图像分析,用于检测和描绘CT图像中的肿瘤,从而减少临床医生的工作量。在有限的数据域(如放射学)中实现泛化能力需要使用图像增强来训练现代DL模型。然而,天真地将为自然图像开发的增强方法应用于CT扫描通常忽视CT模态的性质,其中强度测量Hounsfield单位(HU)并且具有重要的物理意义。本文对使用这种强度增强CT成像提出了挑战,并表明它们可能导致伪影和泛化能力差。为了减轻这一点,我们提出了一种CT特定的增强技术,称为随机窗口,利用现有的HU分布的强度在CT图像。随机窗口增强了对比度增强的鲁棒性,并显著提高了模型在对比度或时序较差的挑战性图像上的性能。我们在多个数据集上对我们的方法进行消融和分析,并与最先进的替代方案进行比较和超越,同时专注于肝脏肿瘤分割的挑战。
摘要:Contrast-enhanced Computed Tomography (CT) is important for diagnosis and treatment planning for various medical conditions. Deep learning (DL) based segmentation models may enable automated medical image analysis for detecting and delineating tumors in CT images, thereby reducing clinicians' workload. Achieving generalization capabilities in limited data domains, such as radiology, requires modern DL models to be trained with image augmentation. However, naively applying augmentation methods developed for natural images to CT scans often disregards the nature of the CT modality, where the intensities measure Hounsfield Units (HU) and have important physical meaning. This paper challenges the use of such intensity augmentations for CT imaging and shows that they may lead to artifacts and poor generalization. To mitigate this, we propose a CT-specific augmentation technique, called Random windowing, that exploits the available HU distribution of intensities in CT images. Random windowing encourages robustness to contrast-enhancement and significantly increases model performance on challenging images with poor contrast or timing. We perform ablations and analysis of our method on multiple datasets, and compare to, and outperform, state-of-the-art alternatives, while focusing on the challenge of liver tumor segmentation.
【2】MMM: Quantum-Chemical Molecular Representation Learning for Combinatorial Drug Recommendation
标题:MMM:组合药物推荐的量子化学分子表示学习
链接:https://arxiv.org/abs/2510.07910
作者:Chongmyung Kwon, Yujin Kim, Seoeun Park, Yunji Lee, Charmgil Hong
备注:Medical Image Computing and Computer-Assisted Intervention (MICCAI) Predictive Intelligence in Medicine Workshop (MICCAI PRIME) 2025; 13 pages
摘要:药物推荐是基于机器学习的临床决策支持系统中的一项重要任务。然而,联合处方药物之间的药物相互作用(DDI)风险仍然是一个重大挑战。以前的研究使用图神经网络(GNNs)来表示药物结构。无论如何,它们的简化离散形式不能完全捕获分子结合亲和力和反应性。因此,我们提出了多模式DDI预测与分子电子定位功能(ELF)地图(MMM),一种新的框架,将三维(3D)量子化学信息集成到药物表征学习。它使用ELF生成3D电子密度图。为了捕获治疗相关性和相互作用风险,MMM将编码全局电子特性的ELF衍生特征与模拟局部子结构相互作用的二分图编码器相结合。这种设计使得能够学习药物分子的互补特征。我们在MIMIC-III数据集(250种药物,442个子结构)中评估了MMM,并将其与几个基线模型进行了比较。特别是,与基于GNN的SafeDrug模型的比较表明F1评分(p = 0.0387)、Jaccard(p = 0.0112)和DDI率(p = 0.0386)在统计学上显著改善。这些结果证明了基于ELF的3D表示在提高预测准确性和支持临床实践中更安全的组合药物处方方面的潜力。
摘要:Drug recommendation is an essential task in machine learning-based clinical decision support systems. However, the risk of drug-drug interactions (DDI) between co-prescribed medications remains a significant challenge. Previous studies have used graph neural networks (GNNs) to represent drug structures. Regardless, their simplified discrete forms cannot fully capture the molecular binding affinity and reactivity. Therefore, we propose Multimodal DDI Prediction with Molecular Electron Localization Function (ELF) Maps (MMM), a novel framework that integrates three-dimensional (3D) quantum-chemical information into drug representation learning. It generates 3D electron density maps using the ELF. To capture both therapeutic relevance and interaction risks, MMM combines ELF-derived features that encode global electronic properties with a bipartite graph encoder that models local substructure interactions. This design enables learning complementary characteristics of drug molecules. We evaluate MMM in the MIMIC-III dataset (250 drugs, 442 substructures), comparing it with several baseline models. In particular, a comparison with the GNN-based SafeDrug model demonstrates statistically significant improvements in the F1-score (p = 0.0387), Jaccard (p = 0.0112), and the DDI rate (p = 0.0386). These results demonstrate the potential of ELF-based 3D representations to enhance prediction accuracy and support safer combinatorial drug prescribing in clinical practice.
【3】Signal-to-Noise Ratio in Scanning Electron Microscopy: A Comprehensive Review
标题:扫描电子显微镜中的信噪比:综合评论
链接:https://arxiv.org/abs/2510.07886
作者:K. S. Sim, I. Bukhori, D. C. Y. Ong, K. B. Gan
备注:in IEEE Access, vol. 13, pp. 154395-154421, 2025, doi: 10.1109/ACCESS.2025.3603013
摘要:扫描电子显微镜(SEM)由于其高空间分辨率和焦深而在纳米技术、材料科学和生物成像中至关重要。信噪比(SNR)是SEM中的一个重要参数,因为它直接影响图像的质量和可解释性。SEM广泛应用于各种科学学科,但其实用性可能会受到噪声的影响,从而降低图像清晰度。这篇评论探讨了SEM成像过程的多个方面,从SEM的主要操作,SEM中的噪声源,SNR测量和估计的方法,到影响SNR测量的各个方面和提高SNR的方法,从硬件和软件的角度来看。我们回顾了传统和新兴技术,重点是它们的应用,优势和局限性。本文旨在为研究人员和从业人员提供对SEM中SNR优化的全面理解,并鼓励该领域的进一步研究。
摘要:Scanning Electron Microscopy (SEM) is critical in nanotechnology, materials science, and biological imaging due to its high spatial resolution and depth of focus. Signal-to-noise ratio (SNR) is an essential parameter in SEM because it directly impacts the quality and interpretability of the images. SEM is widely used in various scientific disciplines, but its utility can be compromised by noise, which degrades image clarity. This review explores multiple aspects of the SEM imaging process, from the principal operation of SEM, sources of noise in SEM, methods for SNR measurement and estimations, to various aspects that affect the SNR measurement and approaches to enhance SNR, both from a hardware and software standpoint. We review traditional and emerging techniques, focusing on their applications, advantages, and limitations. The paper aims to provide a comprehensive understanding of SNR optimization in SEM for researchers and practitioners and to encourage further research in the field.
【4】Property Classification of Vacation Rental Properties during Covid-19
标题:Covid-19期间度假租赁物业的物业分类
链接:https://arxiv.org/abs/2510.07639
作者:Favour Yahdii Aghaebe, Dustin Foley, Eric Atwell, Stephen Clark
备注:GISRUK 2024 Poster
摘要:本研究提倡采用聚类技术对新冠疫情期间活跃的度假租赁物业进行分类,以识别固有模式和行为。该数据集是ESRC资助的消费者数据研究中心(CDRC)和AirDNA之间的合作,包括超过一百万个属性和主机的数据。利用K-means和K-medoids聚类技术,我们确定同质群体及其共同特征。我们的研究结果增强了对度假租赁评估复杂性的理解,并可能用于创建有针对性的,集群特定的政策。
摘要:This study advocates for employing clustering techniques to classify vacation rental properties active during the Covid pandemic to identify inherent patterns and behaviours. The dataset, a collaboration between the ESRC funded Consumer Data Research Centre (CDRC) and AirDNA, encompasses data for over a million properties and hosts. Utilising K-means and K-medoids clustering techniques, we identify homogenous groups and their common characteristics. Our findings enhance comprehension of the intricacies of vacation rental evaluations and could potentially be utilised in the creation of targeted, cluster-specific policies.
【5】Linguistic Patterns in Pandemic-Related Content: A Comparative Analysis of COVID-19, Constraint, and Monkeypox Datasets
标题:流行病相关内容的语言模式:COVID-19、约束和猴痘数据集的比较分析
链接:https://arxiv.org/abs/2510.07579
作者:Mkululi Sikosana, Sean Maudsley-Barton, Oluwaseun Ajao
备注:16 pages
摘要:本研究运用计算语言学的方法对流行病相关的网络话语进行分析,探讨语言是如何区分健康错误信息和事实交流的。基于三个语料库:COVID-19虚假叙述(n = 7588),一般COVID-19内容(n = 10700)和Monkeypox相关帖子(n = 5787),我们发现可读性,修辞标记和说服性语言使用存在显著差异。与其他数据集相比,COVID-19错误信息的可读性得分明显较低,包含的与恐惧相关或有说服力的术语的频率是其他数据集的两倍多。它还显示出最少使用感叹号,与猴痘内容的更情绪化风格形成对比。这些模式表明,错误信息采用了一种故意复杂的修辞风格,其中嵌入了情感线索,这种组合可能会增强其感知的可信度。我们的研究结果通过突出可能有助于检测工作的语言指标,为越来越多的数字健康错误信息工作做出了贡献。他们还告知公共卫生信息战略和网络媒体环境中的危机传播的理论模型。与此同时,该研究承认存在局限性,包括依赖传统的可读性指标,使用故意狭窄的说服性词汇,以及依赖静态聚合分析。因此,未来的研究应该结合纵向设计,更广泛的情感词汇,和平台敏感的方法,以加强鲁棒性。
摘要
:This study conducts a computational linguistic analysis of pandemic-related online discourse to examine how language distinguishes health misinformation from factual communication. Drawing on three corpora: COVID-19 false narratives (n = 7588), general COVID-19 content (n = 10700), and Monkeypox-related posts (n = 5787), we identify significant differences in readability, rhetorical markers, and persuasive language use. COVID-19 misinformation exhibited markedly lower readability scores and contained over twice the frequency of fear-related or persuasive terms compared to the other datasets. It also showed minimal use of exclamation marks, contrasting with the more emotive style of Monkeypox content. These patterns suggest that misinformation employs a deliberately complex rhetorical style embedded with emotional cues, a combination that may enhance its perceived credibility. Our findings contribute to the growing body of work on digital health misinformation by highlighting linguistic indicators that may aid detection efforts. They also inform public health messaging strategies and theoretical models of crisis communication in networked media environments. At the same time, the study acknowledges limitations, including reliance on traditional readability indices, use of a deliberately narrow persuasive lexicon, and reliance on static aggregate analysis. Future research should therefore incorporate longitudinal designs, broader emotion lexicons, and platform-sensitive approaches to strengthen robustness.
【6】MultiFair: Multimodal Balanced Fairness-Aware Medical Classification with Dual-Level Gradient Modulation
标题:MultiFair:具有双级梯度调制的多模式平衡公平意识医疗分类
链接:https://arxiv.org/abs/2510.07328
作者:Md Zubair, Hao Zheng, Nussdorf Jonathan, Grayson W. Armstrong, Lucy Q. Shen, Gabriela Wilson, Yu Tian, Xingquan Zhu, Min Shi
备注:10 Pages
摘要:Medical decision systems increasingly rely on data from multiple sources to ensure reliable and unbiased diagnosis. However, existing multimodal learning models fail to achieve this goal because they often ignore two critical challenges. First, various data modalities may learn unevenly, thereby converging to a model biased towards certain modalities. Second, the model may emphasize learning on certain demographic groups causing unfair performances. The two aspects can influence each other, as different data modalities may favor respective groups during optimization, leading to both imbalanced and unfair multimodal learning. This paper proposes a novel approach called MultiFair for multimodal medical classification, which addresses these challenges with a dual-level gradient modulation process. MultiFair dynamically modulates training gradients regarding the optimization direction and magnitude at both data modality and group levels. We conduct extensive experiments on two multimodal medical datasets with different demographic groups. The results show that MultiFair outperforms state-of-the-art multimodal learning and fairness learning methods.
蒸馏|知识提取(4篇)
【1】Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
标题:通过分数规则化连续时间一致性的大规模扩散蒸馏
链接:https://arxiv.org/abs/2510.08431
作者:Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang
摘要:This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.
【2】Investigating Counterclaims in Causality Extraction from Text
标题:文本因果关系提取中的反诉调查
链接:https://arxiv.org/abs/2510.08224
作者:Tim Hagen, Niklas Deckers, Felix Wolter, Harrisen Scells, Martin Potthast
摘要:Research on causality extraction from text has so far almost entirely neglected counterclaims. Existing causality extraction datasets focus solely on "procausal" claims, i.e., statements that support a relationship. "Concausal" claims, i.e., statements that refute a relationship, are entirely ignored or even accidentally annotated as procausal. We address this shortcoming by developing a new dataset that integrates concausality. Based on an extensive literature review, we first show that concausality is an integral part of causal reasoning on incomplete knowledge. We operationalize this theory in the form of a rigorous guideline for annotation and then augment the Causal News Corpus with concausal statements, obtaining a substantial inter-annotator agreement of Cohen's $\kappa=0.74$. To demonstrate the importance of integrating concausal statements, we show that models trained without concausal relationships tend to misclassify these as procausal instead. Based on our new dataset, this mistake can be mitigated, enabling transformers to effectively distinguish pro- and concausality.
【3】Dual-granularity Sinkhorn Distillation for Enhanced Learning from Long-tailed Noisy Data
标题:双粒度Sinkhorn蒸馏用于从长尾噪音数据中增强学习
链接:https://arxiv.org/abs/2510.08179
作者:Feng Hong, Yu Huang, Zihua Zhao, Zhihan Zhou, Jiangchao Yao, Dongsheng Li, Ya Zhang, Yanfeng Wang
备注:25 pages, 2 figures
摘要
:Real-world datasets for deep learning frequently suffer from the co-occurring challenges of class imbalance and label noise, hindering model performance. While methods exist for each issue, effectively combining them is non-trivial, as distinguishing genuine tail samples from noisy data proves difficult, often leading to conflicting optimization strategies. This paper presents a novel perspective: instead of primarily developing new complex techniques from scratch, we explore synergistically leveraging well-established, individually 'weak' auxiliary models - specialized for tackling either class imbalance or label noise but not both. This view is motivated by the insight that class imbalance (a distributional-level concern) and label noise (a sample-level concern) operate at different granularities, suggesting that robustness mechanisms for each can in principle offer complementary strengths without conflict. We propose Dual-granularity Sinkhorn Distillation (D-SINK), a novel framework that enhances dual robustness by distilling and integrating complementary insights from such 'weak', single-purpose auxiliary models. Specifically, D-SINK uses an optimal transport-optimized surrogate label allocation to align the target model's sample-level predictions with a noise-robust auxiliary and its class distributions with an imbalance-robust one. Extensive experiments on benchmark datasets demonstrate that D-SINK significantly improves robustness and achieves strong empirical performance in learning from long-tailed noisy data.
【4】SimCast: Enhancing Precipitation Nowcasting with Short-to-Long Term Knowledge Distillation
标题:SimCast:通过短期到长期的知识蒸馏增强降水即时预报
链接:https://arxiv.org/abs/2510.07953
作者:Yifang Yin, Shengkai Chen, Yiyao Li, Lu Wang, Ruibing Jin, Wei Cui, Shili Xiang
备注:accepted by ICME 2025
摘要:Precipitation nowcasting predicts future radar sequences based on current observations, which is a highly challenging task driven by the inherent complexity of the Earth system. Accurate nowcasting is of utmost importance for addressing various societal needs, including disaster management, agriculture, transportation, and energy optimization. As a complementary to existing non-autoregressive nowcasting approaches, we investigate the impact of prediction horizons on nowcasting models and propose SimCast, a novel training pipeline featuring a short-to-long term knowledge distillation technique coupled with a weighted MSE loss to prioritize heavy rainfall regions. Improved nowcasting predictions can be obtained without introducing additional overhead during inference. As SimCast generates deterministic predictions, we further integrate it into a diffusion-based framework named CasCast, leveraging the strengths from probabilistic models to overcome limitations such as blurriness and distribution shift in deterministic outputs. Extensive experimental results on three benchmark datasets validate the effectiveness of the proposed framework, achieving mean CSI scores of 0.452 on SEVIR, 0.474 on HKO-7, and 0.361 on MeteoNet, which outperforms existing approaches by a significant margin.
推荐(1篇)
【1】Retentive Relevance: Capturing Long-Term User Value in Recommendation Systems
标题:保留相关性:在推荐系统中捕捉长期用户价值
链接:https://arxiv.org/abs/2510.07621
作者:Saeideh Bakhshi, Phuong Mai Nguyen, Robert Schiller, Tiantian Xu, Pawan Kodandapani, Andrew Levine, Cayman Simpson, Qifan Wang
摘要:Recommendation systems have traditionally relied on short-term engagement signals, such as clicks and likes, to personalize content. However, these signals are often noisy, sparse, and insufficient for capturing long-term user satisfaction and retention. We introduce Retentive Relevance, a novel content-level survey-based feedback measure that directly assesses users' intent to return to the platform for similar content. Unlike other survey measures that focus on immediate satisfaction, Retentive Relevance targets forward-looking behavioral intentions, capturing longer term user intentions and providing a stronger predictor of retention. We validate Retentive Relevance using psychometric methods, establishing its convergent, discriminant, and behavioral validity. Through large-scale offline modeling, we show that Retentive Relevance significantly outperforms both engagement signals and other survey measures in predicting next-day retention, especially for users with limited historical engagement. We develop a production-ready proxy model that integrates Retentive Relevance into the final stage of a multi-stage ranking system on a social media platform. Calibrated score adjustments based on this model yield substantial improvements in engagement, and retention, while reducing exposure to low-quality content, as demonstrated by large-scale A/B experiments. This work provides the first empirically validated framework linking content-level user perceptions to retention outcomes in production systems. We offer a scalable, user-centered solution that advances both platform growth and user experience. Our work has broad implications for responsible AI development.
超分辨率|去噪|去模糊|去雾(1篇)
【1】Biology-driven assessment of deep learning super-resolution imaging of the porosity network in dentin
标题:生物学驱动的牙本质孔隙网络深度学习超分辨率成像评估
链接:https://arxiv.org/abs/2510.08407
作者:Lauren Anderson, Lucas Chatelain, Nicolas Tremblay, Kathryn Grandfield, David Rousseau, Aurélien Gourrier
摘要:The mechanosensory system of teeth is currently believed to partly rely on Odontoblast cells stimulation by fluid flow through a porosity network extending through dentin. Visualizing the smallest sub-microscopic porosity vessels therefore requires the highest achievable resolution from confocal fluorescence microscopy, the current gold standard. This considerably limits the extent of the field of view to very small sample regions. To overcome this limitation, we tested different deep learning (DL) super-resolution (SR) models to allow faster experimental acquisitions of lower resolution images and restore optimal image quality by post-processing. Three supervised 2D SR models (RCAN, pix2pix, FSRCNN) and one unsupervised (CycleGAN) were applied to a unique set of experimentally paired high- and low-resolution confocal images acquired with different sampling schemes, resulting in a pixel size increase of x2, x4, x8. Model performance was quantified using a broad set of similarity and distribution-based image quality assessment (IQA) metrics, which yielded inconsistent results that mostly contradicted our visual perception. This raises the question of the relevance of such generic metrics to efficiently target the specific structure of dental porosity. To resolve this conflicting information, the generated SR images were segmented taking into account the specific scales and morphology of the porosity network and analysed by comparing connected components. Additionally, the capacity of the SR models to preserve 3D porosity connectivity throughout the confocal image stacks was evaluated using graph analysis. This biology-driven assessment allowed a far better mechanistic interpretation of SR performance, highlighting differences in model sensitivity to weak intensity features and the impact of non-linearity in image generation, which explains the failure of standard IQA metrics.
自动驾驶|车辆|车道检测等(2篇)
【1】Team Xiaomi EV-AD VLA: Learning to Navigate Socially Through Proactive Risk Perception - Technical Report for IROS 2025 RoboSense Challenge Social Navigation Track
标题:小米EV-AD VLA团队:学习通过主动风险感知进行社交导航-IROS 2025 RoboSense挑战社交导航赛道技术报告
链接:https://arxiv.org/abs/2510.07871
作者:Erjia Xiao, Lingfeng Zhang, Yingbo Tang, Hao Cheng, Renjing Xu, Wenbo Ding, Lei Zhou, Long Chen, Hangjun Ye, Xiaoshuai Hao
摘要:In this report, we describe the technical details of our submission to the IROS 2025 RoboSense Challenge Social Navigation Track. This track focuses on developing RGBD-based perception and navigation systems that enable autonomous agents to navigate safely, efficiently, and socially compliantly in dynamic human-populated indoor environments. The challenge requires agents to operate from an egocentric perspective using only onboard sensors including RGB-D observations and odometry, without access to global maps or privileged information, while maintaining social norm compliance such as safe distances and collision avoidance. Building upon the Falcon model, we introduce a Proactive Risk Perception Module to enhance social navigation performance. Our approach augments Falcon with collision risk understanding that learns to predict distance-based collision risk scores for surrounding humans, which enables the agent to develop more robust spatial awareness and proactive collision avoidance behaviors. The evaluation on the Social-HM3D benchmark demonstrates that our method improves the agent's ability to maintain personal space compliance while navigating toward goals in crowded indoor scenes with dynamic human agents, achieving 2nd place among 16 participating teams in the challenge.
【2】VeMo: A Lightweight Data-Driven Approach to Model Vehicle Dynamics
标题:VeMo:一种轻量级的数据驱动方法来建模车辆动力学
链接:https://arxiv.org/abs/2510.07447
作者:Girolamo Oddo, Roberto Nuca, Matteo Parsani
摘要:Developing a dynamic model for a high-performance vehicle is a complex problem that requires extensive structural information about the system under analysis. This information is often unavailable to those who did not design the vehicle and represents a typical issue in autonomous driving applications, which are frequently developed on top of existing vehicles; therefore, vehicle models are developed under conditions of information scarcity. This paper proposes a lightweight encoder-decoder model based on Gate Recurrent Unit layers to correlate the vehicle's future state with its past states, measured onboard, and control actions the driver performs. The results demonstrate that the model achieves a maximum mean relative error below 2.6% in extreme dynamic conditions. It also shows good robustness when subject to noisy input data across the interested frequency components. Furthermore, being entirely data-driven and free from physical constraints, the model exhibits physical consistency in the output signals, such as longitudinal and lateral accelerations, yaw rate, and the vehicle's longitudinal velocity.
点云|SLAM|雷达|激光|深度RGBD相关(2篇)
【1】Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries
标题:Beyond Pass@k:推理边界的广度-深度预设
链接:https://arxiv.org/abs/2510.08325
作者:Marius Dragoi, Ioana Pintilie, Florin Gogianu, Florin Brad
备注:10 pages, 3 figures
摘要:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm to improve Large Language Models on reasoning tasks such as coding, math or logic. To assess the reasoning boundary (the fraction of problems a model can solve) researchers often report Pass@k at large sampling budgets. Recent results reveal a crossover phenomenon: while RLVR models outperform the base model at small k values, the base model usually outperforms them when sampling a very large number of completions. This has been interpreted as evidence that base models have a larger reasoning boundary. We argue that on tasks with discrete answer spaces, such as math with numeric outputs, Pass@k at large k reflects the increasingly higher chance of success in the limit of the number of trials rather than genuine reasoning, and can therefore be misleading. We propose Cover@tau, which measures the fraction of problems that a model can solve for which at least a tau proportion of completions are correct. Unlike Pass@k, Cover@tau captures reasoning under an explicit reliability threshold: models that rely on random guessing degrade rapidly as tau increases. We evaluate several RLVR models using Cover@tau-based metrics and illustrate how the relative rankings of popular algorithms change compared to Pass@1, offering a different perspective on reasoning boundaries.
【2】Reconstructing the local density field with combined convolutional and point cloud architecture
标题:利用结合卷积和点云架构重建局部密度场
链接:https://arxiv.org/abs/2510.08573
作者:Baptiste Barthe-Gold, Nhat-Minh Nguyen, Leander Thiele
备注:6 pages, 4 figures, 1 table. Accepted at the NeurIPS 2025 Workshop: ML4PS. Comments welcome!
摘要:We construct a neural network to perform regression on the local dark-matter density field given line-of-sight peculiar velocities of dark-matter halos, biased tracers of the dark matter field. Our architecture combines a convolutional U-Net with a point-cloud DeepSets. This combination enables efficient use of small-scale information and improves reconstruction quality relative to a U-Net-only approach. Specifically, our hybrid network recovers both clustering amplitudes and phases better than the U-Net on small scales.
联邦学习|隐私保护|加密(3篇)
【1】SketchGuard: Scaling Byzantine-Robust Decentralized Federated Learning via Sketch-Based Screening
标题:SketchGuard:通过基于草图的筛选扩展拜占庭稳健的去中心化联邦学习
链接:https://arxiv.org/abs/2510.07922
作者:Murtaza Rangwala, Farag Azzedin, Richard O. Sinnott, Rajkumar Buyya
备注:23 pages, 5 figures, Code Available: this https URL
摘要
:Decentralized Federated Learning (DFL) enables privacy-preserving collaborative training without centralized servers, but remains vulnerable to Byzantine attacks where malicious clients submit corrupted model updates. Existing Byzantine-robust DFL defenses rely on similarity-based neighbor screening that requires every client to exchange and compare complete high-dimensional model vectors with all neighbors in each training round, creating prohibitive communication and computational costs that prevent deployment at web scale. We propose SketchGuard, a general framework that decouples Byzantine filtering from model aggregation through sketch-based neighbor screening. SketchGuard compresses $d$-dimensional models to $k$-dimensional sketches ($k \ll d$) using Count Sketch for similarity comparisons, then selectively fetches full models only from accepted neighbors, reducing per-round communication complexity from $O(d|N_i|)$ to $O(k|N_i| + d|S_i|)$, where $|N_i|$ is the neighbor count and $|S_i| \le |N_i|$ is the accepted neighbor count. We establish rigorous convergence guarantees in both strongly convex and non-convex settings, proving that Count Sketch compression preserves Byzantine resilience with controlled degradation bounds where approximation errors introduce only a $(1+O(\epsilon))$ factor in the effective threshold parameter. Comprehensive experiments across multiple datasets, network topologies, and attack scenarios demonstrate that SketchGuard maintains identical robustness to state-of-the-art methods while reducing computation time by up to 82% and communication overhead by 50-70% depending on filtering effectiveness, with benefits scaling multiplicatively with model dimensionality and network connectivity. These results establish the viability of sketch-based compression as a fundamental enabler of robust DFL at web scale.
【2】FedQS: Optimizing Gradient and Model Aggregation for Semi-Asynchronous Federated Learning
标题:FedQS:优化半同步联邦学习的梯度和模型聚集
链接:https://arxiv.org/abs/2510.07664
作者:Yunbo Li, Jiaping Gui, Zhihang Deng, Fanchao Meng, Yue Wu
备注:Accepted by NeurIPS 2025
摘要:Federated learning (FL) enables collaborative model training across multiple parties without sharing raw data, with semi-asynchronous FL (SAFL) emerging as a balanced approach between synchronous and asynchronous FL. However, SAFL faces significant challenges in optimizing both gradient-based (e.g., FedSGD) and model-based (e.g., FedAvg) aggregation strategies, which exhibit distinct trade-offs in accuracy, convergence speed, and stability. While gradient aggregation achieves faster convergence and higher accuracy, it suffers from pronounced fluctuations, whereas model aggregation offers greater stability but slower convergence and suboptimal accuracy. This paper presents FedQS, the first framework to theoretically analyze and address these disparities in SAFL. FedQS introduces a divide-and-conquer strategy to handle client heterogeneity by classifying clients into four distinct types and adaptively optimizing their local training based on data distribution characteristics and available computational resources. Extensive experiments on computer vision, natural language processing, and real-world tasks demonstrate that FedQS achieves the highest accuracy, attains the lowest loss, and ranks among the fastest in convergence speed, outperforming state-of-the-art baselines. Our work bridges the gap between aggregation strategies in SAFL, offering a unified solution for stable, accurate, and efficient federated learning. The code and datasets are available at https://anonymous.4open.science/r/FedQS-EDD6.
【3】Parameter-Free Federated TD Learning with Markov Noise in Heterogeneous Environments
标题:异类环境中具有马尔科夫噪音的无参数联邦TD学习
链接:https://arxiv.org/abs/2510.07436
作者:Ankur Naskar, Gugan Thoppe, Utsav Negi, Vijay Gupta
摘要:Federated learning (FL) can dramatically speed up reinforcement learning by distributing exploration and training across multiple agents. It can guarantee an optimal convergence rate that scales linearly in the number of agents, i.e., a rate of $\tilde{O}(1/(NT)),$ where $T$ is the iteration index and $N$ is the number of agents. However, when the training samples arise from a Markov chain, existing results on TD learning achieving this rate require the algorithm to depend on unknown problem parameters. We close this gap by proposing a two-timescale Federated Temporal Difference (FTD) learning with Polyak-Ruppert averaging. Our method provably attains the optimal $\tilde{O}(1/NT)$ rate in both average-reward and discounted settings--offering a parameter-free FTD approach for Markovian data. Although our results are novel even in the single-agent setting, they apply to the more realistic and challenging scenario of FL with heterogeneous environments.
推理|分析|理解|解释(10篇)
【1】Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization
标题:混合和MoE-DPO:直接偏好优化的变分推理方法
链接:https://arxiv.org/abs/2510.08256
作者:Jason Bohne, Pawel Polak, David Rosenberg, Brian Bloniarz, Gary Kazantsev
摘要:Direct Preference Optimization (DPO) has recently emerged as a simple and effective alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with user preferences. However, existing DPO formulations rely on a single monolithic model, which limits their expressivity in multi-task settings and their adaptability to heterogeneous or diverse preference distributions. In this work, we propose Mix- and MoE-DPO, a framework that extends DPO with both soft mixture models and mixture-of-experts (MoE) architectures, using a stochastic variational inference approach. Our method introduces a latent-variable model over expert assignments and optimizes a variational evidence lower bound (ELBO), enabling stable and efficient learning of specialized expert policies from preference data. Mix- and MoE-DPO provides three key advantages over standard DPO: (i) generalization via universal function approximation through mixtures; (ii) reward and policy specialization through expert components tailored to distinct preference modes; and (iii) contextual alignment through input-dependent soft gating that enables user-specific mixture policies. Our framework supports both shared base architectures with expert-specific policy heads and fully independent expert models, allowing flexible trade-offs between parameter efficiency and specialization. We validate our approach on a variety of model sizes and multi-preference datasets, demonstrating that Mix- and MoE-DPO offers a powerful and scalable method for preference-based LLM alignment.
【2】Unveiling the Power of Multiple Gossip Steps: A Stability-Based Generalization Analysis in Decentralized Training
标题:揭示多个八卦步骤的力量:分散训练中基于稳定性的概括分析
链接:https://arxiv.org/abs/2510.07980
作者:Qinglun Li, Yingqi Liu, Miao Zhang, Xiaochun Cao, Quanjun Yin, Li Shen
备注:This paper has been accepted by NeurIPS 2025 (Spotlight)
摘要:Decentralized training removes the centralized server, making it a communication-efficient approach that can significantly improve training efficiency, but it often suffers from degraded performance compared to centralized training. Multi-Gossip Steps (MGS) serve as a simple yet effective bridge between decentralized and centralized training, significantly reducing experiment performance gaps. However, the theoretical reasons for its effectiveness and whether this gap can be fully eliminated by MGS remain open questions. In this paper, we derive upper bounds on the generalization error and excess error of MGS using stability analysis, systematically answering these two key questions. 1). Optimization Error Reduction: MGS reduces the optimization error bound at an exponential rate, thereby exponentially tightening the generalization error bound and enabling convergence to better solutions. 2). Gap to Centralization: Even as MGS approaches infinity, a non-negligible gap in generalization error remains compared to centralized mini-batch SGD ($\mathcal{O}(T^{\frac{c\beta}{c\beta +1}}/{n m})$ in centralized and $\mathcal{O}(T^{\frac{2c\beta}{2c\beta +2}}/{n m^{\frac{1}{2c\beta +2}}})$ in decentralized). Furthermore, we provide the first unified analysis of how factors like learning rate, data heterogeneity, node count, per-node sample size, and communication topology impact the generalization of MGS under non-convex settings without the bounded gradients assumption, filling a critical theoretical gap in decentralized training. Finally, promising experiments on CIFAR datasets support our theoretical findings.
【3】Parallel Test-Time Scaling for Latent Reasoning Models
标题:潜在推理模型的并行测试时间缩放
链接:https://arxiv.org/abs/2510.07745
作者:Runyang You, Yongqi Li, Meng Liu, Wenjie Wang, Liqiang Nie, Wenjie Li
摘要:Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs), typically by sampling multiple token-based chains-of-thought in parallel and aggregating outcomes through voting or search. Recent advances in latent reasoning, where intermediate reasoning unfolds in continuous vector spaces, offer a more efficient alternative to explicit Chain-of-Thought, yet whether such latent models can similarly benefit from parallel TTS remains open, mainly due to the absence of sampling mechanisms in continuous space, and the lack of probabilistic signals for advanced trajectory aggregation. \ This work enables parallel TTS for latent reasoning models by addressing the above issues. For sampling, we introduce two uncertainty-inspired stochastic strategies: Monte Carlo Dropout and Additive Gaussian Noise. For aggregation, we design a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning. Extensive experiments and visualization analyses show that both sampling strategies scale effectively with compute and exhibit distinct exploration dynamics, while LatentRM enables effective trajectory selection. Together, our explorations open a new direction for scalable inference in continuous spaces. Code released at https://github.com/YRYangang/LatentTTS.
【4】Design-Based Bandits Under Network Interference: Trade-Off Between Regret and Statistical Inference
标题:网络干扰下的基于设计的盗贼:遗憾与统计推理之间的权衡
链接:https://arxiv.org/abs/2510.07646
作者:Zichen Wang, Haoyang Hong, Chuanhao Li, Haoxuan Li, Zhiheng Zhang, Huazheng Wang
备注:None
摘要:In multi-armed bandits with network interference (MABNI), the action taken by one node can influence the rewards of others, creating complex interdependence. While existing research on MABNI largely concentrates on minimizing regret, it often overlooks the crucial concern that an excessive emphasis on the optimal arm can undermine the inference accuracy for sub-optimal arms. Although initial efforts have been made to address this trade-off in single-unit scenarios, these challenges have become more pronounced in the context of MABNI. In this paper, we establish, for the first time, a theoretical Pareto frontier characterizing the trade-off between regret minimization and inference accuracy in adversarial (design-based) MABNI. We further introduce an anytime-valid asymptotic confidence sequence along with a corresponding algorithm, $\texttt{EXP3-N-CS}$, specifically designed to balance the trade-off between regret minimization and inference accuracy in this setting.
【5】Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models
标题:测试时间匹配:解锁多模式模型中的成分推理
链接:https://arxiv.org/abs/2510.07632
作者:Yinglun Zhu, Jiancheng Zhang, Fuzhi Tang
摘要:Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capability. To address this, we introduce a group matching score that better exploits group structure and reveals substantial hidden capability in both contrastive vision-language models (VLMs) and multimodal large language models (MLLMs). Moreover, simply overfitting to the induced group matchings at test time transfers this hidden capability into higher scores under standard evaluation metrics, closing much of the reported gap. This adjustment enables SigLIP-B16 to surpass all previous results and GPT-4.1 to yield the first result surpassing estimated human performance on Winoground. Building on this insight, we propose Test-Time Matching (TTM), an iterative, self-improving algorithm that further bootstraps model performance without any external supervision. TTM delivers additional, non-trivial improvements: for example, TTM enables SigLIP-B16 to surpass GPT-4.1 on MMVP-VLM, establishing a new state of the art. Importantly, TTM remains broadly effective even on benchmarks without metric-induced effects or group structures, achieving relative gains up to 85.7% on challenging datasets such as WhatsUp. Across 16 dataset variants spanning diverse setups, our experiments demonstrate that TTM consistently improves model performance and advances the frontier of compositional reasoning.
【6】When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs
标题:当想法与事实相遇时:长上下文LM的可重复使用推理
链接:https://arxiv.org/abs/2510.07499
作者:Soyeong Jeong, Taehee Jung, Sung Ju Hwang, Joo-Kyung Kim, Dongyeop Kang
摘要:Recent Long-Context Language Models (LCLMs) can process hundreds of thousands of tokens in a single prompt, enabling new opportunities for knowledge-intensive multi-hop reasoning by integrating large sets of retrieved documents or, in some cases, directly all necessary information. However, simply feeding more documents into the context window fails to capture how evidence should be connected. We address this gap with thought templates, which recast reasoning as reusable thought caches, derived from prior problem solving traces, structuring how evidence is combined and guiding multi-hop inference with factual documents. To keep these templates effective, we propose an update strategy that iteratively refines templates derived from training data through natural-language feedback. Across diverse benchmarks and LCLM families, our approach delivers consistent gains over strong baselines in both retrieval-based and retrieval-free settings. Furthermore, we show that optimized templates can be distilled into smaller open-source models, demonstrating its broad applicability and transparent reasoning reuse. We refer to our framework as Thought Template Augmented LCLMs (ToTAL).
【7】Comparison of Fully Homomorphic Encryption and Garbled Circuit Techniques in Privacy-Preserving Machine Learning Inference
标题:全同态加密和混淆电路技术在隐私保护机器学习推理中的比较
链接:https://arxiv.org/abs/2510.07457
作者:Kalyan Cheerla, Lotfi Ben Othmane, Kirill Morozov (University of North Texas)
备注:8 pages, 9 figures, 2 tables, 32 references
摘要:Machine Learning (ML) is making its way into fields such as healthcare, finance, and Natural Language Processing (NLP), and concerns over data privacy and model confidentiality continue to grow. Privacy-preserving Machine Learning (PPML) addresses this challenge by enabling inference on private data without revealing sensitive inputs or proprietary models. Leveraging Secure Computation techniques from Cryptography, two widely studied approaches in this domain are Fully Homomorphic Encryption (FHE) and Garbled Circuits (GC). This work presents a comparative evaluation of FHE and GC for secure neural network inference. A two-layer neural network (NN) was implemented using the CKKS scheme from the Microsoft SEAL library (FHE) and the TinyGarble2.0 framework (GC) by IntelLabs. Both implementations are evaluated under the semi-honest threat model, measuring inference output error, round-trip time, peak memory usage, communication overhead, and communication rounds. Results reveal a trade-off: modular GC offers faster execution and lower memory consumption, while FHE supports non-interactive inference.
【8】Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts
标题:编码、思考、解码:用循环潜在思想扩展测试时推理
链接:https://arxiv.org/abs/2510.07358
作者:Yeskendir Koishekenov, Aldo Lipani, Nicola Cancedda
摘要:Most efforts to improve the reasoning capabilities of large language models (LLMs) involve either scaling the number of parameters and the size of training data, or scaling inference computation by letting models generate complex chains of thought. Motivated by interpretability studies showing that the crucial computation required for reasoning tasks is concentrated in a limited range of layers, we introduce Encode-Think-Decode (ETD), a method that enhances the reasoning capabilities of a base model by training it to iterate over a small subset of reasoning-relevant layers during the mid-training stage. ETD amplifies latent reasoning while preserving the original architecture, parameter count, hyperparameters, and training data composition. When iterating on the selected layers at inference time, ETD models yield substantial gains on 17 reasoning benchmarks, including +28.4% relative accuracy improvement on GSM8K and +36% on MATH with the OLMo-2 1B Base model. We also explore an adaptive depth strategy that adjusts the computation per input token. Our results show that recursive latent reasoning offers a simple and effective path to stronger LLM reasoning.
【9】High-dimensional Analysis of Synthetic Data Selection
标题:综合数据选择的高维分析
链接:https://arxiv.org/abs/2510.08123
作者:Parham Rezaei, Filip Kovacevic, Francesco Locatello, Marco Mondelli
摘要:Despite the progress in the development of generative models, their usefulness in creating synthetic data that improve prediction performance of classifiers has been put into question. Besides heuristic principles such as "synthetic data should be close to the real data distribution", it is actually not clear which specific properties affect the generalization error. Our paper addresses this question through the lens of high-dimensional regression. Theoretically, we show that, for linear models, the covariance shift between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Furthermore we prove that, in some settings, matching the covariance of the target distribution is optimal. Remarkably, the theoretical insights from linear models carry over to deep neural networks and generative models. We empirically demonstrate that the covariance matching procedure (matching the covariance of the synthetic data with that of the data coming from the target distribution) performs well against several recent approaches for synthetic data selection, across training paradigms, architectures, datasets and generative models used for augmentation.
【10】Beyond independent component analysis: identifiability and algorithms
标题:超越独立成分分析:可识别性和算法
链接:https://arxiv.org/abs/2510.07525
作者:Alvaro Ribot, Anna Seigal, Piotr Zwiernik
备注:30 pages, 8 figures
摘要
:Independent Component Analysis (ICA) is a classical method for recovering latent variables with useful identifiability properties. For independent variables, cumulant tensors are diagonal; relaxing independence yields tensors whose zero structure generalizes diagonality. These models have been the subject of recent work in non-independent component analysis. We show that pairwise mean independence answers the question of how much one can relax independence: it is identifiable, any weaker notion is non-identifiable, and it contains the models previously studied as special cases. Our results apply to distributions with the required zero pattern at any cumulant tensor. We propose an algebraic recovery algorithm based on least-squares optimization over the orthogonal group. Simulations highlight robustness: enforcing full independence can harm estimation, while pairwise mean independence enables more stable recovery. These findings extend the classical ICA framework and provide a rigorous basis for blind source separation beyond independence.
检测相关(4篇)
【1】New Machine Learning Approaches for Intrusion Detection in ADS-B
标题:ADS-B中用于入侵检测的新机器学习方法
链接:https://arxiv.org/abs/2510.08333
作者:Mikaëla Ngamboé, Jean-Simon Marrocco, Jean-Yves Ouattara, José M. Fernandez, Gabriela Nicolescu
备注:This is the author's version of the work accepted for publication Digital Avionics Systems Conference (DASC) 2025. The final version will be available via IEEE Xplore
摘要:With the growing reliance on the vulnerable Automatic Dependent Surveillance-Broadcast (ADS-B) protocol in air traffic management (ATM), ensuring security is critical. This study investigates emerging machine learning models and training strategies to improve AI-based intrusion detection systems (IDS) for ADS-B. Focusing on ground-based ATM systems, we evaluate two deep learning IDS implementations: one using a transformer encoder and the other an extended Long Short-Term Memory (xLSTM) network, marking the first xLSTM-based IDS for ADS-B. A transfer learning strategy was employed, involving pre-training on benign ADS-B messages and fine-tuning with labeled data containing instances of tampered messages. Results show this approach outperforms existing methods, particularly in identifying subtle attacks that progressively undermine situational awareness. The xLSTM-based IDS achieves an F1-score of 98.9%, surpassing the transformer-based model at 94.3%. Tests on unseen attacks validated the generalization ability of the xLSTM model. Inference latency analysis shows that the 7.26-second delay introduced by the xLSTM-based IDS fits within the Secondary Surveillance Radar (SSR) refresh interval (5-12 s), although it may be restrictive for time-critical operations. While the transformer-based IDS achieves a 2.1-second latency, it does so at the cost of lower detection performance.
【2】Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection
标题:用于人工智能生成视频检测的物理驱动时空建模
链接:https://arxiv.org/abs/2510.08073
作者:Shuhai Zhang, ZiHao Lian, Jiahao Yang, Daiyuan Li, Guoxuan Pang, Feng Liu, Bo Han, Shutao Li, Mingkui Tan
备注:Accepted at NeurIPS 2025 spotlight
摘要:AI-generated videos have achieved near-perfect visual realism (e.g., Sora), urgently necessitating reliable detection mechanisms. However, detecting such videos faces significant challenges in modeling high-dimensional spatiotemporal dynamics and identifying subtle anomalies that violate physical laws. In this paper, we propose a physics-driven AI-generated video detection paradigm based on probability flow conservation principles. Specifically, we propose a statistic called Normalized Spatiotemporal Gradient (NSG), which quantifies the ratio of spatial probability gradients to temporal density changes, explicitly capturing deviations from natural video dynamics. Leveraging pre-trained diffusion models, we develop an NSG estimator through spatial gradients approximation and motion-aware temporal modeling without complex motion decomposition while preserving physical constraints. Building on this, we propose an NSG-based video detection method (NSG-VD) that computes the Maximum Mean Discrepancy (MMD) between NSG features of the test and real videos as a detection metric. Last, we derive an upper bound of NSG feature distances between real and generated videos, proving that generated videos exhibit amplified discrepancies due to distributional shifts. Extensive experiments confirm that NSG-VD outperforms state-of-the-art baselines by 16.00% in Recall and 10.75% in F1-Score, validating the superior performance of NSG-VD. The source code is available at https://github.com/ZSHsh98/NSG-VD.
【3】Causality Guided Representation Learning for Cross-Style Hate Speech Detection
标题:跨风格仇恨语音检测的因果引导表示学习
链接:https://arxiv.org/abs/2510.07707
作者:Chengshuai Zhao, Shu Wan, Paras Sheth, Karan Patwa, K. Selçuk Candan, Huan Liu
摘要:The proliferation of online hate speech poses a significant threat to the harmony of the web. While explicit hate is easily recognized through overt slurs, implicit hate speech is often conveyed through sarcasm, irony, stereotypes, or coded language -- making it harder to detect. Existing hate speech detection models, which predominantly rely on surface-level linguistic cues, fail to generalize effectively across diverse stylistic variations. Moreover, hate speech spread on different platforms often targets distinct groups and adopts unique styles, potentially inducing spurious correlations between them and labels, further challenging current detection approaches. Motivated by these observations, we hypothesize that the generation of hate speech can be modeled as a causal graph involving key factors: contextual environment, creator motivation, target, and style. Guided by this graph, we propose CADET, a causal representation learning framework that disentangles hate speech into interpretable latent factors and then controls confounders, thereby isolating genuine hate intent from superficial linguistic cues. Furthermore, CADET allows counterfactual reasoning by intervening on style within the latent space, naturally guiding the model to robustly identify hate speech in varying forms. CADET demonstrates superior performance in comprehensive experiments, highlighting the potential of causal priors in advancing generalizable hate speech detection.
【4】Enhancing Maritime Object Detection in Real-Time with RT-DETR and Data Augmentation
标题:利用RT-DETR和数据增强增强实时海事目标检测
链接:https://arxiv.org/abs/2510.07346
作者:Nader Nemati
备注:13 pages, 10 figures
摘要
:Maritime object detection faces essential challenges due to the small target size and limitations of labeled real RGB data. This paper will present a real-time object detection system based on RT-DETR, enhanced by employing augmented synthetic images while strictly evaluating on real data. This study employs RT-DETR for the maritime environment by combining multi-scale feature fusion, uncertainty-minimizing query selection, and smart weight between synthetic and real training samples. The fusion module in DETR enhances the detection of small, low-contrast vessels, query selection focuses on the most reliable proposals, and the weighting strategy helps reduce the visual gap between synthetic and real domains. This design preserves DETR's refined end-to-end set prediction while allowing users to adjust between speed and accuracy at inference time. Data augmentation techniques were also used to balance the different classes of the dataset to improve the robustness and accuracy of the model. Regarding this study, a full Python robust maritime detection pipeline is delivered that maintains real-time performance even under practical limits. It also verifies how each module contributes, and how the system handles failures in extreme lighting or sea conditions. This study also includes a component analysis to quantify the contribution of each architectural module and explore its interactions.
分类|识别(5篇)
【1】Long-tailed Recognition with Model Rebalancing
标题:模型再平衡的长尾识别
链接:https://arxiv.org/abs/2510.08177
作者:Jiaan Luo, Feng Hong, Qiang Hu, Xiaofeng Cao, Feng Liu, Jiangchao Yao
摘要:Long-tailed recognition is ubiquitous and challenging in deep learning and even in the downstream finetuning of foundation models, since the skew class distribution generally prevents the model generalization to the tail classes. Despite the promise of previous methods from the perspectives of data augmentation, loss rebalancing and decoupled training etc., consistent improvement in the broad scenarios like multi-label long-tailed recognition is difficult. In this study, we dive into the essential model capacity impact under long-tailed context, and propose a novel framework, Model Rebalancing (MORE), which mitigates imbalance by directly rebalancing the model's parameter space. Specifically, MORE introduces a low-rank parameter component to mediate the parameter space allocation guided by a tailored loss and sinusoidal reweighting schedule, but without increasing the overall model complexity or inference costs. Extensive experiments on diverse long-tailed benchmarks, spanning multi-class and multi-label tasks, demonstrate that MORE significantly improves generalization, particularly for tail classes, and effectively complements existing imbalance mitigation methods. These results highlight MORE's potential as a robust plug-and-play module in long-tailed settings.
【2】Beyond Sub-6 GHz: Leveraging mmWave Wi-Fi for Gait-Based Person Identification
标题:低于6 GHz:利用毫米波Wi-Fi进行基于步态的人员识别
链接:https://arxiv.org/abs/2510.08160
作者:Nabeel Nisar Bhat, Maksim Karnaukh, Jakob Struye, Rafael Berkvens, Jeroen Famaey
摘要:Person identification plays a vital role in enabling intelligent, personalized, and secure human-computer interaction. Recent research has demonstrated the feasibility of leveraging Wi-Fi signals for passive person identification using a person's unique gait pattern. Although most existing work focuses on sub-6 GHz frequencies, the emergence of mmWave offers new opportunities through its finer spatial resolution, though its comparative advantages for person identification remain unexplored. This work presents the first comparative study between sub-6 GHz and mmWave Wi-Fi signals for person identification with commercial off-the-shelf (COTS) Wi-Fi, using a novel dataset of synchronized measurements from the two frequency bands in an indoor environment. To ensure a fair comparison, we apply identical training pipelines and model configurations across both frequency bands. Leveraging end-to-end deep learning, we show that even at low sampling rates (10 Hz), mmWave Wi-Fi signals can achieve high identification accuracy (91.2% on 20 individuals) when combined with effective background subtraction.
【3】Label Semantics for Robust Hyperspectral Image Classification
标题:鲁棒高光谱图像分类的标签语义
链接:https://arxiv.org/abs/2510.07556
作者:Rafin Hassan, Zarin Tasnim Roshni, Rafiqul Bari, Alimul Islam, Nabeel Mohammed, Moshiur Farazi, Shafin Rahman
备注:This work has been accepted for publication in the proceedings of IJCNN 2025
摘要:Hyperspectral imaging (HSI) classification is a critical tool with widespread applications across diverse fields such as agriculture, environmental monitoring, medicine, and materials science. Due to the limited availability of high-quality training samples and the high dimensionality of spectral data, HSI classification models are prone to overfitting and often face challenges in balancing accuracy and computational complexity. Furthermore, most of HSI classification models are monomodal, where it solely relies on spectral-spatial data to learn decision boundaries in the high dimensional embedding space. To address this, we propose a general-purpose Semantic Spectral-Spatial Fusion Network (S3FN) that uses contextual, class specific textual descriptions to complement the training of an HSI classification model. Specifically, S3FN leverages LLMs to generate comprehensive textual descriptions for each class label that captures their unique characteristics and spectral behaviors. These descriptions are then embedded into a vector space using a pre-trained text encoder such as BERT or RoBERTa to extract meaningful label semantics which in turn leads to a better feature-label alignment for improved classification performance. To demonstrate the effectiveness of our approach, we evaluate our model on three diverse HSI benchmark datasets - Hyperspectral Wood, HyperspectralBlueberries, and DeepHS-Fruit and report significant performance boost. Our results highlight the synergy between textual semantics and spectral-spatial data, paving the way for further advancements in semantically augmented HSI classification models. Codes are be available in: https://github.com/milab-nsu/S3FN
【4】EEG Sleep Stage Classification with Continuous Wavelet Transform and Deep Learning
标题:使用连续子波变换和深度学习的脑电睡眠阶段分类
链接:https://arxiv.org/abs/2510.07524
作者:Mehdi Zekriyapanah Gashti, Ghasem Farjamnia
备注:11 pages, 2 figures
摘要
:Accurate classification of sleep stages is crucial for the diagnosis and management of sleep disorders. Conventional approaches for sleep scoring rely on manual annotation or features extracted from EEG signals in the time or frequency domain. This study proposes a novel framework for automated sleep stage scoring using time-frequency analysis based on the wavelet transform. The Sleep-EDF Expanded Database (sleep-cassette recordings) was used for evaluation. The continuous wavelet transform (CWT) generated time-frequency maps that capture both transient and oscillatory patterns across frequency bands relevant to sleep staging. Experimental results demonstrate that the proposed wavelet-based representation, combined with ensemble learning, achieves an overall accuracy of 88.37 percent and a macro-averaged F1 score of 73.15, outperforming conventional machine learning methods and exhibiting comparable or superior performance to recent deep learning approaches. These findings highlight the potential of wavelet analysis for robust, interpretable, and clinically applicable sleep stage classification.
【5】Deep Learning Based Approach to Enhanced Recognition of Emotions and Behavioral Patterns of Autistic Children
标题:基于深度学习的方法增强自闭症儿童情绪和行为模式的识别
链接:https://arxiv.org/abs/2510.07320
作者:Nelaka K.A.R, Peiris M.K.V, Liyanage R.P.B
摘要:Autism Spectrum Disorder significantly influences the communication abilities, learning processes, behavior, and social interactions of individuals. Although early intervention and customized educational strategies are critical to improving outcomes, there is a pivotal gap in understanding and addressing nuanced behavioral patterns and emotional identification in autistic children prior to skill development. This extended research delves into the foundational step of recognizing and mapping these patterns as a prerequisite to improving learning and soft skills. Using a longitudinal approach to monitor emotions and behaviors, this study aims to establish a baseline understanding of the unique needs and challenges faced by autistic students, particularly in the Information Technology domain, where opportunities are markedly limited. Through a detailed analysis of behavioral trends over time, we propose a targeted framework for developing applications and technical aids designed to meet these identified needs. Our research underscores the importance of a sequential and evidence-based intervention approach that prioritizes a deep understanding of each child's behavioral and emotional landscape as the basis for effective skill development. By shifting the focus toward early identification of behavioral patterns, we aim to foster a more inclusive and supportive learning environment that can significantly improve the educational and developmental trajectory of children with ASD.
表征(1篇)
【1】On the Relationship Between the Choice of Representation and In-Context Learning
标题:论表现的选择与上下文学习的关系
链接:https://arxiv.org/abs/2510.08372
作者:Ioana Marinescu, Kyunghyun Cho, Eric Karl Oermann
备注:25 pages, 6 figures, 10 tables
摘要:In-context learning (ICL) is the ability of a large language model (LLM) to learn a new task from a few demonstrations presented as part of the context. Past studies have attributed a large portion of the success of ICL to the way these in-context demonstrations are represented, particularly to how labels are represented in classification tasks. On the other hand, observations of the learning capacity of ICL (i.e., the extent to which more in-context demonstrations can lead to higher performance) have been mixed, and ICL is often thought to occur only under specific conditions. The interaction between these two aspects in ICL, representation and learning, has not been studied in depth until now. We hypothesize that they are largely independent of one another, such that the representation of demonstrations determines the baseline accuracy of ICL, while learning from additional demonstrations improves only on top of this baseline. We validate this hypothesis by developing an optimization algorithm that can enumerate a spectrum of possible label sets (representations) varying in semantic relevance. We then perform ICL with varying numbers of in-context demonstrations for each of these label sets. We observed that learning happens regardless of the quality of the label set itself, although its efficiency, measured by the slope of improvement over in-context demonstrations, is conditioned on both the label set quality and the parameter count of the underlying language model. Despite the emergence of learning, the relative quality (accuracy) of the choice of a label set (representation) is largely maintained throughout learning, confirming our hypothesis and implying their orthogonality. Our work reveals a previously underexplored aspect of ICL: the independent effects of learning from demonstrations and their representations on ICL performance.
编码器(2篇)
【1】Bayesian Optimization of Multi-Bit Pulse Encoding in In2O3/Al2O3 Thin-film Transistors for Temporal Data Processing
标题:In 2 O3/Al 2 O3薄膜晶体管多位脉冲编码的Bayesian优化用于时态数据处理
链接:https://arxiv.org/abs/2510.07421
作者:Javier Meza-Arroyo, Benius Dunn, Weijie Xu, Yu-Chieh Chen, Jen-Sue Chen, Julia W.P. Hsu
摘要:Utilizing the intrinsic history-dependence and nonlinearity of hardware, physical reservoir computing is a promising neuromorphic approach to encode time-series data for in-sensor computing. The accuracy of this encoding critically depends on the distinguishability of multi-state outputs, which is often limited by suboptimal and empirically chosen reservoir operation conditions. In this work, we demonstrate a machine learning approach, Bayesian optimization, to improve the encoding fidelity of solution-processed Al2O3/In2O3 thin-film transistors (TFTs). We show high-fidelity 6-bit temporal encoding by exploring five key pulse parameters and using the normalized degree of separation (nDoS) as the metric of output state separability. Additionally, we show that a model trained on simpler 4-bit data can effectively guide optimization of more complex 6-bit encoding tasks, reducing experimental cost. Specifically, for the encoding and reconstruction of binary-patterned images of a moving car across 6 sequential frames, we demonstrate that the encoding is more accurate when operating the TFT using optimized pulse parameters and the 4-bit optimized operating condition performs almost as well as the 6-bit optimized condition. Finally, interpretability analysis via Shapley Additive Explanations (SHAP) reveals that gate pulse amplitude and drain voltage are the most influential parameters in achieving higher state separation. This work presents the first systematic method to identify optimal operating conditions for reservoir devices, and the approach can be extended to other physical reservoir implementations across different material platforms.
【2】Beyond Grid-Locked Voxels: Neural Response Functions for Continuous Brain Encoding
标题:超越网格锁定体素:连续大脑编码的神经反应函数
链接:https://arxiv.org/abs/2510.07342
作者:Haomiao Chen, Keith W Jamison, Mert R. Sabuncu, Amy Kuceyeski
摘要:Neural encoding models aim to predict fMRI-measured brain responses to natural images. fMRI data is acquired as a 3D volume of voxels, where each voxel has a defined spatial location in the brain. However, conventional encoding models often flatten this volume into a 1D vector and treat voxel responses as independent outputs. This removes spatial context, discards anatomical information, and ties each model to a subject-specific voxel grid. We introduce the Neural Response Function (NRF), a framework that models fMRI activity as a continuous function over anatomical space rather than a flat vector of voxels. NRF represents brain activity as a continuous implicit function: given an image and a spatial coordinate (x, y, z) in standardized MNI space, the model predicts the response at that location. This formulation decouples predictions from the training grid, supports querying at arbitrary spatial resolutions, and enables resolution-agnostic analyses. By grounding the model in anatomical space, NRF exploits two key properties of brain responses: (1) local smoothness -- neighboring voxels exhibit similar response patterns; modeling responses continuously captures these correlations and improves data efficiency, and (2) cross-subject alignment -- MNI coordinates unify data across individuals, allowing a model pretrained on one subject to be fine-tuned on new subjects. In experiments, NRF outperformed baseline models in both intrasubject encoding and cross-subject adaptation, achieving high performance while reducing the data size needed by orders of magnitude. To our knowledge, NRF is the first anatomically aware encoding model to move beyond flattened voxels, learning a continuous mapping from images to brain responses in 3D space.
优化|敛散性(11篇)
【1】On the optimization dynamics of RLVR: Gradient gap and step size thresholds
标题:关于WLVR的优化动态:梯度间隙和步进阈值
链接:https://arxiv.org/abs/2510.08539
作者:Joe Suk, Yaqi Duan
摘要:Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has shown significant empirical success. However, a principled understanding of why it works has been lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. We validate these predictions through controlled bandit simulations and LLM experiments, including training Qwen2.5-7B with GRPO.
【2】AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents
标题:AutoMLGen:为编码代理导航细粒度优化
链接:https://arxiv.org/abs/2510.08511
作者:Shangheng Du, Xiangchao Yan, Dengyang Jiang, Jiakang Yuan, Yusong Hu, Xin Li, Liang He, Bo Zhang, Lei Bai
摘要:Large language models (LLMs) have shown impressive performance in general programming tasks. However, in Machine Learning Engineering (MLE) scenarios such as AutoML and Kaggle competitions, achieving high performance depends heavily on expert intervention and repeated adjustments rather than simply generating correct code. When applied directly to these tasks, LLMs often lack fine-grained domain priors, and existing MLE approaches that use linear or tree-structured searches limit knowledge transfer to adjacent hierarchical links. As a result, they cannot leverage past full trajectories or share information across branches, limiting self-evolving ability and search space diversity. To address these limitations, we introduce AutoMLGen, an LLM-based coding agent that integrates a domain knowledge base for high-quality prior guidance and Monte Carlo Graph Search (MCGS) for efficient exploration. MCGS retains the tree-guided exploration of MCTS while embedding a graph structure into the expansion stage to enable dynamic path reorganization, historical trajectory reuse, and multi-solution fusion to support both self-evolution and collaborative learning. Combined with fine-grained operator sets, this design improves stability and accelerates convergence. Evaluation on the MLE-Bench shows that AutoMLGen achieves state-of-the-art performance in numerous dimensions, such as the average medal rate and the valid submission rate, under a 12-hour budget (half the standard runtime). The code is available at https://github.com/Alpha-Innovator/InternAgent.
【3】Reinforcing Diffusion Models by Direct Group Preference Optimization
标题:通过直接群体偏好优化增强扩散模型
链接:https://arxiv.org/abs/2510.08425
作者:Yihong Luo, Tianyang Hu, Jing Tang
摘要:While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic policy, yet the most cost-effective diffusion samplers are based on deterministic ODEs. Recent work addresses this issue by using inefficient SDE-based samplers to induce stochasticity, but this reliance on model-agnostic Gaussian noise leads to slow convergence. To resolve this conflict, we propose Direct Group Preference Optimization (DGPO), a new online RL algorithm that dispenses with the policy-gradient framework entirely. DGPO learns directly from group-level preferences, which utilize relative information of samples within groups. This design eliminates the need for inefficient stochastic policies, unlocking the use of efficient deterministic ODE samplers and faster training. Extensive results show that DGPO trains around 20 times faster than existing state-of-the-art methods and achieves superior performance on both in-domain and out-of-domain reward metrics. Code is available at https://github.com/Luo-Yihong/DGPO.
【4】Counterfactual Identifiability via Dynamic Optimal Transport
标题:通过动态最优传输的反事实可识别性
链接:https://arxiv.org/abs/2510.08294
作者:Fabio De Sousa Ribeiro, Ainkaran Santhirasekaram, Ben Glocker
备注:Accepted at NeurIPS 2025
摘要:We address the open question of counterfactual identification for high-dimensional multivariate outcomes from observational data. Pearl (2000) argues that counterfactuals must be identifiable (i.e., recoverable from the observed data distribution) to justify causal claims. A recent line of work on counterfactual inference shows promising results but lacks identification, undermining the causal validity of its estimates. To address this, we establish a foundation for multivariate counterfactual identification using continuous-time flows, including non-Markovian settings under standard criteria. We characterise the conditions under which flow matching yields a unique, monotone and rank-preserving counterfactual transport map with tools from dynamic optimal transport, ensuring consistent inference. Building on this, we validate the theory in controlled scenarios with counterfactual ground-truth and demonstrate improvements in axiomatic counterfactual soundness on real images.
【5】Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Finetuning
标题:任意熵政策优化:强化微调中的熵是可控的
链接:https://arxiv.org/abs/2510.08141
作者:Chen Wang, Zhaochun Li, Jionghao Bai, Yuzhi Zhang, Shisheng Cui, Zhou Zhao, Yue Wang
摘要:Reinforcement finetuning (RFT) is essential for enhancing the reasoning capabilities of large language models (LLM), yet the widely adopted Group Relative Policy Optimization (GRPO) suffers from entropy collapse, where entropy monotonically decreases, exploration vanishes, and policies converge prematurely. Existing entropy-regularized methods only partially alleviate this issue while introducing bias and instability, leaving entropy control unresolved and the connection between entropy, exploration, and performance unclear. We propose Arbitrary Entropy Policy Optimization (AEPO), which eliminates entropy collapse by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions and stabilizing entropy through temperature regulation. AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization, enabling precise entropy control without distorting optimization. Experiments demonstrate three major contributions: AEPO (1) stabilizes entropy at arbitrary target levels, effectively removing collapse in GRPO; (2) reveals a non-monotonic relation where performance first improves then declines with increasing entropy, clarifying the link between entropy, exploration, and reasoning; and (3) generalizes beyond entropy, providing a broader RFT paradigm where superior target distributions can serve as REINFORCE regularizers.
【6】A Unified Multi-Task Learning Framework for Generative Auto-Bidding with Validation-Aligned Optimization
标题:统一的多任务学习框架,用于具有验证一致优化的生成自动竞价
链接:https://arxiv.org/abs/2510.07760
作者:Yiqin Lv, Zhiyu Mou, Miao Xu, Jinghao Chen, Qi Wang, Yixiu Mao, Yun Qu, Rongquan Bai, Chuan Yu, Jian Xu, Bo Zheng, Xiangyang Ji
摘要:In online advertising, heterogeneous advertiser requirements give rise to numerous customized bidding tasks that are typically optimized independently, resulting in extensive computation and limited data efficiency. Multi-task learning offers a principled framework to train these tasks jointly through shared representations. However, existing multi-task optimization strategies are primarily guided by training dynamics and often generalize poorly in volatile bidding environments. To this end, we present Validation-Aligned Multi-task Optimization (VAMO), which adaptively assigns task weights based on the alignment between per-task training gradients and a held-out validation gradient, thereby steering updates toward validation improvement and better matching deployment objectives. We further equip the framework with a periodicity-aware temporal module and couple it with an advanced generative auto-bidding backbone to enhance cross-task transfer of seasonal structure and strengthen bidding performance. Meanwhile, we provide theoretical insights into the proposed method, e.g., convergence guarantee and alignment analysis. Extensive experiments on both simulated and large-scale real-world advertising systems consistently demonstrate significant improvements over typical baselines, illuminating the effectiveness of the proposed approach.
【7】Surrogate Modeling for the Design of Optimal Lattice Structures using Tensor Completion
标题:利用张量完成设计最佳网格结构的代理建模
链接:https://arxiv.org/abs/2510.07474
作者:Shaan Pakala, Aldair E. Gongora, Brian Giera, Evangelos E. Papalexakis
备注:NeurIPS 2025 AI4Mat Workshop
摘要:When designing new materials, it is often necessary to design a material with specific desired properties. Unfortunately, as new design variables are added, the search space grows exponentially, which makes synthesizing and validating the properties of each material very impractical and time-consuming. In this work, we focus on the design of optimal lattice structures with regard to mechanical performance. Computational approaches, including the use of machine learning (ML) methods, have shown improved success in accelerating materials design. However, these ML methods are still lacking in scenarios when training data (i.e. experimentally validated materials) come from a non-uniformly random sampling across the design space. For example, an experimentalist might synthesize and validate certain materials more frequently because of convenience. For this reason, we suggest the use of tensor completion as a surrogate model to accelerate the design of materials in these atypical supervised learning scenarios. In our experiments, we show that tensor completion is superior to classic ML methods such as Gaussian Process and XGBoost with biased sampling of the search space, with around 5\% increased $R^2$. Furthermore, tensor completion still gives comparable performance with a uniformly random sampling of the entire search space.
【8】Accelerated Aggregated D-Optimal Designs for Estimating Main Effects in Black-Box Models
标题:估计黑匣子模型主效应的加速聚集D-最优设计
链接:https://arxiv.org/abs/2510.08465
作者:Chih-Yu Chang, Ming-Chung Chang
摘要:Recent advances in supervised learning have driven growing interest in explaining black-box models, particularly by estimating the effects of input variables on model predictions. However, existing approaches often face key limitations, including poor scalability, sensitivity to out-of-distribution sampling, and instability under correlated features. To address these issues, we propose A2D2E, an $\textbf{E}$stimator based on $\textbf{A}$ccelerated $\textbf{A}$ggregated $\textbf{D}$-Optimal $\textbf{D}$esigns. Our method leverages principled experimental design to improve efficiency and robustness in main effect estimation. We establish theoretical guarantees, including convergence and variance reduction, and validate A2D2E through extensive simulations. We further provide the potential of the proposed method with a case study on real data and applications in language models. The code to reproduce the results can be found at https://github.com/cchihyu/A2D2E.
【9】Optimal Stopping in Latent Diffusion Models
标题:潜扩散模型中的最优停止
链接:https://arxiv.org/abs/2510.08409
作者:Yu-Han Wu, Quentin Berthet, Gérard Biau, Claire Boyer, Romuald Elie, Pierre Marion
摘要:We identify and analyze a surprising phenomenon of Latent Diffusion Models (LDMs) where the final steps of the diffusion can degrade sample quality. In contrast to conventional arguments that justify early stopping for numerical stability, this phenomenon is intrinsic to the dimensionality reduction in LDMs. We provide a principled explanation by analyzing the interaction between latent dimension and stopping time. Under a Gaussian framework with linear autoencoders, we characterize the conditions under which early stopping is needed to minimize the distance between generated and target distributions. More precisely, we show that lower-dimensional representations benefit from earlier termination, whereas higher-dimensional latent spaces require later stopping time. We further establish that the latent dimension interplays with other hyperparameters of the problem such as constraints in the parameters of score matching. Experiments on synthetic and real datasets illustrate these properties, underlining that early stopping can improve generative quality. Together, our results offer a theoretical foundation for understanding how the latent dimension influences the sample quality, and highlight stopping time as a key hyperparameter in LDMs.
【10】From Data to Rewards: a Bilevel Optimization Perspective on Maximum Likelihood Estimation
标题:从数据到回报:最大似然估计的二层优化视角
链接:https://arxiv.org/abs/2510.07624
作者:Abdelhakim Benechehab, Gabriel Singer, Corentin Léger, Youssef Attia El Hili, Giuseppe Paolo, Albert Thomas, Maurizio Filippone, Balázs Kégl
摘要:Generative models form the backbone of modern machine learning, underpinning state-of-the-art systems in text, vision, and multimodal applications. While Maximum Likelihood Estimation has traditionally served as the dominant training paradigm, recent work have highlighted its limitations, particularly in generalization and susceptibility to catastrophic forgetting compared to Reinforcement Learning techniques, such as Policy Gradient methods. However, these approaches depend on explicit reward signals, which are often unavailable in practice, leaving open the fundamental problem of how to align generative models when only high-quality datasets are accessible. In this work, we address this challenge via a Bilevel Optimization framework, where the reward function is treated as the optimization variable of an outer-level problem, while a policy gradient objective defines the inner-level. We then conduct a theoretical analysis of this optimization problem in a tractable setting and extract insights that, as we demonstrate, generalize to applications such as tabular classification and model-based reinforcement learning. We release the code at https://github.com/abenechehab/nll_to_po .
【11】Evaluating and Learning Optimal Dynamic Treatment Regimes under Truncation by Death
标题:评估和学习死亡截尾下的最佳动态治疗方案
链接:https://arxiv.org/abs/2510.07501
作者:Sihyung Park (1), Wenbin Lu (1), Shu Yang (1) ((1) North Carolina State University)
备注:30 pages, 5 figures, 6 tables, The Thirty-Ninth Annual Conference on Neural Information Processing Systems
摘要:Truncation by death, a prevalent challenge in critical care, renders traditional dynamic treatment regime (DTR) evaluation inapplicable due to ill-defined potential outcomes. We introduce a principal stratification-based method, focusing on the always-survivor value function. We derive a semiparametrically efficient, multiply robust estimator for multi-stage DTRs, demonstrating its robustness and efficiency. Empirical validation and an application to electronic health records showcase its utility for personalized treatment optimization.
预测|估计(5篇)
【1】DemandCast: Global hourly electricity demand forecasting
标题:DemandCast:全球每小时电力需求预测
链接:https://arxiv.org/abs/2510.08000
作者:Kevin Steijn, Vamsi Priya Goli, Enrico Antonini
备注:7 pages, 4 figures, accepted at the NeurIPS 2025 Workshop: Tackling Climate Change with Machine Learning
摘要:This paper presents a machine learning framework for electricity demand forecasting across diverse geographical regions using the gradient boosting algorithm XGBoost. The model integrates historical electricity demand and comprehensive weather and socioeconomic variables to predict normalized electricity demand profiles. To enable robust training and evaluation, we developed a large-scale dataset spanning multiple years and countries, applying a temporal data-splitting strategy that ensures benchmarking of out-of-sample performance. Our approach delivers accurate and scalable demand forecasts, providing valuable insights for energy system planners and policymakers as they navigate the challenges of the global energy transition.
【2】PRESCRIBE: Predicting Single-Cell Responses with Bayesian Estimation
标题:引言:用Bayesian估计预测单细胞反应
链接:https://arxiv.org/abs/2510.07964
作者:Jiabei Cheng, Changxi Chi, Jingbo Zhou, Hongyi Xin, Jun Xia
备注:39th Conference on Neural Information Processing Systems (NeurIPS 2025)
摘要:In single-cell perturbation prediction, a central task is to forecast the effects of perturbing a gene unseen in the training data. The efficacy of such predictions depends on two factors: (1) the similarity of the target gene to those covered in the training data, which informs model (epistemic) uncertainty, and (2) the quality of the corresponding training data, which reflects data (aleatoric) uncertainty. Both factors are critical for determining the reliability of a prediction, particularly as gene perturbation is an inherently stochastic biochemical process. In this paper, we propose PRESCRIBE (PREdicting Single-Cell Response wIth Bayesian Estimation), a multivariate deep evidential regression framework designed to measure both sources of uncertainty jointly. Our analysis demonstrates that PRESCRIBE effectively estimates a confidence score for each prediction, which strongly correlates with its empirical accuracy. This capability enables the filtering of untrustworthy results, and in our experiments, it achieves steady accuracy improvements of over 3% compared to comparable baselines.
【3】Out-of-Distribution Generalization in Climate-Aware Yield Prediction with Earth Observation Data
标题:利用地球观测数据预测气候感知产量的分布外推广
链接:https://arxiv.org/abs/2510.07350
作者:Aditya Chakravarty
备注:None
摘要:Climate change is increasingly disrupting agricultural systems, making accurate crop yield forecasting essential for food security. While deep learning models have shown promise in yield prediction using satellite and weather data, their ability to generalize across geographic regions and years - critical for real-world deployment - remains largely untested. We benchmark two state-of-the-art models, GNN-RNN and MMST-ViT, under realistic out-of-distribution (OOD) conditions using the large-scale CropNet dataset spanning 1,200+ U.S. counties from 2017-2022. Through leave-one-cluster-out cross-validation across seven USDA Farm Resource Regions and year-ahead prediction scenarios, we identify substantial variability in cross-region transferability. GNN-RNN demonstrates superior generalization with positive correlations under geographic shifts, while MMST-ViT performs well in-domain but degrades sharply under OOD conditions. Regions like Heartland and Northern Great Plains show stable transfer dynamics (RMSE less than 10 bu/acre for soybean), whereas Prairie Gateway exhibits persistent underperformance (RMSE greater than 20 bu/acre) across both models and crops, revealing structural dissimilarities likely driven by semi-arid climate, irrigation patterns, and incomplete spectral coverage. Beyond accuracy differences, GNN-RNN achieves 135x faster training than MMST-ViT (14 minutes vs. 31.5 hours), making it more viable for sustainable deployment. Our findings underscore that spatial-temporal alignment - not merely model complexity or data scale - is key to robust generalization, and highlight the need for transparent OOD evaluation protocols to ensure equitable and reliable climate-aware agricultural forecasting.
【4】Computational and statistical lower bounds for low-rank estimation under general inhomogeneous noise
标题:一般非均匀噪音下低阶估计的计算和统计下界
链接:https://arxiv.org/abs/2510.08541
作者:Debsurya De, Dmitriy Kunisky
备注:52 pages, 3 figures
摘要:Recent work has generalized several results concerning the well-understood spiked Wigner matrix model of a low-rank signal matrix corrupted by additive i.i.d. Gaussian noise to the inhomogeneous case, where the noise has a variance profile. In particular, for the special case where the variance profile has a block structure, a series of results identified an effective spectral algorithm for detecting and estimating the signal, identified the threshold signal strength required for that algorithm to succeed, and proved information-theoretic lower bounds that, for some special signal distributions, match the above threshold. We complement these results by studying the computational optimality of this spectral algorithm. Namely, we show that, for a much broader range of signal distributions, whenever the spectral algorithm cannot detect a low-rank signal, then neither can any low-degree polynomial algorithm. This gives the first evidence for a computational hardness conjecture of Guionnet, Ko, Krzakala, and Zdeborov\'a (2023). With similar techniques, we also prove sharp information-theoretic lower bounds for a class of signal distributions not treated by prior work. Unlike all of the above results on inhomogeneous models, our results do not assume that the variance profile has a block structure, and suggest that the same spectral algorithm might remain optimal for quite general profiles. We include a numerical study of this claim for an example of a smoothly-varying rather than piecewise-constant profile. Our proofs involve analyzing the graph sums of a matrix, which also appear in free and traffic probability, but we require new bounds on these quantities that are tighter than existing ones for non-negative matrices, which may be of independent interest.
【5】A Honest Cross-Validation Estimator for Prediction Performance
标题:预测性能的诚实交叉验证估计
链接:https://arxiv.org/abs/2510.07649
作者:Tianyu Pan, Vincent Z. Yu, Viswanath Devanarayan, Lu Tian
摘要
:Cross-validation is a standard tool for obtaining a honest assessment of the performance of a prediction model. The commonly used version repeatedly splits data, trains the prediction model on the training set, evaluates the model performance on the test set, and averages the model performance across different data splits. A well-known criticism is that such cross-validation procedure does not directly estimate the performance of the particular model recommended for future use. In this paper, we propose a new method to estimate the performance of a model trained on a specific (random) training set. A naive estimator can be obtained by applying the model to a disjoint testing set. Surprisingly, cross-validation estimators computed from other random splits can be used to improve this naive estimator within a random-effects model framework. We develop two estimators -- a hierarchical Bayesian estimator and an empirical Bayes estimator -- that perform similarly to or better than both the conventional cross-validation estimator and the naive single-split estimator. Simulations and a real-data example demonstrate the superior performance of the proposed method.
其他神经网络|深度学习|模型|建模(30篇)
【1】Who Said Neural Networks Aren't Linear?
标题:谁说神经网络不是线性的?
链接:https://arxiv.org/abs/2510.08570
作者:Nimrod Berman, Assaf Hallak, Assaf Shocher
摘要:Neural networks are famously nonlinear. However, linearity is defined relative to a pair of vector spaces, $f$$:$$X$$\to$$Y$. Is it possible to identify a pair of non-standard vector spaces for which a conventionally nonlinear function is, in fact, linear? This paper introduces a method that makes such vector spaces explicit by construction. We find that if we sandwich a linear operator $A$ between two invertible neural networks, $f(x)=g_y^{-1}(A g_x(x))$, then the corresponding vector spaces $X$ and $Y$ are induced by newly defined addition and scaling actions derived from $g_x$ and $g_y$. We term this kind of architecture a Linearizer. This framework makes the entire arsenal of linear algebra, including SVD, pseudo-inverse, orthogonal projection and more, applicable to nonlinear mappings. Furthermore, we show that the composition of two Linearizers that share a neural network is also a Linearizer. We leverage this property and demonstrate that training diffusion models using our architecture makes the hundreds of sampling steps collapse into a single step. We further utilize our framework to enforce idempotency (i.e. $f(f(x))=f(x)$) on networks leading to a globally projective generative model and to demonstrate modular style transfer.
【2】ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation
标题:ArenaBencher:通过多模型竞争评估自动基准进化
链接:https://arxiv.org/abs/2510.08569
作者:Qin Liu, Jacob Dineen, Yuxi Huang, Sheng Zhang, Hoifung Poon, Ben Zhou, Muhao Chen
备注:Preprint
摘要:Benchmarks are central to measuring the capabilities of large language models and guiding model development, yet widespread data leakage from pretraining corpora undermines their validity. Models can match memorized content rather than demonstrate true generalization, which inflates scores, distorts cross-model comparisons, and misrepresents progress. We introduce ArenaBencher, a model-agnostic framework for automatic benchmark evolution that updates test cases while preserving comparability. Given an existing benchmark and a diverse pool of models to be evaluated, ArenaBencher infers the core ability of each test case, generates candidate question-answer pairs that preserve the original objective, verifies correctness and intent with an LLM as a judge, and aggregates feedback from multiple models to select candidates that expose shared weaknesses. The process runs iteratively with in-context demonstrations that steer generation toward more challenging and diagnostic cases. We apply ArenaBencher to math problem solving, commonsense reasoning, and safety domains and show that it produces verified, diverse, and fair updates that uncover new failure modes, increase difficulty while preserving test objective alignment, and improve model separability. The framework provides a scalable path to continuously evolve benchmarks in step with the rapid progress of foundation models.
【3】How to Teach Large Multimodal Models New Skills
标题:如何教授大型多峰模型新技能
链接:https://arxiv.org/abs/2510.08564
作者:Zhen Zhu, Yiming Gong, Yao Xiao, Yaoyao Liu, Derek Hoiem
备注:In submission. Code is available at this https URL
摘要:How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. We observe that apparent "forgetting" on held-out tasks after narrow fine-tuning can partly recover at later stages. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that co-varies with forgetting. Guided by this picture, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers, and (ii) updating only the MLP Gate&Up while freezing the Down projection. Across models and tasks, these choices deliver strong target gains while largely preserving held-out performance. Code is available at https://github.com/jessemelpolio/LMM_CL
【4】Agent Learning via Early Experience
标题:通过早期经验进行代理学习
链接:https://arxiv.org/abs/2510.08558
作者:Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu
备注:Work in progress
摘要
:A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.
【5】Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models
标题:更好地在一起:利用未配对的多峰数据建立更强大的单峰模型
链接:https://arxiv.org/abs/2510.08492
作者:Sharut Gupta, Shobhita Sundaram, Chenyu Wang, Stefanie Jegelka, Phillip Isola
备注:63 pages, 29 tables, and 47 figures
摘要:Traditional multimodal learners find unified representations for tasks like visual question answering, but rely heavily on paired datasets. However, an overlooked yet potentially powerful question is: can one leverage auxiliary unpaired multimodal data to directly enhance representation learning in a target modality? We introduce UML: Unpaired Multimodal Learner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them. This design exploits the assumption that different modalities are projections of a shared underlying reality, allowing the model to benefit from cross-modal structure without requiring explicit pairs. Theoretically, under linear data-generating assumptions, we show that unpaired auxiliary data can yield representations strictly more informative about the data-generating process than unimodal training. Empirically, we show that using unpaired data from auxiliary modalities -- such as text, audio, or images -- consistently improves downstream performance across diverse unimodal targets such as image and audio. Our project page: https://unpaired-multimodal.github.io/
【6】DexMan: Learning Bimanual Dexterous Manipulation from Human and Generated Videos
标题:DexMan:从人类学习双手灵巧操作并生成视频
链接:https://arxiv.org/abs/2510.08475
作者:Jhen Hsieh, Kuan-Hsun Tu, Kuo-Han Hung, Tsung-Wei Ke
备注:Video results are available at: this https URL
摘要:We present DexMan, an automated framework that converts human visual demonstrations into bimanual dexterous manipulation skills for humanoid robots in simulation. Operating directly on third-person videos of humans manipulating rigid objects, DexMan eliminates the need for camera calibration, depth sensors, scanned 3D object assets, or ground-truth hand and object motion annotations. Unlike prior approaches that consider only simplified floating hands, it directly controls a humanoid robot and leverages novel contact-based rewards to improve policy learning from noisy hand-object poses estimated from in-the-wild videos. DexMan achieves state-of-the-art performance in object pose estimation on the TACO benchmark, with absolute gains of 0.08 and 0.12 in ADD-S and VSD. Meanwhile, its reinforcement learning policy surpasses previous methods by 19% in success rate on OakInk-v2. Furthermore, DexMan can generate skills from both real and synthetic videos, without the need for manual data collection and costly motion capture, and enabling the creation of large-scale, diverse datasets for training generalist dexterous manipulation.
【7】Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling
标题:期待学习:用于低资源视觉语言建模的令牌式动态门控
链接:https://arxiv.org/abs/2510.08470
作者:Bianca-Mihaela Ganescu, Suchir Salhan, Andrew Caines, Paula Buttery
备注:Accepted to the EMNLP 2025 BabyLM Workshop
摘要:Training vision-language models on cognitively-plausible amounts of data requires rethinking how models integrate multimodal information. Within the constraints of the Vision track for the BabyLM Challenge 2025, we propose a lightweight decoder-based architecture with (1) token-wise dynamic gating for adaptive fusion of linguistic and visual cues, (2) feature modulation and channel attention to maximise the utility of limited visual information and (3) auxiliary contrastive objectives for visual grounding. Evaluation on five benchmarks (BLiMP, BLiMP Supplement, EWoK, Winoground and VQA) shows competitive or superior performance to multimodal baselines. More notably, our dynamic gate discovers interpretable patterns without explicit supervision, favouring visual cues for content words and linguistic cues for function words. While we identify limitations in the Challenge constraints, such as the information bottleneck created by global image embeddings and training instability from the dataset split, our findings establish dynamic gating as a powerful tool for efficient multimodal learning, offering both interpretability and performance even under severe constraints.
【8】Don't Run with Scissors: Pruning Breaks VLA Models but They Can Be Recovered
标题:不要拿着剪刀跑步:修剪会破坏VLA模型,但可以修复
链接:https://arxiv.org/abs/2510.08464
作者:Jason Jabbour, Dong-Ki Kim, Max Smith, Jay Patrikar, Radhika Ghosal, Youhui Wang, Ali Agha, Vijay Janapa Reddi, Shayegan Omidshafiei
摘要:Vision-Language-Action (VLA) models have advanced robotic capabilities but remain challenging to deploy on resource-limited hardware. Pruning has enabled efficient compression of large language models (LLMs), yet it is largely understudied in robotics. Surprisingly, we observe that pruning VLA models leads to drastic degradation and increased safety violations. We introduce GLUESTICK, a post-pruning recovery method that restores much of the original model's functionality while retaining sparsity benefits. Our method performs a one-time interpolation between the dense and pruned models in weight-space to compute a corrective term. This correction is used during inference by each pruned layer to recover lost capabilities with minimal overhead. GLUESTICK requires no additional training, is agnostic to the pruning algorithm, and introduces a single hyperparameter that controls the tradeoff between efficiency and accuracy. Across diverse VLA architectures and tasks in manipulation and navigation, GLUESTICK achieves competitive memory efficiency while substantially recovering success rates and reducing safety violations. Additional material can be found at: https://gluestick-vla.github.io/.
【9】SummDiff: Generative Modeling of Video Summarization with Diffusion
标题:SummDiff:具有扩散的视频摘要生成建模
链接
:https://arxiv.org/abs/2510.08458
作者:Kwanseok Kim, Jaehoon Hahm, Sumin Kim, Jinhwan Sul, Byunghak Kim, Joonseok Lee
摘要:Video summarization is a task of shortening a video by choosing a subset of frames while preserving its essential moments. Despite the innate subjectivity of the task, previous works have deterministically regressed to an averaged frame score over multiple raters, ignoring the inherent subjectivity of what constitutes a good summary. We propose a novel problem formulation by framing video summarization as a conditional generation task, allowing a model to learn the distribution of good summaries and to generate multiple plausible summaries that better reflect varying human perspectives. Adopting diffusion models for the first time in video summarization, our proposed method, SummDiff, dynamically adapts to visual contexts and generates multiple candidate summaries conditioned on the input video. Extensive experiments demonstrate that SummDiff not only achieves the state-of-the-art performance on various benchmarks but also produces summaries that closely align with individual annotator preferences. Moreover, we provide a deeper insight with novel metrics from an analysis of the knapsack, which is an important last step of generating summaries but has been overlooked in evaluation.
【10】Integral Signatures of Activation Functions: A 9-Dimensional Taxonomy and Stability Theory for Deep Learning
标题:激活函数的积分签名:深度学习的9维分类和稳定性理论
链接:https://arxiv.org/abs/2510.08456
作者:Ankur Mali, Lawrence Hall, Jake Williams, Gordon Richards
备注:25 pages
摘要:Activation functions govern the expressivity and stability of neural networks, yet existing comparisons remain largely heuristic. We propose a rigorous framework for their classification via a nine-dimensional integral signature S_sigma(phi), combining Gaussian propagation statistics (m1, g1, g2, m2, eta), asymptotic slopes (alpha_plus, alpha_minus), and regularity measures (TV(phi'), C(phi)). This taxonomy establishes well-posedness, affine reparameterization laws with bias, and closure under bounded slope variation. Dynamical analysis yields Lyapunov theorems with explicit descent constants and identifies variance stability regions through (m2', g2). From a kernel perspective, we derive dimension-free Hessian bounds and connect smoothness to bounded variation of phi'. Applying the framework, we classify eight standard activations (ReLU, leaky-ReLU, tanh, sigmoid, Swish, GELU, Mish, TeLU), proving sharp distinctions between saturating, linear-growth, and smooth families. Numerical Gauss-Hermite and Monte Carlo validation confirms theoretical predictions. Our framework provides principled design guidance, moving activation choice from trial-and-error to provable stability and kernel conditioning.
【11】Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization
标题:了解缺少什么:长度概括中的注意力分散和EMA稳定
链接:https://arxiv.org/abs/2510.08341
作者:Pál Zsámboki, Benjamin Levi, David Ansel Josef Smith, Mitansh Kagalwala, Arlington Kell, Samuel Liechty, Cong Wang
备注:10 pages, 5 figures, 2 tables
摘要:We study length generalization in transformers through the set complement task, where a model must predict a uniform distribution over tokens absent from an input sequence -- an ability central to board-game style reasoning. Our main theoretical result establishes two statements. First, we prove tight bounds on embedding and value dimensions for single-layer attention-only transformers. Second, we show that if such a model achieves balanced logit displacement at lengths 1 and 2, then it must generalize to longer sequences, though with reduced precision. A mechanistic reading of the proof explains this limitation: as more tokens are attended to, softmax compresses logit displacements, eroding separation between valid and invalid outputs. Training dynamics also suggest a second obstacle: when many next tokens are possible, updates become noisy. We hypothesize that dropout can counteract the first effect and Exponential Moving Average (EMA) the second. We validate these hypotheses through random hyperparameter search on the set complement task, which confirms both mechanisms. We then test OthelloGPT, a GPT-1 style model trained on random Othello moves, and find that EMA again improves length generalization in this more complex setting.
【12】To Ask or Not to Ask: Learning to Require Human Feedback
标题:问或不问:学会要求人类反馈
链接:https://arxiv.org/abs/2510.08314
作者:Andrea Pugnana, Giovanni De Toni, Cesare Barbera, Roberto Pellungrini, Bruno Lepri, Andrea Passerini
摘要:Developing decision-support systems that complement human performance in classification tasks remains an open challenge. A popular approach, Learning to Defer (LtD), allows a Machine Learning (ML) model to pass difficult cases to a human expert. However, LtD treats humans and ML models as mutually exclusive decision-makers, restricting the expert contribution to mere predictions. To address this limitation, we propose Learning to Ask (LtA), a new framework that handles both when and how to incorporate expert input in an ML model. LtA is based on a two-part architecture: a standard ML model and an enriched model trained with additional expert human feedback, with a formally optimal strategy for selecting when to query the enriched model. We provide two practical implementations of LtA: a sequential approach, which trains the models in stages, and a joint approach, which optimises them simultaneously. For the latter, we design surrogate losses with realisable-consistency guarantees. Our experiments with synthetic and real expert data demonstrate that LtA provides a more flexible and powerful foundation for effective human-AI collaboration.
【13】Robust and Efficient Collaborative Learning
标题:稳健高效的协作学习
链接:https://arxiv.org/abs/2510.08311
作者:Abdellah El Mrini, Sadegh Farhadkhan, Rachid Guerraoui
摘要
:Collaborative machine learning is challenged by training-time adversarial behaviors. Existing approaches to tolerate such behaviors either rely on a central server or induce high communication costs. We propose Robust Pull-based Epidemic Learning (RPEL), a novel, scalable collaborative approach to ensure robust learning despite adversaries. RPEL does not rely on any central server and, unlike traditional methods, where communication costs grow in $\mathcal{O}(n^2)$ with the number of nodes $n$, RPEL employs a pull-based epidemic-based communication strategy that scales in $\mathcal{O}(n \log n)$. By pulling model parameters from small random subsets of nodes, RPEL significantly lowers the number of required messages without compromising convergence guarantees, which hold with high probability. Empirical results demonstrate that RPEL maintains robustness in adversarial settings, competes with all-to-all communication accuracy, and scales efficiently across large networks.
【14】Post-hoc Stochastic Concept Bottleneck Models
标题:事后随机概念瓶颈模型
链接:https://arxiv.org/abs/2510.08219
作者:Wiktor Jan Hoffmann, Sonia Laguna, Moritz Vandenhirtz, Emanuele Palumbo, Julia E. Vogt
摘要:Concept Bottleneck Models (CBMs) are interpretable models that predict the target variable through high-level human-understandable concepts, allowing users to intervene on mispredicted concepts to adjust the final output. While recent work has shown that modeling dependencies between concepts can improve CBM performance, especially under interventions, such approaches typically require retraining the entire model, which may be infeasible when access to the original data or compute is limited. In this paper, we introduce Post-hoc Stochastic Concept Bottleneck Models (PSCBMs), a lightweight method that augments any pre-trained CBM with a multivariate normal distribution over concepts by adding only a small covariance-prediction module, without retraining the backbone model. We propose two training strategies and show on real-world data that PSCBMs consistently match or improve both concept and target accuracy over standard CBMs at test time. Furthermore, we show that due to the modeling of concept dependencies, PSCBMs perform much better than CBMs under interventions, while remaining far more efficient than retraining a similar stochastic model from scratch.
【15】FuelCast: Benchmarking Tabular and Temporal Models for Ship Fuel Consumption
标题:FuelCast:船舶燃油消耗的表格和时态模型基准
链接:https://arxiv.org/abs/2510.08217
作者:Justus Viga, Penelope Mueck, Alexander Löser, Torben Weis
备注:This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution will be published in "ECML PKDD Workshop 2025 - Advanced Analytics and Learning on Temporal Data"
摘要:In the shipping industry, fuel consumption and emissions are critical factors due to their significant impact on economic efficiency and environmental sustainability. Accurate prediction of ship fuel consumption is essential for further optimization of maritime operations. However, heterogeneous methodologies and limited high-quality datasets hinder direct comparison of modeling approaches. This paper makes three key contributions: (1) we introduce and release a new dataset (https://huggingface.co/datasets/krohnedigital/FuelCast) comprising operational and environmental data from three ships; (2) we define a standardized benchmark covering tabular regression and time-series regression (3) we investigate the application of in-context learning for ship consumption modeling using the TabPFN foundation model - a first in this domain to our knowledge. Our results demonstrate strong performance across all evaluated models, supporting the feasibility of onboard, data-driven fuel prediction. Models incorporating environmental conditions consistently outperform simple polynomial baselines relying solely on vessel speed. TabPFN slightly outperforms other techniques, highlighting the potential of foundation models with in-context learning capabilities for tabular prediction. Furthermore, including temporal context improves accuracy.
【16】DISCO: Diversifying Sample Condensation for Efficient Model Evaluation
标题:DISCO:多样化的样品浓缩,用于有效的模型评估
链接:https://arxiv.org/abs/2510.07959
作者:Alexander Rubinstein, Benjamin Raible, Martin Gubri, Seong Joon Oh
摘要:Evaluating modern machine learning models has become prohibitively expensive. Benchmarks such as LMMs-Eval and HELM demand thousands of GPU hours per model. Costly evaluation reduces inclusivity, slows the cycle of innovation, and worsens environmental impact. The typical approach follows two steps. First, select an anchor subset of data. Second, train a mapping from the accuracy on this subset to the final test result. The drawback is that anchor selection depends on clustering, which can be complex and sensitive to design choices. We argue that promoting diversity among samples is not essential; what matters is to select samples that $\textit{maximise diversity in model responses}$. Our method, $\textbf{Diversifying Sample Condensation (DISCO)}$, selects the top-k samples with the greatest model disagreements. This uses greedy, sample-wise statistics rather than global clustering. The approach is conceptually simpler. From a theoretical view, inter-model disagreement provides an information-theoretically optimal rule for such greedy selection. $\textbf{DISCO}$ shows empirical gains over prior methods, achieving state-of-the-art results in performance prediction across MMLU, Hellaswag, Winogrande, and ARC. Code is available here: https://github.com/arubique/disco-public.
【17】Some theoretical improvements on the tightness of PAC-Bayes risk certificates for neural networks
标题:神经网络PAC-Bayes风险证书严密性的理论改进
链接:https://arxiv.org/abs/2510.07935
作者:Diego García-Pérez, Emilio Parrado-Hernández, John Shawe-Taylor
摘要
:This paper presents four theoretical contributions that improve the usability of risk certificates for neural networks based on PAC-Bayes bounds. First, two bounds on the KL divergence between Bernoulli distributions enable the derivation of the tightest explicit bounds on the true risk of classifiers across different ranges of empirical risk. The paper next focuses on the formalization of an efficient methodology based on implicit differentiation that enables the introduction of the optimization of PAC-Bayesian risk certificates inside the loss/objective function used to fit the network/model. The last contribution is a method to optimize bounds on non-differentiable objectives such as the 0-1 loss. These theoretical contributions are complemented with an empirical evaluation on the MNIST and CIFAR-10 datasets. In fact, this paper presents the first non-vacuous generalization bounds on CIFAR-10 for neural networks.
【18】Synergy Between the Strong and the Weak: Spiking Neural Networks are Inherently Self-Distillers
标题:强者与弱者之间的协同作用:尖峰神经网络本质上是自我蒸馏器
链接:https://arxiv.org/abs/2510.07924
作者:Yongqi Ding, Lin Zuo, Mengmeng Jing, Kunshan Yang, Pei He, Tonglan Xie
备注:Accepted by NeurIPS 2025
摘要:Brain-inspired spiking neural networks (SNNs) promise to be a low-power alternative to computationally intensive artificial neural networks (ANNs), although performance gaps persist. Recent studies have improved the performance of SNNs through knowledge distillation, but rely on large teacher models or introduce additional training overhead. In this paper, we show that SNNs can be naturally deconstructed into multiple submodels for efficient self-distillation. We treat each timestep instance of the SNN as a submodel and evaluate its output confidence, thus efficiently identifying the strong and the weak. Based on this strong and weak relationship, we propose two efficient self-distillation schemes: (1) \textbf{Strong2Weak}: During training, the stronger "teacher" guides the weaker "student", effectively improving overall performance. (2) \textbf{Weak2Strong}: The weak serve as the "teacher", distilling the strong in reverse with underlying dark knowledge, again yielding significant performance gains. For both distillation schemes, we offer flexible implementations such as ensemble, simultaneous, and cascade distillation. Experiments show that our method effectively improves the discriminability and overall performance of the SNN, while its adversarial robustness is also enhanced, benefiting from the stability brought by self-distillation. This ingeniously exploits the temporal properties of SNNs and provides insight into how to efficiently train high-performance SNNs.
【19】Weak Form Learning for Mean-Field Partial Differential Equations: an Application to Insect Movement
标题:平均场偏微方程的弱形式学习:在昆虫运动中的应用
链接:https://arxiv.org/abs/2510.07786
作者:Seth Minor, Bret D. Elderd, Benjamin Van Allen, David M. Bortz, Vanja Dukic
备注:39 pages, 16 figures
摘要:Insect species subject to infection, predation, and anisotropic environmental conditions may exhibit preferential movement patterns. Given the innate stochasticity of exogenous factors driving these patterns over short timescales, individual insect trajectories typically obey overdamped stochastic dynamics. In practice, data-driven modeling approaches designed to learn the underlying Fokker-Planck equations from observed insect distributions serve as ideal tools for understanding and predicting such behavior. Understanding dispersal dynamics of crop and silvicultural pests can lead to a better forecasting of outbreak intensity and location, which can result in better pest management. In this work, we extend weak-form equation learning techniques, coupled with kernel density estimation, to learn effective models for lepidopteran larval population movement from highly sparse experimental data. Galerkin methods such as the Weak form Sparse Identification of Nonlinear Dynamics (WSINDy) algorithm have recently proven useful for learning governing equations in several scientific contexts. We demonstrate the utility of the method on a sparse dataset of position measurements of fall armyworms (Spodoptera frugiperda) obtained in simulated agricultural conditions with varied plant resources and infection status.
【20】DEAS: DEtached value learning with Action Sequence for Scalable Offline RL
标题:DEAS:通过可扩展离线RL的动作序列分离价值学习
链接:https://arxiv.org/abs/2510.07730
作者:Changyeon Kim, Haeone Lee, Younggyo Seo, Kimin Lee, Yuke Zhu
备注:Project website: this https URL
摘要:Offline reinforcement learning (RL) presents an attractive paradigm for training intelligent agents without expensive online interactions. However, current approaches still struggle with complex, long-horizon sequential decision making. In this work, we introduce DEtached value learning with Action Sequence (DEAS), a simple yet effective offline RL framework that leverages action sequences for value learning. These temporally extended actions provide richer information than single-step actions and can be interpreted through the options framework via semi-Markov decision process Q-learning, enabling reduction of the effective planning horizon by considering longer sequences at once. However, directly adopting such sequences in actor-critic algorithms introduces excessive value overestimation, which we address through detached value learning that steers value estimates toward in-distribution actions that achieve high return in the offline dataset. We demonstrate that DEAS consistently outperforms baselines on complex, long-horizon tasks from OGBench and can be applied to enhance the performance of large-scale Vision-Language-Action models that predict action sequences, significantly boosting performance in both RoboCasa Kitchen simulation tasks and real-world manipulation tasks.
【21】Accuracy, Memory Efficiency and Generalization: A Comparative Study on Liquid Neural Networks and Recurrent Neural Networks
标题:准确性、记忆效率和概括性:液体神经网络和回归神经网络的比较研究
链接:https://arxiv.org/abs/2510.07578
作者:Shilong Zong, Alex Bierly, Almuatazbellah Boker, Hoda Eldardiry
备注:13 pages, 12 figures. Submitted to IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
摘要
:This review aims to conduct a comparative analysis of liquid neural networks (LNNs) and traditional recurrent neural networks (RNNs) and their variants, such as long short-term memory networks (LSTMs) and gated recurrent units (GRUs). The core dimensions of the analysis include model accuracy, memory efficiency, and generalization ability. By systematically reviewing existing research, this paper explores the basic principles, mathematical models, key characteristics, and inherent challenges of these neural network architectures in processing sequential data. Research findings reveal that LNN, as an emerging, biologically inspired, continuous-time dynamic neural network, demonstrates significant potential in handling noisy, non-stationary data, and achieving out-of-distribution (OOD) generalization. Additionally, some LNN variants outperform traditional RNN in terms of parameter efficiency and computational speed. However, RNN remains a cornerstone in sequence modeling due to its mature ecosystem and successful applications across various tasks. This review identifies the commonalities and differences between LNNs and RNNs, summarizes their respective shortcomings and challenges, and points out valuable directions for future research, particularly emphasizing the importance of improving the scalability of LNNs to promote their application in broader and more complex scenarios.
【22】Phase Diagram of Dropout for Two-Layer Neural Networks in the Mean-Field Regime
标题:平均场区域两层神经网络的退出阶段图
链接:https://arxiv.org/abs/2510.07554
作者:Lénaïc Chizat, Pierre Marion, Yerkin Yesbay
摘要:Dropout is a standard training technique for neural networks that consists of randomly deactivating units at each step of their gradient-based training. It is known to improve performance in many settings, including in the large-scale training of language or vision models. As a first step towards understanding the role of dropout in large neural networks, we study the large-width asymptotics of gradient descent with dropout on two-layer neural networks with the mean-field initialization scale. We obtain a rich asymptotic phase diagram that exhibits five distinct nondegenerate phases depending on the relative magnitudes of the dropout rate, the learning rate, and the width. Notably, we find that the well-studied "penalty" effect of dropout only persists in the limit with impractically small learning rates of order $O(1/\text{width})$. For larger learning rates, this effect disappears and in the limit, dropout is equivalent to a "random geometry" technique, where the gradients are thinned randomly after the forward and backward pass have been computed. In this asymptotic regime, the limit is described by a mean-field jump process where the neurons' update times follow independent Poisson or Bernoulli clocks (depending on whether the learning rate vanishes or not). For some of the phases, we obtain a description of the limit dynamics both in path-space and in distribution-space. The convergence proofs involve a mix of tools from mean-field particle systems and stochastic processes. Together, our results lay the groundwork for a renewed theoretical understanding of dropout in large-scale neural networks.
【23】Targeted Digital Twin via Flow Map Learning and Its Application to Fluid Dynamics
标题:基于流图学习的目标数字孪生及其在流体力学中的应用
链接:https://arxiv.org/abs/2510.07549
作者:Qifan Chen, Zhongshu Xu, Jinjin Zhang, Dongbin Xiu
摘要:We present a numerical framework for constructing a targeted digital twin (tDT) that directly models the dynamics of quantities of interest (QoIs) in a full digital twin (DT). The proposed approach employs memory-based flow map learning (FML) to develop a data-driven model of the QoIs using short bursts of trajectory data generated through repeated executions of the full DT. This renders the construction of the FML-based tDT an entirely offline computational process. During online simulation, the learned tDT can efficiently predict and analyze the long-term dynamics of the QoIs without requiring simulations of the full DT system, thereby achieving substantial computational savings. After introducing the general numerical procedure, we demonstrate the construction and predictive capability of the tDT in a computational fluid dynamics (CFD) example: two-dimensional incompressible flow past a cylinder. The QoIs in this problem are the hydrodynamic forces exerted on the cylinder. The resulting tDTs are compact dynamical systems that evolve these forces without explicit knowledge of the underlying flow field. Numerical results show that the tDTs yield accurate long-term predictions of the forces while entirely bypassing full flow simulations.
【24】Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices
标题:部署微型LVLM判断器用于图表模型的真实评估:经验教训和最佳实践
链接:https://arxiv.org/abs/2510.07545
作者:Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Mizanur Rahman, Amran Bhuiyan, Israt Jahan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, Jimmy Huang
备注:Accepted to the EMNLP 2025 Industry Track
摘要:Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks. However, tiny models (<=2B parameters) still perform poorly as judges, limiting their real-world use in resource-constrained settings. To address this, we propose two approaches to ensure cost-efficient evaluation: (i) multi-criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain-adaptive transfer learning, in which we fine-tune a 2B-parameter LVLM on synthetic judgments in a chart dataset to create the ChartJudge. Experiments show that multi-criteria prompting exposes robustness gaps, which led to a huge drop in performance for 7B models, including specialized LVLM judges like LLaVA-Critic. In addition, we find that our tiny LVLM (ChartJudge) can effectively transfer knowledge from one dataset to another to make it a more specialized model. Our fine-grained analysis across chart types and query complexities offers actionable insights into trade-offs between model size, prompt design, and transferability, enabling scalable, low-cost evaluation for chart reasoning tasks. Our code and the data will be made publicly available.
【25】metabeta - A fast neural model for Bayesian mixed-effects regression
标题:metabeta -用于Bayesian混合效应回归的快速神经模型
链接:https://arxiv.org/abs/2510.07473
作者:Alex Kipnis, Marcel Binz, Eric Schulz
备注:19 pages, 9 main text, 8 figures
摘要:Hierarchical data with multiple observations per group is ubiquitous in empirical sciences and is often analyzed using mixed-effects regression. In such models, Bayesian inference gives an estimate of uncertainty but is analytically intractable and requires costly approximation using Markov Chain Monte Carlo (MCMC) methods. Neural posterior estimation shifts the bulk of computation from inference time to pre-training time, amortizing over simulated datasets with known ground truth targets. We propose metabeta, a transformer-based neural network model for Bayesian mixed-effects regression. Using simulated and real data, we show that it reaches stable and comparable performance to MCMC-based parameter estimation at a fraction of the usually required time.
【26】Base Models Know How to Reason, Thinking Models Learn When
标题:基础模型知道如何推理,思维模型何时学习
链接:https://arxiv.org/abs/2510.07364
作者:Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda
备注:10 pages
摘要:Why do thinking language models like DeepSeek R1 outperform their base counterparts? Despite consistent performance gains, it remains unclear to what extent thinking models learn entirely new reasoning capabilities or repurpose pre-existing base model ones. In this work, we propose a hybrid model where we activate reasoning mechanisms in base models at the right time to elicit thinking-model-level reasoning chains, implying that thinking models exploit already existing capabilities. To ground our analysis, we introduce an unsupervised, bottom-up approach for uncovering human-interpretable reasoning behaviors in thinking models. This approach provides an unbiased method to discover reasoning behaviors without imposing manual or LLM-derived assumptions. Across three base and four thinking models, using GSM8K and MATH500, our hybrid model recovers up to 91% of the performance gap to thinking models without any weight updates while steering only 12% of tokens. Concretely, our empirical setup provides a simple, causal way to test the effectiveness of existing reasoning mechanisms in base models by invoking them directly and measuring the resulting task performance. More broadly, these results reframe our understanding of how thinking models are trained: pre-training is when models acquire most of their reasoning mechanisms, and post-training teaches efficient deployment of these mechanisms at the right time, enabling efficient use of their inference-time compute.
【27】Permutation-Invariant Spectral Learning via Dyson Diffusion
标题:通过Dyson扩散的排列不变谱学习
链接:https://arxiv.org/abs/2510.08535
作者:Tassilo Schwarz, Cai Dieball, Constantin Kogler, Kevin Lam, Renaud Lambiotte, Arnaud Doucet, Aljaž Godec, George Deligiannidis
摘要:Diffusion models are central to generative modeling and have been adapted to graphs by diffusing adjacency matrix representations. The challenge of having up to $n!$ such representations for graphs with $n$ nodes is only partially mitigated by using permutation-equivariant learning architectures. Despite their computational efficiency, existing graph diffusion models struggle to distinguish certain graph families, unless graph data are augmented with ad hoc features. This shortcoming stems from enforcing the inductive bias within the learning architecture. In this work, we leverage random matrix theory to analytically extract the spectral properties of the diffusion process, allowing us to push the inductive bias from the architecture into the dynamics. Building on this, we introduce the Dyson Diffusion Model, which employs Dyson's Brownian Motion to capture the spectral dynamics of an Ornstein-Uhlenbeck process on the adjacency matrix while retaining all non-spectral information. We demonstrate that the Dyson Diffusion Model learns graph spectra accurately and outperforms existing graph diffusion models.
【28】Wavefunction Flows: Efficient Quantum Simulation of Continuous Flow Models
标题:波函数流:连续流模型的有效量子模拟
链接:https://arxiv.org/abs/2510.08462
作者:David Layden, Ryan Sweke, Vojtěch Havlíček, Anirban Chowdhury, Kirill Neklyudov
摘要:Flow models are a cornerstone of modern machine learning. They are generative models that progressively transform probability distributions according to learned dynamics. Specifically, they learn a continuous-time Markov process that efficiently maps samples from a simple source distribution into samples from a complex target distribution. We show that these models are naturally related to the Schr\"odinger equation, for an unusual Hamiltonian on continuous variables. Moreover, we prove that the dynamics generated by this Hamiltonian can be efficiently simulated on a quantum computer. Together, these results give a quantum algorithm for preparing coherent encodings (a.k.a., qsamples) for a vast family of probability distributions--namely, those expressible by flow models--by reducing the task to an existing classical learning problem, plus Hamiltonian simulation. For statistical problems defined by flow models, such as mean estimation and property testing, this enables the use of quantum algorithms tailored to qsamples, which may offer advantages over classical algorithms based only on samples from a flow model. More broadly, these results reveal a close connection between state-of-the-art machine learning models, such as flow matching and diffusion models, and one of the main expected capabilities of quantum computers: simulating quantum dynamics.
【29】Decoding the dark proteome: Deep learning-enabled discovery of druggable enzymes in Wuchereria bancrofti
标题:解码黑暗蛋白质组:深度学习使班氏菌中的可药物酶的发现成为可能
链接:https://arxiv.org/abs/2510.07337
作者:Shawnak Shivakumar, Jefferson Hernandez
备注:Accepted for peer-reviewed publication at the STEM Fellowship Journal
摘要
:Wuchereria bancrofti, the parasitic roundworm responsible for lymphatic filariasis, permanently disables over 36 million people and places 657 million at risk across 39 countries. A major bottleneck for drug discovery is the lack of functional annotation for more than 90 percent of the W. bancrofti dark proteome, leaving many potential targets unidentified. In this work, we present a novel computational pipeline that converts W. bancrofti's unannotated amino acid sequence data into precise four-level Enzyme Commission (EC) numbers and drug candidates. We utilized a DEtection TRansformer to estimate the probability of enzymatic function, fine-tuned a hierarchical nearest neighbor EC predictor on 4,476 labeled parasite proteins, and applied rejection sampling to retain only four-level EC classifications at 100 percent confidence. This pipeline assigned precise EC numbers to 14,772 previously uncharacterized proteins and discovered 543 EC classes not previously known in W. bancrofti. A qualitative triage emphasizing parasite-specific targets, chemical tractability, biochemical importance, and biological plausibility prioritized six enzymes across five separate strategies: anti-Wolbachia cell-wall inhibition, proteolysis blockade, transmission disruption, purinergic immune interference, and cGMP-signaling destabilization. We curated a 43-compound library from ChEMBL and BindingDB and co-folded across multiple protein conformers with Boltz-2. All six targets exhibited at least moderately strong predicted binding affinities below 1 micromolar, with moenomycin analogs against peptidoglycan glycosyltransferase and NTPase inhibitors showing promising nanomolar hits and well-defined binding pockets. While experimental validation remains essential, our results provide the first large-scale functional map of the W. bancrofti dark proteome and accelerate early-stage drug development for the species.
【30】Geodesics in the Deep Linear Network
标题:深度线性网络中的测地学
链接:https://arxiv.org/abs/2510.07324
作者:Alan Chen
摘要:We derive a general system of ODEs and associated explicit solutions in a special case for geodesics between full rank matrices in the deep linear network geometry. In the process, we characterize all horizontal straight lines in the invariant balanced manifold that remain geodesics under Riemannian submersion.
其他(31篇)
【1】Where Have All the Kaczmarz Iterates Gone?
标题:卡兹马尔兹的所有迭代都去哪儿了?
链接:https://arxiv.org/abs/2510.08563
作者:El Houcine Bergou, Soumia Boucherouite, Aritra Dutta, Xin Li, Anna Ma
摘要:The randomized Kaczmarz (RK) algorithm is one of the most computationally and memory-efficient iterative algorithms for solving large-scale linear systems. However, practical applications often involve noisy and potentially inconsistent systems. While the convergence of RK is well understood for consistent systems, the study of RK on noisy, inconsistent linear systems is limited. This paper investigates the asymptotic behavior of RK iterates in expectation when solving noisy and inconsistent systems, addressing the locations of their limit points. We explore the roles of singular vectors of the (noisy) coefficient matrix and derive bounds on the convergence horizon, which depend on the noise levels and system characteristics. Finally, we provide extensive numerical experiments that validate our theoretical findings, offering practical insights into the algorithm's performance under realistic conditions. These results establish a deeper understanding of the RK algorithm's limitations and robustness in noisy environments, paving the way for optimized applications in real-world scientific and engineering problems.
【2】Implementing Semantic Join Operators Efficiently
标题:有效实施语义加入操作符
链接:https://arxiv.org/abs/2510.08489
作者:Immanuel Trummer
摘要:Semantic query processing engines often support semantic joins, enabling users to match rows that satisfy conditions specified in natural language. Such join conditions can be evaluated using large language models (LLMs) that solve novel tasks without task-specific training. Currently, many semantic query processing engines implement semantic joins via nested loops, invoking the LLM to evaluate the join condition on row pairs. Instead, this paper proposes a novel algorithm, inspired by the block nested loops join operator implementation in traditional database systems. The proposed algorithm integrates batches of rows from both input tables into a single prompt. The goal of the LLM invocation is to identify all matching row pairs in the current input. The paper introduces formulas that can be used to optimize the size of the row batches, taking into account constraints on the size of the LLM context window (limiting both input and output size). An adaptive variant of the proposed algorithm refers to cases in which the size of the output is difficult to estimate. A formal analysis of asymptotic processing costs, as well as empirical results, demonstrates that the proposed approach reduces costs significantly and performs well compared to join implementations used by recent semantic query processing engines.
【3】gLSTM: Mitigating Over-Squashing by Increasing Storage Capacity
标题:gLSTM:通过增加存储容量缓解过度挤压
链接:https://arxiv.org/abs/2510.08450
作者:Hugh Blayney, Álvaro Arroyo, Xiaowen Dong, Michael M. Bronstein
备注:22 pages, 22 figures, 7 tables
摘要:Graph Neural Networks (GNNs) leverage the graph structure to transmit information between nodes, typically through the message-passing mechanism. While these models have found a wide variety of applications, they are known to suffer from over-squashing, where information from a large receptive field of node representations is collapsed into a single fixed sized vector, resulting in an information bottleneck. In this paper, we re-examine the over-squashing phenomenon through the lens of model storage and retrieval capacity, which we define as the amount of information that can be stored in a node's representation for later use. We study some of the limitations of existing tasks used to measure over-squashing and introduce a new synthetic task to demonstrate that an information bottleneck can saturate this capacity. Furthermore, we adapt ideas from the sequence modeling literature on associative memories, fast weight programmers, and the xLSTM model to develop a novel GNN architecture with improved capacity. We demonstrate strong performance of this architecture both on our capacity synthetic task, as well as a range of real-world graph benchmarks.
【4】Prompts Generalize with Low Data: Non-vacuous Generalization Bounds for Optimizing Prompts with More Informative Priors
标题:利用低数据进行预算概括:优化具有更多信息性先验的预算的非空洞概括界限
链接:https://arxiv.org/abs/2510.08413
作者:David Madras, Joshua Safyan, Qiuyi (Richard)Zhang
备注:EXAIT Workshop paper at ICML 2025
摘要
:Many prompt engineering techniques have been successful in practice, even when optimizing over a large prompt space with with a small amount of task-specific data. Recent work has partially explained this success by showing generalization bounds which apply PAC-Bayes theory to the discrete prompt space, but they are non-vacuous only in data-rich scenarios. We argue that such widespread success can be more fully explained through more carefully considering data- or distribution-dependent perplexity, which acts as an effective prior and steers the optimization towards prompts that are more ``natural'' for the task at hand. We derive novel generalization bounds that are non-vacuous for data-scarce prompt optimization via more useful priors, formally analyzing how perplexity regularization tightens these bounds by limiting exploration. Empirically, we explore both the bounds' effectiveness and the practical benefits of perplexity regularization in improving prompt generalization.
【5】Single layer tiny Co$^4$ outpaces GPT-2 and GPT-BERT
标题:单层微钴^4 $性能优于GPT-2和GPT-BERT
链接:https://arxiv.org/abs/2510.08404
作者:Noor Ul Zain, Mohsin Raza, Ahsan Adeel
摘要:We show that a tiny Co$^4$ machine(Adeel,2025) with a single layer, two heads, and 8M parameters, operating at an approximate cost of $O(N)$ (where $N$ is the number of input tokens), outpaces the BabyLM Challenge baselines GPT-2 (124M, 12 layers, $O(N^2))$ and GPT-BERT (30M, 12 layers, $O(N^2))$ in just two epochs, while both are trained for ten. Co$^4$ achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating highly sample efficient pretraining. Using the BabyLM challenge evaluation pipeline across complex benchmarks, Co$^4$ exhibits strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co$^4$ outperforms GPT-2 on 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT on 4 out of 7 metrics in both cases. These results suggest the need to rethink prevailing deep learning paradigms and associated scaling laws.
【6】FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts
标题:FlyLoRA:通过隐式级别专家混合提高任务脱钩和参数效率
链接:https://arxiv.org/abs/2510.08396
作者:Heming Zou, Yunliang Zang, Wutong Xu, Yao Zhu, Xiangyang Ji
备注:NeurIPS 2025 accepted paper
摘要:Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning method for foundation models, but it suffers from parameter interference, resulting in suboptimal performance. Although Mixture-of-Experts (MoE)-based LoRA variants show promise in mitigating intra-task correlations in single-task instruction tuning, they introduce additional router parameters and remain ineffective in multi-task model merging where inter-task interference arises. Inspired by the fly olfactory circuit, we propose FlyLoRA, an implicit MoE-based LoRA variant that introduces: (1) rank-wise expert activation in the up-projection matrix, and (2) an implicit router that unifies expert routing and down-projection, where a frozen sparse random projection matrix replaces the traditional dense trainable version. This design resolves the trade-off between intra-task decorrelation and computational efficiency by eliminating the need for an explicit router, while inherently mitigating inter-task interference due to the orthogonality property of random matrices. Extensive experiments across four domains -- general knowledge understanding, scientific question answering, mathematical reasoning, and code generation -- demonstrate consistent performance improvements over existing methods. Beyond empirical gains, FlyLoRA highlights how biological structures can inspire innovations in AI technologies. Code is available at https://github.com/gfyddha/FlyLoRA.
【7】Characterizing the Multiclass Learnability of Forgiving 0-1 Loss Functions
标题:原谅0-1损失函数的多类可学习性特征
链接:https://arxiv.org/abs/2510.08382
作者:Jacob Trauger, Tyson Trauger, Ambuj Tewari
备注:9 pages
摘要:In this paper we will give a characterization of the learnability of forgiving 0-1 loss functions in the finite label multiclass setting. To do this, we create a new combinatorial dimension that is based off of the Natarajan Dimension \citep{natarajan1989learning} and we show that a hypothesis class is learnable in our setting if and only if this Generalized Natarajan Dimension is finite. We also show a connection to learning with set-valued feedback. Through our results we show that the learnability of a set learning problem is characterized by the Natarajan Dimension.
【8】Guided Star-Shaped Masked Diffusion
标题:引导星形掩蔽扩散
链接:https://arxiv.org/abs/2510.08369
作者:Viacheslav Meshchaninov, Egor Shibaev, Artem Makoian, Ivan Klimov, Danil Sheshenya, Andrei Malinin, Nikita Balagansky, Daniil Gavrilov, Aibek Alanov, Dmitry Vetrov
摘要:The performance of pre-trained masked diffusion models is often constrained by their sampling procedure, which makes decisions irreversible and struggles in low-step generation regimes. We introduce a novel sampling algorithm that works with pre-trained models and, after a lightweight fine-tuning of a single layer, significantly improves sample quality and efficiency. Our method reformulates the generation process using a star-shaped paradigm, which inherently allows for error correction. To make this process effective, we augment it with a learnable re-masking scheduler that intelligently identifies and revises likely errors. This approach yields a substantial quality boost, particularly when using a small number of sampling steps. We extensively ablate key components of our approach and show its usability in different scenarios. In comprehensive experiments on text, and code generation, our sampling algorithm outperforms or matches existing methods.
【9】Bridging the Physics-Data Gap with FNO-Guided Conditional Flow Matching: Designing Inductive Bias through Hierarchical Physical Constraints
标题:利用FNO引导的条件流匹配弥合物理数据差距:通过分层物理约束设计归纳偏差
链接:https://arxiv.org/abs/2510.08295
作者:Tsuyoshi Okita
备注:8 pages, 1 figure
摘要:Conventional time-series generation often ignores domain-specific physical constraints, limiting statistical and physical consistency. We propose a hierarchical framework that embeds the inherent hierarchy of physical laws-conservation, dynamics, boundary, and empirical relations-directly into deep generative models, introducing a new paradigm of physics-informed inductive bias. Our method combines Fourier Neural Operators (FNOs) for learning physical operators with Conditional Flow Matching (CFM) for probabilistic generation, integrated via time-dependent hierarchical constraints and FNO-guided corrections. Experiments on harmonic oscillators, human activity recognition, and lithium-ion battery degradation show 16.3% higher generation quality, 46% fewer physics violations, and 18.5% improved predictive accuracy over baselines.
【10】Leveraging Whisper Embeddings for Audio-based Lyrics Matching
标题:利用Whisper嵌入进行音频歌词匹配
链接:https://arxiv.org/abs/2510.08176
作者:Eleonora Mancini, Joan Serrà, Paolo Torroni, Yuki Mitsufuji
摘要:Audio-based lyrics matching can be an appealing alternative to other content-based retrieval approaches, but existing methods often suffer from limited reproducibility and inconsistent baselines. In this work, we introduce WEALY, a fully reproducible pipeline that leverages Whisper decoder embeddings for lyrics matching tasks. WEALY establishes robust and transparent baselines, while also exploring multimodal extensions that integrate textual and acoustic features. Through extensive experiments on standard datasets, we demonstrate that WEALY achieves a performance comparable to state-of-the-art methods that lack reproducibility. In addition, we provide ablation studies and analyses on language robustness, loss functions, and embedding strategies. This work contributes a reliable benchmark for future research, and underscores the potential of speech technologies for music information retrieval tasks.
【11】AI Knowledge Assist: An Automated Approach for the Creation of Knowledge Bases for Conversational AI Agents
标题:人工智能知识辅助:为对话式人工智能代理创建知识库的自动化方法
链接:https://arxiv.org/abs/2510.08149
作者:Md Tahmid Rahman Laskar, Julien Bouvier Tremblay, Xue-Yong Fu, Cheng Chen, Shashi Bhushan TN
备注:Accepted to the EMNLP 2025 Industry Track
摘要:The utilization of conversational AI systems by leveraging Retrieval Augmented Generation (RAG) techniques to solve customer problems has been on the rise with the rapid progress of Large Language Models (LLMs). However, the absence of a company-specific dedicated knowledge base is a major barrier to the integration of conversational AI systems in contact centers. To this end, we introduce AI Knowledge Assist, a system that extracts knowledge in the form of question-answer (QA) pairs from historical customer-agent conversations to automatically build a knowledge base. Fine-tuning a lightweight LLM on internal data demonstrates state-of-the-art performance, outperforming larger closed-source LLMs. More specifically, empirical evaluation on 20 companies demonstrates that the proposed AI Knowledge Assist system that leverages the LLaMA-3.1-8B model eliminates the cold-start gap in contact centers by achieving above 90% accuracy in answering information-seeking questions. This enables immediate deployment of RAG-powered chatbots.
【12】Bayesian Decision Making around Experts
标题:围绕专家的Bayesian决策
链接:https://arxiv.org/abs/2510.08113
作者:Daniel Jarne Ornia, Joel Dyer, Nicholas Bishop, Anisoara Calinescu, Michael Wooldridge
摘要:Complex learning agents are increasingly deployed alongside existing experts, such as human operators or previously trained agents. However, it remains unclear how should learners optimally incorporate certain forms of expert data, which may differ in structure from the learner's own action-outcome experiences. We study this problem in the context of Bayesian multi-armed bandits, considering: (i) offline settings, where the learner receives a dataset of outcomes from the expert's optimal policy before interaction, and (ii) simultaneous settings, where the learner must choose at each step whether to update its beliefs based on its own experience, or based on the outcome simultaneously achieved by an expert. We formalize how expert data influences the learner's posterior, and prove that pretraining on expert outcomes tightens information-theoretic regret bounds by the mutual information between the expert data and the optimal action. For the simultaneous setting, we propose an information-directed rule where the learner processes the data source that maximizes their one-step information gain about the optimal action. Finally, we propose strategies for how the learner can infer when to trust the expert and when not to, safeguarding the learner for the cases where the expert is ineffective or compromised. By quantifying the value of expert data, our framework provides practical, information-theoretic algorithms for agents to intelligently decide when to learn from others.
【13】Mitigating Subject Dependency in EEG Decoding with Subject-Specific Low-Rank Adapters
标题:使用特定对象的低等级适配器减轻脑电解码中的对象依赖性
链接:https://arxiv.org/abs/2510.08059
作者:Timon Klein, Piotr Minakowski, Sebastian Sager
摘要:Subject-specific distribution shifts represent an important obstacle to the development of foundation models for EEG decoding. To address this, we propose Subject-Conditioned Layer,, an adaptive layer designed as a drop-in replacement for standard linear or convolutional layers in any neural network architecture. Our layer captures subject-specific variability by decomposing its weights into a shared, subject-invariant component and a lightweight, low-rank correction unique to each subject. This explicit separation of general knowledge from personalized adaptation allows existing models to become robust to subject shifts. Empirically, models equipped with our layer outperform both a shared-weight-only model (subject-agnostic model) and the average of individually trained subject-specific models. Consequently, the Subject-Conditioned Layer, offers a practical and scalable path towards building effective cross-subject foundation models for EEG.
【14】Do We Really Need Permutations? Impact of Width Expansion on Linear Mode Connectivity
标题:我们真的需要排列吗?宽度扩展对线性模式连接性的影响
链接:https://arxiv.org/abs/2510.08023
作者:Akira Ito, Masanori Yamada, Daiki Chijiwa, Atsutoshi Kumagai
摘要:Recently, Ainsworth et al. empirically demonstrated that, given two independently trained models, applying a parameter permutation that preserves the input-output behavior allows the two models to be connected by a low-loss linear path. When such a path exists, the models are said to achieve linear mode connectivity (LMC). Prior studies, including Ainsworth et al., have reported that achieving LMC requires not only an appropriate permutation search but also sufficiently wide models (e.g., a 32 $\times$ width multiplier for ResNet-20). This is broadly believed to be because increasing the model width ensures a large enough space of candidate permutations, increasing the chance of finding one that yields LMC. In this work, we empirically demonstrate that, even without any permutations, simply widening the models is sufficient for achieving LMC when using a suitable softmax temperature calibration. We further explain why this phenomenon arises by analyzing intermediate layer outputs. Specifically, we introduce layerwise exponentially weighted connectivity (LEWC), which states that the output of each layer of the merged model can be represented as an exponentially weighted sum of the outputs of the corresponding layers of the original models. Consequently the merged model's output matches that of an ensemble of the original models, which facilitates LMC. To the best of our knowledge, this work is the first to show that widening the model not only facilitates nonlinear mode connectivity, as suggested in prior research, but also significantly increases the possibility of achieving linear mode connectivity.
【15】Accelerated Evolving Set Processes for Local PageRank Computation
标题:本地PageRank计算的加速进化集过程
链接:https://arxiv.org/abs/2510.08010
作者:Binbin Huang, Luo Luo, Yanghua Xiao, Deqing Yang, Baojian Zhou
摘要:This work proposes a novel framework based on nested evolving set processes to accelerate Personalized PageRank (PPR) computation. At each stage of the process, we employ a localized inexact proximal point iteration to solve a simplified linear system. We show that the time complexity of such localized methods is upper bounded by $\min\{\tilde{\mathcal{O}}(R^2/\epsilon^2), \tilde{\mathcal{O}}(m)\}$ to obtain an $\epsilon$-approximation of the PPR vector, where $m$ denotes the number of edges in the graph and $R$ is a constant defined via nested evolving set processes. Furthermore, the algorithms induced by our framework require solving only $\tilde{\mathcal{O}}(1/\sqrt{\alpha})$ such linear systems, where $\alpha$ is the damping factor. When $1/\epsilon^2\ll m$, this implies the existence of an algorithm that computes an $\ epsilon $-approximation of the PPR vector with an overall time complexity of $\tilde{\mathcal{O}}\left(R^2 / (\sqrt{\alpha}\epsilon^2)\right)$, independent of the underlying graph size. Our result resolves an open conjecture from existing literature. Experimental results on real-world graphs validate the efficiency of our methods, demonstrating significant convergence in the early stages.
【16】VoiceAgentBench: Are Voice Assistants ready for agentic tasks?
标题:VoiceAgentBench:语音助理准备好执行代理任务了吗?
链接:https://arxiv.org/abs/2510.07978
作者:Dhruv Jain, Harshit Shukla, Gautam Rajeev, Ashish Kulkarni, Chandra Khatri, Shubham Agarwal
摘要:Large-scale Speech Language Models (SpeechLMs) have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. However, existing speech benchmarks primarily focus on isolated capabilities such as transcription, or question-answering, and do not systematically evaluate agentic scenarios encompassing multilingual and cultural understanding, as well as adversarial robustness. To address this, we introduce VoiceAgentBench, a comprehensive benchmark designed to evaluate SpeechLMs in realistic spoken agentic settings. It comprises over 5,500 synthetic spoken queries, including dialogues grounded in Indian context, covering single-tool invocations, multi-tool workflows, multi-turn interactions, and safety evaluations. The benchmark supports English, Hindi, and 5 other Indian languages, reflecting real-world linguistic and cultural diversity. We simulate speaker variability using a novel sampling algorithm that selects audios for TTS voice conversion based on its speaker embeddings, maximizing acoustic and speaker diversity. Our evaluation measures tool selection accuracy, structural consistency, and the correctness of tool invocations, including adversarial robustness. Our experiments reveal significant gaps in contextual tool orchestration tasks, Indic generalization, and adversarial robustness, exposing critical limitations of current SpeechLMs.
【17】SIMU: Selective Influence Machine Unlearning
标题:SIGU:选择性影响机器遗忘
链接:https://arxiv.org/abs/2510.07822
作者:Anu Agarwal, Mihir Pamnani, Dilek Hakkani-Tur
备注:Accepted to NeurIPS 2025 Workshop: Constrained Optimization for Machine Learning (COML)
摘要:The undesired memorization of sensitive information by Large Language Models (LLMs) has emphasized the need for safety mechanisms that can regulate model behavior. This has led to the development of machine unlearning techniques that enable models to precisely forget sensitive and unwanted information. For machine unlearning, first-order and second-order optimizer-based methods have shown significant progress in enabling LLMs to forget targeted information. However, in doing so, these approaches often compromise the model's original capabilities, resulting in unlearned models that struggle to retain their prior knowledge and overall utility. To address this, we propose Selective Influence Machine Unlearning (SIMU), a two-step framework that enhances second-order optimizer-based unlearning by selectively updating only the critical neurons responsible for encoding the forget-set. By constraining updates to these targeted neurons, SIMU achieves comparable unlearning efficacy while substantially outperforming current methods in retaining the model's original knowledge.
【18】Rényi Sharpness: A Novel Sharpness that Strongly Correlates with Generalization
标题:雷尼·夏普:一部与概括性强相关的小说夏普
链接:https://arxiv.org/abs/2510.07758
作者:Qiaozhe Zhang, Jun Sun, Ruijie Zhang, Yingzhuang Liu
摘要:Sharpness (of the loss minima) is a common measure to investigate the generalization of neural networks. Intuitively speaking, the flatter the landscape near the minima is, the better generalization might be. Unfortunately, the correlation between many existing sharpness measures and the generalization is usually not strong, sometimes even weak. To close the gap between the intuition and the reality, we propose a novel sharpness measure, i.e., \textit{R\'enyi sharpness}, which is defined as the negative R\'enyi entropy (a generalization of the classical Shannon entropy) of the loss Hessian. The main ideas are as follows: 1) we realize that \textit{uniform} (identical) eigenvalues of the loss Hessian is most desirable (while keeping the sum constant) to achieve good generalization; 2) we employ the \textit{R\'enyi entropy} to concisely characterize the extent of the spread of the eigenvalues of loss Hessian. Normally, the larger the spread, the smaller the (R\'enyi) entropy. To rigorously establish the relationship between generalization and (R\'enyi) sharpness, we provide several generalization bounds in terms of R\'enyi sharpness, by taking advantage of the reparametrization invariance property of R\'enyi sharpness, as well as the trick of translating the data discrepancy to the weight perturbation. Furthermore, extensive experiments are conducted to verify the strong correlation (in specific, Kendall rank correlation) between the R\'enyi sharpness and generalization. Moreover, we propose to use a variant of R\'enyi Sharpness as regularizer during training, i.e., R\'enyi Sharpness Aware Minimization (RSAM), which turns out to outperform all existing sharpness-aware minimization methods. It is worthy noting that the test accuracy gain of our proposed RSAM method could be as high as nearly 2.5\%, compared against the classical SAM method.
【19】t-SNE Exaggerates Clusters, Provably
标题:可以证明,t-SNE夸大了集群
链接:https://arxiv.org/abs/2510.07746
作者:Noah Bergam, Szymon Snoeck, Nakul Verma
摘要:Central to the widespread use of t-distributed stochastic neighbor embedding (t-SNE) is the conviction that it produces visualizations whose structure roughly matches that of the input. To the contrary, we prove that (1) the strength of the input clustering, and (2) the extremity of outlier points, cannot be reliably inferred from the t-SNE output. We demonstrate the prevalence of these failure modes in practice as well.
【20】Value Flows
标题:值流
链接:https://arxiv.org/abs/2510.07650
作者:Perry Dong, Chongyi Zheng, Chelsea Finn, Dorsa Sadigh, Benjamin Eysenbach
摘要:While most reinforcement learning methods today flatten the distribution of future returns to a single scalar value, distributional RL methods exploit the return distribution to provide stronger learning signals and to enable applications in exploration and safe RL. While the predominant method for estimating the return distribution is by modeling it as a categorical distribution over discrete bins or estimating a finite number of quantiles, such approaches leave unanswered questions about the fine-grained structure of the return distribution and about how to distinguish states with high return uncertainty for decision-making. The key idea in this paper is to use modern, flexible flow-based models to estimate the full future return distributions and identify those states with high return variance. We do so by formulating a new flow-matching objective that generates probability density paths satisfying the distributional Bellman equation. Building upon the learned flow models, we estimate the return uncertainty of distinct states using a new flow derivative ODE. We additionally use this uncertainty information to prioritize learning a more accurate return estimation on certain transitions. We compare our method (Value Flows) with prior methods in the offline and online-to-online settings. Experiments on $37$ state-based and $25$ image-based benchmark tasks demonstrate that Value Flows achieves a $1.3\times$ improvement on average in success rates. Website: https://pd-perry.github.io/value-flows Code: https://github.com/chongyi-zheng/value-flows
【21】Benchmarking is Broken - Don't Let AI be its Own Judge
标题:基准已被打破--不要让人工智能自己评判
链接:https://arxiv.org/abs/2510.07575
作者:Zerui Cheng, Stella Wohnig, Ruchika Gupta, Samiul Alam, Tassallah Abdullahi, João Alves Ribeiro, Christian Nielsen-Garcia, Saif Mir, Siran Li, Jason Orender, Seyed Ali Bahrainian, Daniel Kirste, Aaron Gokaslan, Mikołaj Glinka, Carsten Eickhoff, Ruben Wolff
备注:12 pages; Accepted to NeurIPS 2025. Link to poster: this https URL
摘要:The meteoric rise of Artificial Intelligence (AI), with its rapidly expanding market capitalization, presents both transformative opportunities and critical challenges. Chief among these is the urgent need for a new, unified paradigm for trustworthy evaluation, as current benchmarks increasingly reveal critical vulnerabilities. Issues like data contamination and selective reporting by model developers fuel hype, while inadequate data quality control can lead to biased evaluations that, even if unintentionally, may favor specific approaches. As a flood of participants enters the AI space, this "Wild West" of assessment makes distinguishing genuine progress from exaggerated claims exceptionally difficult. Such ambiguity blurs scientific signals and erodes public confidence, much as unchecked claims would destabilize financial markets reliant on credible oversight from agencies like Moody's. In high-stakes human examinations (e.g., SAT, GRE), substantial effort is devoted to ensuring fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? This position paper argues that the current laissez-faire approach is unsustainable. We contend that true, sustainable AI advancement demands a paradigm shift: a unified, live, and quality-controlled benchmarking framework robust by construction, not by mere courtesy and goodwill. To this end, we dissect the systemic flaws undermining today's AI evaluation, distill the essential requirements for a new generation of assessments, and introduce PeerBench, a community-governed, proctored evaluation blueprint that embodies this paradigm through sealed execution, item banking with rolling renewal, and delayed transparency. Our goal is to pave the way for evaluations that can restore integrity and deliver genuinely trustworthy measures of AI progress.
【22】Efficient Generalization via Multimodal Co-Training under Data Scarcity and Distribution Shift
标题:数据稀缺和分布转移下通过多模式联合训练进行高效概括
链接:https://arxiv.org/abs/2510.07509
作者:Tianyu Bell Pan, Damon L. Woodard
摘要:This paper explores a multimodal co-training framework designed to enhance model generalization in situations where labeled data is limited and distribution shifts occur. We thoroughly examine the theoretical foundations of this framework, deriving conditions under which the use of unlabeled data and the promotion of agreement between classifiers for different modalities lead to significant improvements in generalization. We also present a convergence analysis that confirms the effectiveness of iterative co-training in reducing classification errors. In addition, we establish a novel generalization bound that, for the first time in a multimodal co-training context, decomposes and quantifies the distinct advantages gained from leveraging unlabeled multimodal data, promoting inter-view agreement, and maintaining conditional view independence. Our findings highlight the practical benefits of multimodal co-training as a structured approach to developing data-efficient and robust AI systems that can effectively generalize in dynamic, real-world environments. The theoretical foundations are examined in dialogue with, and in advance of, established co-training principles.
【23】PEAR: Planner-Executor Agent Robustness Benchmark
标题:PEAR:规划者-执行者代理稳健性基准
链接:https://arxiv.org/abs/2510.07505
作者:Shen Dong, Mingxuan Zhang, Pengfei He, Li Ma, Bhavani Thuraisingham, Hui Liu, Yue Xing
摘要:Large Language Model (LLM)-based Multi-Agent Systems (MAS) have emerged as a powerful paradigm for tackling complex, multi-step tasks across diverse domains. However, despite their impressive capabilities, MAS remain susceptible to adversarial manipulation. Existing studies typically examine isolated attack surfaces or specific scenarios, leaving a lack of holistic understanding of MAS vulnerabilities. To bridge this gap, we introduce PEAR, a benchmark for systematically evaluating both the utility and vulnerability of planner-executor MAS. While compatible with various MAS architectures, our benchmark focuses on the planner-executor structure, which is a practical and widely adopted design. Through extensive experiments, we find that (1) a weak planner degrades overall clean task performance more severely than a weak executor; (2) while a memory module is essential for the planner, having a memory module for the executor does not impact the clean task performance; (3) there exists a trade-off between task performance and robustness; and (4) attacks targeting the planner are particularly effective at misleading the system. These findings offer actionable insights for enhancing the robustness of MAS and lay the groundwork for principled defenses in multi-agent settings.
【24】Best-of-Both Worlds for linear contextual bandits with paid observations
标题:两个世界中最好的线性上下文强盗与付费观察
链接:https://arxiv.org/abs/2510.07424
作者:Nathan Boyer, Dorian Baudry, Patrick Rebeschini
摘要:We study the problem of linear contextual bandits with paid observations, where at each round the learner selects an action in order to minimize its loss in a given context, and can then decide to pay a fixed cost to observe the loss of any arm. Building on the Follow-the-Regularized-Leader framework with efficient estimators via Matrix Geometric Resampling, we introduce a computationally efficient Best-of-Both-Worlds (BOBW) algorithm for this problem. We show that it achieves the minimax-optimal regret of $\Theta(T^{2/3})$ in adversarial settings, while guaranteeing poly-logarithmic regret in (corrupted) stochastic regimes. Our approach builds on the framework from \cite{BOBWhardproblems} to design BOBW algorithms for ``hard problem'', using analysis techniques tailored for the setting that we consider.
【25】Inconsistent Affective Reaction: Sentiment of Perception and Opinion in Urban Environments
标题:不一致的情感反应:城市环境中的感知和观点
链接:https://arxiv.org/abs/2510.07359
作者:Jingfei Huang, Han Tu
备注:10 pages
摘要:The ascension of social media platforms has transformed our understanding of urban environments, giving rise to nuanced variations in sentiment reaction embedded within human perception and opinion, and challenging existing multidimensional sentiment analysis approaches in urban studies. This study presents novel methodologies for identifying and elucidating sentiment inconsistency, constructing a dataset encompassing 140,750 Baidu and Tencent Street view images to measure perceptions, and 984,024 Weibo social media text posts to measure opinions. A reaction index is developed, integrating object detection and natural language processing techniques to classify sentiment in Beijing Second Ring for 2016 and 2022. Classified sentiment reaction is analysed and visualized using regression analysis, image segmentation, and word frequency based on land-use distribution to discern underlying factors. The perception affective reaction trend map reveals a shift toward more evenly distributed positive sentiment, while the opinion affective reaction trend map shows more extreme changes. Our mismatch map indicates significant disparities between the sentiments of human perception and opinion of urban areas over the years. Changes in sentiment reactions have significant relationships with elements such as dense buildings and pedestrian presence. Our inconsistent maps present perception and opinion sentiments before and after the pandemic and offer potential explanations and directions for environmental management, in formulating strategies for urban renewal.
【26】UniFField: A Generalizable Unified Neural Feature Field for Visual, Semantic, and Spatial Uncertainties in Any Scene
标题:UniFField:一个可推广的统一神经特征场,用于任何场景中的视觉,语义和空间识别
链接:https://arxiv.org/abs/2510.06754
作者:Christian Maurer, Snehal Jauhri, Sophie Lueth, Georgia Chalvatzaki
备注:Project website: this https URL
摘要
:Comprehensive visual, geometric, and semantic understanding of a 3D scene is crucial for successful execution of robotic tasks, especially in unstructured and complex environments. Additionally, to make robust decisions, it is necessary for the robot to evaluate the reliability of perceived information. While recent advances in 3D neural feature fields have enabled robots to leverage features from pretrained foundation models for tasks such as language-guided manipulation and navigation, existing methods suffer from two critical limitations: (i) they are typically scene-specific, and (ii) they lack the ability to model uncertainty in their predictions. We present UniFField, a unified uncertainty-aware neural feature field that combines visual, semantic, and geometric features in a single generalizable representation while also predicting uncertainty in each modality. Our approach, which can be applied zero shot to any new environment, incrementally integrates RGB-D images into our voxel-based feature representation as the robot explores the scene, simultaneously updating uncertainty estimation. We evaluate our uncertainty estimations to accurately describe the model prediction errors in scene reconstruction and semantic feature prediction. Furthermore, we successfully leverage our feature predictions and their respective uncertainty for an active object search task using a mobile manipulator robot, demonstrating the capability for robust decision-making.
【27】Navigating Sparsities in High-Dimensional Linear Contextual Bandits
标题:在多维线性背景盗贼中导航稀缺性
链接:https://arxiv.org/abs/2510.08435
作者:Rui Zhao, Zihan Chen, Zemin Zheng
摘要:High-dimensional linear contextual bandit problems remain a significant challenge due to the curse of dimensionality. Existing methods typically consider either the model parameters to be sparse or the eigenvalues of context covariance matrices to be (approximately) sparse, lacking general applicability due to the rigidity of conventional reward estimators. To overcome this limitation, a powerful pointwise estimator is introduced in this work that adaptively navigates both kinds of sparsity. Based on this pointwise estimator, a novel algorithm, termed HOPE, is proposed. Theoretical analyses demonstrate that HOPE not only achieves improved regret bounds in previously discussed homogeneous settings (i.e., considering only one type of sparsity) but also, for the first time, efficiently handles two new challenging heterogeneous settings (i.e., considering a mixture of two types of sparsity), highlighting its flexibility and generality. Experiments corroborate the superiority of HOPE over existing methods across various scenarios.
【28】PAC Learnability in the Presence of Performativity
标题:表演性存在下的PAC可学习性
链接:https://arxiv.org/abs/2510.08335
作者:Ivan Kirev, Lyuben Baltadzhiev, Nikola Konstantinov
备注:21 pages, 3 figures
摘要:Following the wide-spread adoption of machine learning models in real-world applications, the phenomenon of performativity, i.e. model-dependent shifts in the test distribution, becomes increasingly prevalent. Unfortunately, since models are usually trained solely based on samples from the original (unshifted) distribution, this performative shift may lead to decreased test-time performance. In this paper, we study the question of whether and when performative binary classification problems are learnable, via the lens of the classic PAC (Probably Approximately Correct) learning framework. We motivate several performative scenarios, accounting in particular for linear shifts in the label distribution, as well as for more general changes in both the labels and the features. We construct a performative empirical risk function, which depends only on data from the original distribution and on the type performative effect, and is yet an unbiased estimate of the true risk of a classifier on the shifted distribution. Minimizing this notion of performative risk allows us to show that any PAC-learnable hypothesis space in the standard binary classification setting remains PAC-learnable for the considered performative scenarios. We also conduct an extensive experimental evaluation of our performative risk minimization method and showcase benefits on synthetic and real data.
【29】Quantum Agents for Algorithmic Discovery
标题:量子发现剂
链接:https://arxiv.org/abs/2510.08159
作者:Iordanis Kerenidis, El-Amine Cherrat
摘要:We introduce quantum agents trained by episodic, reward-based reinforcement learning to autonomously rediscover several seminal quantum algorithms and protocols. In particular, our agents learn: efficient logarithmic-depth quantum circuits for the Quantum Fourier Transform; Grover's search algorithm; optimal cheating strategies for strong coin flipping; and optimal winning strategies for the CHSH and other nonlocal games. The agents achieve these results directly through interaction, without prior access to known optimal solutions. This demonstrates the potential of quantum intelligence as a tool for algorithmic discovery, opening the way for the automated design of novel quantum algorithms and protocols.
【30】Beyond Real Data: Synthetic Data through the Lens of Regularization
标题:超越真实数据:通过正则化的镜头合成数据
链接:https://arxiv.org/abs/2510.08095
作者:Amitis Shidani, Tyler Farghly, Yang Sun, Habib Ganjgahi, George Deligiannidis
摘要:Synthetic data can improve generalization when real data is scarce, but excessive reliance may introduce distributional mismatches that degrade performance. In this paper, we present a learning-theoretic framework to quantify the trade-off between synthetic and real data. Our approach leverages algorithmic stability to derive generalization error bounds, characterizing the optimal synthetic-to-real data ratio that minimizes expected test error as a function of the Wasserstein distance between the real and synthetic distributions. We motivate our framework in the setting of kernel ridge regression with mixed data, offering a detailed analysis that may be of independent interest. Our theory predicts the existence of an optimal ratio, leading to a U-shaped behavior of test error with respect to the proportion of synthetic data. Empirically, we validate this prediction on CIFAR-10 and a clinical brain MRI dataset. Our theory extends to the important scenario of domain adaptation, showing that carefully blending synthetic target data with limited source data can mitigate domain shift and enhance generalization. We conclude with practical guidance for applying our results to both in-domain and out-of-domain scenarios.
【31】Computations and ML for surjective rational maps
标题:满射有理映射的计算和ML
链接:https://arxiv.org/abs/2510.08093
作者:Ilya Karzhemanov
备注:15 pages, 2 figures, a couple of Python codes
摘要:The present note studies \emph{surjective rational endomorphisms} $f: \mathbb{P}^2 \dashrightarrow \mathbb{P}^2$ with \emph{cubic} terms and the indeterminacy locus $I_f \ne \emptyset$. We develop an experimental approach, based on some Python programming and Machine Learning, towards the classification of such maps; a couple of new explicit $f$ is constructed in this way. We also prove (via pure projective geometry) that a general non-regular cubic endomorphism $f$ of $\mathbb{P}^2$ is surjective if and only if the set $I_f$ has cardinality at least $3$.
机器翻译由腾讯交互翻译提供,仅供参考
点击“阅读原文”获取带摘要的学术速递