点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
cs.LG 方向,今日共计116篇
大模型相关(12篇)
【1】DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation
标题:DraCo:文本到图像预览和稀有概念生成的CoT草案
链接:https://arxiv.org/abs/2512.05112
作者:Dongzhi Jiang,Renrui Zhang,Haodong Li,Zhuofan Zong,Ziyu Guo,Jun He,Claire Guo,Junyan Ye,Rongyao Fang,Weijia Li,Rui Liu,Hongsheng Li
备注:Project Page: https://github.com/CaraJ7/DraCo
摘要:最近的统一多模态大型语言模型(MLLM)表现出令人印象深刻的能力,结合了思想链(CoT)推理,增强了文本到图像的生成。然而,现有的方法仍然是有限的,要么把模型仅仅作为一个独立的发电机或依赖于抽象的文本规划。为此,我们提出了Draft-as-CoT(DraCo),这是一种新的交错推理范式,它充分利用CoT中的文本和视觉内容,以实现更好的规划和验证。我们的方法首先生成一个低分辨率的草案图像作为预览,提供更具体和结构化的视觉规划和指导。然后,我们利用模型固有的理解能力来验证草稿和输入提示之间潜在的语义不一致,并通过超分辨率的选择性更正来执行细化。通过这种方式,我们的方法解决了两个基本的挑战:文本规划的粗粒度性质和难以生成罕见的属性组合。为了支持培训,我们策划了DraCo-240 K,旨在增强三个原子功能,包括一般校正,实例操作和布局重组。在DraCo-CFG(一种用于交叉推理的专用无分类器指导(CFG)策略)的支持下,DraCo在GenEval(+8%)、Imagine-Bench(+0.91)和GenEval++(+3%)上实现了巨大的增长,显著优于直接生成和CoT支持的其他生成方法。
摘要:Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.
【2】Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning
标题:语义软引导:在没有强化学习的情况下在LLM中进行长上下文推理
链接:https://arxiv.org/abs/2512.05105
作者:Purbesh Mitra,Sennur Ulukus
摘要:大型语言模型(LLM)中的长上下文推理已经证明了通过思想链(CoT)推理增强其认知能力。训练这种模型通常是通过基于推理的问题(如数学和编程)中的可验证奖励强化学习(RLVR)来完成的。然而,RLVR受到几个瓶颈的限制,如缺乏稠密奖励,样本效率不足。因此,它需要大量的计算资源在后训练阶段。为了克服这些限制,在这项工作中,我们提出了\textbf{语义软引导(SSB)},一种自升华技术,其中相同的基础语言模型扮演教师和学生的角色,但在训练时接收不同的语义上下文关于其结果的正确性。模型首先提示一个数学问题,并生成几个卷展栏。从它们中,过滤出正确的和最常见的不正确的回答,然后在上下文中提供给模型,以产生一个更强大的、逐步的解释,并得到一个经过验证的最终答案。该管道自动从原始问题答案数据中管理成对的师生训练集,而无需任何人为干预。这个生成过程还产生了一个logit序列,这是学生模型在训练阶段试图仅从裸问题中匹配的。在我们的实验中,Qwen2.5-3B-Instruct在GSM 8 K数据集上通过参数有效的微调。然后,我们在MATH 500和AIME 2024基准上测试了它的准确性。我们的实验表明,跳跃的10.6%,和10%的准确性,分别提高,组相对策略优化(GRPO),这是一种常用的RLVR算法。我们的代码可以在https://github.com/purbeshmitra/semantic-soft-bootstrapping上找到,模型,策展数据集可以在https://huggingface.co/purbeshmitra/semantic-soft-bootstrapping上找到。
摘要:Long context reasoning in large language models (LLMs) has demonstrated enhancement of their cognitive capabilities via chain-of-thought (CoT) inference. Training such models is usually done via reinforcement learning with verifiable rewards (RLVR) in reasoning based problems, like math and programming. However, RLVR is limited by several bottlenecks, such as, lack of dense reward, and inadequate sample efficiency. As a result, it requires significant compute resources in post-training phase. To overcome these limitations, in this work, we propose \textbf{Semantic Soft Bootstrapping (SSB)}, a self-distillation technique, in which the same base language model plays the role of both teacher and student, but receives different semantic contexts about the correctness of its outcome at training time. The model is first prompted with a math problem and several rollouts are generated. From them, the correct and most common incorrect response are filtered, and then provided to the model in context to produce a more robust, step-by-step explanation with a verified final answer. This pipeline automatically curates a paired teacher-student training set from raw problem-answer data, without any human intervention. This generation process also produces a sequence of logits, which is what the student model tries to match in the training phase just from the bare question alone. In our experiment, Qwen2.5-3B-Instruct on GSM8K dataset via parameter-efficient fine-tuning. We then tested its accuracy on MATH500, and AIME2024 benchmarks. Our experiments show a jump of 10.6%, and 10% improvements in accuracy, respectively, over group relative policy optimization (GRPO), which is a commonly used RLVR algorithm. Our code is available at https://github.com/purbeshmitra/semantic-soft-bootstrapping, and the model, curated dataset is available at https://huggingface.co/purbeshmitra/semantic-soft-bootstrapping.
【3】Multi-LLM Collaboration for Medication Recommendation
标题:多LLM药物推荐合作
链接:https://arxiv.org/abs/2512.05066
作者:Huascar Sanchez,Briland Hitaj,Jules Bergmann,Linda Briesemeister
备注:8 pages, 5 figures, 1 table
摘要:随着医疗保健越来越多地转向人工智能来提供可扩展和可信赖的临床决策支持,确保模型推理的可靠性仍然是一个关键挑战。单个的大型语言模型(LLM)容易产生幻觉和不一致,而天真的模型集合通常无法提供稳定和可信的建议。基于我们以前在LLM化学方面的工作,该工作量化了LLM之间的协作兼容性,我们应用这个框架来提高简短临床小插曲中药物推荐的可靠性。我们的方法利用了由化学启发的交互建模指导的多LLM协作,使合奏有效(利用互补优势),稳定(产生一致的质量)和校准(最大限度地减少干扰和误差放大)。我们评估了我们基于化学的多LLM合作策略在现实世界的临床场景,以调查这种相互作用感知的合奏是否可以产生可信的,患者特定的药物建议。初步结果令人鼓舞,表明LLM化学指导的合作可能为临床实践中可靠和值得信赖的AI助手提供了一条有前途的道路。
摘要
:As healthcare increasingly turns to AI for scalable and trustworthy clinical decision support, ensuring reliability in model reasoning remains a critical challenge. Individual large language models (LLMs) are susceptible to hallucinations and inconsistency, whereas naive ensembles of models often fail to deliver stable and credible recommendations. Building on our previous work on LLM Chemistry, which quantifies the collaborative compatibility among LLMs, we apply this framework to improve the reliability in medication recommendation from brief clinical vignettes. Our approach leverages multi-LLM collaboration guided by Chemistry-inspired interaction modeling, enabling ensembles that are effective (exploiting complementary strengths), stable (producing consistent quality), and calibrated (minimizing interference and error amplification). We evaluate our Chemistry-based Multi-LLM collaboration strategy on real-world clinical scenarios to investigate whether such interaction-aware ensembles can generate credible, patient-specific medication recommendations. Preliminary results are encouraging, suggesting that LLM Chemistry-guided collaboration may offer a promising path toward reliable and trustworthy AI assistants in clinical practice.
【4】STELLA: Guiding Large Language Models for Time Series Forecasting with Semantic Abstractions
标题:STELA:用语义抽象指导时间序列预测的大型语言模型
链接:https://arxiv.org/abs/2512.04871
作者:Junjie Fan,Hongye Zhao,Linduo Wei,Jiayu Rao,Guijia Li,Jiaxin Yuan,Wenqi Xu,Yong Qi
备注:This work has been submitted to the IEEE for possible publication
摘要:最近的适应大语言模型(LLM)的时间序列预测往往无法有效地提高原始序列的信息,使LLM推理能力未得到充分利用。现有的激励策略依赖于静态的相关性,而不是动态行为的生成性解释,缺乏关键的全球性和特定于实例的背景。为了解决这个问题,我们提出了STELLA(语义时间对齐与语言抽象),一个框架,系统地挖掘和注入结构化的补充和互补信息。STELLA采用动态语义抽象机制,将输入序列分解为趋势、季节性和残差分量。然后,它将这些组件的内在行为特征转换为层次语义分类器:用于全局上下文的语料库级语义先验(CSP)和用于实例级模式的细粒度行为提示(FBP)。使用这些锚作为前缀提示,STELLA引导LLM对内在动力学进行建模。在八个基准数据集上的实验表明,STELLA在长期和短期预测中优于最先进的方法,在zero-shot和Few-Shot设置中显示出优越的泛化能力。消融研究进一步验证了我们的动态生成的语义锚的有效性。
摘要:Recent adaptations of Large Language Models (LLMs) for time series forecasting often fail to effectively enhance information for raw series, leaving LLM reasoning capabilities underutilized. Existing prompting strategies rely on static correlations rather than generative interpretations of dynamic behavior, lacking critical global and instance-specific context. To address this, we propose STELLA (Semantic-Temporal Alignment with Language Abstractions), a framework that systematically mines and injects structured supplementary and complementary information. STELLA employs a dynamic semantic abstraction mechanism that decouples input series into trend, seasonality, and residual components. It then translates intrinsic behavioral features of these components into Hierarchical Semantic Anchors: a Corpus-level Semantic Prior (CSP) for global context and a Fine-grained Behavioral Prompt (FBP) for instance-level patterns. Using these anchors as prefix-prompts, STELLA guides the LLM to model intrinsic dynamics. Experiments on eight benchmark datasets demonstrate that STELLA outperforms state-of-the-art methods in long- and short-term forecasting, showing superior generalization in zero-shot and few-shot settings. Ablation studies further validate the effectiveness of our dynamically generated semantic anchors.
【5】Sequential Enumeration in Large Language Models
标题:大型语言模型中的顺序计数
链接:https://arxiv.org/abs/2512.04727
作者:Kuinan Hou,Marco Zorzi,Alberto Testolin
摘要:可靠地计数和生成项目序列仍然是神经网络的一个重大挑战,包括大型语言模型(LLM)。事实上,尽管这种能力很容易被基于串行计算的基于规则的符号系统处理,但学习系统地部署计数程序对于神经模型来说是困难的,神经模型应该通过学习来获得这些技能。以前的研究已经证明,递归架构只能近似地跟踪和枚举事件序列,目前还不清楚现代深度学习系统(包括LLM)是否可以在离散符号序列上部署系统计数程序。本文旨在通过调查五个最先进的LLM(包括专有,开源和推理模型)的顺序枚举能力来填补这一空白。我们探测LLM在顺序命名和生产任务,涉及列表的字母和单词,采用了各种提示指令,探索的作用链的思想在自发出现的计数策略。我们还评估了具有相同架构但大小不断增加的开源模型,以了解计数原理的掌握是否遵循标度律,并且我们分析了顺序枚举过程中的嵌入动态,以研究数字的紧急编码。我们发现,一些LLMs确实能够部署计数程序时,明确提示这样做,但没有一个自发地从事计数时,只是要求列举的项目数量的序列。我们的研究结果表明,尽管他们令人印象深刻的新兴能力,LLM还不能强大和系统地部署计数程序,突出了神经和符号的方法组成的概括之间的持续差距。
摘要:Reliably counting and generating sequences of items remain a significant challenge for neural networks, including Large Language Models (LLMs). Indeed, although this capability is readily handled by rule-based symbolic systems based on serial computation, learning to systematically deploy counting procedures is difficult for neural models, which should acquire these skills through learning. Previous research has demonstrated that recurrent architectures can only approximately track and enumerate sequences of events, and it remains unclear whether modern deep learning systems, including LLMs, can deploy systematic counting procedures over sequences of discrete symbols. This paper aims to fill this gap by investigating the sequential enumeration abilities of five state-of-the-art LLMs, including proprietary, open-source, and reasoning models. We probe LLMs in sequential naming and production tasks involving lists of letters and words, adopting a variety of prompting instructions to explore the role of chain-of-thought in the spontaneous emerging of counting strategies. We also evaluate open-source models with the same architecture but increasing size to see whether the mastering of counting principles follows scaling laws, and we analyze the embedding dynamics during sequential enumeration to investigate the emergent encoding of numerosity. We find that some LLMs are indeed capable of deploying counting procedures when explicitly prompted to do so, but none of them spontaneously engage in counting when simply asked to enumerate the number of items in a sequence. Our results suggest that, despite their impressive emergent abilities, LLMs cannot yet robustly and systematically deploy counting procedures, highlighting a persistent gap between neural and symbolic approaches to compositional generalization.
【6】TRINITY: An Evolved LLM Coordinator
标题:TRINITY:一个进化的LLM协调员
链接:https://arxiv.org/abs/2512.04695
作者:Jinglue Xu,Qi Sun,Peter Schwendeman,Stefan Nielsen,Edoardo Cetin,Yujin Tang
摘要:组合不同的基础模型是有前途的,但权重合并受到不匹配的架构和封闭的API的限制。Trinity通过一个轻量级协调器来解决这个问题,该协调器协调大型语言模型(LLM)之间的协作。协调员,包括一个紧凑的语言模型(约$0.6$B参数)和一个轻量级的头(约$10$K参数),优化与高效和自适应的授权的进化策略。Trinity在多个回合中处理查询,在每个回合中,协调员将三个角色(思想者,工作者或验证者)中的一个分配给选定的LLM,有效地从协调员本身卸载复杂的技能获取。实验表明,Trinity在编码、数学、推理和领域知识任务中的性能始终优于单个模型和现有方法,并且可以稳健地推广到分布外的任务。在标准基准测试中,Trinity取得了最先进的成绩,包括在LiveCodeBench上获得86.2%的分数。理论和实证分析确定了这种性能背后的两个主要因素:(1)协调器的隐藏状态表示提供了丰富的输入上下文,以及(2)在高维和严格的预算约束下,可分离协方差矩阵自适应进化策略通过利用潜在的块可分离性提供了优于强化学习,模仿学习和随机搜索的优势。
摘要
:Combining diverse foundation models is promising, but weight-merging is limited by mismatched architectures and closed APIs. Trinity addresses this with a lightweight coordinator that orchestrates collaboration among large language models (LLMs). The coordinator, comprising a compact language model (approximately $0.6$B parameters) and a lightweight head (approximately $10$K parameters), is optimized with an evolutionary strategy for efficient and adaptive delegation. Trinity processes queries over multiple turns, where at each turn the coordinator assigns one of three roles (Thinker, Worker, or Verifier) to a selected LLM, effectively offloading complex skill acquisition from the coordinator itself. Experiments show that Trinity consistently outperforms individual models and existing methods across coding, math, reasoning, and domain knowledge tasks, and generalizes robustly to out-of-distribution tasks. On standard benchmarks, Trinity achieves state-of-the-art results, including a score of 86.2% on LiveCodeBench. Theoretical and empirical analyses identify two main factors behind this performance: (1) the coordinator's hidden-state representations provide rich contextualization of inputs, and (2) under high dimensionality and strict budget constraints, the separable Covariance Matrix Adaptation Evolution Strategy offers advantages over reinforcement learning, imitation learning, and random search by exploiting potential block-epsilon-separability.
【7】SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
标题:SEASON:通过自我诊断对比解码缓解视频大型语言模型中的时间幻觉
链接:https://arxiv.org/abs/2512.04643
作者:Chang-Hsun Wu,Kai-Po Chang,Yu-Yang Sheng,Hung-Kai Chung,Kuei-Chun Wang,Yu-Chiang Frank Wang
摘要:视频大语言模型(VideoLLM)在视频理解方面取得了显着进展。然而,这些模型在响应用户查询时仍然难以有效地感知和利用视频中丰富的时间信息。因此,他们经常产生时间不一致或因果关系不可信的事件描述,导致严重的幻觉问题。虽然大多数先前的研究都集中在空间幻觉(例如对象不匹配),但视频理解中的时间推理仍然相对不足。为了解决这个问题,我们提出了自诊断对比解码(SEASON),这是一种无需训练的方法,可以自适应地增强每个输出令牌的时间和空间忠诚度。它通过动态诊断每个标记的幻觉倾向并对其相应的时间和空间负应用自适应对比解码来实现这一点。广泛的实验表明,SEASON在三个幻觉检查基准上优于所有现有的无训练幻觉缓解方法,同时进一步提高了四个通用视频理解基准的VideoLLM。代码将在接受后发布。
摘要:Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.
【8】Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment
标题:通过自增强对比对齐缓解多模式LLM中的物体和动作幻觉
链接:https://arxiv.org/abs/2512.04356
作者:Kai-Po Chang,Wei-Yuan Cheng,Chi-Pin Huang,Fu-En Yang,Yu-Chiang Frank Wang
备注:IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026. Project page: https://kpc0810.github.io/santa/
摘要:多模态LLM(MLLM)的最新进展已经证明了其为输入视频生成描述性字幕的卓越能力。然而,这些模型在生成的描述中存在事实不准确的问题,导致严重的幻觉问题。虽然先前的工作已经探索了减轻静态图像的幻觉,但是联合减轻动态视频的视觉对象和时间动作幻觉仍然是一个具有挑战性且未解决的任务。为了应对这一挑战,我们提出了一个自增强对比对齐(SANTA)框架,通过免除虚假的相关性和强调视觉事实来实现对象和动作的忠实性。圣诞老人采用了一种幻觉自我增强方案,以确定潜在的幻觉,躺在MLLM和转换原始字幕的对比底片。此外,我们开发了一个轨迹短语对比对齐,以匹配区域对象和关系引导的动作与其相应的视觉和时间短语。大量的实验表明,桑塔优于现有的方法,在减轻对象和动作幻觉,产生优越的性能幻觉检查基准。
摘要:Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.
【9】Distance Is All You Need: Radial Dispersion for Uncertainty Estimation in Large Language Models
标题:距离就是你所需要的一切:大型语言模型中不确定性估计的辐射分散
链接:https://arxiv.org/abs/2512.04351
作者:Manh Nguyen,Sunil Gupta,Hung Le
摘要:检测大型语言模型(LLM)何时不确定对于构建可靠的系统至关重要,但现有方法过于复杂,依赖于脆弱的语义聚类或内部状态。我们引入了\textbf{径向分散分数(RDS)},这是一个简单的、无参数的、完全与模型无关的不确定性度量,用于测量嵌入空间中采样世代的径向分散。一个轻量级的概率加权变体进一步整合了模型自己的令牌概率,表现优于不同的九个强基线。此外,RDS自然地扩展到每个样本评分,使应用程序,如最好的$N$选择和基于信心的过滤。在四个具有挑战性的自由形式QA数据集和多个LLM中,我们的指标实现了最先进的幻觉检测和答案选择性能,同时在样本大小和嵌入选择方面保持了鲁棒性和可扩展性。
摘要:Detecting when large language models (LLMs) are uncertain is critical for building reliable systems, yet existing methods are overly complicated, relying on brittle semantic clustering or internal states. We introduce \textbf{Radial Dispersion Score (RDS)}, a simple, parameter-free, fully model-agnostic uncertainty metric that measures the radial dispersion of sampled generations in embedding space. A lightweight probability-weighted variant further incorporates the model's own token probabilities when available, outperforming different nine strong baselines. Moroever, RDS naturally extends to per-sample scoring, enabling applications such as best-of-$N$ selection and confidence-based filtering. Across four challenging free-form QA datasets and multiple LLMs, our metrics achieve state-of-the-art hallucination detection and answer selection performance, while remaining robust and scalable with respect to sample size and embedding choice.
【10】Evaluating Long-Context Reasoning in LLM-Based WebAgents
标题:评估基于LLM的Web代理中的长上下文推理
链接:https://arxiv.org/abs/2512.04307
作者:Andy Chung,Yichi Zhang,Kaixiang Lin,Aditya Rawal,Qiaozi Gao,Joyce Chai
备注:Accepted NeurIPS 25 LAW Workshop
摘要:随着基于大型语言模型(LLM)的代理越来越多地集成到日常数字交互中,它们在长期交互历史中进行推理的能力对于提供个性化和上下文感知的帮助变得至关重要。然而,这些代理在长上下文场景中的性能,特别是在现实的Web环境中操作的采取行动的WebAgents,仍然在很大程度上未被探索。本文介绍了一个基准评估长上下文推理能力的WebAgents通过顺序依赖的子任务,需要检索和应用程序的信息,从扩展的交互历史。我们开发了一种新的评估框架,通过在相关子任务之间注入不相关的任务轨迹来模拟多会话用户交互,创建范围从25,000到150,000个令牌的上下文。通过对Claude-3.7、GPT-4.1、Llama 4和o 4-mini四种流行模型的广泛评估,我们观察到随着上下文长度的增加,性能急剧下降,成功率从基线条件下的40- 50%下降到长上下文场景下的不到10%。我们详细的错误分析表明,代理失败主要是由于陷入循环和失去对原始任务目标的跟踪。我们进一步提出了一个隐式RAG方法,通过生成任务相关的摘要,提供适度的改进,但长期的上下文推理的基本限制仍然存在。这些发现突出了在现实的,长期的用户交互场景中部署WebAgent的关键挑战,并为开发更强大的代理架构提供了见解,这些架构能够在扩展的上下文中保持连贯的任务执行。
摘要:As large language model (LLM)-based agents become increasingly integrated into daily digital interactions, their ability to reason across long interaction histories becomes crucial for providing personalized and contextually aware assistance. However, the performance of these agents in long context scenarios, particularly for action-taking WebAgents operating in realistic web environments, remains largely unexplored. This paper introduces a benchmark for evaluating long context reasoning capabilities of WebAgents through sequentially dependent subtasks that require retrieval and application of information from extended interaction histories. We develop a novel evaluation framework that simulates multi-session user interactions by injecting irrelevant task trajectories between dependent subtasks, creating contexts ranging from 25,000 to 150,000 tokens. Through extensive evaluation of four popular models, Claude-3.7, GPT-4.1, Llama 4, and o4-mini, we observe a dramatic performance degradation as context length increases, with success rates dropping from 40-50\% in baseline conditions to less than 10\% in long context scenarios. Our detailed error analysis reveals that agents primarily fail due to getting stuck in loops and losing track of original task objectives. We further propose an implicit RAG approach that provides modest improvements by generating task-relevant summaries, though fundamental limitations in long context reasoning persist. These findings highlight critical challenges for deploying WebAgents in realistic, long-term user interaction scenarios and provide insights for developing more robust agent architectures capable of maintaining coherent task execution across extended contexts.
【11】Decoding Large Language Diffusion Models with Foreseeing Movement
标题:具有预见运动的大语言扩散模型的解码
链接:https://arxiv.org/abs/2512.04135
作者:Yichuan Mo,Quan Chen,Mingjie Li,Zeming Wei,Yisen Wang
摘要:大型语言扩散模型(LLDM)受益于灵活的解码机制,该机制支持自回归模型的并行推理和可控生成。然而,这种灵活性带来了一个关键的挑战:推理性能对令牌的解码顺序变得高度敏感。然而,现有的启发式方法主要关注局部影响,而忽略了长期影响。为了解决这一限制,我们提出了预见解码方法(FDM),这是一种新的方法,它集成了局部和全局的考虑,以释放全部潜力,采用基于搜索的策略,使有效的优化离散空间。此外,通过分析在整个解码过程中所选择的令牌的一致性,我们开发了一个变体,FDM与加速(FDM-A),它将深度探索限制在被识别为探索和平衡环境的关键步骤。在不同的基准测试和模型架构上进行的大量实验验证了FDM的可扩展性,并展示了FDM-A所实现的卓越效率-性能权衡。我们的工作可能为LLDM的更强大的解码方法提供原则性的一步。
摘要:Large Language Diffusion Models (LLDMs) benefit from a flexible decoding mechanism that enables parallelized inference and controllable generations over autoregressive models. Yet such flexibility introduces a critical challenge: inference performance becomes highly sensitive to the decoding order of tokens. Existing heuristic methods, however, focus mainly on local effects while overlooking long-term impacts. To address this limitation, we propose the Foreseeing Decoding Method (FDM), a novel approach that integrates both local and global considerations to unlock the full potential, employing a search-based strategy to enable effective optimization in discrete spaces. Furthermore, by analyzing the consistency of chosen tokens in the full decoding process, we develop a variant, FDM with Acceleration (FDM-A), which restricts deep exploration to critical steps identified as the exploration and balance circumantences. Extensive experiments across diverse benchmarks and model architectures validate the scalability of FDM and demonstrate the superior efficiency-performance trade-off achieved by FDM-A. Our work might potentially provide a principled step toward more powerful decoding methods for LLDMs.
【12】ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text
标题:ASCIIBBench:评估基于地理模型的视觉导向文本理解
链接:https://arxiv.org/abs/2512.04125
作者:Kerry Luo,Michael Fu,Joshua Peguero,Husnain Malik,Anvay Patil,Joyce Lin,Megan Van Overborg,Ryan Sarmiento,Kevin Zhu
备注:Accepted to The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025): LLM Evaluation Workshop & Multimodal Algorithmic Reasoning Workshop
摘要:大型语言模型(LLM)已经展示了几种具有规模的涌现行为,包括长格式文本生成中的推理和流畅性。然而,他们仍然在需要精确的空间和位置推理的任务中挣扎。ASCII艺术是一种符号媒介,字符编码结构和形式,为这种限制提供了独特的探索。我们介绍ASCIIBench,一个新的基准评估生成和分类的ASCII文本图像。ASCIIBench由5,315个类别标记的ASCII图像的过滤数据集组成,据我们所知,它是同类中第一个公开可用的基准。除了数据集,我们还发布了一个微调的CLIP模型的权重,该模型适用于捕获ASCII结构,从而能够评估LLM生成的ASCII艺术。我们的分析表明,CLIP嵌入的余弦相似性无法分离大多数ASCII类别,即使对于低方差类,也会产生机会级性能。与此相反,具有高内部均值相似性的类表现出明显的可辨别性,揭示了瓶颈在于代表性,而不是世代方差。这些发现将ASCII艺术定位为多模态表征的压力测试,并激励开发新的嵌入方法或针对符号视觉模态的评估指标。所有资源都可以在https://github.com/ASCIIBench/ASCIIBench上找到。
摘要:Large language models (LLMs) have demonstrated several emergent behaviors with scale, including reasoning and fluency in long-form text generation. However, they continue to struggle with tasks requiring precise spatial and positional reasoning. ASCII art, a symbolic medium where characters encode structure and form, provides a unique probe of this limitation. We introduce ASCIIBench, a novel benchmark for evaluating both the generation and classification of ASCII-text images. ASCIIBench consists of a filtered dataset of 5,315 class-labeled ASCII images and is, to our knowledge, the first publicly available benchmark of its kind. Alongside the dataset, we release weights for a fine-tuned CLIP model adapted to capture ASCII structure, enabling the evaluation of LLM-generated ASCII art. Our analysis shows that cosine similarity over CLIP embeddings fails to separate most ASCII categories, yielding chance-level performance even for low-variance classes. In contrast, classes with high internal mean similarity exhibit clear discriminability, revealing that the bottleneck lies in representation rather than generational variance. These findings position ASCII art as a stress test for multimodal representations and motivate the development of new embedding methods or evaluation metrics tailored to symbolic visual modalities. All resources are available at https://github.com/ASCIIBench/ASCIIBench.
Graph相关(图学习|图神经网络|图优化等)(5篇)
【1】Contract-Driven QoE Auditing for Speech and Singing Services: From MOS Regression to Service Graphs
标题:语音和歌唱服务的合同驱动QOE审计:从MOS回归到服务图
链接:https://arxiv.org/abs/2512.04827
作者:Wenzhang Du
备注:11 pages, 3 figures
摘要:主观平均意见分数(MOS)仍然是非侵入性语音和歌唱质量评估的事实上的目标。然而,MOS是一个标量,它会压缩不同的用户期望,忽略服务级别目标,并且很难在部署图中进行比较。我们提出了一个合同驱动的QoE审计框架:每个服务图G在一组人类可解释的经验合同C下进行评估,产生一个合同级满意度向量Q(G,C)。我们证明了(i)经典MOS回归是退化合同集的特殊情况,(ii)合同驱动的质量在图视图变换下比MOS更稳定(例如,由系统与系统类型池),和(iii)学习合同的有效样本复杂性是由合同语义,而不仅仅是C的维数。我们在URGENT 2024 MOS(6.9k语音话语与原始评级向量)和SingMOS v1(7,981个唱歌片段; 80个系统)上实例化了该框架。在URGENT上,我们在自监督WavLM嵌入上训练了一个合同感知的神经审计器;在SingMOS上,我们使用发布的评级向量和元数据执行合同驱动的图审计,而无需解码音频。从经验上讲,我们的审计师匹配强MOS预测MOS准确性,同时提供校准的合同概率; SingMOS,Q(G,C)表现出明显较小的跨视图漂移比原始MOS和图形只有基线;紧急,难度曲线显示,错误指定的“简单”的合同可能比更丰富,但更好地对齐的合同集更难学习。
摘要
:Subjective mean opinion scores (MOS) remain the de-facto target for non-intrusive speech and singing quality assessment. However, MOS is a scalar that collapses heterogeneous user expectations, ignores service-level objectives, and is difficult to compare across deployment graphs. We propose a contract-driven QoE auditing framework: each service graph G is evaluated under a set of human-interpretable experience contracts C, yielding a contract-level satisfaction vector Q(G, C). We show that (i) classical MOS regression is a special case with a degenerate contract set, (ii) contract-driven quality is more stable than MOS under graph view transformations (e.g., pooling by system vs. by system type), and (iii) the effective sample complexity of learning contracts is governed by contract semantics rather than merely the dimensionality of C. We instantiate the framework on URGENT2024 MOS (6.9k speech utterances with raw rating vectors) and SingMOS v1 (7,981 singing clips; 80 systems). On URGENT, we train a contract-aware neural auditor on self-supervised WavLM embeddings; on SingMOS, we perform contract-driven graph auditing using released rating vectors and metadata without decoding audio. Empirically, our auditor matches strong MOS predictors in MOS accuracy while providing calibrated contract probabilities; on SingMOS, Q(G, C) exhibits substantially smaller cross-view drift than raw MOS and graph-only baselines; on URGENT, difficulty curves reveal that mis-specified "simple" contracts can be harder to learn than richer but better aligned contract sets.
【2】Contract-Governed Training for Earth Observation: Observed Service Agreement Graphs and Coverage-Accuracy Trade-offs
标题:合同管理的地球观测训练:观测服务协议图表和覆盖率-准确性权衡
链接:https://arxiv.org/abs/2512.04644
作者:Wenzhang Du
备注:9 pages, 2 figures
摘要:地球观测(EO)模型经常在隐式采样策略下进行训练,这些策略优化了全局精度,但没有明确保证在整个训练过程中为谁(哪些区域,类别或关键任务层)提供服务。本文介绍了一个合同管理的训练范式EO训练样本分组到服务合同-语义上有意义的单位,如(数据集,区域,稀有作物指标)-每个合同被分配一个目标服务份额。我们将此范例实例化为观察服务协议图(OSAG),这是一个轻量级的治理层,(i)在优化期间监视合同级别的暴露(覆盖率),(ii)通过合同标准化的采样权重将经验覆盖率推向目标份额,以及(iii)通过两个旋钮暴露显式的准确性-治理权衡:采样混合系数alpha和合同正则化权重lambda_C。我们提供了一个紧凑的理论在一个玩具设置:OSAG抽样集中经验覆盖目标;覆盖偏差上限服务风险偏差和合同设计(粗与细)调节治理成本。AVIRIS高光谱场景(印度松树加萨利纳斯)和多光谱哨兵-2欧洲卫星的实验表明,OSAG可以大大减少优先级覆盖误差,同时保持全球准确度和提高高优先级准确度。EuroSAT粗对细合同消融进一步证明了语义细化合同如何降低每单位治理改进的准确性成本。
摘要:Earth observation (EO) models are frequently trained under implicit sampling policies that optimize global accuracy but provide no explicit guarantees on who (which regions, classes, or mission-critical strata) is being served throughout training. This paper introduces a contract-governed training paradigm for EO in which training samples are grouped into service contracts -- semantically meaningful units such as (dataset, region, rare-crop indicator) -- and each contract is assigned a target service share. We instantiate this paradigm as an Observed Service Agreement Graph (OSAG), a lightweight governance layer that (i) monitors contract-level exposure (coverage) during optimization, (ii) drives empirical coverage toward target shares via contract-normalized sampling weights, and (iii) exposes explicit accuracy-governance trade-offs through two knobs: a sampling mixture coefficient alpha and a contract-regularization weight lambda_C. We provide a compact theory in a toy setting: OSAG sampling concentrates empirical coverage to targets; coverage deviations upper-bound service-risk deviations; and contract design (coarse vs. fine) modulates governance cost. Experiments on AVIRIS hyperspectral scenes (Indian Pines plus Salinas) and multispectral Sentinel-2 EuroSAT demonstrate that OSAG can substantially reduce priority coverage error while maintaining global accuracy and improving high-priority accuracy. A EuroSAT coarse-vs-fine contract ablation further evidences how semantically refined contracts can reduce the accuracy cost per unit of governance improvement.
【3】Explainable Graph Representation Learning via Graph Pattern Analysis
标题:通过图模式分析的可解释图表示学习
链接:https://arxiv.org/abs/2512.04530
作者:Xudong Wang,Ziheng Sun,Chris Ding,Jicong Fan
备注:Full version with appendix of the paper published in the Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25), Main Track
摘要:可解释人工智能(XAI)是AI社区的一个重要领域,可解释性对于构建健壮和可信的AI模型至关重要。虽然以前的工作已经探索了模型级和实例级的可解释图学习,但对可解释图表示学习的研究有限。在本文中,我们专注于表示级可解释图学习,并提出一个基本问题:在图表示中捕获了关于图的哪些特定信息?我们的方法受到图内核的启发,图内核通过计算特定图模式中的子结构来评估图的相似性。虽然模式计数向量可以作为一个可解释的表示,它有局限性,如忽略节点特征和高维。为了解决这些限制,我们引入了一个框架(PXGL-GNN),用于通过图模式分析学习和解释图表示。我们首先对各种模式的图的子结构进行采样。然后,我们学习这些模式的表示,并使用加权和将它们组合起来,其中权重表示每个图形模式贡献的重要性。我们还提供了我们的方法,包括鲁棒性和泛化的理论分析。在我们的实验中,我们展示了如何使用模式分析来学习和解释真实世界数据的图形表示。此外,我们将我们的方法与监督和无监督学习任务中的多个基线进行比较,以证明其有效性。
摘要:Explainable artificial intelligence (XAI) is an important area in the AI community, and interpretability is crucial for building robust and trustworthy AI models. While previous work has explored model-level and instance-level explainable graph learning, there has been limited investigation into explainable graph representation learning. In this paper, we focus on representation-level explainable graph learning and ask a fundamental question: What specific information about a graph is captured in graph representations? Our approach is inspired by graph kernels, which evaluate graph similarities by counting substructures within specific graph patterns. Although the pattern counting vector can serve as an explainable representation, it has limitations such as ignoring node features and being high-dimensional. To address these limitations, we introduce a framework (PXGL-GNN) for learning and explaining graph representations through graph pattern analysis. We start by sampling graph substructures of various patterns. Then, we learn the representations of these patterns and combine them using a weighted sum, where the weights indicate the importance of each graph pattern's contribution. We also provide theoretical analyses of our methods, including robustness and generalization. In our experiments, we show how to learn and explain graph representations for real-world data using pattern analysis. Additionally, we compare our method against multiple baselines in both supervised and unsupervised learning tasks to demonstrate its effectiveness.
【4】GraphBench: Next-generation graph learning benchmarking
标题:GraphBench:下一代图形学习基准测试
链接:https://arxiv.org/abs/2512.04475
作者:Timo Stoll,Chendi Qian,Ben Finkelshtein,Ali Parviz,Darius Weber,Fabrizio Frasca,Hadar Shavit,Antoine Siraudin,Arman Mielke,Marie Anastacio,Erik Müller,Maya Bechler-Speicher,Michael Bronstein,Mikhail Galkin,Holger Hoos,Mathias Niepert,Bryan Perozzi,Jan Tönshoff,Christopher Morris
摘要:图上的机器学习最近在各个领域取得了令人印象深刻的进展,包括分子性质预测和芯片设计。然而,基准做法仍然支离破碎,往往依赖于狭隘的、针对具体任务的数据集和不一致的评价协议,这阻碍了可重复性和更广泛的进展。为了解决这个问题,我们引入了GraphBench,这是一个全面的基准测试套件,涵盖了不同的领域和预测任务,包括节点级,边缘级,图形级和生成设置。GraphBench提供了标准化的评估协议--具有一致的数据集分割和性能指标,可以解释分布外的泛化--以及统一的超参数调优框架。此外,我们使用消息传递神经网络和图形Transformer模型对GraphBench进行基准测试,提供原则性基线并建立参考性能。更多详情请访问www.graphbench.io。
摘要
:Machine learning on graphs has recently achieved impressive progress in various domains, including molecular property prediction and chip design. However, benchmarking practices remain fragmented, often relying on narrow, task-specific datasets and inconsistent evaluation protocols, which hampers reproducibility and broader progress. To address this, we introduce GraphBench, a comprehensive benchmarking suite that spans diverse domains and prediction tasks, including node-level, edge-level, graph-level, and generative settings. GraphBench provides standardized evaluation protocols -- with consistent dataset splits and performance metrics that account for out-of-distribution generalization -- as well as a unified hyperparameter tuning framework. Additionally, we benchmark GraphBench using message-passing neural networks and graph transformer models, providing principled baselines and establishing a reference performance. See www.graphbench.io for further details.
【5】RGE-GCN: Recursive Gene Elimination with Graph Convolutional Networks for RNA-seq based Early Cancer Detection
标题:RGE-GCN:利用图卷积网络进行循环基因消除,用于基于RN-seq的早期癌症检测
链接:https://arxiv.org/abs/2512.04333
作者:Shreyas Shende,Varsha Narayanan,Vishal Fenn,Yiran Huang,Dincer Goksuluk,Gaurav Choudhary,Melih Agraz,Mengjia Xu
备注:12 pages, 2 figures
摘要:癌症的早期检测在提高生存率方面发挥着关键作用,但从RNA-seq数据中识别可靠的生物标志物仍然是一个重大挑战。数据是高维的,传统的统计方法往往无法捕捉基因之间的复杂关系。在这项研究中,我们介绍了RGE-GCN(递归基因消除与图卷积网络),一个框架,结合了特征选择和分类在一个单一的管道。我们的方法从基因表达谱中构建了一个图,使用图卷积网络对癌症与正常样本进行分类,并应用整合的遗传因子来突出显示信息量最大的基因。通过递归地去除不太相关的基因,该模型收敛到一组紧凑的生物标志物,这些生物标志物既可解释又可预测。我们评估了RGE-GCN的合成数据以及肺癌、肾癌和宫颈癌的真实世界RNA-seq队列。在所有数据集中,该方法始终比DESeq 2,edgeR和limma-voom等标准工具获得更高的准确性和F1分数。重要的是,所选基因与众所周知的癌症途径一致,包括PI 3 K-AKT,MAPK,SUMO化和免疫调节。这些结果表明,RGE-GCN显示出作为基于RNA-seq的早期癌症检测和生物标志物发现的可推广方法的前景(https:rce-gcn.streamlit.app/)。
摘要:Early detection of cancer plays a key role in improving survival rates, but identifying reliable biomarkers from RNA-seq data is still a major challenge. The data are high-dimensional, and conventional statistical methods often fail to capture the complex relationships between genes. In this study, we introduce RGE-GCN (Recursive Gene Elimination with Graph Convolutional Networks), a framework that combines feature selection and classification in a single pipeline. Our approach builds a graph from gene expression profiles, uses a Graph Convolutional Network to classify cancer versus normal samples, and applies Integrated Gradients to highlight the most informative genes. By recursively removing less relevant genes, the model converges to a compact set of biomarkers that are both interpretable and predictive. We evaluated RGE-GCN on synthetic data as well as real-world RNA-seq cohorts of lung, kidney, and cervical cancers. Across all datasets, the method consistently achieved higher accuracy and F1-scores than standard tools such as DESeq2, edgeR, and limma-voom. Importantly, the selected genes aligned with well-known cancer pathways including PI3K-AKT, MAPK, SUMOylation, and immune regulation. These results suggest that RGE-GCN shows promise as a generalizable approach for RNA-seq based early cancer detection and biomarker discovery (https://rce-gcn.streamlit.app/ ).
Transformer(4篇)
【1】Efficient Generative Transformer Operators For Million-Point PDEs
标题:百万点偏微分方程的有效生成Transformer算子
链接:https://arxiv.org/abs/2512.04974
作者:Armand Kassaï Koupaï,Lise Le Boudec,Patrick Gallinari
摘要:我们介绍了ECHO,一个变压器运营商的框架,用于生成百万点PDE轨迹。虽然现有的神经算子(NO)已经显示出解决偏微分方程的前景,但由于密集网格上的可扩展性差,动态展开过程中的误差积累以及特定于任务的设计,它们在实践中仍然受到限制。ECHO通过三项关键创新应对这些挑战。(i)它采用分层卷积编码-解码架构,实现100 $\times$时空压缩,同时保持网格点的保真度。(ii)它结合了一个训练和适应策略,使高分辨率PDE解决方案生成稀疏输入网格。(iii)它采用生成式建模范式,学习完整的轨迹段,减轻长期误差漂移。训练策略将表示学习与下游任务监督相结合,使模型能够处理多个任务,如轨迹生成、正问题和逆问题以及插值。生成模型进一步支持条件生成和无条件生成。我们在具有复杂几何形状、高频动态和长期视野的各种PDE系统上展示了百万点模拟的最新性能。
摘要:We introduce ECHO, a transformer-operator framework for generating million-point PDE trajectories. While existing neural operators (NOs) have shown promise for solving partial differential equations, they remain limited in practice due to poor scalability on dense grids, error accumulation during dynamic unrolling, and task-specific design. ECHO addresses these challenges through three key innovations. (i) It employs a hierarchical convolutional encode-decode architecture that achieves a 100 $\times$ spatio-temporal compression while preserving fidelity on mesh points. (ii) It incorporates a training and adaptation strategy that enables high-resolution PDE solution generation from sparse input grids. (iii) It adopts a generative modeling paradigm that learns complete trajectory segments, mitigating long-horizon error drift. The training strategy decouples representation learning from downstream task supervision, allowing the model to tackle multiple tasks such as trajectory generation, forward and inverse problems, and interpolation. The generative model further supports both conditional and unconditional generation. We demonstrate state-of-the-art performance on million-point simulations across diverse PDE systems featuring complex geometries, high-frequency dynamics, and long-term horizons.
【2】Rethinking the Use of Vision Transformers for AI-Generated Image Detection
标题:重新思考使用视觉变形器进行人工智能生成图像检测
链接:https://arxiv.org/abs/2512.04969
作者:NaHyeon Park,Kunhee Kim,Junsuk Choe,Hyunjung Shim
备注:Code: https://github.com/nahyeonkaty/mold
摘要:从CLIP-ViT中获得的丰富特征表示已被广泛用于AI生成的图像检测。虽然大多数现有的方法主要利用最后一层的功能,我们系统地分析了逐层功能对这项任务的贡献。我们的研究表明,较早的层提供更多的本地化和概括的功能,往往超过性能的检测任务中的最后一层功能。此外,我们发现不同的层捕获数据的不同方面,每个层都对AI生成的图像检测做出独特的贡献。出于这些发现,我们引入了一种新的自适应方法,称为MoLD,它动态地集成功能,从多个ViT层使用基于门控的机制。对GAN和扩散生成图像的广泛实验表明,MoLD显着提高了检测性能,增强了不同生成模型的泛化能力,并在现实世界的场景中表现出鲁棒性。最后,我们通过将其成功应用于其他预先训练的ViTs(如DINOv 2)来说明我们方法的可扩展性和多功能性。
摘要:Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.
【3】Tokenizing Buildings: A Transformer for Layout Synthesis
标题:代币化建筑:布局合成的Transformer
链接:https://arxiv.org/abs/2512.04832
作者:Manuel Ladron de Guevara,Jinmo Rhee,Ardavan Bidgoli,Vaidas Razgaitis,Michael Bergin
备注:8 pages, 1 page References, 4 figures
摘要:我们介绍了小型建筑模型(SBM),一个基于transformer的建筑信息模型(BIM)场景布局综合体系结构。我们解决的问题,如何标记建筑物统一到序列的异构特征集的建筑元素,同时保留组成结构。这些特征集被表示为捕获房间属性的稀疏属性-特征矩阵。然后,我们设计了一个统一的嵌入模块,学习分类和可能相关的连续特征组的联合表示。最后,我们在两种模式下训练单个Transformer骨干:产生高保真房间嵌入的仅编码器路径,以及用于房间实体的自回归预测的编码器-解码器管道,称为数据驱动实体预测(DDEP)。跨检索和生成布局合成的实验表明,SBM学习紧凑的房间嵌入,可靠地按类型和拓扑聚类,实现强大的语义检索。在DDEP模式下,SBM产生功能合理的布局,减少碰撞和边界侵犯,提高导航性。
摘要:We introduce Small Building Model (SBM), a Transformer-based architecture for layout synthesis in Building Information Modeling (BIM) scenes. We address the question of how to tokenize buildings by unifying heterogeneous feature sets of architectural elements into sequences while preserving compositional structure. Such feature sets are represented as a sparse attribute-feature matrix that captures room properties. We then design a unified embedding module that learns joint representations of categorical and possibly correlated continuous feature groups. Lastly, we train a single Transformer backbone in two modes: an encoder-only pathway that yields high-fidelity room embeddings, and an encoder-decoder pipeline for autoregressive prediction of room entities, referred to as Data-Driven Entity Prediction (DDEP). Experiments across retrieval and generative layout synthesis show that SBM learns compact room embeddings that reliably cluster by type and topology, enabling strong semantic retrieval. In DDEP mode, SBM produces functionally sound layouts, with fewer collisions and boundary violations and improved navigability.
【4】GRASP: GRouped Activation Shared Parameterization for Parameter-Efficient Fine-Tuning and Robust Inference of Transformers
标题:GRASP:用于Transformer参数有效微调和鲁棒推理的群激活共享参数化
链接:https://arxiv.org/abs/2512.04296
作者:Malyaban Bal,Abhronil Sengupta
备注:Under Review
摘要:参数高效微调(PEFT)通过仅更新大型预训练模型中的一小部分参数,为全模型自适应提供了一种可扩展的替代方案。我们引入GRASP -分组激活共享参数化-一个轻量级的PEFT框架,它将选定层的D维令牌表示划分为K << D组,并为每个组学习共享的缩放和移位向量。这种分组调制显著减少了可训练参数的数量,同时保留了模型学习特定任务特征的能力。在此基础上,我们进一步提出了StochGRASP,它将高斯分布作为对预训练权重的扰动而不是确定性值来学习。这种概率参数化以及噪声感知损失函数公式能够对编程权重中的硬件级可变性进行建模,并显着提高非理想推理条件下的鲁棒性-这是基于边缘的新兴AI硬件部署的重要要求。在GLUE(RobERTa-base和RobERTa-large)和E2 E NLG(GPT-2 Medium)中,GRASP匹配或超过了已建立的PEFT方法的性能,同时与LoRA和BitFit相比,可训练参数减少了一个数量级。在不同的噪声水平下,StochGRASP始终优于确定性变体,证明了其适用于节能和易受噪声影响的硬件平台。
摘要:Parameter-efficient fine-tuning (PEFT) provides a scalable alternative to full-model adaptation by updating only a small subset of parameters in large pre-trained models. We introduce GRASP - GRouped Activation Shared Parameterization - a lightweight PEFT framework that partitions the D-dimensional token representations of selected layers into K << D groups and learns a shared scaling and shifting vector for each group. This grouped modulation reduces the number of trainable parameters significantly while preserving the ability of the model to learn task-specific features. Building on this formulation, we further propose StochGRASP, which learns Gaussian distributions as perturbations to the pre-trained weights rather than deterministic values. This probabilistic parameterization along with a noise-aware loss function formulation enables modelling hardware-level variability in programmed weights and significantly improves robustness under non-ideal inference conditions-an important requirement for deployment on edge-based emerging AI hardware. Across GLUE (RoBERTa-base & RoBERTa-large) and E2E NLG (GPT-2 Medium), GRASP matches or exceeds the performance of established PEFT methods while achieving an order of magnitude reduction in trainable parameters compared to LoRA and BitFit. Under varying levels of noise, StochGRASP consistently outperforms deterministic variants, demonstrating its suitability for energy-efficient and noise-prone hardware platforms.
GAN|对抗|攻击|生成相关(4篇)
【1】NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation
标题:NeuralRemaster:结构对齐一代的保相扩散
链接:https://arxiv.org/abs/2512.05106
作者:Yu Zeng,Charles Ochoa,Mingyuan Zhou,Vishal M. Patel,Vitor Guizilini,Rowan McAllister
摘要:标准扩散使用高斯噪声破坏数据,高斯噪声的傅立叶系数具有随机幅度和随机相位。虽然对无条件或文本到图像生成有效,但破坏相位分量会破坏空间结构,使其不适合需要几何一致性的任务,例如重新渲染,模拟增强和图像到图像转换。我们引入了相位保持扩散φ-PD,这是一种扩散过程的模型不可知的重新表述,它在保持输入相位的同时随机化幅度,从而实现结构对齐生成,而无需架构更改或额外参数。我们进一步提出了频率选择性结构(FSS)噪声,它通过一个单一的频率截止参数提供了对结构刚度的连续控制。φ-PD不增加推理时间成本,并且与图像或视频的任何扩散模型兼容。通过照片级真实感和风格化的重新渲染,以及驾驶规划者的模拟到真实的增强,φ-PD产生可控的,空间对齐的结果。当应用于CARLA模拟器时,φ-PD将CARLA到Waymo规划器的性能提高了50%。该方法是现有的条件方法的补充,并广泛适用于图像到图像和视频到视频的生成。在我们的\href{https://yuzeng-at-tri.github.io/ppd-page/}{project page}上可以找到视频、其他示例和代码。
摘要:Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion φ-PD, a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. φ-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, φ-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, φ-PD improves CARLA-to-Waymo planner performance by 50\%. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \href{https://yuzeng-at-tri.github.io/ppd-page/}{project page}.
【2】TV2TV: A Unified Framework for Interleaved Language and Video Generation
标题:TV 2 TV:交织语言和视频生成的统一框架
链接:https://arxiv.org/abs/2512.05103
作者:Xiaochuang Han,Youssef Emad,Melissa Hall,John Nguyen,Karthik Padthe,Liam Robbins,Amir Bar,Delong Chen,Michal Drozdzal,Maha Elbayad,Yushi Hu,Shang-Wen Li,Sreya Dutta Roy,Jakob Verbeek,XuDong Wang,Marjan Ghazvininejad,Luke Zettlemoyer,Emily Dinan
摘要
:视频生成模型正在快速发展,但仍然需要处理复杂的视频输出,这些视频输出需要大量的语义分支或重复的高级推理来确定接下来应该发生什么。在本文中,我们介绍了一类新的全方位视频文本模型,集成了最近LM推理进展的想法,以解决这一挑战。更具体地说,我们提出了TV2TV,一个统一的生成建模框架,它将视频生成分解成一个交错的文本和视频生成过程。TV2TV使用混合Transformers(MoT)架构联合学习语言建模(下一个令牌预测)和视频流匹配(下一帧预测)。在推理时,TV2TV决定何时在生成文本和视频帧之间交替,允许模型在“以像素为单位”生成帧之前对后续内容进行“文字思考”。这种设计减轻了决定语言建模塔接下来应该发生什么的大部分责任,从而提高了视觉质量并及时对齐生成的视频。它还实现了细粒度的可控性,允许用户在过程中的任何时候通过文本干预来修改视频生成轨迹。在视频游戏数据的控制实验中,TV2TV在视觉质量和可控性方面都有了实质性的改善。TV2TV还可以扩展到自然视频,正如我们通过使用视觉语言模型(VLM)使用交错的自然语言动作描述来增强体育视频所展示的那样。在这个语料库上训练TV2TV产生了强大的视觉质量和快速对齐,展示了模型推理和生成复杂的现实世界动作序列的能力。总之,这些结果突出了TV2TV作为一个有前途的一步,视频生成与开放式文本推理和控制。
摘要:Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.
【3】TimesNet-Gen: Deep Learning-based Site Specific Strong Motion Generation
标题:TimesNet-Gen:基于深度学习的站点特定强动作生成
链接:https://arxiv.org/abs/2512.04694
作者:Baris Yilmaz,Bevan Deniz Cilgin,Erdem Akagündüz,Salih Tileylioglu
摘要:有效降低地震风险依赖于准确的场地评估。这就需要能够代表当地场地条件对地震动特性影响的模型。在这种情况下,数据驱动的方法,学习现场控制的签名记录的地面运动提供了一个很有前途的方向。我们解决强地面运动生成从时域加速度计记录,并介绍了TimesNet-Gen,时域条件发生器。该方法使用特定于站的潜在瓶颈。我们通过比较HVSR曲线和每个站的真实和生成的记录之间的基本站点频率$f_0$分布来评估生成,并根据$f_0$分布混淆矩阵用分数总结站特异性。TimesNet-Gen实现了强大的逐站对齐,并与基于频谱图的条件VAE基线进行了比较,用于特定于站点的强运动合成。我们的代码可通过https://github.com/brsylmz23/TimesNet-Gen获得。
摘要:Effective earthquake risk reduction relies on accurate site-specific evaluations. This requires models that can represent the influence of local site conditions on ground motion characteristics. In this context, data driven approaches that learn site controlled signatures from recorded ground motions offer a promising direction. We address strong ground motion generation from time-domain accelerometer records and introduce the TimesNet-Gen, a time-domain conditional generator. The approach uses a station specific latent bottleneck. We evaluate generation by comparing HVSR curves and fundamental site-frequency $f_0$ distributions between real and generated records per station, and summarize station specificity with a score based on the $f_0$ distribution confusion matrices. TimesNet-Gen achieves strong station-wise alignment and compares favorably with a spectrogram-based conditional VAE baseline for site-specific strong motion synthesis. Our codes are available via https://github.com/brsylmz23/TimesNet-Gen.
【4】QoSDiff: An Implicit Topological Embedding Learning Framework Leveraging Denoising Diffusion and Adversarial Attention for Robust QoS Prediction
标题:MQSDiff:一个隐式Topic嵌入学习框架,利用去噪扩散和对抗注意力进行鲁棒的服务质量预测
链接:https://arxiv.org/abs/2512.04596
作者:Guanchen Du,Jianlong Xu,Wei Wei
备注:Preprint submitted to IEEE Transactions on Services Computing
摘要:准确的服务质量(QoS)预测是服务计算的基础,为服务选择提供必要的数据驱动指导,并确保卓越的用户体验。然而,流行的方法,特别是图神经网络(GNNs),严重依赖于构建显式的用户-服务交互图。当显式连接稀疏或被噪声破坏时,这种依赖性会引入严重的可伸缩性瓶颈并限制性能。为了解决这些挑战,本文介绍了\ldblquote {QoSDiff},一种新的嵌入式学习框架,绕过显式图构造的先决条件。具体而言,它利用去噪扩散概率模型从噪声初始化中恢复固有的潜在结构。为了进一步捕获高阶交互,我们提出了一个集成了双向混合注意力机制的对抗交互模块。这种对抗模式动态地区分信息模式和噪音,从而实现复杂的用户-服务关联的双视角建模。在两个大规模真实世界数据集上的广泛实验表明,QoSDiff显著优于最先进的基线。值得注意的是,结果突出了该框架优越的跨数据集泛化能力和对数据稀疏性和观测噪声的出色鲁棒性。
摘要:Accurate Quality of Service (QoS) prediction is fundamental to service computing, providing essential data-driven guidance for service selection and ensuring superior user experiences. However, prevalent approaches, particularly Graph Neural Networks (GNNs), heavily rely on constructing explicit user--service interaction graphs. This dependency introduces severe scalability bottlenecks and limits performance when explicit connections are sparse or corrupted by noise. To address these challenges, this paper introduces \emph{QoSDiff}, a novel embedding learning framework that bypasses the prerequisite of explicit graph construction. Specifically, it leverages a denoising diffusion probabilistic model to recover intrinsic latent structures from noisy initializations. To further capture high-order interactions, we propose an adversarial interaction module that integrates a bidirectional hybrid attention mechanism. This adversarial paradigm dynamically distinguishes informative patterns from noise, enabling a dual-perspective modeling of intricate user--service associations. Extensive experiments on two large-scale real-world datasets demonstrate that QoSDiff significantly outperforms state-of-the-art baselines. Notably, the results highlight the framework's superior cross-dataset generalization capability and exceptional robustness against data sparsity and observational noise.
半/弱/无/有监督|不确定性|主动学习(5篇)
【1】Hybrid Quantum-Classical Autoencoders for Unsupervised Network Intrusion Detection
标题:用于无监督网络入侵检测的混合量子经典自动编码器
链接:https://arxiv.org/abs/2512.05069
作者:Mohammad Arif Rasyidi,Omar Alhussein,Sami Muhaidat,Ernesto Damiani
摘要:无监督的基于异常的入侵检测需要模型,可以概括为在训练过程中没有观察到的攻击模式。这项工作提出了第一个大规模的评估混合量子经典(HQC)自动编码器的这项任务。我们构建了一个统一的实验框架,迭代关键的量子设计选择,包括量子层的位置,测量方法,变分和非变分配方,和潜在的空间正则化。在三个基准NIDS数据集上的实验表明,HQC自动编码器在其最佳配置下可以匹配或超过经典性能,尽管它们对架构决策具有更高的敏感性。在零日评估下,配置良好的HQC模型比经典和监督基线提供更强大和更稳定的泛化。模拟门噪声实验显示早期的性能下降,表明噪声感知HQC设计的需要。这些结果提供了第一个数据驱动的表征HQC自动编码器的行为,网络入侵检测和概述的关键因素,管理其实际可行性。所有实验代码和配置均可在https://github.com/arasyi/hqcae-network-intrusion-detection上获得。
摘要
:Unsupervised anomaly-based intrusion detection requires models that can generalize to attack patterns not observed during training. This work presents the first large-scale evaluation of hybrid quantum-classical (HQC) autoencoders for this task. We construct a unified experimental framework that iterates over key quantum design choices, including quantum-layer placement, measurement approach, variational and non-variational formulations, and latent-space regularization. Experiments across three benchmark NIDS datasets show that HQC autoencoders can match or exceed classical performance in their best configurations, although they exhibit higher sensitivity to architectural decisions. Under zero-day evaluation, well-configured HQC models provide stronger and more stable generalization than classical and supervised baselines. Simulated gate-noise experiments reveal early performance degradation, indicating the need for noise-aware HQC designs. These results provide the first data-driven characterization of HQC autoencoder behavior for network intrusion detection and outline key factors that govern their practical viability. All experiment code and configurations are available at https://github.com/arasyi/hqcae-network-intrusion-detection.
【2】Multi-Agent Reinforcement Learning for Intraday Operating Rooms Scheduling under Uncertainty
标题:不确定性下手术室日间调度的多智能体强化学习
链接:https://arxiv.org/abs/2512.04918
作者:Kailiang Liu,Ying Chen,Ralf Borndörfer,Thorsten Koch
摘要:日内手术排程是一个多目标决策问题下的不确定性平衡选修吞吐量,紧急和紧急的需求,延迟,顺序相关的设置,和加班。我们制定的问题作为一个合作的马尔可夫博弈,并提出了一个多智能体强化学习(MARL)框架,其中每个手术室(OR)是一个代理训练与集中训练和分散执行。所有代理共享通过邻近策略优化(PPO)训练的策略,该策略将丰富的系统状态映射到动作,而时期内顺序分配协议在OR之间构建无冲突的联合调度。一个混合的整数预调度提供参考选修课的开始时间,我们施加类型特定的二次延迟罚款相对于这些参考和终端超时罚款,产生一个单一的奖励,捕获吞吐量,及时性和工作人员的工作量。在反映现实的医院组合(六个OR,八种手术类型,随机紧急和紧急到达)的模拟中,学习的策略在七个指标和三个评估子集上优于六个基于规则的策略,并且相对于事后MIP预言,量化了最优性差距。政策分析揭示了可解释的行为-优先考虑紧急情况,避免类似情况以减少设置,并推迟价值较低的选修课。在简化的假设下,我们还推导出了序列分解的次优界。我们讨论的限制,包括或同质性和遗漏明确的人员配置的限制和轮廓扩展。总的来说,该方法提供了一个实用的,可解释的,可调的数据驱动的补充,以优化实时或调度。
摘要:Intraday surgical scheduling is a multi-objective decision problem under uncertainty-balancing elective throughput, urgent and emergency demand, delays, sequence-dependent setups, and overtime. We formulate the problem as a cooperative Markov game and propose a multi-agent reinforcement learning (MARL) framework in which each operating room (OR) is an agent trained with centralized training and decentralized execution. All agents share a policy trained via Proximal Policy Optimization (PPO), which maps rich system states to actions, while a within-epoch sequential assignment protocol constructs conflict-free joint schedules across ORs. A mixed-integer pre-schedule provides reference starting times for electives; we impose type-specific quadratic delay penalties relative to these references and a terminal overtime penalty, yielding a single reward that captures throughput, timeliness, and staff workload. In simulations reflecting a realistic hospital mix (six ORs, eight surgery types, random urgent and emergency arrivals), the learned policy outperforms six rule-based heuristics across seven metrics and three evaluation subsets, and, relative to an ex post MIP oracle, quantifies optimality gaps. Policy analytics reveal interpretable behavior-prioritizing emergencies, batching similar cases to reduce setups, and deferring lower-value electives. We also derive a suboptimality bound for the sequential decomposition under simplifying assumptions. We discuss limitations-including OR homogeneity and the omission of explicit staffing constraints-and outline extensions. Overall, the approach offers a practical, interpretable, and tunable data-driven complement to optimization for real-time OR scheduling.
【3】AutoGuard: A Self-Healing Proactive Security Layer for DevSecOps Pipelines Using Reinforcement Learning
标题:AutoGuard:使用强化学习的DevSecOps管道的自我修复主动安全层
链接:https://arxiv.org/abs/2512.04368
作者:Praveen Anugula,Avdhesh Kumar Bhardwaj,Navin Chhibber,Rohit Tewari,Sunil Khemka,Piyush Ranjan
备注:Accepted and Presented at 1st IEEE Uttar Pradesh Section Women in Engineering International Conference on Electrical Electronics and Computer Engineering (UPWIECON 2025) organized by NIELIT Dehradun held during 30th 31st October 2025
摘要:当代DevSecOps管道必须在不断集成和部署的环境中应对安全性的演变。现有的方法,如基于规则的入侵检测和静态漏洞扫描,是不够的,不接受系统中的变化,导致更长的响应时间和组织需要暴露于新兴的攻击媒介。鉴于上述限制,我们将AutoGuard引入DevSecOps生态系统,这是一个强化学习(RL)驱动的自我修复安全框架,旨在先发制人地保护DevSecOps环境。AutoGuard是一种自我保护的安全环境,可持续观察管道活动以发现潜在异常,同时抢先修复环境。该模型基于随时间动态学习的策略进行观察和反应。RL代理通过基于奖励的学习来改进每个动作,旨在提高代理实时预防,检测和响应安全事件的能力。使用模拟的持续集成/持续部署(CI/CD)环境进行的测试表明,与传统方法相比,AutoGuard成功地将威胁检测准确性提高了22%,将事件的平均恢复时间(MTTR)减少了38%,并提高了对事件的整体恢复能力。 关键词- DevSecOps,强化学习,自我修复安全,持续集成,自动化威胁缓解
摘要:Contemporary DevSecOps pipelines have to deal with the evolution of security in an ever-continuously integrated and deployed environment. Existing methods,such as rule-based intrusion detection and static vulnerability scanning, are inadequate and unreceptive to changes in the system, causing longer response times and organization needs exposure to emerging attack vectors. In light of the previous constraints, we introduce AutoGuard to the DevSecOps ecosystem, a reinforcement learning (RL)-powered self-healing security framework built to pre-emptively protect DevSecOps environments. AutoGuard is a self-securing security environment that continuously observes pipeline activities for potential anomalies while preemptively remediating the environment. The model observes and reacts based on a policy that is continually learned dynamically over time. The RL agent improves each action over time through reward-based learning aimed at improving the agent's ability to prevent, detect and respond to a security incident in real-time. Testing using simulated ContinuousIntegration / Continuous Deployment (CI/CD) environments showed AutoGuard to successfully improve threat detection accuracy by 22%, reduce mean time torecovery (MTTR) for incidents by 38% and increase overall resilience to incidents as compared to traditional methods. Keywords- DevSecOps, Reinforcement Learning, Self- Healing Security, Continuous Integration, Automated Threat Mitigation
【4】Bayes-DIC Net: Estimating Digital Image Correlation Uncertainty with Bayesian Neural Networks
标题:Bayes-DickNet:使用Bayes-NIC神经网络估计数字图像相关不确定性
链接:https://arxiv.org/abs/2512.04323
作者:Biao Chen,Zhenhua Lei,Yahui Zhang,Tongzhi Niu
备注:17 pages, 8 figures
摘要:提出了一种基于非均匀B样条曲面的高质量数字图像相关(DIC)数据集生成方法。通过随机生成控制点坐标,我们构建了包含各种现实位移场景的位移场,这些场景随后用于生成散斑图案数据集。这种方法能够生成捕获真实世界位移场情况的大规模数据集,从而增强基于深度学习的DIC算法的训练和泛化能力。此外,我们提出了一种新的网络架构,称为贝叶斯-DIC网,它在下采样阶段提取多个级别的信息,并通过在上采样阶段的一个单一的跳过连接,促进各个级别的信息聚合。Bayes-DIC Net集成了一系列轻量级卷积块,旨在扩展感受野并捕获丰富的上下文信息,同时最大限度地降低计算成本。此外,通过将适当的dropout模块集成到Bayes-DIC网络中,并在网络推理阶段激活它们,将Bayes-DIC网络转换为贝叶斯神经网络。这种转换允许网络在处理真实的未标记数据集时不仅提供预测结果,而且还提供这些预测的置信水平。这一特性显著增强了我们的网络在实际位移场预测任务中的实用性和可靠性。通过这些创新,本文提供了新的视角和方法,数据集生成和算法性能的提高在DIC领域。
摘要
:This paper introduces a novel method for generating high-quality Digital Image Correlation (DIC) dataset based on non-uniform B-spline surfaces. By randomly generating control point coordinates, we construct displacement fields that encompass a variety of realistic displacement scenarios, which are subsequently used to generate speckle pattern datasets. This approach enables the generation of a large-scale dataset that capture real-world displacement field situations, thereby enhancing the training and generalization capabilities of deep learning-based DIC algorithms. Additionally, we propose a novel network architecture, termed Bayes-DIC Net, which extracts information at multiple levels during the down-sampling phase and facilitates the aggregation of information across various levels through a single skip connection during the up-sampling phase. Bayes-DIC Net incorporates a series of lightweight convolutional blocks designed to expand the receptive field and capture rich contextual information while minimizing computational costs. Furthermore, by integrating appropriate dropout modules into Bayes-DIC Net and activating them during the network inference stage, Bayes-DIC Net is transformed into a Bayesian neural network. This transformation allows the network to provide not only predictive results but also confidence levels in these predictions when processing real unlabeled datasets. This feature significantly enhances the practicality and reliability of our network in real-world displacement field prediction tasks. Through these innovations, this paper offers new perspectives and methods for dataset generation and algorithm performance enhancement in the field of DIC.
【5】Informative missingness and its implications in semi-supervised learning
标题:半监督学习中的信息缺失及其影响
链接:https://arxiv.org/abs/2512.04392
作者:Jinran Wu,You-Gan Wang,Geoffrey J. McLachlan
备注:1
摘要:半监督学习(SSL)使用标记和未标记的数据构建分类器。它利用来自标记样本的信息,其获取通常是昂贵的或劳动密集型的,以及未标记的数据来提高预测性能。这定义了一个不完全数据问题,在统计上可以在有限混合模型的似然框架内制定,可以使用期望最大化(EM)算法进行拟合。理想情况下,人们更喜欢完全标记的样本,因为人们会预期标记的观察比未标记的观察提供更多的信息。然而,当控制标签缺失的机制取决于观察到的特征或类别标签或两者时,缺失指示符本身包含有用的信息。在某些情况下,从建模缺失标签机制中获得的信息甚至可以超过由于缺失标签而造成的损失,从而产生具有比基于分析的完全标记样本的分类器更小的预期误差的分类器。这种改进特别是当类重叠是适度的,标记的数据是稀疏的,和缺失是翔实的。因此,对这种信息缺失的建模提供了一个连贯的统计框架,将基于可能性的推断与经验SSL方法的行为相统一。
摘要:Semi-supervised learning (SSL) constructs classifiers using both labelled and unlabelled data. It leverages information from labelled samples, whose acquisition is often costly or labour-intensive, together with unlabelled data to enhance prediction performance. This defines an incomplete-data problem, which statistically can be formulated within the likelihood framework for finite mixture models that can be fitted using the expectation-maximisation (EM) algorithm. Ideally, one would prefer a completely labelled sample, as one would anticipate that a labelled observation provides more information than an unlabelled one. However, when the mechanism governing label absence depends on the observed features or the class labels or both, the missingness indicators themselves contain useful information. In certain situations, the information gained from modelling the missing-label mechanism can even outweigh the loss due to missing labels, yielding a classifier with a smaller expected error than one based on a completely labelled sample analysed. This improvement arises particularly when class overlap is moderate, labelled data are sparse, and the missingness is informative. Modelling such informative missingness thus offers a coherent statistical framework that unifies likelihood-based inference with the behaviour of empirical SSL methods.
迁移|Zero/Few/One-Shot|自适应(3篇)
【1】RLHFSpec: Breaking the Efficiency Bottleneck in RLHF Training via Adaptive Drafting
标题:RLHFSec:通过自适应起草打破RLHF训练的效率瓶颈
链接:https://arxiv.org/abs/2512.04752
作者:Siqi Wang,Hailong Yang,Junjie Zhu,Xuezhu Wang,Yufan Xu,Depei Qian
摘要:基于人类反馈的强化学习(RLHF)是大型语言模型(LLM)的重要微调技术,包括三个阶段:生成,推理和训练。生成阶段生成样本,然后用于推断可学习的训练经验。我们观察到生成阶段是整个执行过程的瓶颈,并认为它是优化的关键点。具体来说,我们实现了第一次尝试将推测性解码集成到RLHF生成阶段,并提出RLHFSpec,RLHF系统,加速生成执行自适应推测性解码和样本重新分配。为了充分利用推测解码提供的性能潜力,特别是处理生成阶段的动态工作负载,RLHFSpec提出了一种工作负载感知的起草策略选择机制,该机制通过联合考虑验证成本和接受令牌的数量来选择接近最优的策略。此外,RLHFSpec还提出了样本重新分配以充分利用GPU资源,并通过高效的样本迁移机制对其进行了优化。实验结果表明,RLHFSpec可以实现更高的吞吐量在生成阶段相比,国家的最先进的作品。此外,由于生成瓶颈的有效缓解,RLHFSpec也显示出显着的性能加速在整个RLHF执行。
摘要:Reinforcement Learning from Human Feedback (RLHF) is an important fine-tuning technique for large language models (LLMs) and comprises three stages: generation, inference, and training. The generation stage generates samples that are then used to infer learnable experiences for training. We observe that the generation stage is the bottleneck of the entire execution process and consider it a key point for optimization. Specifically, we realize the first attempt to integrate speculative decoding into the RLHF generation stage and propose RLHFSpec, an RLHF system that accelerates generation execution with adaptive speculative decoding and sample reallocation. To fully exploit the performance potential provided by speculative decoding, especially dealing with the dynamic workload of the generation stage, RLHFSpec proposes a workload-aware drafting strategy selection mechanism, which selects the near-optimal strategy by jointly considering the verification cost and the number of accepted tokens. Moreover, RLHFSpec also proposes sample reallocation to fully utilize the GPU resources, and optimizes it with an efficient sample migration mechanism. The experimental results show that the RLHFSpec can achieve higher throughput in the generation stage compared to state-of-the-art works. Moreover, due to the effective alleviation of the generation bottleneck, RLHFSpec also shows significant performance speedup in the entire RLHF execution.
【2】Prototype-Based Semantic Consistency Alignment for Domain Adaptive Retrieval
标题:基于原型的领域自适应检索语义一致性对齐
链接:https://arxiv.org/abs/2512.04524
作者:Tianle Hu,Weijun Lv,Na Han,Xiaozhao Fang,Jie Wen,Jiaxing Li,Guoxu Zhou
备注:This paper was accepted by AAAI2026 main tech track not long ago. This is an expanded version with an appendix
摘要:领域自适应检索旨在将知识从标记的源领域转移到未标记的目标领域,从而在减少领域差异的同时实现有效的检索。然而,现有的方法遇到了几个基本的限制:1)忽略类级别的语义对齐和过度追求成对的样本对齐; 2)缺乏伪标签的可靠性考虑或评估标签正确性的几何指导; 3)直接量化受域移位影响的原始特征,破坏了学习的哈希码的质量。鉴于这些局限性,我们提出了基于原型的语义一致性对齐(PSCA),一个有效的域自适应检索的两阶段框架。在第一阶段,一组正交原型直接建立类级语义连接,在收集类内样本的同时最大化类间的可分性。在原型学习过程中,几何邻近度通过自适应加权伪标签置信度为语义一致性对齐提供了可靠性指标。由此产生的隶属度矩阵和原型有助于特征重构,确保对重构特征而不是原始特征进行量化,从而提高后续散列编码质量并无缝连接两个阶段。在第二阶段中,特定于域的量化函数在相互近似约束下处理重构特征,生成跨域的统一二进制哈希码。大量的实验验证了PSCA在多个数据集上的卓越性能。
摘要
:Domain adaptive retrieval aims to transfer knowledge from a labeled source domain to an unlabeled target domain, enabling effective retrieval while mitigating domain discrepancies. However, existing methods encounter several fundamental limitations: 1) neglecting class-level semantic alignment and excessively pursuing pair-wise sample alignment; 2) lacking either pseudo-label reliability consideration or geometric guidance for assessing label correctness; 3) directly quantizing original features affected by domain shift, undermining the quality of learned hash codes. In view of these limitations, we propose Prototype-Based Semantic Consistency Alignment (PSCA), a two-stage framework for effective domain adaptive retrieval. In the first stage, a set of orthogonal prototypes directly establishes class-level semantic connections, maximizing inter-class separability while gathering intra-class samples. During the prototype learning, geometric proximity provides a reliability indicator for semantic consistency alignment through adaptive weighting of pseudo-label confidences. The resulting membership matrix and prototypes facilitate feature reconstruction, ensuring quantization on reconstructed rather than original features, thereby improving subsequent hash coding quality and seamlessly connecting both stages. In the second stage, domain-specific quantization functions process the reconstructed features under mutual approximation constraints, generating unified binary hash codes across domains. Extensive experiments validate PSCA's superior performance across multiple datasets.
【3】One Detector Fits All: Robust and Adaptive Detection of Malicious Packages from PyPI to Enterprises
标题:一个检测器适合所有人:对从PyPI到企业的恶意包进行稳健且自适应的检测
链接:https://arxiv.org/abs/2512.04338
作者:Biagio Montaruli,Luca Compagna,Serena Elisa Ponta,Davide Balzarotti
备注:Proceedings of the 2025 Annual Computer Security Applications Conference (ACSAC' 25), December 8-12, 2025, Honolulu, Hawaii, USA
摘要:通过恶意Python包进行的供应链攻击的增加需要强大的检测解决方案。然而,目前的方法忽略了两个关键挑战:对敌对源代码转换的鲁棒性和对不同参与者的不同误报率(FPR)要求的适应性,从存储库维护人员(要求低FPR)到企业安全团队(更高FPR容忍度)。 我们引入了一个强大的检测器,能够无缝集成到PyPI等公共存储库和企业生态系统中。为了确保鲁棒性,我们提出了一种新的方法来生成对抗包使用细粒度的代码混淆。将这些与对抗训练(AT)相结合,可将检测器的鲁棒性提高2.5倍。我们通过对PyPI在80天内每天收集的122,398个包测试我们的检测器来全面评估AT的有效性,表明AT需要谨慎应用:它使检测器对混淆更加强大,并允许找到10%以上的混淆包,但对非混淆包的性能略有下降。 我们通过两个案例研究证明了我们的检测器的生产适应性:(i)一个用于PyPI维护人员(调整为0.1%FPR)和(ii)一个用于企业团队(调整为10%FPR)。在前者中,我们分析了在37天内从PyPI收集的91,949个包,实现了每天2.48个恶意包的检测率,只有2.18个误报。在后者中,我们分析了一家跨国软件公司采用的1,596个软件包,每天只有1.24个误报。这些结果表明,我们的检测器可以无缝集成到PyPI等公共存储库和企业生态系统中,确保只需几分钟的时间预算来审查误报。 总体而言,我们发现了346个恶意软件包,现已向社区报告。
摘要:The rise of supply chain attacks via malicious Python packages demands robust detection solutions. Current approaches, however, overlook two critical challenges: robustness against adversarial source code transformations and adaptability to the varying false positive rate (FPR) requirements of different actors, from repository maintainers (requiring low FPR) to enterprise security teams (higher FPR tolerance). We introduce a robust detector capable of seamless integration into both public repositories like PyPI and enterprise ecosystems. To ensure robustness, we propose a novel methodology for generating adversarial packages using fine-grained code obfuscation. Combining these with adversarial training (AT) enhances detector robustness by 2.5x. We comprehensively evaluate AT effectiveness by testing our detector against 122,398 packages collected daily from PyPI over 80 days, showing that AT needs careful application: it makes the detector more robust to obfuscations and allows finding 10% more obfuscated packages, but slightly decreases performance on non-obfuscated packages. We demonstrate production adaptability of our detector via two case studies: (i) one for PyPI maintainers (tuned at 0.1% FPR) and (ii) one for enterprise teams (tuned at 10% FPR). In the former, we analyze 91,949 packages collected from PyPI over 37 days, achieving a daily detection rate of 2.48 malicious packages with only 2.18 false positives. In the latter, we analyze 1,596 packages adopted by a multinational software company, obtaining only 1.24 false positives daily. These results show that our detector can be seamlessly integrated into both public repositories like PyPI and enterprise ecosystems, ensuring a very low time budget of a few minutes to review the false positives. Overall, we uncovered 346 malicious packages, now reported to the community.
强化学习(7篇)
【1】Structured Document Translation via Format Reinforcement Learning
标题:通过格式强化学习进行结构化文档翻译
链接:https://arxiv.org/abs/2512.05100
作者:Haiyue Song,Johannes Eschbach-Dymanus,Hour Kaing,Sumire Honda,Hideki Tanaka,Bianka Buschbeck,Masao Utiyama
备注:IJCNLP-AACL 2025 Main (Oral)
摘要:最近的结构化文本翻译工作仍然局限于句子级别,因为它们很难有效地处理复杂的文档级XML或HTML结构。为了解决这个问题,我们提出了\textbf{格式强化学习(FormatRL)},它在监督微调模型之上采用组相对策略优化来直接优化新型结构感知奖励:1)TreeSim,它测量预测和参考之间的结构相似性和2)Node-chrF,它在XML节点级别测量翻译质量。此外,我们应用StrucAUC,一个细粒度的度量区分小错误和主要的结构故障。SAP软件文档基准测试的实验证明了六个指标的改进,分析进一步显示了不同的奖励函数如何有助于提高结构和翻译质量。
摘要:Recent works on structured text translation remain limited to the sentence level, as they struggle to effectively handle the complex document-level XML or HTML structures. To address this, we propose \textbf{Format Reinforcement Learning (FormatRL)}, which employs Group Relative Policy Optimization on top of a supervised fine-tuning model to directly optimize novel structure-aware rewards: 1) TreeSim, which measures structural similarity between predicted and reference XML trees and 2) Node-chrF, which measures translation quality at the level of XML nodes. Additionally, we apply StrucAUC, a fine-grained metric distinguishing between minor errors and major structural failures. Experiments on the SAP software-documentation benchmark demonstrate improvements across six metrics and an analysis further shows how different reward functions contribute to improvements in both structural and translation quality.
【2】Realizable Abstractions: Near-Optimal Hierarchical Reinforcement Learning
标题:可实现的抽象:接近最优的分层强化学习
链接:https://arxiv.org/abs/2512.04958
作者:Roberto Cipollone,Luca Iocchi,Matteo Leonetti
摘要:分层强化学习(HRL)的主要重点是研究如何通过组合为较小子任务计算的部分解,以模块化方式更有效地解决大型马尔可夫决策过程(MDP)。尽管他们非常直观的作用,学习,大多数概念的MDP抽象HRL文献中提出的表达能力有限,或不具备正式的效率保证。这项工作解决了这些基本问题,通过定义可实现的抽象,通用的低级别MDPs和相关的高级别决策过程之间的新关系。我们提出的概念避免了非马尔可夫性问题,并具有理想的近最优性保证。事实上,我们表明,任何抽象的政策可实现的抽象可以转化为接近最优的政策,低级别的MDP,通过一个合适的组合选项。如本文所示,这些选项可以表示为特定的约束MDPs的解决方案。基于这些发现,我们提出了RARL,一个新的HRL算法,返回组合和接近最优的低级别的政策,利用可实现的抽象输入。我们表明,RARL可能是近似正确的,它收敛于一个多项式数量的样本,它是强大的不准确的抽象。
摘要:The main focus of Hierarchical Reinforcement Learning (HRL) is studying how large Markov Decision Processes (MDPs) can be more efficiently solved when addressed in a modular way, by combining partial solutions computed for smaller subtasks. Despite their very intuitive role for learning, most notions of MDP abstractions proposed in the HRL literature have limited expressive power or do not possess formal efficiency guarantees. This work addresses these fundamental issues by defining Realizable Abstractions, a new relation between generic low-level MDPs and their associated high-level decision processes. The notion we propose avoids non-Markovianity issues and has desirable near-optimality guarantees. Indeed, we show that any abstract policy for Realizable Abstractions can be translated into near-optimal policies for the low-level MDP, through a suitable composition of options. As demonstrated in the paper, these options can be expressed as solutions of specific constrained MDPs. Based on these findings, we propose RARL, a new HRL algorithm that returns compositional and near-optimal low-level policies, taking advantage of the Realizable Abstraction given in the input. We show that RARL is Probably Approximately Correct, it converges in a polynomial number of samples, and it is robust to inaccuracies in the abstraction.
【3】CARL: Critical Action Focused Reinforcement Learning for Multi-Step Agent
标题:CARL:针对多步骤智能体的关键动作聚焦强化学习
链接:https://arxiv.org/abs/2512.04949
作者:Leyang Shen,Yang Zhang,Chun Kai Ling,Xiaoyan Zhao,Tat-Seng Chua
备注:10 pages, 4 figures
摘要:能够通过与环境的多次交互来完成复杂任务的智能体已经成为一个热门的研究方向。然而,在这样的多步骤设置中,传统的组级策略优化算法变得次优,因为其潜在的假设,每个动作持有相等的贡献,这大大偏离了现实。我们的分析表明,只有一小部分行动对决定最终结果至关重要。基于这一认识,我们提出了CARL,一个为多步代理量身定制的以关键动作为中心的强化学习算法。CARL通过为高关键性动作提供动作级优化信号,同时从模型更新中排除低关键性动作,实现了集中训练。大量的实验表明,CARL实现了更强的性能和更高的效率,在不同的评估设置的训练和推理。
摘要:Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm becomes suboptimal because of its underlying assumption that each action holds equal contribution, which deviates significantly from reality. Our analysis reveals that only a small fraction of actions are critical in determining the final outcome. Building on this insight, we propose CARL, a critical-action-focused reinforcement learning algorithm tailored for multi-step agents. CARL achieves focused training through providing action-level optimization signals for high-criticality actions while excluding low-criticality actions from model update. Extensive experiments demonstrate that CARL achieves both stronger performance and higher efficiency during training and inference across diverse evaluation settings.
【4】Semi Centralized Training Decentralized Execution Architecture for Multi Agent Deep Reinforcement Learning in Traffic Signal Control
标题:交通信号控制中多智能体深度强化学习的半集中训练去中心执行架构
链接:https://arxiv.org/abs/2512.04653
作者:Pouria Yazdani,Arash Rezaali,Monireh Abdoos
备注:Co-first authors: Pouria Yazdani and Arash Rezaali
摘要:多智能体强化学习(MARL)已成为一种很有前途的范例,多个交叉口的自适应交通信号控制(ATSC)。现有的方法通常遵循完全集中或完全分散的设计。完全集中的方法受到维数灾难的影响,并且依赖于单个学习服务器,而纯粹分散的方法在严重的部分可观测性下运行,并且缺乏明确的协调,从而导致次优性能。这些限制激发了基于区域的MARL,其中网络被划分为形成区域的较小的、紧密耦合的交叉点,并且围绕这些区域组织训练。本文介绍了一种用于多交叉点ATSC的半集中式训练、分散式执行(SEMI-CTDE)结构。在每个区域内,SEMI-CTDE进行集中训练,区域参数共享,并采用复合状态和奖励公式,共同编码本地和区域信息。该架构在不同的策略骨干和状态奖励实例之间具有高度的可移植性。在此架构的基础上,我们实现了两个具有不同设计目标的模型。一个多角度的实验分析的两个实施SEMI-CTDE为基础的模型,包括消融的架构的核心元素,包括基于规则和完全分散的基线表明,他们实现了一贯的卓越性能,并保持有效的广泛的流量密度和分布。
摘要:Multi-agent reinforcement learning (MARL) has emerged as a promising paradigm for adaptive traffic signal control (ATSC) of multiple intersections. Existing approaches typically follow either a fully centralized or a fully decentralized design. Fully centralized approaches suffer from the curse of dimensionality, and reliance on a single learning server, whereas purely decentralized approaches operate under severe partial observability and lack explicit coordination resulting in suboptimal performance. These limitations motivate region-based MARL, where the network is partitioned into smaller, tightly coupled intersections that form regions, and training is organized around these regions. This paper introduces a Semi-Centralized Training, Decentralized Execution (SEMI-CTDE) architecture for multi intersection ATSC. Within each region, SEMI-CTDE performs centralized training with regional parameter sharing and employs composite state and reward formulations that jointly encode local and regional information. The architecture is highly transferable across different policy backbones and state-reward instantiations. Building on this architecture, we implement two models with distinct design objectives. A multi-perspective experimental analysis of the two implemented SEMI-CTDE-based models covering ablations of the architecture's core elements including rule based and fully decentralized baselines shows that they achieve consistently superior performance and remain effective across a wide range of traffic densities and distributions.
【5】Long-Horizon Model-Based Offline Reinforcement Learning Without Conservatism
标题:不带保守主义的基于长视野模型的离线强化学习
链接:https://arxiv.org/abs/2512.04341
作者:Tianwei Ni,Esther Derman,Vineet Jain,Vincent Taboga,Siamak Ravanbakhsh,Pierre-Luc Bacon
备注:Preprint (52 pages, 15 figures)
摘要:流行的离线强化学习(RL)方法依赖于保守性,要么惩罚数据集外的行为,要么限制规划范围。在这项工作中,我们质疑这一原则的普遍性,而是重新审视一个补充:贝叶斯的观点。贝叶斯方法不是强制执行保守主义,而是通过在合理的世界模型上建模后验分布并训练历史依赖代理以最大化预期奖励来解决离线数据中的认知不确定性,从而实现测试时的泛化。我们首先说明,在一个强盗设置,贝叶斯主义优于保守主义失败的低质量的数据集。然后,我们将该原则扩展到现实任务中,确定关键的设计选择,例如世界模型中的层规范化和自适应长期规划,以减轻复合误差和价值高估。这些产生了我们的实用算法,Neubay,基于中性贝叶斯原理。在D4RL和NeoRL基准测试中,Neubay通常匹配或超越领先的保守算法,在7个数据集上实现了新的最先进的算法。值得注意的是,它成功地规划了数百步的视野,挑战了共同的信念。最后,我们描述了Neubay何时优于保守主义,为离线和基于模型的RL的新方向奠定了基础。
摘要:Popular offline reinforcement learning (RL) methods rely on conservatism, either by penalizing out-of-dataset actions or by restricting planning horizons. In this work, we question the universality of this principle and instead revisit a complementary one: a Bayesian perspective. Rather than enforcing conservatism, the Bayesian approach tackles epistemic uncertainty in offline data by modeling a posterior distribution over plausible world models and training a history-dependent agent to maximize expected rewards, enabling test-time generalization. We first illustrate, in a bandit setting, that Bayesianism excels on low-quality datasets where conservatism fails. We then scale the principle to realistic tasks, identifying key design choices, such as layer normalization in the world model and adaptive long-horizon planning, that mitigate compounding error and value overestimation. These yield our practical algorithm, Neubay, grounded in the neutral Bayesian principle. On D4RL and NeoRL benchmarks, Neubay generally matches or surpasses leading conservative algorithms, achieving new state-of-the-art on 7 datasets. Notably, it succeeds with planning horizons of several hundred steps, challenging common belief. Finally, we characterize when Neubay is preferable to conservatism, laying the foundation for a new direction in offline and model-based RL.
【6】Data-regularized Reinforcement Learning for Diffusion Models at Scale
标题:大规模扩散模型的数据正规化强化学习
链接:https://arxiv.org/abs/2512.04332
作者:Haotian Ye,Kaiwen Zheng,Jiashu Xu,Puheng Li,Huayu Chen,Jiaqi Han,Sheng Liu,Qinsheng Zhang,Hanzi Mao,Zekun Hao,Prithvijit Chattopadhyay,Dinghao Yang,Liang Feng,Maosheng Liao,Junjie Bai,Ming-Yu Liu,James Zou,Stefano Ermon
摘要:通过强化学习(RL)将生成扩散模型与人类偏好对齐是至关重要但具有挑战性的。大多数现有的算法往往容易受到奖励黑客攻击,如质量下降,过度风格化,或减少多样性。我们的分析表明,这可以归因于其正则化的固有局限性,这提供了不可靠的惩罚。我们引入了数据正则化扩散强化学习(DDRL),这是一种新的框架,它使用前向KL发散将策略锚定到非策略数据分布。从理论上讲,DDRL可以将RL与标准扩散训练进行鲁棒的,无偏的整合。从经验上讲,这转化为一个简单而有效的算法,结合了奖励最大化和扩散损失最小化。通过超过100万GPU小时的实验和1万次双盲人工评估,我们在高分辨率视频生成任务中证明了DDRL显著提高了奖励,同时减轻了基线中的奖励黑客行为,实现了最高的人类偏好,并建立了一个强大且可扩展的扩散后训练范例。
摘要
:Aligning generative diffusion models with human preferences via reinforcement learning (RL) is critical yet challenging. Most existing algorithms are often vulnerable to reward hacking, such as quality degradation, over-stylization, or reduced diversity. Our analysis demonstrates that this can be attributed to the inherent limitations of their regularization, which provides unreliable penalties. We introduce Data-regularized Diffusion Reinforcement Learning (DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution. Theoretically, DDRL enables robust, unbiased integration of RL with standard diffusion training. Empirically, this translates into a simple yet effective algorithm that combines reward maximization with diffusion loss minimization. With over a million GPU hours of experiments and ten thousand double-blind human evaluations, we demonstrate on high-resolution video generation tasks that DDRL significantly improves rewards while alleviating the reward hacking seen in baselines, achieving the highest human preference and establishing a robust and scalable paradigm for diffusion post-training.
【7】Continuous-time reinforcement learning for optimal switching over multiple regimes
标题:连续时间强化学习在多个机制上的最优切换
链接:https://arxiv.org/abs/2512.04697
作者:Yijie Huang,Mengge Li,Xiang Yu,Zhou Zhou
备注:Keywords: Optimal regime switching, multiple regimes, continuous-time reinforcement learning, system of HJB equations, policy improvement, policy iteration convergence
摘要:本文研究了连续时间强化学习(RL)在多个区域的最优切换问题。我们考虑一种类型的探索性制定下熵正则化的代理随机化的开关和政权的选择,通过相关的连续时间有限状态马尔可夫链的生成矩阵的时间。我们建立了相应的Hamilton-Jacobi-Bellman(HJB)方程组的适定性,并给出了最优策略的一个特征。通过对方程组的分析,严格地建立了策略改进和策略迭代的收敛性。我们还显示了收敛的值函数的探索性制定的价值函数的经典制定的温度参数为零。最后,通过调用基于鞅特征的策略评估,设计并实现了一个强化学习算法。我们的数值例子与神经网络的帮助下说明了所提出的RL算法的有效性。
摘要:This paper studies the continuous-time reinforcement learning (RL) for optimal switching problems across multiple regimes. We consider a type of exploratory formulation under entropy regularization where the agent randomizes both the timing of switches and the selection of regimes through the generator matrix of an associated continuous-time finite-state Markov chain. We establish the well-posedness of the associated system of Hamilton-Jacobi-Bellman (HJB) equations and provide a characterization of the optimal policy. The policy improvement and the convergence of the policy iterations are rigorously established by analyzing the system of equations. We also show the convergence of the value function in the exploratory formulation towards the value function in the classical formulation as the temperature parameter vanishes. Finally, a reinforcement learning algorithm is devised and implemented by invoking the policy evaluation based on the martingale characterization. Our numerical examples with the aid of neural networks illustrate the effectiveness of the proposed RL algorithm.
元学习(1篇)
【1】Meta-Learning for Quantum Optimization via Quantum Sequence Model
标题:通过量子序列模型实现量子优化的元学习
链接:https://arxiv.org/abs/2512.05058
作者:Yu-Cheng Lin,Yu-Chao Hsu,Samuel Yen-Chi Chen
摘要:量子近似优化算法(QAOA)是在近期量子处理器上解决组合优化问题的领先方法。然而,由于非凸的能量景观,找到好的变分参数仍然是一个重大的挑战,往往导致收敛速度慢,解决方案质量差。在这项工作中,我们提出了一个量子元学习框架,训练先进的量子序列模型,以生成有效的参数初始化策略。我们研究了四个经典或量子序列模型,包括基于量子内核的长短期记忆(QK-LSTM),作为“学习学习”范式中的学习优化器。我们在Max-Cut问题上的数值实验表明,QK-LSTM优化器实现了卓越的性能,获得了最高的近似比,并在所有测试的问题大小(n=10到13)中表现出最快的收敛速度。至关重要的是,QK-LSTM模型通过合成一组固定的接近最优的参数来实现完美的参数可移植性,即使在推广到更大的问题时,也能显著持续加速收敛。这种能力,由量子内核架构的紧凑和表达能力所支持,强调了它的有效性。QK-LSTM只有43个可训练参数,大大优于经典LSTM(56个参数)和其他量子序列模型,为NISQ时代的变分量子算法建立了一条高效参数初始化的强大途径。
摘要:The Quantum Approximate Optimization Algorithm (QAOA) is a leading approach for solving combinatorial optimization problems on near-term quantum processors. However, finding good variational parameters remains a significant challenge due to the non-convex energy landscape, often resulting in slow convergence and poor solution quality. In this work, we propose a quantum meta-learning framework that trains advanced quantum sequence models to generate effective parameter initialization policies. We investigate four classical or quantum sequence models, including the Quantum Kernel-based Long Short-Term Memory (QK-LSTM), as learned optimizers in a "learning to learn" paradigm. Our numerical experiments on the Max-Cut problem demonstrate that the QK-LSTM optimizer achieves superior performance, obtaining the highest approximation ratios and exhibiting the fastest convergence rate across all tested problem sizes (n=10 to 13). Crucially, the QK-LSTM model achieves perfect parameter transferability by synthesizing a single, fixed set of near-optimal parameters, leading to a remarkable sustained acceleration of convergence even when generalizing to larger problems. This capability, enabled by the compact and expressive power of the quantum kernel architecture, underscores its effectiveness. The QK-LSTM, with only 43 trainable parameters, substantially outperforms the classical LSTM (56 parameters) and other quantum sequence models, establishing a robust pathway toward highly efficient parameter initialization for variational quantum algorithms in the NISQ era.
医学相关(3篇)
【1】Deep infant brain segmentation from multi-contrast MRI
标题:多对比度MRI对婴儿大脑进行深度分割
链接:https://arxiv.org/abs/2512.05114
作者:Malte Hoffmann,Lilla Zöllei,Adrian V. Dalca
备注:8 pages, 8 figures, 1 table, website at https://w3id.org/babyseg, presented at the 2025 IEEE Asilomar Conference on Signals, Systems, and Computers
摘要:磁共振图像(MRI)的分割有助于通过描绘解剖结构来分析人脑的发育。然而,在婴儿和幼儿中,由于发育和成像限制,准确分割是具有挑战性的。众所周知,儿科脑部MRI难以获取,成像模式的可用性不一致,视野中存在大量非头部解剖结构,以及频繁的运动伪影。这导致了专门的分割模型,这些模型通常限于特定的图像类型或狭窄的年龄组,或者对于更可变的图像(例如临床采集的图像)来说是脆弱的。我们使用BabySeg解决了这种方法碎片化问题,BabySeg是一种针对婴幼儿的深度学习大脑分割框架,支持多种MRI协议,包括重复扫描和训练期间不可用的图像类型。我们的方法建立在最近的域随机化技术的基础上,该技术合成的训练图像远远超出了现实的界限,以促进数据集的移位不变性。我们还描述了一种机制,使模型能够灵活地池和互动功能,从任何数量的输入扫描。我们展示了最先进的性能,匹配或超过了几个现有的方法的准确性,为各种年龄组和输入配置使用一个单一的模型,在一小部分的运行时间所需的许多现有的工具。
摘要
:Segmentation of magnetic resonance images (MRI) facilitates analysis of human brain development by delineating anatomical structures. However, in infants and young children, accurate segmentation is challenging due to development and imaging constraints. Pediatric brain MRI is notoriously difficult to acquire, with inconsistent availability of imaging modalities, substantial non-head anatomy in the field of view, and frequent motion artifacts. This has led to specialized segmentation models that are often limited to specific image types or narrow age groups, or that are fragile for more variable images such as those acquired clinically. We address this method fragmentation with BabySeg, a deep learning brain segmentation framework for infants and young children that supports diverse MRI protocols, including repeat scans and image types unavailable during training. Our approach builds on recent domain randomization techniques, which synthesize training images far beyond realistic bounds to promote dataset shift invariance. We also describe a mechanism that enables models to flexibly pool and interact features from any number of input scans. We demonstrate state-of-the-art performance that matches or exceeds the accuracy of several existing methods for various age cohorts and input configurations using a single model, in a fraction of the runtime required by many existing tools.
【2】OMTRA: A Multi-Task Generative Model for Structure-Based Drug Design
标题:OMTRA:基于结构的药物设计的多任务生成模型
链接:https://arxiv.org/abs/2512.05080
作者:Ian Dunn,Liv Toft,Tyler Katz,Juhi Gupta,Riya Shah,Ramith Hettiarachchi,David R. Koes
备注:Presented at the Machine Learning for Structural Biology Workshop, 2025
摘要:基于结构的药物设计(SBDD)专注于设计与特定蛋白质口袋结合的小分子配体。计算方法是现代SBDD工作流程中不可或缺的一部分,通常通过对接或药效团搜索使用虚拟筛选方法。现代生成建模方法的重点是通过从头设计来改进新的配体发现。在这项工作中,我们认识到,这些任务共享一个共同的结构,因此可以表示为一个一致的生成建模框架的不同实例。我们提出了一个统一的方法,在OMTRA,多模式的流匹配模型,灵活地执行许多相关的SBDD任务,包括一些没有模拟在传统的工作流程。此外,我们策划了一个500 M 3D分子构象的数据集,补充了蛋白质-配体数据,并扩大了可用于训练的化学多样性。OMTRA在口袋条件从头设计和对接方面获得了最先进的性能;然而,大规模预训练和多任务训练的效果是适度的。用于复制这项工作的所有代码、训练模型和数据集均可在https://github.com/gnina/OMTRA上获得
摘要:Structure-based drug design (SBDD) focuses on designing small-molecule ligands that bind to specific protein pockets. Computational methods are integral in modern SBDD workflows and often make use of virtual screening methods via docking or pharmacophore search. Modern generative modeling approaches have focused on improving novel ligand discovery by enabling de novo design. In this work, we recognize that these tasks share a common structure and can therefore be represented as different instantiations of a consistent generative modeling framework. We propose a unified approach in OMTRA, a multi-modal flow matching model that flexibly performs many tasks relevant to SBDD, including some with no analogue in conventional workflows. Additionally, we curate a dataset of 500M 3D molecular conformers, complementing protein-ligand data and expanding the chemical diversity available for training. OMTRA obtains state of the art performance on pocket-conditioned de novo design and docking; however, the effects of large-scale pretraining and multi-task training are modest. All code, trained models, and dataset for reproducing this work are available at https://github.com/gnina/OMTRA
【3】SmartAlert: Implementing Machine Learning-Driven Clinical Decision Support for Inpatient Lab Utilization Reduction
标题:SmartAlert:实施机器学习驱动的临床决策支持,以减少住院实验室利用率
链接:https://arxiv.org/abs/2512.04354
作者:April S. Liang,Fatemeh Amrollahi,Yixing Jiang,Conor K. Corbin,Grace Y. E. Kim,David Mui,Trevor Crowell,Aakash Acharya,Sreedevi Mony,Soumya Punnathanam,Jack McKeown,Margaret Smith,Steven Lin,Arnold Milstein,Kevin Schulman,Jason Hom,Michael A. Pfeffer,Tho D. Pham,David Svec,Weihan Chu,Lisa Shieh,Christopher Sharp,Stephen P. Ma,Jonathan H. Chen
备注:22 pages, 5 figures
摘要:重复的实验室检测不太可能产生临床有用的信息,这是一种常见的做法,给患者带来负担并增加医疗费用。教育和反馈干预措施的成功有限,而一般的测试顺序限制和电子警报阻碍了适当的临床护理。我们介绍并评估SmartAlert,这是一种集成到电子健康记录中的机器学习(ML)驱动的临床决策支持(CDS)系统,可预测稳定的实验室结果,以减少不必要的重复检测。本案例研究描述了在2024年8月15日至2025年3月15日期间,在两家医院的8个急诊室的9270名住院患者中,在随机对照试点中部署针对全血细胞计数(CBC)利用率的SmartAlert的实施过程、挑战和经验教训。结果显示,SmartAlert显示后52小时内CBC结果数量显著减少(1.54 vs 1.82,p <0.01),对次要安全性结局无不良影响,代表重复检测相对减少15%。实施经验教训包括在临床环境中解释概率模型预测,利益相关者参与定义可接受的模型行为,在临床环境中部署复杂模型的治理流程,用户界面设计考虑,与临床操作优先级的一致性,以及最终用户的定性反馈的价值。总之,由精心实施和治理过程支持的机器学习驱动的CDS系统可以为住院实验室检测提供精确指导,以安全地减少不必要的重复检测。
摘要:Repetitive laboratory testing unlikely to yield clinically useful information is a common practice that burdens patients and increases healthcare costs. Education and feedback interventions have limited success, while general test ordering restrictions and electronic alerts impede appropriate clinical care. We introduce and evaluate SmartAlert, a machine learning (ML)-driven clinical decision support (CDS) system integrated into the electronic health record that predicts stable laboratory results to reduce unnecessary repeat testing. This case study describes the implementation process, challenges, and lessons learned from deploying SmartAlert targeting complete blood count (CBC) utilization in a randomized controlled pilot across 9270 admissions in eight acute care units across two hospitals between August 15, 2024, and March 15, 2025. Results show significant decrease in number of CBC results within 52 hours of SmartAlert display (1.54 vs 1.82, p <0.01) without adverse effect on secondary safety outcomes, representing a 15% relative reduction in repetitive testing. Implementation lessons learned include interpretation of probabilistic model predictions in clinical contexts, stakeholder engagement to define acceptable model behavior, governance processes for deploying a complex model in a clinical environment, user interface design considerations, alignment with clinical operational priorities, and the value of qualitative feedback from end users. In conclusion, a machine learning-driven CDS system backed by a deliberate implementation and governance process can provide precision guidance on inpatient laboratory testing to safely reduce unnecessary repetitive testing.
蒸馏|知识提取(2篇)
【1】MemLoRA: Distilling Expert Adapters for On-Device Memory Systems
标题:MemLoRA:为设备上存储系统提取专家适配器
链接:https://arxiv.org/abs/2512.04763
作者:Massimo Bini,Ondrej Bohdal,Umberto Michieli,Zeynep Akata,Mete Ozay,Taha Ceritli
摘要:记忆增强的大型语言模型(LLM)通过存储相关记忆并将其作为上下文,在长时间对话中表现出显着的一致性。这种基于记忆的个性化也是设备设置的关键,允许用户保持他们的对话和数据隐私。然而,存储器增强系统通常依赖于LLM,其对于本地设备上部署来说成本太高。尽管小型语言模型(SLM)比LLM更适合设备上的推理,但它们无法实现足够的性能。此外,这些基于LLM的系统缺乏本地视觉功能,限制了它们在多模态环境中的适用性。在本文中,我们介绍(一)MemLoRA,一种新的内存系统,使本地部署配备专用的内存适配器的SLM,和(ii)其视觉扩展MemLoRA-V,它集成了小视觉语言模型(SVLM)的内存系统,使本地视觉理解。遵循知识蒸馏原则,每个适配器都被单独训练用于特定的内存操作$\unicode{x2013}$知识提取、内存更新和内存增强生成。配备内存适配器,小型机型可实现精确的设备内存操作,而无需依赖云。在纯文本操作上,MemLoRA的性能优于10倍于基准模型的性能(例如,Gemma 2 - 27 B),并实现了与60倍大的型号(例如,GPT-OSS-120 B)的LoCoMo基准测试。相反,为了评估视觉理解操作,我们扩展了LoCoMo具有挑战性的视觉问题推理任务,需要直接的视觉推理。在这一点上,我们的VLM集成MemLoRA-V显示了基于标题的方法的巨大改进(81.3 vs. 23.7准确度),同时在基于文本的任务中保持了强大的性能,证明了我们的方法在多模态环境中的有效性。
摘要
:Memory-augmented Large Language Models (LLMs) have demonstrated remarkable consistency during prolonged dialogues by storing relevant memories and incorporating them as context. Such memory-based personalization is also key in on-device settings that allow users to keep their conversations and data private. However, memory-augmented systems typically rely on LLMs that are too costly for local on-device deployment. Even though Small Language Models (SLMs) are more suitable for on-device inference than LLMs, they cannot achieve sufficient performance. Additionally, these LLM-based systems lack native visual capabilities, limiting their applicability in multimodal contexts. In this paper, we introduce (i) MemLoRA, a novel memory system that enables local deployment by equipping SLMs with specialized memory adapters, and (ii) its vision extension MemLoRA-V, which integrates small Vision-Language Models (SVLMs) to memory systems, enabling native visual understanding. Following knowledge distillation principles, each adapter is trained separately for specific memory operations$\unicode{x2013}$knowledge extraction, memory update, and memory-augmented generation. Equipped with memory adapters, small models enable accurate on-device memory operations without cloud dependency. On text-only operations, MemLoRA outperforms 10$\times$ larger baseline models (e.g., Gemma2-27B) and achieves performance comparable to 60$\times$ larger models (e.g., GPT-OSS-120B) on the LoCoMo benchmark. To evaluate visual understanding operations instead, we extend LoCoMo with challenging Visual Question Answering tasks that require direct visual reasoning. On this, our VLM-integrated MemLoRA-V shows massive improvements over caption-based approaches (81.3 vs. 23.7 accuracy) while keeping strong performance in text-based tasks, demonstrating the efficacy of our method in multimodal contexts.
【2】Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective
标题:从预测分布的角度重新思考解耦知识蒸馏
链接:https://arxiv.org/abs/2512.04625
作者:Bowen Zheng,Ran Cheng
备注:Accepted to IEEE TNNLS
摘要:在知识蒸馏的历史中,焦点曾经随着时间的推移从基于逻辑的方法转移到基于特征的方法。然而,随着解耦知识蒸馏(DKD)的出现,这一转变被重新审视,它通过先进的解耦和加权策略重新强调了logit知识的重要性。虽然DKD标志着一个重大进步,但其潜在机制值得更深入的探索。作为回应,我们从预测分布的角度重新思考DKD。首先,我们介绍了一个增强的版本,广义解耦知识蒸馏(GDKD)的损失,它提供了一个更通用的方法解耦logits。然后,我们特别关注教师模型的预测分布及其对GDKD损失梯度的影响,揭示了两个经常被忽视的关键见解:(1)顶部logit的划分大大改善了非顶部logits的相互关系,以及(2)放大对非顶部logits蒸馏损失的关注,增强了它们之间的知识提取。利用这些见解,我们进一步提出了一个简化的GDKD算法与一个有效的分区策略,以处理教师模型的预测分布的多模态。我们在CIFAR-100、ImageNet、Tiny-ImageNet、CUB-200-2011和Cityscapes等各种基准测试上进行了全面的实验,证明GDKD的性能优于原始DKD和其他领先的知识提取方法。该代码可从https://github.com/ZaberKo/GDKD获得。
摘要:In the history of knowledge distillation, the focus has once shifted over time from logit-based to feature-based approaches. However, this transition has been revisited with the advent of Decoupled Knowledge Distillation (DKD), which re-emphasizes the importance of logit knowledge through advanced decoupling and weighting strategies. While DKD marks a significant advancement, its underlying mechanisms merit deeper exploration. As a response, we rethink DKD from a predictive distribution perspective. First, we introduce an enhanced version, the Generalized Decoupled Knowledge Distillation (GDKD) loss, which offers a more versatile method for decoupling logits. Then we pay particular attention to the teacher model's predictive distribution and its impact on the gradients of GDKD loss, uncovering two critical insights often overlooked: (1) the partitioning by the top logit considerably improves the interrelationship of non-top logits, and (2) amplifying the focus on the distillation loss of non-top logits enhances the knowledge extraction among them. Utilizing these insights, we further propose a streamlined GDKD algorithm with an efficient partition strategy to handle the multimodality of teacher models' predictive distribution. Our comprehensive experiments conducted on a variety of benchmarks, including CIFAR-100, ImageNet, Tiny-ImageNet, CUB-200-2011, and Cityscapes, demonstrate GDKD's superior performance over both the original DKD and other leading knowledge distillation methods. The code is available at https://github.com/ZaberKo/GDKD.
联邦学习|隐私保护|加密(1篇)
【1】Federated Learning for Anomaly Detection in Maritime Movement Data
标题:用于海洋运动数据异常检测的联邦学习
链接:https://arxiv.org/abs/2512.04635
作者:Anita Graser,Axel Weißenfeld,Clemens Heistracher,Melitta Dragaschnig,Peter Widhalm
备注:Accepted at MDM2024
摘要:本文介绍了M3fed,一种新的解决方案,用于运动异常检测模型的联邦学习。这项创新有可能提高数据隐私,降低机器学习中用于运动异常检测的通信成本。我们提出了用于训练M3 fed的新型联邦学习(FL)策略,使用海事AIS数据进行示例实验,并通过比较经典的集中式M3和新的联邦M3 fed来评估通信成本和FL模型质量方面的结果。
摘要:This paper introduces M3fed, a novel solution for federated learning of movement anomaly detection models. This innovation has the potential to improve data privacy and reduce communication costs in machine learning for movement anomaly detection. We present the novel federated learning (FL) strategies employed to train M3fed, perform an example experiment with maritime AIS data, and evaluate the results with respect to communication costs and FL model quality by comparing classic centralized M3 and the new federated M3fed.
推理|分析|理解|解释(8篇)
【1】Arbitrage: Efficient Reasoning via Advantage-Aware Speculation
标题:套利:通过有风险意识的投机进行有效推理
链接:https://arxiv.org/abs/2512.05033
作者:Monishwaran Maheswaran,Rishabh Tiwari,Yuezhou Hu,Kerem Dilmen,Coleman Hooper,Haocheng Xi,Nicholas Lee,Mehrdad Farajtabar,Michael W. Mahoney,Kurt Keutzer,Amir Gholami
备注:22 pages
摘要:现代大型语言模型具有很长的思维链,实现了令人印象深刻的推理能力,但它们在推理过程中会产生大量的计算成本,这促使技术提高性能成本比。在这些技术中,推测解码通过采用快速但不准确的草稿模型来自动回归地提出令牌,然后由更有能力的目标模型并行验证,从而加速推理。然而,由于在语义等价的步骤中标记不匹配导致不必要的拒绝,传统的标记级推测解码在推理任务中挣扎。虽然最近的工作已经转移到步骤级语义验证,通过接受或拒绝整个推理步骤来提高效率,但现有的步骤级方法仍然会重新生成许多被拒绝的步骤,并且几乎没有改进,浪费了宝贵的目标计算。为了解决这一挑战,我们提出了套利,一个新的步骤级投机生成框架,路由生成动态的基础上,草案和目标模型之间的相对优势。Arbitrage没有使用固定的接受阈值,而是使用经过训练的轻量级路由器来预测目标模型何时可能产生有意义的更好步骤。这种路由近似于理想的套利Oracle,总是选择更高质量的步骤,实现接近最佳的效率-准确性权衡。在多个数学推理基准测试中,Arbitrage始终超过先前的步骤级推测解码基线,在匹配的准确度下将推理延迟减少了$\sim2\times$。
摘要:Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\sim2\times$ at matched accuracy.
【2】Environment-Aware Channel Inference via Cross-Modal Flow: From Multimodal Sensing to Wireless Channels
标题:通过跨模式流的环境感知频道推理:从多模式传感到无线频道
链接:https://arxiv.org/abs/2512.04966
作者:Guangming Liang,Mingjie Yang,Dongzhu Liu,Paul Henderson,Lajos Hanzo
备注:13 pages, 13 figures, 40 references, submitted to IEEE for possible publication
摘要:准确的信道状态信息(CSI)是可靠和高效的无线通信的基础。然而,经由导频估计获取CSI引起大量开销,特别是在高多普勒环境中操作的大规模多输入多输出(MIMO)系统中。通过利用日益增长的可用性的环境传感数据,本论文探讨了无驾驶员信道推断,估计完整的CSI直接从多模态观测,包括相机图像,激光雷达点云,GPS坐标。与依赖于预定义的通道模型的先前研究相比,我们开发了一个数据驱动的框架,该框架将传感到通道映射制定为跨模态流匹配问题。该框架将多模态特征融合到通道域内的潜在分布中,并学习一个速度场,该速度场不断将潜在分布向通道分布转换。为了使这个公式易于处理和高效,我们重新制定的问题作为一个等价的条件流匹配目标,并纳入模态对齐损失,同时采用低延迟推理机制,使实时CSI估计。在实验中,我们建立了一个基于Sionna和Blender的程序化数据生成器,以支持传感场景和无线传播的真实感建模。系统级评估表明,在下行波束成形任务的信道估计精度和频谱效率方面,与基于导频和基于感测的基准相比,有了显着的改进。
摘要:Accurate channel state information (CSI) underpins reliable and efficient wireless communication. However, acquiring CSI via pilot estimation incurs substantial overhead, especially in massive multiple-input multiple-output (MIMO) systems operating in high-Doppler environments. By leveraging the growing availability of environmental sensing data, this treatise investigates pilot-free channel inference that estimates complete CSI directly from multimodal observations, including camera images, LiDAR point clouds, and GPS coordinates. In contrast to prior studies that rely on predefined channel models, we develop a data-driven framework that formulates the sensing-to-channel mapping as a cross-modal flow matching problem. The framework fuses multimodal features into a latent distribution within the channel domain, and learns a velocity field that continuously transforms the latent distribution toward the channel distribution. To make this formulation tractable and efficient, we reformulate the problem as an equivalent conditional flow matching objective and incorporate a modality alignment loss, while adopting low-latency inference mechanisms to enable real-time CSI estimation. In experiments, we build a procedural data generator based on Sionna and Blender to support realistic modeling of sensing scenes and wireless propagation. System-level evaluations demonstrate significant improvements over pilot- and sensing-based benchmarks in both channel estimation accuracy and spectral efficiency for the downstream beamforming task.
【3】Amortized Inference of Multi-Modal Posteriors using Likelihood-Weighted Normalizing Flows
标题:使用似然加权正规化流的多模式后验的摊销推断
链接:https://arxiv.org/abs/2512.04954
作者:Rajneil Baruah
备注:14 pages, 8 figures
摘要:我们提出了一种新的技术摊销后验估计使用归一化流训练似然加权重要性抽样。这种方法允许在高维反问题的理论参数的有效推断,而不需要后验训练样本。我们在2D和3D的多模态基准任务上实现了该方法,以检查其有效性。我们的研究的一个重要观察是基础分布的拓扑结构对建模后验的影响。我们发现,标准的单峰基础分布无法捕获断开的支持,导致虚假的概率模式之间的桥梁。我们证明,初始化流与匹配的基数的目标模式的高斯混合模型显着提高重建保真度,测量的一些距离和发散度量。
摘要:We present a novel technique for amortized posterior estimation using Normalizing Flows trained with likelihood-weighted importance sampling. This approach allows for the efficient inference of theoretical parameters in high-dimensional inverse problems without the need for posterior training samples. We implement the method on multi-modal benchmark tasks in 2D and 3D to check for the efficacy. A critical observation of our study is the impact of the topology of the base distributions on the modelled posteriors. We find that standard unimodal base distributions fail to capture disconnected support, resulting in spurious probability bridges between modes. We demonstrate that initializing the flow with a Gaussian Mixture Model that matches the cardinality of the target modes significantly improves reconstruction fidelity, as measured by some distance and divergence metrics.
【4】A Tutorial on Regression Analysis: From Linear Models to Deep Learning -- Lecture Notes on Artificial Intelligence
标题:子回归分析:从线性模型到深度学习--人工智能讲座笔记
链接:https://arxiv.org/abs/2512.04747
作者:Jingyuan Wang,Jiahao Ji
摘要:本文作为智能计算课程群(包括人工智能,数据挖掘,机器学习和模式识别课程)中的回归分析讲义。它的目的是为学生提供-谁被认为只拥有基本的大学水平的数学(即,微积分,线性代数和概率论的先决课程)-对回归分析有全面和独立的理解,而不需要任何额外的参考资料。课堂讲稿系统地介绍了回归分析的基本概念、建模组件和理论基础,包括线性回归、逻辑回归、多项式逻辑回归、多项式回归、基函数模型、基于核的方法和基于神经网络的非线性回归。核心方法论主题包括损失函数设计,参数估计原理,普通最小二乘法,基于梯度的优化算法及其变体,以及正则化技术,如岭和LASSO回归。通过详细的数学推导,说明性的例子和直观的视觉解释,这些材料不仅帮助学生理解回归模型是如何构建和优化的,而且还可以揭示特征和响应变量之间的潜在关系。通过连接经典统计建模和现代机器学习实践,这些课堂讲稿旨在为学生提供坚实的概念和技术基础,以便进一步研究先进的人工智能模型。
摘要:This article serves as the regression analysis lecture notes in the Intelligent Computing course cluster (including the courses of Artificial Intelligence, Data Mining, Machine Learning, and Pattern Recognition). It aims to provide students -- who are assumed to possess only basic university-level mathematics (i.e., with prerequisite courses in calculus, linear algebra, and probability theory) -- with a comprehensive and self-contained understanding of regression analysis without requiring any additional references. The lecture notes systematically introduce the fundamental concepts, modeling components, and theoretical foundations of regression analysis, covering linear regression, logistic regression, multinomial logistic regression, polynomial regression, basis-function models, kernel-based methods, and neural-network-based nonlinear regression. Core methodological topics include loss-function design, parameter-estimation principles, ordinary least squares, gradient-based optimization algorithms and their variants, as well as regularization techniques such as Ridge and LASSO regression. Through detailed mathematical derivations, illustrative examples, and intuitive visual explanations, the materials help students understand not only how regression models are constructed and optimized, but also how they reveal the underlying relationships between features and response variables. By bridging classical statistical modeling and modern machine-learning practice, these lecture notes aim to equip students with a solid conceptual and technical foundation for further study in advanced artificial intelligence models.
【5】On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference
标题:测试时间计算的局限性:顺序奖励过滤以实现更好的推理
链接:https://arxiv.org/abs/2512.04558
作者:Yue Yu,Qiwei Di,Quanquan Gu,Dongruo Zhou
备注:45 pages, 6 figures, 3 tables
摘要
:测试时计算(TTC)已经成为增强大型语言模型(LLM)的一个日益突出的范例。尽管经验的成功的方法,如最好的-$N$(BoN)抽样和顺序修订,其基本限制仍然不清楚。我们通过分析混合参考政策模型并证明标准BoN本质上是次优的来解决这一差距。为了更接近最优边界,我们研究了奖励过滤的顺序推理,这是一个简单的过程,选择性地将高奖励的几代人纳入上下文。该机制将计算集中在高级策略候选上,并抑制低级策略候选。在理论方面,我们表明,奖励过滤顺序推理产生严格更强的保证比标准TTC范式。在实证方面,我们评估这样的推理策略在不同的基准和观察一致的改进,广泛使用的方法,证明了我们的框架的实际有效性。
摘要:Test-time compute (TTC) has become an increasingly prominent paradigm for enhancing large language models (LLMs). Despite the empirical success of methods such as best-of-$n$ (BoN) sampling and sequential revision, their fundamental limits remain unclear. We address this gap by analyzing a mixture-of-reference policy model and proving that standard BoN is inherently suboptimal. To move closer to the optimal frontier, we study reward-filtered sequential inference, a simple procedure that selectively incorporates only high-reward generations into the context. This mechanism concentrates computation on superior policy candidates and suppresses inferior ones. On the theoretical side, we show that reward-filtered sequential inference yields strictly stronger guarantees than standard TTC paradigms. On the empirical side, we evaluate such an inference strategy across diverse benchmarks and observe consistent improvements over widely used approaches, demonstrating the practical effectiveness of our framework.
【6】Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems
标题:支持CXL的GPU-NDP系统的上下文感知专家混合推理
链接:https://arxiv.org/abs/2512.04476
作者:Zehao Fan,Zhenyu Liu,Yunzhen Liu,Yayue Hou,Hadjer Benmeziane,Kaoutar El Maghraoui,Liu Liu
摘要:混合专家模型(Mixture-of-Experts,MoE)通过条件计算扩展大型语言模型,但一旦专家权重超过GPU内存容量,推理就会受到内存限制。在这种情况下,必须将权重卸载到外部存储器,并且获取它们会导致昂贵的重复传输。我们通过采用CXL附加近数据处理(CXL-NDP)作为卸载层来执行冷专家,将昂贵的参数移动转换为更便宜的激活移动来解决这个问题。与先前的GPU-NDP系统在很大程度上是上下文不可知的和反应性的不同,我们开发了一个上下文感知的MoE系统,该系统使用预填充阶段激活统计来指导解码阶段专家的放置,动态地将热专家固定在GPU侧HBM中,并将其余部分映射到CXL-NDP。为了满足NDP有限的计算吞吐量,我们引入了上下文感知的混合精度量化,基于预填充阶段分配每个专家的位宽(1-4位)。由此产生的MoE推理系统重叠GPU和NDP执行,同时最大限度地减少跨设备移动。GPU-NDP系统的评估表明,我们的方法实现了高达8.7倍的解码吞吐量的改进,而只会导致0.13%的平均精度下降。
摘要:Mixture-of-Experts (MoE) models scale large language models through conditional computation, but inference becomes memory-bound once expert weights exceed the capacity of GPU memory. In this case, weights must be offloaded to external memory, and fetching them incurs costly and repeated transfers. We address this by adopting CXL-attached near-data processing (CXL-NDP) as the offloading tier to execute cold experts in place, converting expensive parameter movement into cheaper activation movement. Unlike prior GPU-NDP systems that are largely context-agnostic and reactive, we develop a context-aware MoE system that uses prefill-stage activation statistics to guide decoding-stage expert placement, dynamically pins hot experts in GPU-side HBM, and maps the remainder to CXL-NDP. To meet NDP's limited compute throughput, we introduce context-aware mixed-precision quantization that allocates per-expert bitwidths (1-4 bit) based on prefill stage. The resulting MoE inference system overlaps GPU and NDP execution while minimizing cross-device movement. The evaluation on the GPU-NDP system shows that our approach achieves up to an 8.7-fold decoding throughput improvement over the state-of-the-art method, while incurring only a 0.13% average accuracy drop.
【7】Inference-time Stochastic Refinement of GRU-Normalizing Flow for Real-time Video Motion Transfer
标题:实时视频运动传输的GRU规范化流程的推理时随机细化
链接:https://arxiv.org/abs/2512.04282
作者:Tasmiah Haque,Srinjoy Das
摘要:实时视频运动传输应用,如沉浸式游戏和基于视觉的异常检测,需要准确而多样化的未来预测,以支持现实的合成和不确定性下的稳健下游决策。为了提高这种连续预测的多样性,我们提出了一种新的推理时间细化技术,结合门控递归单位归一化流(GRU-NF)与随机抽样方法。虽然GRU-NF可以通过在时间预测框架内集成归一化流来捕获多模态分布,但其确定性转换结构可能会限制表达能力。为了解决这个问题,受随机规范化流(SNF)的启发,我们在GRU-NF推理过程中引入了马尔可夫链蒙特卡罗(MCMC)步骤,使模型能够探索更丰富的输出空间,更好地近似真实的数据分布,而无需重新训练。我们验证了我们的方法在基于关键点的视频运动传输管道,捕捉时间上连贯和感知不同的未来轨迹是至关重要的现实样本和低带宽通信。实验表明,我们的推理框架,门控循环单元-随机规范化流(GRU-SNF)优于GRU-NF在生成不同的输出,而不牺牲准确性,即使在较长的预测范围。通过在推理过程中注入随机性,我们的方法可以更有效地捕捉多模态行为。这些结果突出了集成随机动力学与基于流的序列模型生成时间序列预测的潜力。
摘要:Real-time video motion transfer applications such as immersive gaming and vision-based anomaly detection require accurate yet diverse future predictions to support realistic synthesis and robust downstream decision making under uncertainty. To improve the diversity of such sequential forecasts we propose a novel inference-time refinement technique that combines Gated Recurrent Unit-Normalizing Flows (GRU-NF) with stochastic sampling methods. While GRU-NF can capture multimodal distributions through its integration of normalizing flows within a temporal forecasting framework, its deterministic transformation structure can limit expressivity. To address this, inspired by Stochastic Normalizing Flows (SNF), we introduce Markov Chain Monte Carlo (MCMC) steps during GRU-NF inference, enabling the model to explore a richer output space and better approximate the true data distribution without retraining. We validate our approach in a keypoint-based video motion transfer pipeline, where capturing temporally coherent and perceptually diverse future trajectories is essential for realistic samples and low bandwidth communication. Experiments show that our inference framework, Gated Recurrent Unit- Stochastic Normalizing Flows (GRU-SNF) outperforms GRU-NF in generating diverse outputs without sacrificing accuracy, even under longer prediction horizons. By injecting stochasticity during inference, our approach captures multimodal behavior more effectively. These results highlight the potential of integrating stochastic dynamics with flow-based sequence models for generative time series forecasting.
【8】Constructive Approximation under Carleman's Condition, with Applications to Smoothed Analysis
标题:Carleman条件下的构造逼近及其在光滑分析中的应用
链接:https://arxiv.org/abs/2512.04371
作者:Frederic Koehler,Beining Wu
摘要:一个经典的结果Carleman,基于拟解析函数的理论,表明多项式在L^2(μ)中是稠密的,对于任何μ,使得矩$\int x^k dμ$不会随着k \到\infty$而增长得太快。在这项工作中,我们开发了一个相当紧密的定量模拟的基础Denjoy-Carleman定理通过复杂的分析,并表明,这允许非渐近控制的多项式的任何光滑函数在无穷大的多项式增长的逼近率。在许多情况下,这使我们能够为一般分布类上的函数建立L^2 $近似理论结果(例如,多变量亚高斯或亚指数分布),这在以前仅在特殊情况下是已知的。作为一个应用,我们证明了带限为[-Ω,Ω]$的Paley-Wiener函数类在所有严格次指数分布上都有超指数逼近率,从而给出了该类的一个新的刻画.作为另一个应用,我们解决了最近提出的一个开放的问题,提出了由Ehrasekaran,Klivans,Kontonis,Meka和Stavropoulos的平滑分析的学习,并获得定量的改进,他们的主要结果和应用。
摘要:A classical result of Carleman, based on the theory of quasianalytic functions, shows that polynomials are dense in $L^2(μ)$ for any $μ$ such that the moments $\int x^k dμ$ do not grow too rapidly as $k \to \infty$. In this work, we develop a fairly tight quantitative analogue of the underlying Denjoy-Carleman theorem via complex analysis, and show that this allows for nonasymptotic control of the rate of approximation by polynomials for any smooth function with polynomial growth at infinity. In many cases, this allows us to establish $L^2$ approximation-theoretic results for functions over general classes of distributions (e.g., multivariate sub-Gaussian or sub-exponential distributions) which were previously known only in special cases. As one application, we show that the Paley--Wiener class of functions bandlimited to $[-Ω,Ω]$ admits superexponential rates of approximation over all strictly sub-exponential distributions, which leads to a new characterization of the class. As another application, we solve an open problem recently posed by Chandrasekaran, Klivans, Kontonis, Meka and Stavropoulos on the smoothed analysis of learning, and also obtain quantitative improvements to their main results and applications.
检测相关(4篇)
【1】Exploiting \texttt{ftrace}'s \texttt{function\_graph} Tracer Features for Machine Learning: A Case Study on Encryption Detection
链接:https://arxiv.org/abs/2512.04590
作者:Kenan Begovic,Abdulaziz Al-Ali,Qutaibah Malluhi
备注:Conference paper presented at AICCSA 2025
摘要:本文提出使用Linux内核ftrace框架,特别是函数图跟踪器,为机器学习(ML)应用程序生成信息丰富的系统级数据。一个真实世界的加密检测任务的实验证明了所提出的功能在几个学习算法的有效性。学习者面临的问题是使用函数调用跟踪和基于图形的特征来检测大型文件数据集上的加密活动。实证结果突出了手头任务的99.28的出色准确性,强调了功能图跟踪器的功能的有效性。在针对多标签分类问题的额外实验中进一步验证了结果,其中从跟踪数据中识别出正在运行的程序。这项工作为预处理原始跟踪数据和提取基于图形的特征提供了全面的方法,在将ML应用于系统行为分析、程序识别和异常检测方面取得了重大进展。通过弥合系统跟踪和ML之间的差距,本文为性能监控和安全分析方面的创新解决方案铺平了道路。
摘要:This paper proposes using the Linux kernel ftrace framework, particularly the function graph tracer, to generate informative system level data for machine learning (ML) applications. Experiments on a real world encryption detection task demonstrate the efficacy of the proposed features across several learning algorithms. The learner faces the problem of detecting encryption activities across a large dataset of files, using function call traces and graph based features. Empirical results highlight an outstanding accuracy of 99.28 on the task at hand, underscoring the efficacy of features derived from the function graph tracer. The results were further validated in an additional experiment targeting a multilabel classification problem, in which running programs were identified from trace data. This work provides comprehensive methodologies for preprocessing raw trace data and extracting graph based features, offering significant advancements in applying ML to system behavior analysis, program identification, and anomaly detection. By bridging the gap between system tracing and ML, this paper paves the way for innovative solutions in performance monitoring and security analytics.
【2】Temp-SCONE: A Novel Out-of-Distribution Detection and Domain Generalization Framework for Wild Data with Temporal Shift
标题:Temp-SCONE:一种新型的具有时态变化的野生数据的分布外检测和领域概括框架
链接:https://arxiv.org/abs/2512.04571
作者:Aditi Naiknaware,Sanchit Singh,Hajar Homayouni,Salimeh Sekeh
备注:22 pages, 12 figures, 72 subfigures, 6 tables
摘要:开放世界学习(OWL)需要能够适应不断变化的环境的模型,同时可靠地检测分布外(OOD)输入。现有的方法,如SCONE,实现协变量和语义变化的鲁棒性,但假设静态环境,导致动态域中的性能下降。在本文中,我们提出了Temp-SCONE,时间上一致的扩展SCONE旨在处理动态环境中的时间转移。Temp-SCONE引入了一个基于平均保留置信度(ATC)的置信度驱动正则化损失,在保持SCONE的能量余量分离的同时,惩罚跨时间步长预测的不稳定性。动态数据集上的实验表明,Temp-SCONE显着提高了时间漂移下的鲁棒性,产生更高的损坏的数据准确性和更可靠的OOD检测相比,SCONE。在没有时间连续性的不同数据集上,Temp-SCONE保持了相当的性能,突出了时间正则化的重要性和局限性。我们对时间稳定性和泛化误差的理论见解进一步建立了Temp-SCONE作为在不断变化的动态环境中实现可靠OWL的一步。
摘要:Open-world learning (OWL) requires models that can adapt to evolving environments while reliably detecting out-of-distribution (OOD) inputs. Existing approaches, such as SCONE, achieve robustness to covariate and semantic shifts but assume static environments, leading to degraded performance in dynamic domains. In this paper, we propose Temp-SCONE, a temporally consistent extension of SCONE designed to handle temporal shifts in dynamic environments. Temp-SCONE introduces a confidence-driven regularization loss based on Average Thresholded Confidence (ATC), penalizing instability in predictions across time steps while preserving SCONE's energy-margin separation. Experiments on dynamic datasets demonstrate that Temp-SCONE significantly improves robustness under temporal drift, yielding higher corrupted-data accuracy and more reliable OOD detection compared to SCONE. On distinct datasets without temporal continuity, Temp-SCONE maintains comparable performance, highlighting the importance and limitations of temporal regularization. Our theoretical insights on temporal stability and generalization error further establish Temp-SCONE as a step toward reliable OWL in evolving dynamic environments.
【3】Sarcasm Detection on Reddit Using Classical Machine Learning and Feature Engineering
标题:使用经典机器学习和特征工程在Reddit上进行讽刺检测
链接:https://arxiv.org/abs/2512.04396
作者:Subrata Karmaker
备注:11 pages, 2 figures, includes full Python code. Classical machine learning baseline for sarcasm detection on the SARC 2.0 dataset
摘要:讽刺在在线讨论中很常见,但机器很难识别,因为其意图往往与字面意思相矛盾。在这项工作中,我只使用经典的机器学习方法和显式特征工程来研究讽刺检测,而不依赖于神经网络或来自父母评论的上下文。使用自注释Reddit语料库(SARC 2.0)的100,000条评论子样本,我将单词级别和字符级别的TF-IDF功能与简单的风格指示符相结合。评估四种模型:逻辑回归,线性SVM,多项式朴素贝叶斯和随机森林。朴素贝叶斯和逻辑回归表现最强,讽刺评论的F1得分约为0.57。虽然缺乏会话上下文限制性能,结果提供了一个明确的和可重复的基线讽刺检测使用轻量级和可解释的方法。
摘要:Sarcasm is common in online discussions, yet difficult for machines to identify because the intended meaning often contradicts the literal wording. In this work, I study sarcasm detection using only classical machine learning methods and explicit feature engineering, without relying on neural networks or context from parent comments. Using a 100,000-comment subsample of the Self-Annotated Reddit Corpus (SARC 2.0), I combine word-level and character-level TF-IDF features with simple stylistic indicators. Four models are evaluated: logistic regression, a linear SVM, multinomial Naive Bayes, and a random forest. Naive Bayes and logistic regression perform the strongest, achieving F1-scores around 0.57 for sarcastic comments. Although the lack of conversational context limits performance, the results offer a clear and reproducible baseline for sarcasm detection using lightweight and interpretable methods.
【4】MechDetect: Detecting Data-Dependent Errors
标题:MechDetect:检测数据相关错误
链接:https://arxiv.org/abs/2512.04138
作者:Philipp Jung,Nicholas Chandler,Sebastian Jäger,Felix Biessmann
备注:International Conference on Data Science and Intelligent Systems (DSIS 2025)
摘要:数据质量监控是现代信息处理系统的核心挑战。虽然已经提出了许多方法来检测数据错误或偏移,但很少有研究调查错误产生的机制。我们认为,知道错误是如何产生的,可以跟踪和修复它们的关键。在本研究中,我们以统计文献中关于缺失值的现有工作为基础,提出了MechDetect,这是一种用于研究错误生成机制的简单算法。给定表格数据集和相应的错误掩码,该算法使用机器学习模型来估计错误是否取决于数据。我们的工作扩展了现有的方法来检测机制的缺失值,并可以很容易地应用到其他错误类型,提供了一个错误掩码。我们在已建立的基准数据集上的实验中证明了MechDetect的有效性。
摘要
:Data quality monitoring is a core challenge in modern information processing systems. While many approaches to detect data errors or shifts have been proposed, few studies investigate the mechanisms governing error generation. We argue that knowing how errors were generated can be key to tracing and fixing them. In this study, we build on existing work in the statistics literature on missing values and propose MechDetect, a simple algorithm to investigate error generation mechanisms. Given a tabular data set and a corresponding error mask, the algorithm estimates whether or not the errors depend on the data using machine learning models. Our work extends established approaches to detect mechanisms underlying missing values and can be readily applied to other error types, provided that an error mask is available. We demonstrate the effectiveness of MechDetect in experiments on established benchmark datasets.
分类|识别(2篇)
【1】HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition
标题:HTR-Convtext:利用卷积和文本信息进行手写文本识别
链接:https://arxiv.org/abs/2512.05021
作者:Pham Thach Thanh Truc,Dang Hoai Nam,Huynh Tong Dang Khoa,Vo Nguyen Le Duy
摘要:手写文本识别仍然具有挑战性,由于有限的数据,高书写风格的变化,并与复杂的变音符号脚本。现有的方法虽然部分解决了这些问题,但在没有大量合成数据的情况下往往难以推广。为了解决这些挑战,我们提出了HTR-ConvText,一种旨在捕获细粒度,笔划级本地功能,同时保留全局上下文依赖关系的模型。在特征提取阶段,我们将残余卷积神经网络骨干与具有位置编码块的MobileViT集成。这使得模型能够捕获结构模式并学习微妙的写作细节。然后,我们介绍了ConvText编码器,一个混合体系结构结合全局上下文和本地功能的层次结构,减少序列长度,提高效率。此外,一个辅助模块注入文本上下文,以减轻连接主义时间分类的弱点。在IAM、READ 2016、LAM和HANDS-VNOnDB上的评估表明,与现有方法相比,我们的方法实现了更好的性能和更好的泛化能力,特别是在训练样本有限和手写体多样性高的情况下。
摘要:Handwritten Text Recognition remains challenging due to the limited data, high writing style variance, and scripts with complex diacritics. Existing approaches, though partially address these issues, often struggle to generalize without massive synthetic data. To address these challenges, we propose HTR-ConvText, a model designed to capture fine-grained, stroke-level local features while preserving global contextual dependencies. In the feature extraction stage, we integrate a residual Convolutional Neural Network backbone with a MobileViT with Positional Encoding block. This enables the model to both capture structural patterns and learn subtle writing details. We then introduce the ConvText encoder, a hybrid architecture combining global context and local features within a hierarchical structure that reduces sequence length for improved efficiency. Additionally, an auxiliary module injects textual context to mitigate the weakness of Connectionist Temporal Classification. Evaluations on IAM, READ2016, LAM and HANDS-VNOnDB demonstrate that our approach achieves improved performance and better generalization compared to existing methods, especially in scenarios with limited training samples and high handwriting diversity.
【2】Computational Linguistics Meets Libyan Dialect: A Study on Dialect Identification
标题:计算语言学遇上利比亚方言:方言识别研究
链接:https://arxiv.org/abs/2512.04257
作者:Mansour Essgaer,Khamis Massud,Rabia Al Mamlook,Najah Ghmaid
备注:13 pages, 8 figures
摘要:本研究探讨逻辑回归,线性支持向量机,多项式朴素贝叶斯,和伯努利朴素贝叶斯分类利比亚方言的话语从Twitter收集。使用的数据集是QADI语料库,其中包括18种阿拉伯语方言的540,000个句子。预处理的挑战包括处理不一致的正字法变体和利比亚方言典型的非标准拼写。卡方分析显示,某些功能,如电子邮件提及和情感指标,与方言分类没有显着相关,因此被排除在进一步的分析。进行了两个主要实验:(1)使用卡方检验评估从语料库中提取的元特征的重要性,以及(2)使用不同的单词和字符n-gram表示评估分类器性能。分类实验表明,多项朴素贝叶斯(MNB)在使用(1,2)词n-gram和(1,5)字符n-gram表示时达到了85.89%的最高准确率和0.85741的F1分数。相比之下,Logistic回归和线性SVM表现出略低的性能,最大准确率分别为84.41%和84.73%。其他评价指标,包括对数损失,科恩卡帕,马修相关系数,进一步支持MNB在这项任务中的有效性。结果表明,精心挑选的n-gram表示和分类模型在提高利比亚方言识别的准确性方面发挥了至关重要的作用。这项研究为阿拉伯语方言NLP应用的未来研究提供了经验基准和见解。
摘要:This study investigates logistic regression, linear support vector machine, multinomial Naive Bayes, and Bernoulli Naive Bayes for classifying Libyan dialect utterances gathered from Twitter. The dataset used is the QADI corpus, which consists of 540,000 sentences across 18 Arabic dialects. Preprocessing challenges include handling inconsistent orthographic variations and non-standard spellings typical of the Libyan dialect. The chi-square analysis revealed that certain features, such as email mentions and emotion indicators, were not significantly associated with dialect classification and were thus excluded from further analysis. Two main experiments were conducted: (1) evaluating the significance of meta-features extracted from the corpus using the chi-square test and (2) assessing classifier performance using different word and character n-gram representations. The classification experiments showed that Multinomial Naive Bayes (MNB) achieved the highest accuracy of 85.89% and an F1-score of 0.85741 when using a (1,2) word n-gram and (1,5) character n-gram representation. In contrast, Logistic Regression and Linear SVM exhibited slightly lower performance, with maximum accuracies of 84.41% and 84.73%, respectively. Additional evaluation metrics, including log loss, Cohen kappa, and Matthew correlation coefficient, further supported the effectiveness of MNB in this task. The results indicate that carefully selected n-gram representations and classification models play a crucial role in improving the accuracy of Libyan dialect identification. This study provides empirical benchmarks and insights for future research in Arabic dialect NLP applications.
表征(1篇)
【1】RNNs perform task computations by dynamically warping neural representations
标题:RNN通过动态扭曲神经表示来执行任务计算
链接:https://arxiv.org/abs/2512.04310
作者:Arthur Pellegrino,Angus Chadwick
备注:NeurIPS 2025
摘要:分析神经网络如何在其激活中表示数据特征可以帮助解释它们如何执行任务。因此,一长串的工作集中在数学上描述这种“神经表征”的几何特征。“与此同时,机器学习在理解动态系统如何对时变输入数据执行计算方面也引起了极大的兴趣。然而,通过动力学计算和表征几何之间的联系仍然知之甚少。在这里,我们假设递归神经网络(RNN)通过动态扭曲其任务变量的表示来执行计算。为了验证这一假设,我们开发了一个黎曼几何框架,使衍生的流形拓扑结构和几何的动力系统从流形的输入。通过描述RNN的时变几何结构,我们表明动态扭曲是其计算的基本特征。
摘要:Analysing how neural networks represent data features in their activations can help interpret how they perform tasks. Hence, a long line of work has focused on mathematically characterising the geometry of such "neural representations." In parallel, machine learning has seen a surge of interest in understanding how dynamical systems perform computations on time-varying input data. Yet, the link between computation-through-dynamics and representational geometry remains poorly understood. Here, we hypothesise that recurrent neural networks (RNNs) perform computations by dynamically warping their representations of task variables. To test this hypothesis, we develop a Riemannian geometric framework that enables the derivation of the manifold topology and geometry of a dynamical system from the manifold of its inputs. By characterising the time-varying geometry of RNNs, we show that dynamic warping is a fundamental feature of their computations.
预测|估计(5篇)
【1】STeP-Diff: Spatio-Temporal Physics-Informed Diffusion Models for Mobile Fine-Grained Pollution Forecasting
标题:STeP-Diff:用于移动细颗粒污染预测的时空物理信息扩散模型
链接:https://arxiv.org/abs/2512.04385
作者:Nan Zhou,Weijie Hong,Huandong Wang,Jianfeng Zheng,Qiuhua Wang,Yali Song,Xiao-Ping Zhang,Yong Li,Xinlei Chen
摘要:精细化的空气污染预测对于城市管理和健康建筑的发展至关重要。在汽车和公共汽车等移动平台上部署便携式传感器提供了一种低成本、易于维护和覆盖范围广的数据收集解决方案。然而,由于这些非专用移动平台的随机和不可控的运动模式,所得到的传感器数据往往是不完整的和时间上不一致的。通过探索扩散模型逆向过程中潜在的训练模式,我们提出了时空物理信息扩散模型(STeP-Diff)。STeP-Diff利用DeepONet对测量的空间序列进行建模,并使用PDE通知的扩散模型从不完整和随时间变化的数据中预测时空场。通过偏微分方程约束的正则化框架,去噪过程渐近收敛到对流扩散动力学,确保预测既基于现实世界的测量,又与控制污染扩散的基本物理学保持一致。为了评估该系统的性能,我们在两个城市部署了59个自行设计的便携式传感器,运行了14天,以收集空气污染数据。与性能第二好的算法相比,我们的模型在MAE,RMSE和MAPE方面分别提高了89.12%,82.30%和25.00%,广泛的评估表明STeP-Diff有效地捕获了空气污染领域的时空依赖性。
摘要:Fine-grained air pollution forecasting is crucial for urban management and the development of healthy buildings. Deploying portable sensors on mobile platforms such as cars and buses offers a low-cost, easy-to-maintain, and wide-coverage data collection solution. However, due to the random and uncontrollable movement patterns of these non-dedicated mobile platforms, the resulting sensor data are often incomplete and temporally inconsistent. By exploring potential training patterns in the reverse process of diffusion models, we propose Spatio-Temporal Physics-Informed Diffusion Models (STeP-Diff). STeP-Diff leverages DeepONet to model the spatial sequence of measurements along with a PDE-informed diffusion model to forecast the spatio-temporal field from incomplete and time-varying data. Through a PDE-constrained regularization framework, the denoising process asymptotically converges to the convection-diffusion dynamics, ensuring that predictions are both grounded in real-world measurements and aligned with the fundamental physics governing pollution dispersion. To assess the performance of the system, we deployed 59 self-designed portable sensing devices in two cities, operating for 14 days to collect air pollution data. Compared to the second-best performing algorithm, our model achieved improvements of up to 89.12% in MAE, 82.30% in RMSE, and 25.00% in MAPE, with extensive evaluations demonstrating that STeP-Diff effectively captures the spatio-temporal dependencies in air pollution fields.
【2】Fine-Tuning ChemBERTa for Predicting Inhibitory Activity Against TDP1 Using Deep Learning
标题:微调ChemBERTa用于使用深度学习预测针对DPP 1的抑制活性
链接:https://arxiv.org/abs/2512.04252
作者:Baichuan Zeng
摘要:预测小分子对酪氨酸-DNA磷酸二酯酶1(TDP 1)的抑制效力-克服癌症耐药性的关键目标-仍然是早期药物发现的关键挑战。我们提出了一个深度学习框架,用于从分子简化分子输入行输入系统(SMILES)字符串中定量回归pIC 50值,使用预先训练的化学语言模型ChemBERTA的微调变体。利用包含177,092种化合物的大规模共识数据集,我们系统地评估了两种预训练策略-掩蔽语言建模(MLM)和掩蔽令牌回归(MTR)-在分层数据分割和样本加权下,以解决严重的活动不平衡问题,其中只有2.1%是活跃的。我们的方法在回归精度和虚拟筛选效用方面都优于经典基线随机预测器,并且与随机森林相比具有竞争力的性能,在排名靠前的预测中实现了高富集因子EF@1%17.4和精度Precision@1%37.4。由此产生的模型,通过严格的消融和超参数研究验证,提供了一个强大的,准备部署的工具,优先TDP 1抑制剂的实验测试。通过直接从SMILES实现准确的、无3D结构的pIC 50预测,这项工作展示了化学Transformers在加速靶向特异性药物发现方面的变革潜力。
摘要:Predicting the inhibitory potency of small molecules against Tyrosyl-DNA Phosphodiesterase 1 (TDP1)-a key target in overcoming cancer chemoresistance-remains a critical challenge in early drug discovery. We present a deep learning framework for the quantitative regression of pIC50 values from molecular Simplified Molecular Input Line Entry System (SMILES) strings using fine-tuned variants of ChemBERTa, a pre-trained chemical language model. Leveraging a large-scale consensus dataset of 177,092 compounds, we systematically evaluate two pre-training strategies-Masked Language Modeling (MLM) and Masked Token Regression (MTR)-under stratified data splits and sample weighting to address severe activity imbalance which only 2.1% are active. Our approach outperforms classical baselines Random Predictor in both regression accuracy and virtual screening utility, and has competitive performance compared to Random Forest, achieving high enrichment factor EF@1% 17.4 and precision Precision@1% 37.4 among top-ranked predictions. The resulting model, validated through rigorous ablation and hyperparameter studies, provides a robust, ready-to-deploy tool for prioritizing TDP1 inhibitors for experimental testing. By enabling accurate, 3D-structure-free pIC50 prediction directly from SMILES, this work demonstrates the transformative potential of chemical transformers in accelerating target-specific drug discovery.
【3】Recurrent Neural Networks with Linear Structures for Electricity Price Forecasting
标题:用于电价预测的线性结构回归神经网络
链接:https://arxiv.org/abs/2512.04690
作者:Souhir Ben Amor,Florian Ziel
摘要:我们提出了一种新的递归神经网络架构,专门为日前电价预测而设计,旨在改善能源系统的短期决策和运营管理。我们的组合预测模型将专家模型和卡尔曼滤波器等线性结构嵌入到递归网络中,从而实现高效计算和增强的可解释性。该设计利用了线性和非线性模型结构的优势,使其能够捕捉电力市场中所有相关的程式化价格特征,包括日历和自回归效应,以及负荷,可再生能源和相关燃料和碳市场的影响。为了进行实证检验,我们使用了2018年至2025年欧洲最大电力市场的每小时数据进行综合预测研究,将我们的模型与最先进的方法进行比较,特别是高维线性和神经网络模型。所提出的模型实现了约12%的精度比领先的基准。我们评估了可解释模型组件的贡献,并得出结论,结合线性和非线性结构的影响。
摘要:We present a novel recurrent neural network architecture designed explicitly for day-ahead electricity price forecasting, aimed at improving short-term decision-making and operational management in energy systems. Our combined forecasting model embeds linear structures, such as expert models and Kalman filters, into recurrent networks, enabling efficient computation and enhanced interpretability. The design leverages the strengths of both linear and non-linear model structures, allowing it to capture all relevant stylised price characteristics in power markets, including calendar and autoregressive effects, as well as influences from load, renewable energy, and related fuel and carbon markets. For empirical testing, we use hourly data from the largest European electricity market spanning 2018 to 2025 in a comprehensive forecasting study, comparing our model against state-of-the-art approaches, particularly high-dimensional linear and neural network models. The proposed model achieves approximately 12% higher accuracy than leading benchmarks. We evaluate the contributions of the interpretable model components and conclude on the impact of combining linear and non-linear structures.
【4】Predicting Time-Dependent Flow Over Complex Geometries Using Operator Networks
标题:用算子网络预测复杂几何形状上的时变流
链接:https://arxiv.org/abs/2512.04434
作者:Ali Rabeh,Suresh Murugaiyan,Adarsh Krishnamurthy,Baskar Ganapathysubramanian
摘要:非定常流的快速、几何概化替代物仍然具有挑战性。我们提出了一个时间相关的,几何感知的深度算子网络,预测参数和非参数形状周围的中等Re流的速度场。该模型通过带符号的距离场(SDF)主干和通过CNN分支的流动历史对几何形状进行编码,并在841高保真模拟上进行训练。在保持出的形状,它达到$\sim 5\%$相对L2单步误差和高达1000倍的CFD加速。我们提供以物理为中心的部署诊断,包括探测器的相位误差和发散规范,以量化长期保真度。这些揭示了准确的近期瞬变,但在细尺度尾流,最明显的尖角几何形状的误差积累。我们分析了故障模式,并概述了实际的缓解措施。代码、分割和脚本在https://github.com/baskargroup/TimeDependent-DeepONet上公开发布,以支持可重复性和基准测试。
摘要
:Fast, geometry-generalizing surrogates for unsteady flow remain challenging. We present a time-dependent, geometry-aware Deep Operator Network that predicts velocity fields for moderate-Re flows around parametric and non-parametric shapes. The model encodes geometry via a signed distance field (SDF) trunk and flow history via a CNN branch, trained on 841 high-fidelity simulations. On held-out shapes, it attains $\sim 5\%$ relative L2 single-step error and up to 1000X speedups over CFD. We provide physics-centric rollout diagnostics, including phase error at probes and divergence norms, to quantify long-horizon fidelity. These reveal accurate near-term transients but error accumulation in fine-scale wakes, most pronounced for sharp-cornered geometries. We analyze failure modes and outline practical mitigations. Code, splits, and scripts are openly released at: https://github.com/baskargroup/TimeDependent-DeepONet to support reproducibility and benchmarking.
【5】Enhancing next token prediction based pre-training for jet foundation models
标题:增强基于下一个标记预测的喷气基础模型的预训练
链接:https://arxiv.org/abs/2512.04149
作者:Joschka Birk,Anna Hallin,Gregor Kasieczka,Nikol Madzharova,Ian Pang,David Shih
摘要:下一个标记预测是喷气机基础模型的一个有吸引力的预训练任务,因为它是模拟免费的,并实现了可以跨数据集传输的出色生成功能。在这里,我们研究了对下一个令牌预测的多项改进,建立在OmniJet-$α$的初始工作基础上。我们采用了一种混合设置,而不是将粒子标记化,然后仅使用标记ID作为生成和分类任务的模型输入,这允许我们使用连续的特征向量作为模型输入,同时仅在下一个标记预测目标中使用标记ID。其次,我们探索了一种组合的预训练策略,该策略将掩蔽粒子建模和生成学习目标相结合。总之,这些变化大大提高了下游分类任务的性能,而不会损失任何生成性能。
摘要:Next token prediction is an attractive pre-training task for jet foundation models, in that it is simulation free and enables excellent generative capabilities that can transfer across datasets. Here we study multiple improvements to next token prediction, building on the initial work of OmniJet-$α$. Instead of tokenizing particles and subsequently only using the token-ID as the model input for both the generative and the classification task, we adopt a hybrid setup, which allows us to use continuous feature vectors as model input while only using token-IDs in the next token prediction target. Secondly, we explore a combined pre-training strategy that combines masked particle modeling and generative learning objectives. Taken together, these changes greatly improve the performance in downstream classification tasks without any loss in generative performance.
其他神经网络|深度学习|模型|建模(23篇)
【1】Gradient Descent with Provably Tuned Learning-rate Schedules
标题:具有可证明调整学习率的梯度下降算法
链接:https://arxiv.org/abs/2512.05084
作者:Dravyansh Sharma
摘要:基于递归的迭代优化方法是现代机器学习的主力。它们主要依赖于对学习率和动量等参数的仔细调整。然而,人们通常使用启发式方法来设置它们,而没有正式的近似最优性保证。Gupta和Roughgarden最近的工作研究了如何在梯度下降中学习一个好的步长。然而,像大多数文献的理论保证基于梯度的优化,他们的结果依赖于强假设的函数类,包括凸性和光滑性,不举行在典型的应用程序。在这项工作中,我们开发了新的分析工具,可证明调整超参数的梯度算法,适用于非凸和非光滑函数。我们获得匹配的样本复杂性界学习的步长梯度下降光滑,凸函数在以前的工作(对数因子),但更广泛的一类功能。我们的分析适用于具有常用激活函数(包括ReLU,sigmoid和tanh)的神经网络的梯度下降。我们将我们的框架扩展到调整多个超参数,包括调整学习速率计划,同时调整动量和步长,以及预训练初始化向量。我们的方法可以用来约束样本的复杂性,以最小化验证损失以及梯度下降迭代的次数。
摘要:Gradient-based iterative optimization methods are the workhorse of modern machine learning. They crucially rely on careful tuning of parameters like learning rate and momentum. However, one typically sets them using heuristic approaches without formal near-optimality guarantees. Recent work by Gupta and Roughgarden studies how to learn a good step-size in gradient descent. However, like most of the literature with theoretical guarantees for gradient-based optimization, their results rely on strong assumptions on the function class including convexity and smoothness which do not hold in typical applications. In this work, we develop novel analytical tools for provably tuning hyperparameters in gradient-based algorithms that apply to non-convex and non-smooth functions. We obtain matching sample complexity bounds for learning the step-size in gradient descent shown for smooth, convex functions in prior work (up to logarithmic factors) but for a much broader class of functions. Our analysis applies to gradient descent on neural networks with commonly used activation functions (including ReLU, sigmoid and tanh). We extend our framework to tuning multiple hyperparameters, including tuning the learning rate schedule, simultaneously tuning momentum and step-size, and pre-training the initialization vector. Our approach can be used to bound the sample complexity for minimizing both the validation loss as well as the number of gradient descent iterations.
【2】David vs. Goliath: Can Small Models Win Big with Agentic AI in Hardware Design?
标题:David vs. Goliath:小模型能在硬件设计中赢得巨大的成功吗?
链接:https://arxiv.org/abs/2512.05073
作者:Shashwat Shankar,Subhranshu Pandey,Innocent Dengkhw Mochahari,Bhabesh Mali,Animesh Basak Chowdhury,Sukanta Bhattacharjee,Chandan Karfa
摘要:大型语言模型(LLM)推理需要大量的计算和能量,使得特定领域的任务昂贵且不可持续。随着基础模型不断扩展,我们会问:硬件设计越大越好吗?我们的工作通过在NVIDIA的综合Verilog设计问题(CVDP)基准测试中评估小语言模型以及策划的代理AI框架来测试这一点。结果表明,代理工作流程:通过任务分解,迭代反馈和校正-不仅以一小部分成本解锁接近LLM的性能,而且还为代理创造学习机会,为复杂设计任务中的高效,自适应解决方案铺平道路。
摘要:Large Language Model(LLM) inference demands massive compute and energy, making domain-specific tasks expensive and unsustainable. As foundation models keep scaling, we ask: Is bigger always better for hardware design? Our work tests this by evaluating Small Language Models coupled with a curated agentic AI framework on NVIDIA's Comprehensive Verilog Design Problems(CVDP) benchmark. Results show that agentic workflows: through task decomposition, iterative feedback, and correction - not only unlock near-LLM performance at a fraction of the cost but also create learning opportunities for agents, paving the way for efficient, adaptive solutions in complex design tasks.
【3】Dual-Path Region-Guided Attention Network for Ground Reaction Force and Moment Regression
标题:基于双路径区域引导的地面反作用力和力矩回归注意网络
链接:https://arxiv.org/abs/2512.05030
作者:Xuan Li,Samuel Bello
摘要:三维地面反力和力矩(GRF/GRM)的准确估计对于生物力学研究和临床康复评估至关重要。在这项研究中,我们专注于基于鞋垫的GRF/GRM估计,并进一步验证我们的方法在公共步行数据集。我们提出了一个双路径区域引导注意力网络,它将解剖学启发的空间先验和时间先验集成到区域级注意力机制中,而互补路径则从整个传感器场捕获上下文。这两条路径被联合训练,它们的输出被组合以产生最终的GRF/GRM预测。结论:我们的模型在两个数据集上的表现优于强基线模型,包括CNN和CNN-LSTM架构,在鞋垫数据集上实现了最低的六分量平均NRMSE,为5.78%,在公共数据集上实现了1.42%的垂直地面反作用力。这证明了地面反作用力和力矩估计的鲁棒性能。
摘要
:Accurate estimation of three-dimensional ground reaction forces and moments (GRFs/GRMs) is crucial for both biomechanics research and clinical rehabilitation evaluation. In this study, we focus on insole-based GRF/GRM estimation and further validate our approach on a public walking dataset. We propose a Dual-Path Region-Guided Attention Network that integrates anatomy-inspired spatial priors and temporal priors into a region-level attention mechanism, while a complementary path captures context from the full sensor field. The two paths are trained jointly and their outputs are combined to produce the final GRF/GRM predictions. Conclusions: Our model outperforms strong baseline models, including CNN and CNN-LSTM architectures on two datasets, achieving the lowest six-component average NRMSE of 5.78% on the insole dataset and 1.42% for the vertical ground reaction force on the public dataset. This demonstrates robust performance for ground reaction force and moment estimation.
【4】Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias in LVLM-Based Text-to-Image Models
标题:一致但刻板印象?系统预设对基于LVLM的文本到图像模型中社会偏见的隐性影响
链接:https://arxiv.org/abs/2512.04981
作者:NaHyeon Park,Namin An,Kunhee Kim,Soyeon Yoon,Jiahao Huo,Hyunjung Shim
备注:Project page: https://fairpro-t2i.github.io
摘要:基于大视觉语言模型(LVLM)的文本到图像(T2 I)系统已经成为图像生成的主导范式,但它们是否放大了社会偏见仍然没有得到充分的理解。在本文中,我们表明,LVLM为基础的模型产生显着更多的社会偏见的图像比非LVLM为基础的模型。我们引入了一个1,024提示基准跨越四个层次的语言复杂性和评估人口偏见在多个属性中的系统方式。我们的分析确定系统提示,预定义的指令引导LVLM,作为一个主要的驱动程序的偏见行为。通过解码的中间表示,标记概率诊断和嵌入关联分析,我们揭示了系统提示如何编码传播到图像合成的人口统计先验。为此,我们提出了FairPro,一个无需训练的元提示框架,使LVLM能够在测试时进行自我审计并构建公平感知系统提示。在两个基于LVLM的T2 I模型SANA和Qwen-Image上的实验表明,FairPro在保持文本图像对齐的同时大大减少了人口统计偏差。我们相信,我们的研究结果为系统提示在偏见传播中的核心作用提供了更深入的见解,并为构建更具社会责任感的T2 I系统提供了一种实用,可部署的方法。
摘要:Large vision-language model (LVLM) based text-to-image (T2I) systems have become the dominant paradigm in image generation, yet whether they amplify social biases remains insufficiently understood. In this paper, we show that LVLM-based models produce markedly more socially biased images than non-LVLM-based models. We introduce a 1,024 prompt benchmark spanning four levels of linguistic complexity and evaluate demographic bias across multiple attributes in a systematic manner. Our analysis identifies system prompts, the predefined instructions guiding LVLMs, as a primary driver of biased behavior. Through decoded intermediate representations, token-probability diagnostics, and embedding-association analyses, we reveal how system prompts encode demographic priors that propagate into image synthesis. To this end, we propose FairPro, a training-free meta-prompting framework that enables LVLMs to self-audit and construct fairness-aware system prompts at test time. Experiments on two LVLM-based T2I models, SANA and Qwen-Image, show that FairPro substantially reduces demographic bias while preserving text-image alignment. We believe our findings provide deeper insight into the central role of system prompts in bias propagation and offer a practical, deployable approach for building more socially responsible T2I systems.
【5】A result relating convex n-widths to covering numbers with some applications to neural networks
标题:凸n宽与覆盖数相关的结果及其对神经网络的一些应用
链接:https://arxiv.org/abs/2512.04912
作者:Jonathan Baxter,Peter Bartlett
摘要:一般来说,通过一组固定的基函数或“特征”的线性组合来近似定义在高维输入空间上的函数类是很困难的。通常情况下,最佳基集的最坏情况误差衰减速度仅为$Θ\(n^{-1/d}\)$,其中$n$是基函数的数量,$d$是输入维度。然而,有许多高维模式识别问题(如人脸识别)的例子,其中小特征集的线性组合确实很好地解决了问题。因此,这些函数类不会遭受与更一般的类相关的“维数灾难”。这是自然的,然后,寻找高维函数类的特征,但近似的线性组合的小功能。本文给出了函数类的逼近误差与其“凸核”的覆盖数之间关系的一般结果。对于单隐层神经网络,由单个隐节点计算的函数类的覆盖数上界凸核的覆盖数。因此,使用标准的结果,我们得到的神经网络类的逼近速度的上界。
摘要:In general, approximating classes of functions defined over high-dimensional input spaces by linear combinations of a fixed set of basis functions or ``features'' is known to be hard. Typically, the worst-case error of the best basis set decays only as fast as $Θ\(n^{-1/d}\)$, where $n$ is the number of basis functions and $d$ is the input dimension. However, there are many examples of high-dimensional pattern recognition problems (such as face recognition) where linear combinations of small sets of features do solve the problem well. Hence these function classes do not suffer from the ``curse of dimensionality'' associated with more general classes. It is natural then, to look for characterizations of high-dimensional function classes that nevertheless are approximated well by linear combinations of small sets of features. In this paper we give a general result relating the error of approximation of a function class to the covering number of its ``convex core''. For one-hidden-layer neural networks, covering numbers of the class of functions computed by a single hidden node upper bound the covering numbers of the convex core. Hence, using standard results we obtain upper bounds on the approximation rate of neural network classes.
【6】Pick-to-Learn for Systems and Control: Data-driven Synthesis with State-of-the-art Safety Guarantees
标题:系统和控制的Pick-to-Learn:具有最先进安全保证的数据驱动合成
链接:https://arxiv.org/abs/2512.04781
作者:Dario Paccagnan,Daniel Marks,Marco C. Campi,Simone Garatti
备注:27 double-column pages, 18 figures
摘要:数据驱动的方法在现代系统和控制问题中已经变得至关重要,其特点是复杂性不断增加。在安全关键的环境中,部署这些方法需要严格的保证,这一需求激发了统计学习和控制界面的许多近期工作。然而,许多现有的方法实现这一目标的代价是牺牲有价值的数据进行测试和校准,或通过限制学习算法的选择,从而导致次优性能。在本文中,我们描述了系统和控制的Pick-to-Learn(P2 L),这是一个框架,允许任何数据驱动的控制方法配备最先进的安全和性能保证。P2 L允许使用所有可用数据来联合综合和验证设计,无需为校准或验证目的留出数据。在提出一个全面的版本的P2 L系统和控制,本文证明了其有效性在一系列的核心问题,包括最优控制,可达性分析,安全合成,鲁棒控制。在许多这些应用中,P2 L提供的设计和证书优于常用的方法,并显示出在不同的实际环境中广泛适用的强大潜力。
摘要:Data-driven methods have become paramount in modern systems and control problems characterized by growing levels of complexity. In safety-critical environments, deploying these methods requires rigorous guarantees, a need that has motivated much recent work at the interface of statistical learning and control. However, many existing approaches achieve this goal at the cost of sacrificing valuable data for testing and calibration, or by constraining the choice of learning algorithm, thus leading to suboptimal performances. In this paper, we describe Pick-to-Learn (P2L) for Systems and Control, a framework that allows any data-driven control method to be equipped with state-of-the-art safety and performance guarantees. P2L enables the use of all available data to jointly synthesize and certify the design, eliminating the need to set aside data for calibration or validation purposes. In presenting a comprehensive version of P2L for systems and control, this paper demonstrates its effectiveness across a range of core problems, including optimal control, reachability analysis, safe synthesis, and robust control. In many of these applications, P2L delivers designs and certificates that outperform commonly employed methods, and shows strong potential for broad applicability in diverse practical settings.
【7】Complementary Characterization of Agent-Based Models via Computational Mechanics and Diffusion Models
标题:通过计算力学和扩散模型对基于主体的模型进行补充描述
链接:https://arxiv.org/abs/2512.04771
作者:Roberto Garrone
备注:11 pages. Methods paper introducing a dual-domain framework for analyzing ABM dynamics. Companion temporal-analysis preprint: arXiv:2510.12729
摘要:本文通过引入扩散模型作为表征基于代理的模型(ABM)输出的正交和互补工具,扩展了预印本“Characterizing Agent-Based Model Dynamics via $ε$-Machines and Kolmogorov-Style Complexity”。其中,$ε$-机器捕获ABM生成的时间序列的预测时间结构和内在计算,扩散模型表征高维横截面分布,学习底层数据流形,并能够合成生成合理的人口水平结果。我们提供了一个正式的分析,证明了这两种方法在不同的数学域-过程与\分布-并表明它们的组合产生一个双轴表示ABM行为的时间组织和分布几何的基础上。据我们所知,这是第一个将计算力学与基于分数的生成建模相结合的框架,用于ABM输出的结构分析,从而将ABM表征置于现代机器学习方法的更广泛的景观中,用于密度估计和内在计算。该框架使用在配套文件中介绍的相同的老年人护理ABM数据集进行验证,我们提供了精确的定义和命题形式化$ε$-机和扩散模型之间的数学互补性。这建立了一个原则性的方法,联合分析复杂的仿真模型的时间可预测性和高维分布结构。
摘要:This article extends the preprint "Characterizing Agent-Based Model Dynamics via $ε$-Machines and Kolmogorov-Style Complexity" by introducing diffusion models as orthogonal and complementary tools for characterizing the output of agent-based models (ABMs). Where $ε$-machines capture the predictive temporal structure and intrinsic computation of ABM-generated time series, diffusion models characterize high-dimensional cross-sectional distributions, learn underlying data manifolds, and enable synthetic generation of plausible population-level outcomes. We provide a formal analysis demonstrating that the two approaches operate on distinct mathematical domains -processes vs.\ distributions- and show that their combination yields a two-axis representation of ABM behavior based on temporal organization and distributional geometry. To our knowledge, this is the first framework to integrate computational mechanics with score-based generative modeling for the structural analysis of ABM outputs, thereby situating ABM characterization within the broader landscape of modern machine-learning methods for density estimation and intrinsic computation. The framework is validated using the same elder-caregiver ABM dataset introduced in the companion paper, and we provide precise definitions and propositions formalizing the mathematical complementarity between $ε$-machines and diffusion models. This establishes a principled methodology for jointly analyzing temporal predictability and high-dimensional distributional structure in complex simulation models.
【8】Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space
标题:自然语言演员-评论家:语言空间中可扩展的政策外学习
链接:https://arxiv.org/abs/2512.04601
作者:Joey Hong,Kang Liu,Zhan Ling,Jiecao Chen,Sergey Levine
备注:22 pages, 4 figures
摘要:大型语言模型(LLM)代理-在长期范围内与环境动态交互的LLM-已经成为一个越来越重要的研究领域,使涉及工具使用,Web浏览和与人对话的复杂任务实现自动化。在没有专家演示的情况下,训练LLM代理依赖于策略梯度方法,该方法可以优化LLM策略(通常是稀疏的)奖励函数。然而,在具有稀疏奖励的长时间任务中,从随机级奖励中学习可能是嘈杂的,导致训练不稳定并且具有高样本复杂度。此外,政策改进取决于通过探索发现更好的行动,当行动位于自然语言空间时,这可能很困难。在本文中,我们提出了自然语言演员-评论家(NLAC),一种新的演员-评论家算法,使用生成的LLM评论家来训练LLM策略,该评论家产生自然语言而不是标量值。这种方法利用LLM的固有优势提供更丰富,更可操作的训练信号;特别是在具有大型开放式动作空间的任务中,自然语言解释为什么动作是次优的对于LLM策略来说非常有用,可以推理如何改进他们的动作,而不依赖于随机探索。此外,我们的方法可以在没有策略梯度的情况下进行离线训练,为现有的策略方法提供了一种更有效和更稳定的替代方案。我们提出了混合推理,网页浏览和工具使用与对话任务的结果,表明NLAC表现出优于现有的培训方法的承诺,并提供了一个更具可扩展性和稳定的培训范例LLM代理。
摘要:Large language model (LLM) agents -- LLMs that dynamically interact with an environment over long horizons -- have become an increasingly important area of research, enabling automation in complex tasks involving tool-use, web browsing, and dialogue with people. In the absence of expert demonstrations, training LLM agents has relied on policy gradient methods that optimize LLM policies with respect to an (often sparse) reward function. However, in long-horizon tasks with sparse rewards, learning from trajectory-level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that trains LLM policies using a generative LLM critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal; particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods. We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for LLM agents.
【9】LeMat-GenBench: A Unified Evaluation Framework for Crystal Generative Models
标题:LeMat-GenBench:晶体生成模型的统一评估框架
链接:https://arxiv.org/abs/2512.04562
作者:Siddharth Betala,Samuel P. Gleason,Ali Ramlaoui,Andy Xu,Georgia Channing,Daniel Levy,Clémentine Fourrier,Nikita Kazeev,Chaitanya K. Joshi,Sékou-Oumar Kaba,Félix Therrien,Alex Hernandez-Garcia,Rocío Mercado,N. M. Anoop Krishnan,Alexandre Duval
备注:46 pages, 17 figures, 16 tables
摘要:生成式机器学习(ML)模型通过无机晶体的逆向设计加速材料发现,从而实现对化学空间的前所未有的探索。然而,缺乏标准化的评估框架使得评估、比较和进一步有意义地开发这些ML模型变得非常困难。在这项工作中,我们介绍了LeMat-GenBench,这是一个用于晶体材料生成模型的统一基准,由一组旨在更好地为模型开发和下游应用提供信息的评估指标支持。我们在Hugging Face上发布了一个开源评估套件和一个公共排行榜,并对12个最近的生成模型进行了基准测试。结果表明,稳定性的增加导致平均新颖性和多样性的减少,没有模型在所有维度上都表现出色。总而言之,LeMat-GenBench为公平的模型比较建立了可重复和可扩展的基础,旨在指导开发更可靠的、以发现为导向的晶体材料生成模型。
摘要:Generative machine learning (ML) models hold great promise for accelerating materials discovery through the inverse design of inorganic crystals, enabling an unprecedented exploration of chemical space. Yet, the lack of standardized evaluation frameworks makes it challenging to evaluate, compare, and further develop these ML models meaningfully. In this work, we introduce LeMat-GenBench, a unified benchmark for generative models of crystalline materials, supported by a set of evaluation metrics designed to better inform model development and downstream applications. We release both an open-source evaluation suite and a public leaderboard on Hugging Face, and benchmark 12 recent generative models. Results reveal that an increase in stability leads to a decrease in novelty and diversity on average, with no model excelling across all dimensions. Altogether, LeMat-GenBench establishes a reproducible and extensible foundation for fair model comparison and aims to guide the development of more reliable, discovery-oriented generative models for crystalline materials.
【10】Feature Engineering vs. Deep Learning for Automated Coin Grading: A Comparative Study on Saint-Gaudens Double Eagles
标题:自动硬币分级的特征工程与深度学习:圣高登斯双鹰的比较研究
链接:https://arxiv.org/abs/2512.04464
作者:Tanmay Dogra,Eric Ngo,Mohammad Alam,Jean-Paul Talavera,Asim Dahal
摘要:我们挑战了深度学习总是胜过旧技术的普遍信念,使用自动分级Saint-Gaudens双鹰金币的例子。在我们的工作中,我们将一个基于特征的人工神经网络构建在192个自定义特征上,这些特征来自Sobel边缘检测和HSV颜色分析,而不是一个混合了EfficientNetV 2的混合卷积神经网络,再加上一个简单的支持向量机作为控制。测试了1,785枚由专家分级的硬币,人工神经网络确定了86%的精确匹配,并在允许3级回旋余地时达到了98%。另一方面,CNN和SVM大多只是猜测最常见的等级,分别有31%和30%的准确命中率。当然,CNN在更广泛的容忍度指标上看起来不错,但这是因为回归中的一些平均技巧隐藏了它在挑选特定等级时完全失败的原因。总而言之,当你被2,000个以下的例子和不平衡的类卡住时,通过特征设计来烘焙真正的硬币专家知识,击败了那些难以理解的,一体化的深度学习设置。这对于其他利基质量检查来说也是如此,在这些检查中,数据很薄,技术诀窍比原始计算更重要。
摘要:We challenge the common belief that deep learning always trumps older techniques, using the example of grading Saint-Gaudens Double Eagle gold coins automatically. In our work, we put a feature-based Artificial Neural Network built around 192 custom features pulled from Sobel edge detection and HSV color analysis up against a hybrid Convolutional Neural Network that blends in EfficientNetV2, plus a straightforward Support Vector Machine as the control. Testing 1,785 coins graded by experts, the ANN nailed 86% exact matches and hit 98% when allowing a 3-grade leeway. On the flip side, CNN and SVM mostly just guessed the most common grade, scraping by with 31% and 30% exact hits. Sure, the CNN looked good on broader tolerance metrics, but that is because of some averaging trick in regression that hides how it totally flops at picking out specific grades. All told, when you are stuck with under 2,000 examples and lopsided classes, baking in real coin-expert knowledge through feature design beats out those inscrutable, all-in-one deep learning setups. This rings true for other niche quality checks where data's thin and know-how matters more than raw compute.
【11】Learning to Orchestrate Agents in Natural Language with the Conductor
标题:学习与指挥者用自然语言描述代理人
链接:https://arxiv.org/abs/2512.04388
作者:Stefan Nielsen,Edoardo Cetin,Peter Schwendeman,Qi Sun,Jinglue Xu,Yujin Tang
摘要:来自不同提供商的强大的大型语言模型(LLM)已经过昂贵的训练和微调,以专注于不同的领域。在这项工作中,我们引入了一种新的导体模型,通过强化学习训练,自动发现LLM之间强大的协调策略。我们的指挥不仅学习设计有针对性的通信拓扑结构,以实现有效的代理到代理协作,而且还向LLM提示以工程师为中心的指示,以最大限度地利用他们的个人能力。我们表明,通过在强大的工人LLM池上学习最佳协调策略,7B Conductor实现了超越任何单个工人的显着性能提升,在具有挑战性的推理基准(如LiveCodeBench和GPQA)中获得了最先进的结果。通过使用随机代理池进行训练,我们的指挥者可以有效地适应任意的开源和闭源代理集,满足任何用户需求。此外,允许Conductor选择自己作为worker,会产生递归拓扑,通过在线迭代自适应,以一种新形式的动态测试时间缩放来提升性能。更广泛地说,我们是早期的工作之一,证明语言模型协调可以通过强化学习解锁,其中强大的协调策略通过纯粹的端到端奖励最大化在LLM中自然出现。
摘要:Powerful large language models (LLMs) from different providers have been expensively trained and finetuned to specialize across varying domains. In this work, we introduce a new kind of Conductor model trained with reinforcement learning to automatically discover powerful coordination strategies among LLMs. Our Conductor learns not only to design targeted communication topologies for effective agent-to-agent collaboration, but also to prompt engineer focused instructions to the LLMs to maximally leverage their individual capabilities. We show that, by learning optimal coordination strategies over pools of powerful worker LLMs, a 7B Conductor achieves significant performance gains beyond any individual worker, attaining state-of-the-art results in challenging reasoning benchmarks, such as LiveCodeBench and GPQA. By training with randomized agent pools, our conductor effectively adapts to arbitrary sets of open- and closed-source agents, meeting any user requirements. Furthermore, allowing the Conductor to select itself as a worker gives rise to recursive topologies, elevating performance with a new form of dynamic test-time scaling through online iterative adaptation. More broadly, ours is among the early work demonstrating language model coordination can be unlocked through RL, where powerful coordination strategies emerge naturally in LLMs through pure end-to-end reward maximization.
【12】When do spectral gradient updates help in deep learning?
标题:光谱梯度更新何时对深度学习有帮助?
链接:https://arxiv.org/abs/2512.04299
作者:Damek Davis,Dmitriy Drusvyatskiy
摘要:谱梯度方法,例如最近流行的μ子优化器,是训练深度神经网络和Transformers的标准欧几里得梯度下降的一种有前途的替代方法,但目前还不清楚它们在哪些机制下会表现得更好。我们提出了一个简单的逐层条件,预测当一个频谱更新产生一个更大的减少比欧几里德梯度步骤的损失。对于每个参数块,该条件将梯度的平方核-弗罗贝纽斯比与传入激活的稳定秩进行比较。为了理解何时可以满足这个条件,我们首先证明后激活矩阵在随机特征回归、前馈网络和Transformer块中在高斯初始化时具有低稳定秩。在尖峰随机特征模型中,我们表明,经过短暂的老化,欧几里得梯度的核-弗罗贝纽斯比随着数据维度的增长而增长,而激活的稳定秩仍然有界,因此光谱更新的预测优势随维度而变化。我们在合成回归实验和NanoGPT规模语言模型训练中验证了这些预测,我们发现中间激活在整个训练过程中具有低稳定等级,并且相应的梯度保持较大的核与弗罗贝纽斯比率。总之,这些结果确定了谱梯度方法(如Muon)在训练深度网络和Transformers时有效的条件。
摘要:Spectral gradient methods, such as the recently popularized Muon optimizer, are a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers, but it is still unclear in which regimes they are expected to perform better. We propose a simple layerwise condition that predicts when a spectral update yields a larger decrease in the loss than a Euclidean gradient step. This condition compares, for each parameter block, the squared nuclear-to-Frobenius ratio of the gradient to the stable rank of the incoming activations. To understand when this condition may be satisfied, we first prove that post-activation matrices have low stable rank at Gaussian initialization in random feature regression, feedforward networks, and transformer blocks. In spiked random feature models we then show that, after a short burn-in, the Euclidean gradient's nuclear-to-Frobenius ratio grows with the data dimension while the stable rank of the activations remains bounded, so the predicted advantage of spectral updates scales with dimension. We validate these predictions in synthetic regression experiments and in NanoGPT-scale language model training, where we find that intermediate activations have low-stable-rank throughout training and the corresponding gradients maintain large nuclear-to-Frobenius ratios. Together, these results identify conditions for spectral gradient methods, such as Muon, to be effective in training deep networks and transformers.
【13】The Initialization Determines Whether In-Context Learning Is Gradient Descent
标题:收件箱确定上下文学习是否是梯度下降
链接:https://arxiv.org/abs/2512.04268
作者:Shifeng Xie,Rui Yuan,Simone Rossi,Thomas Hannagan
摘要:在大型语言模型(LLM)中的上下文学习(ICL)是一个引人注目的现象,但其潜在的机制仍然只有部分了解。以前的工作将线性自注意(LSA)与梯度下降(GD)联系起来,这种联系主要是在简化条件下建立的,具有零均值高斯先验和零初始化GD。然而,随后的研究通过强调其过于限制性的假设来挑战这种简化的观点,相反,证明在多层或非线性注意等条件下,自我注意执行类似优化的推理,类似于但不同于GD。我们研究如何多头LSA近似GD在更现实的条件下,特别是当将非零高斯先验均值的ICL的线性回归公式。我们首先通过引入查询的初始估计(称为初始猜测)来扩展多头LSA嵌入矩阵。我们证明了ICL线性回归设置所需的头数的上限。我们的实验证实了这一结果,并进一步观察到一步GD和多头LSA之间的性能差距仍然存在。为了解决这个问题,我们引入了yq-LSA,这是一种简单的单头LSA的推广,具有可训练的初始猜测yq。我们从理论上建立了yq-LSA的能力,并提供了线性回归任务的实验验证,从而扩展了桥接ICL和GD的理论。最后,在线性回归的情况下,我们的研究结果的启发,我们认为广泛的LLM增强与初始猜测能力,并表明,他们的表现是提高语义相似性任务。
摘要
:In-context learning (ICL) in large language models (LLMs) is a striking phenomenon, yet its underlying mechanisms remain only partially understood. Previous work connects linear self-attention (LSA) to gradient descent (GD), this connection has primarily been established under simplified conditions with zero-mean Gaussian priors and zero initialization for GD. However, subsequent studies have challenged this simplified view by highlighting its overly restrictive assumptions, demonstrating instead that under conditions such as multi-layer or nonlinear attention, self-attention performs optimization-like inference, akin to but distinct from GD. We investigate how multi-head LSA approximates GD under more realistic conditions specifically when incorporating non-zero Gaussian prior means in linear regression formulations of ICL. We first extend multi-head LSA embedding matrix by introducing an initial estimation of the query, referred to as the initial guess. We prove an upper bound on the number of heads needed for ICL linear regression setup. Our experiments confirm this result and further observe that a performance gap between one-step GD and multi-head LSA persists. To address this gap, we introduce yq-LSA, a simple generalization of single-head LSA with a trainable initial guess yq. We theoretically establish the capabilities of yq-LSA and provide experimental validation on linear regression tasks, thereby extending the theory that bridges ICL and GD. Finally, inspired by our findings in the case of linear regression, we consider widespread LLMs augmented with initial guess capabilities, and show that their performance is improved on a semantic similarity task.
【14】Studying Various Activation Functions and Non-IID Data for Machine Learning Model Robustness
标题:研究各种激活函数和非IID数据以实现机器学习模型的鲁棒性
链接:https://arxiv.org/abs/2512.04264
作者:Long Dang,Thushari Hapuarachchi,Kaiqi Xiong,Jing Lin
摘要:对抗性训练是提高机器学习模型鲁棒性的有效方法。大多数现有的研究通常考虑整流线性单元(ReLU)激活函数和集中式训练环境。在本文中,我们通过集中式环境中的对抗训练,使用十种不同的激活函数来研究ML模型的鲁棒性,并探索ML模型在联邦学习环境中的鲁棒性。在集中式环境中,我们首先提出了一种先进的对抗性训练方法,通过结合模型架构变化、软标签、简化的数据增强和不同的学习率来提高ML模型的鲁棒性。然后,除了ReLU之外,我们还对十个著名的激活函数进行了广泛的实验,以更好地了解它们如何影响ML模型的鲁棒性。此外,我们将所提出的对抗训练方法扩展到联邦学习环境,其中考虑了独立同分布(IID)和非IID数据设置。我们提出的集中式对抗训练方法在CIFAR-10上分别实现了77.08%和67.96%的自然和鲁棒准确率,以对抗快速梯度符号攻击。对十个激活函数的实验表明,ReLU通常表现最好。然而,在联邦学习环境中,鲁棒准确性显着下降,尤其是在非IID数据上。为了解决非IID数据情况下的显着性能下降,我们引入了数据共享,并分别实现了70.09%和54.79%的自然和鲁棒准确率,超过了CalFAT算法,当使用40%的数据共享时。也就是说,适当比例的数据共享可以显着提高ML模型的鲁棒性,这对一些现实世界的应用程序很有用。
摘要:Adversarial training is an effective method to improve the machine learning (ML) model robustness. Most existing studies typically consider the Rectified linear unit (ReLU) activation function and centralized training environments. In this paper, we study the ML model robustness using ten different activation functions through adversarial training in centralized environments and explore the ML model robustness in federal learning environments. In the centralized environment, we first propose an advanced adversarial training approach to improving the ML model robustness by incorporating model architecture change, soft labeling, simplified data augmentation, and varying learning rates. Then, we conduct extensive experiments on ten well-known activation functions in addition to ReLU to better understand how they impact the ML model robustness. Furthermore, we extend the proposed adversarial training approach to the federal learning environment, where both independent and identically distributed (IID) and non-IID data settings are considered. Our proposed centralized adversarial training approach achieves a natural and robust accuracy of 77.08% and 67.96%, respectively on CIFAR-10 against the fast gradient sign attacks. Experiments on ten activation functions reveal ReLU usually performs best. In the federated learning environment, however, the robust accuracy decreases significantly, especially on non-IID data. To address the significant performance drop in the non-IID data case, we introduce data sharing and achieve the natural and robust accuracy of 70.09% and 54.79%, respectively, surpassing the CalFAT algorithm, when 40% data sharing is used. That is, a proper percentage of data sharing can significantly improve the ML model robustness, which is useful to some real-world applications.
【15】Educational Cone Model in Embedding Vector Spaces
标题:嵌入载体空间中的教育锥模型
链接:https://arxiv.org/abs/2512.04227
作者:Yo Ehara
备注:Accepted to the 33rd International Conference on Computers in Education (ICCE 2025)
摘要:具有明确难度等级的人工标注数据集在智能教育系统中至关重要。虽然嵌入向量空间被广泛用于表示语义接近度,并有希望用于分析文本难度,但丰富的嵌入方法在选择最合适的方法方面带来了挑战。本研究提出了教育锥模型,这是一个几何框架的基础上,假设较容易的文本是不太多样化(侧重于基本概念),而较难的文本是更加多样化。无论使用何种嵌入方法,该假设都会导致嵌入空间中的锥形分布。该模型将嵌入的评估框架为优化问题,其目的是检测结构化的基于困难的模式。通过设计特定的损失函数,高效的封闭形式的解决方案,避免昂贵的计算。在真实世界数据集上的实证测试验证了该模型在识别与难以注释的教育文本最佳对齐的嵌入空间方面的有效性和速度。
摘要:Human-annotated datasets with explicit difficulty ratings are essential in intelligent educational systems. Although embedding vector spaces are widely used to represent semantic closeness and are promising for analyzing text difficulty, the abundance of embedding methods creates a challenge in selecting the most suitable method. This study proposes the Educational Cone Model, which is a geometric framework based on the assumption that easier texts are less diverse (focusing on fundamental concepts), whereas harder texts are more diverse. This assumption leads to a cone-shaped distribution in the embedding space regardless of the embedding method used. The model frames the evaluation of embeddings as an optimization problem with the aim of detecting structured difficulty-based patterns. By designing specific loss functions, efficient closed-form solutions are derived that avoid costly computation. Empirical tests on real-world datasets validated the model's effectiveness and speed in identifying the embedding spaces that are best aligned with difficulty-annotated educational texts.
【16】Network of Theseus (like the ship)
标题:忒修斯网络(就像船一样)
链接:https://arxiv.org/abs/2512.04198
作者:Vighnesh Subramaniam,Colin Conwell,Boris Katz,Andrei Barbu,Brian Cheung
备注:Preprint. 24 pages, 9 figures, 8 tables
摘要:深度学习中的一个标准假设是,神经网络架构引入的归纳偏差必须从训练到推理一直存在。您训练使用的体系结构就是您部署的体系结构。由于优化的困难,这种假设限制了社区选择可能具有期望的效率或设计属性的架构。我们用忒修斯网络(Network of Theseus,NoT)来挑战这一假设,NoT是一种逐步将训练过的,甚至是未经训练的引导网络架构逐部分转换为完全不同的目标网络架构,同时保持引导网络的性能的方法。在每个阶段,引导网络架构中的组件逐渐被目标架构模块替换,并通过代表性相似性指标进行对齐。这个过程在很大程度上保留了指导网络的功能,即使在重大的架构变化,例如,转换卷积网络到多层感知器,或GPT-2到递归神经网络。通过将优化与部署解耦,NoT扩展了可行的推理时间架构的空间,为更好的准确性-效率权衡提供了机会,并实现了对架构设计空间的更直接的探索。
摘要
:A standard assumption in deep learning is that the inductive bias introduced by a neural network architecture must persist from training through inference. The architecture you train with is the architecture you deploy. This assumption constrains the community from selecting architectures that may have desirable efficiency or design properties due to difficulties with optimization. We challenge this assumption with Network of Theseus (NoT), a method for progressively converting a trained, or even untrained, guide network architecture part-by-part into an entirely different target network architecture while preserving the performance of the guide network. At each stage, components in the guide network architecture are incrementally replaced with target architecture modules and aligned via representational similarity metrics. This procedure largely preserves the functionality of the guide network even under substantial architectural changes-for example, converting a convolutional network into a multilayer perceptron, or GPT-2 into a recurrent neural network. By decoupling optimization from deployment, NoT expands the space of viable inference-time architectures, opening opportunities for better accuracy-efficiency tradeoffs and enabling more directed exploration of the architectural design space.
【17】BEP: A Binary Error Propagation Algorithm for Binary Neural Networks Training
标题:BEP:一种用于二进制神经网络训练的二进制错误传播算法
链接:https://arxiv.org/abs/2512.04189
作者:Luca Colombo,Fabrizio Pittorino,Daniele Zambon,Carlo Baldassi,Manuel Roveri,Cesare Alippi
摘要:二进制神经网络(BNN)将权重和激活都限制为二进制值,大大降低了计算复杂性,内存占用和能耗。这些优点使它们特别适合在资源受限的设备上部署。然而,由于其变量的离散性,通过基于梯度的优化训练BNN仍然具有挑战性。主要的方法,量化感知训练,通过使用替代梯度来规避这个问题。然而,这种方法需要维护潜在的全精度参数并使用浮点运算执行向后传递,从而在训练期间丧失了二进制运算的效率。虽然存在基于局部学习规则的替代方法,但它们不适合全局信用分配和多层架构中的反向传播错误。本文介绍了二进制误差传播(BEP),第一个学习算法,建立一个原则性的,离散模拟的反向传播链规则。这种机制使表示为二进制向量的错误信号能够通过神经网络的多层向后传播。BEP完全对二进制变量进行操作,所有向前和向后计算都只使用按位操作。至关重要的是,这使得BEP成为第一个为递归神经网络架构实现端到端二进制训练的解决方案。我们在多层感知器和递归神经网络上验证了BEP的有效性,测试准确率分别提高了+6.89%和+10.57%。所提出的算法作为开源存储库发布。
摘要:Binary Neural Networks (BNNs), which constrain both weights and activations to binary values, offer substantial reductions in computational complexity, memory footprint, and energy consumption. These advantages make them particularly well suited for deployment on resource-constrained devices. However, training BNNs via gradient-based optimization remains challenging due to the discrete nature of their variables. The dominant approach, quantization-aware training, circumvents this issue by employing surrogate gradients. Yet, this method requires maintaining latent full-precision parameters and performing the backward pass with floating-point arithmetic, thereby forfeiting the efficiency of binary operations during training. While alternative approaches based on local learning rules exist, they are unsuitable for global credit assignment and for back-propagating errors in multi-layer architectures. This paper introduces Binary Error Propagation (BEP), the first learning algorithm to establish a principled, discrete analog of the backpropagation chain rule. This mechanism enables error signals, represented as binary vectors, to be propagated backward through multiple layers of a neural network. BEP operates entirely on binary variables, with all forward and backward computations performed using only bitwise operations. Crucially, this makes BEP the first solution to enable end-to-end binary training for recurrent neural network architectures. We validate the effectiveness of BEP on both multi-layer perceptrons and recurrent neural networks, demonstrating gains of up to +6.89% and +10.57% in test accuracy, respectively. The proposed algorithm is released as an open-source repository.
【18】Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity
标题:缓解细节诅咒:特征学习和样本复杂性的缩放参数
链接:https://arxiv.org/abs/2512.04165
作者:Noa Rubin,Orit Davidovich,Zohar Ringel
摘要:深度学习理论中的两个紧迫主题是特征学习机制的解释和丰富区域中网络的内隐偏差的确定。目前关于丰富特征学习效果的理论围绕着具有一个或两个可训练层的网络或深度线性网络。此外,即使在这样的限制设置下,预测通常以高维非线性方程的形式出现,这需要计算密集型的数值解。考虑到定义深度学习问题的许多细节,这种分析复杂性是一个重大且往往不可避免的挑战。在这里,我们提出了一个强大的启发式路线预测的数据和宽度尺度的各种模式的特征学习出现。这种形式的标度分析比这种精确理论简单得多,并再现了各种已知结果的标度指数。此外,我们对复杂的玩具架构做出了新的预测,例如三层非线性网络和注意力头,从而扩展了深度学习的第一性原理理论的范围。
摘要:Two pressing topics in the theory of deep learning are the interpretation of feature learning mechanisms and the determination of implicit bias of networks in the rich regime. Current theories of rich feature learning effects revolve around networks with one or two trainable layers or deep linear networks. Furthermore, even under such limiting settings, predictions often appear in the form of high-dimensional non-linear equations, which require computationally intensive numerical solutions. Given the many details that go into defining a deep learning problem, this analytical complexity is a significant and often unavoidable challenge. Here, we propose a powerful heuristic route for predicting the data and width scales at which various patterns of feature learning emerge. This form of scale analysis is considerably simpler than such exact theories and reproduces the scaling exponents of various known results. In addition, we make novel predictions on complex toy architectures, such as three-layer non-linear networks and attention heads, thus extending the scope of first-principle theories of deep learning.
【19】Foundations of Diffusion Models in General State Spaces: A Self-Contained Introduction
标题:一般状态空间中扩散模型的基础:自成一体的介绍
链接:https://arxiv.org/abs/2512.05092
作者:Vincent Pauline,Tobias Höppe,Kirill Neklyudov,Alexander Tong,Stefan Bauer,Andrea Dittadi
摘要:虽然扩散模型现在在生成建模中占据中心地位,但介绍性的处理通常假设欧几里得数据,很少澄清它们与离散状态类似物的联系。这篇文章是关于一般状态空间上的扩散的一个独立的入门,在一个镜头下统一连续域和离散/分类结构。我们开发的离散时间视图(通过马尔可夫内核和学习的反向动态噪声)旁边的连续时间限制-随机微分方程(SDEs)在$\mathbb {R}^d $和连续时间马尔可夫链(CTMCs)有限的字母表-并推导出相关的福克-普朗克和主方程。一个常见的变分处理产生ELBO,支持标准的训练损失。我们明确了如何向前腐败的选择-高斯过程在连续空间和结构化的分类过渡内核(均匀,掩蔽/吸收和更多)在离散空间-形状反向动力学和ELBO。该演示文稿分为三个层次:寻求独立直观介绍的新人;希望获得全球理论综合的扩散从业者;以及寻找模拟优先路径进入离散扩散的连续扩散专家。其结果是一个统一的路线图,现代扩散方法跨越连续域和离散序列,突出了一组紧凑的可重用的证明,身份和核心理论原则。
摘要:Although diffusion models now occupy a central place in generative modeling, introductory treatments commonly assume Euclidean data and seldom clarify their connection to discrete-state analogues. This article is a self-contained primer on diffusion over general state spaces, unifying continuous domains and discrete/categorical structures under one lens. We develop the discrete-time view (forward noising via Markov kernels and learned reverse dynamics) alongside its continuous-time limits -- stochastic differential equations (SDEs) in $\mathbb{R}^d$ and continuous-time Markov chains (CTMCs) on finite alphabets -- and derive the associated Fokker--Planck and master equations. A common variational treatment yields the ELBO that underpins standard training losses. We make explicit how forward corruption choices -- Gaussian processes in continuous spaces and structured categorical transition kernels (uniform, masking/absorbing and more) in discrete spaces -- shape reverse dynamics and the ELBO. The presentation is layered for three audiences: newcomers seeking a self-contained intuitive introduction; diffusion practitioners wanting a global theoretical synthesis; and continuous-diffusion experts looking for an analogy-first path into discrete diffusion. The result is a unified roadmap to modern diffusion methodology across continuous domains and discrete sequences, highlighting a compact set of reusable proofs, identities, and core theoretical principles.
【20】Model-Free Assessment of Simulator Fidelity via Quantile Curves
标题:通过分位数曲线对模拟器保真度进行无模型评估
链接:https://arxiv.org/abs/2512.05024
作者:Garud Iyengar,Yu-Shiou Willy Lin,Kaizheng Wang
备注:33 pages, 11 figures
摘要:复杂系统的模拟起源于制造和排队应用。它现在被广泛用于研究、教育和消费者调查中的大规模、基于ML的系统。然而,对于日益复杂的基于机器学习的系统来说,表征模拟器和地面实况之间的差异仍然具有挑战性。我们提出了一种计算上易于处理的方法来估计模拟结果分布和地面真实结果分布之间的差异的分位数函数。我们的方法侧重于输出的不确定性,并把模拟器作为一个黑盒子,不施加任何建模假设的内部,因此广泛适用于许多参数的家庭,从伯努利和多项式模型连续,向量值设置。所得到的分位数曲线支持针对不可见场景的置信区间构造、模拟与真实差异的风险感知总结(例如,VaR/CVaR),以及模拟器的性能比较。我们在一个应用程序中展示了我们的方法,该应用程序在跨越四个LLM的WorldValueBench数据集上评估LLM模拟保真度。
摘要:Simulation of complex systems originated in manufacturing and queuing applications. It is now widely used for large-scale, ML-based systems in research, education, and consumer surveys. However, characterizing the discrepancy between simulators and ground truth remains challenging for increasingly complex, machine-learning-based systems. We propose a computationally tractable method to estimate the quantile function of the discrepancy between the simulated and ground-truth outcome distributions. Our approach focuses on output uncertainty and treats the simulator as a black box, imposing no modeling assumptions on its internals, and hence applies broadly across many parameter families, from Bernoulli and multinomial models to continuous, vector-valued settings. The resulting quantile curve supports confidence interval construction for unseen scenarios, risk-aware summaries of sim-to-real discrepancy (e.g., VaR/CVaR), and comparison of simulators' performance. We demonstrate our methodology in an application assessing LLM simulation fidelity on the WorldValueBench dataset spanning four LLMs.
【21】Towards a unified framework for guided diffusion models
标题:建立引导扩散模型的统一框架
链接:https://arxiv.org/abs/2512.04985
作者:Yuchen Jiao,Yuxin Chen,Gen Li
摘要:使用扩散模型的引导或控制数据生成\blfootnote{这项工作的部分初步结果出现在2025年国际机器学习会议上\citep{li 2025 provable}。}已经成为现代生成建模的基石。尽管扩散模型理论取得了重大进展,但对引导扩散采样器的理论理解仍然非常有限。我们通过开发一个统一的算法和理论框架,同时容纳扩散指导和奖励引导的扩散取得进展。为了微调扩散模型,以提高某些奖励,我们建议注入一个奖励指导项-从原始和奖励重新加权分数之间的差异构建-到反向扩散过程中,并严格量化由此产生的奖励改进。作为一个关键的应用程序,我们的框架表明,无分类器指导(CFG)降低了分类器概率的预期倒数,提供了第一个理论表征的特定性能指标,CFG提高一般目标分布。当应用于奖励引导扩散,我们的框架产生了一个新的采样器,易于训练,在训练过程中不需要完整的扩散轨迹。数值实验进一步证实了我们的理论研究结果。
摘要:Guided or controlled data generation with diffusion models\blfootnote{Partial preliminary results of this work appeared in International Conference on Machine Learning 2025 \citep{li2025provable}.} has become a cornerstone of modern generative modeling. Despite substantial advances in diffusion model theory, the theoretical understanding of guided diffusion samplers remains severely limited. We make progress by developing a unified algorithmic and theoretical framework that accommodates both diffusion guidance and reward-guided diffusion. Aimed at fine-tuning diffusion models to improve certain rewards, we propose injecting a reward guidance term -- constructed from the difference between the original and reward-reweighted scores -- into the backward diffusion process, and rigorously quantify the resulting reward improvement over the unguided counterpart. As a key application, our framework shows that classifier-free guidance (CFG) decreases the expected reciprocal of the classifier probability, providing the first theoretical characterization of the specific performance metric that CFG improves for general target distributions. When applied to reward-guided diffusion, our framework yields a new sampler that is easy-to-train and requires no full diffusion trajectories during training. Numerical experiments further corroborate our theoretical findings.
【22】Learning Causality for Longitudinal Data
标题:纵向数据的学习因果关系
链接:https://arxiv.org/abs/2512.04980
作者:Mouad EL Bouchattaoui
备注:PhD thesis manuscript
摘要:本论文主要研究高维时变数据中的因果推理和因果表征学习方法。 第一个贡献介绍了因果动态变分自动编码器(CDVAE),这是一种通过捕获仅影响结果的潜在风险因素驱动的治疗反应中未观察到的异质性来估计个体治疗效果(ITE)的模型。CDVAE在理论上保证了ITE错误的有效潜在调整和泛化范围。在合成数据集和真实数据集上的实验表明,CDVAE的性能优于基线,并且当使用其潜在替代品进行增强时,最先进的模型会大大提高,在不访问真实调整变量的情况下接近Oracle性能。 第二个贡献提出了一个基于RNN的长期反事实回归的有效框架,RNN通过对比预测编码(CPC)和InfoMax增强。它捕获时变混杂下的长程依赖关系,同时避免了Transformers的计算成本,实现了最先进的结果,并将CPC引入因果推理。 第三个贡献通过解决潜在原因如何在观察到的变量中显现来推进CRL。我们引入了一个模型不可知的可解释性层的基础上的解码器雅可比矩阵的几何。一个稀疏的自我表达先验诱导模块化的,可能重叠的观察到的功能与共享的潜在影响对齐组。我们提供恢复保证在不相交和重叠的设置,并表明有意义的潜在观察结构可以恢复没有锚功能或单亲假设。可伸缩的雅可比的正则化技术也被开发。
摘要:This thesis develops methods for causal inference and causal representation learning (CRL) in high-dimensional, time-varying data. The first contribution introduces the Causal Dynamic Variational Autoencoder (CDVAE), a model for estimating Individual Treatment Effects (ITEs) by capturing unobserved heterogeneity in treatment response driven by latent risk factors that affect only outcomes. CDVAE comes with theoretical guarantees on valid latent adjustment and generalization bounds for ITE error. Experiments on synthetic and real datasets show that CDVAE outperforms baselines, and that state-of-the-art models greatly improve when augmented with its latent substitutes, approaching oracle performance without access to true adjustment variables. The second contribution proposes an efficient framework for long-term counterfactual regression based on RNNs enhanced with Contrastive Predictive Coding (CPC) and InfoMax. It captures long-range dependencies under time-varying confounding while avoiding the computational cost of transformers, achieving state-of-the-art results and introducing CPC into causal inference. The third contribution advances CRL by addressing how latent causes manifest in observed variables. We introduce a model-agnostic interpretability layer based on the geometry of the decoder Jacobian. A sparse self-expression prior induces modular, possibly overlapping groups of observed features aligned with shared latent influences. We provide recovery guarantees in both disjoint and overlapping settings and show that meaningful latent-to-observed structure can be recovered without anchor features or single-parent assumptions. Scalable Jacobian-based regularization techniques are also developed.
【23】Series of quasi-uniform scatterings with fast search, root systems and neural network classifications
标题:具有快速搜索、根系和神经网络分类的一系列准均匀散布
链接:https://arxiv.org/abs/2512.04865
作者:Igor V. Netay
摘要:在本文中,我们描述了一种方法来构建大型可扩展集合的向量在预定义的空间中的给定尺寸。这些集合对于神经网络潜在空间的配置和训练是有用的。对于具有大量或未知数量的类的分类问题,这允许在没有分类层的情况下构造分类器,并且在不从一开始就重新训练网络的情况下扩展类的数量。该构造允许在最小可能维度的空间中创建大的间隔良好的向量集合。如果类的数量是已知的或近似可预测的,则可以选择足够的向量集合大小。如果需要显着扩展类的数量,可以在相同的潜在空间中扩展集合,或者将集合合并到具有相同向量间距的更高维度的集合中。此外,构造的向量集合的规则对称结构可以显着简化在潜在空间中搜索最近的聚类中心或嵌入的问题。向量集合的构造是基于组合学和半单李群的几何学,即具有最高权重的不可约表示。
摘要
:In this paper we describe an approach to construct large extendable collections of vectors in predefined spaces of given dimensions. These collections are useful for neural network latent space configuration and training. For classification problem with large or unknown number of classes this allows to construct classifiers without classification layer and extend the number of classes without retraining of network from the very beginning. The construction allows to create large well-spaced vector collections in spaces of minimal possible dimension. If the number of classes is known or approximately predictable, one can choose sufficient enough vector collection size. If one needs to significantly extend the number of classes, one can extend the collection in the same latent space, or to incorporate the collection into collection of higher dimensions with same spacing between vectors. Also, regular symmetric structure of constructed vector collections can significantly simplify problems of search for nearest cluster centers or embeddings in the latent space. Construction of vector collections is based on combinatorics and geometry of semi-simple Lie groups irreducible representations with highest weight.
其他(26篇)
【1】The Universal Weight Subspace Hypothesis
标题:泛权子空间假设
链接:https://arxiv.org/abs/2512.05117
作者:Prakhar Kaushik,Shravan Chaudhari,Ankit Vaidya,Rama Chellappa,Alan Yuille
备注:37 pages
摘要:我们发现,在不同任务中训练的深度神经网络表现出非常相似的低维参数子空间。我们提供了第一个大规模的经验证据,证明神经网络系统地收敛到共享谱子空间,无论初始化,任务或域。通过对1100多个模型(包括500个Mistral-7 B LoRA,500个Vision Transformers和50个LLaMA-8B模型)的模式频谱分析,我们确定了在几个主要方向上捕获大多数方差的通用子空间。通过将谱分解技术应用于在各种任务和数据集上训练的各种架构的权重矩阵,我们在不同任务和数据集的共享架构中识别出被一致利用的稀疏联合子空间。我们的发现为深度网络中信息的内在组织提供了新的见解,并提出了关于在不需要大量数据和计算资源的情况下发现这些通用子空间的可能性的重要问题。此外,这种固有结构对模型可重用性、多任务学习、模型合并以及训练和推理高效算法的开发具有重要意义,可能会减少大规模神经模型的碳足迹。
摘要:We show that deep neural networks trained across diverse tasks exhibit remarkably similar low-dimensional parametric subspaces. We provide the first large-scale empirical evidence that demonstrates that neural networks systematically converge to shared spectral subspaces regardless of initialization, task, or domain. Through mode-wise spectral analysis of over 1100 models - including 500 Mistral-7B LoRAs, 500 Vision Transformers, and 50 LLaMA-8B models - we identify universal subspaces capturing majority variance in just a few principal directions. By applying spectral decomposition techniques to the weight matrices of various architectures trained on a wide range of tasks and datasets, we identify sparse, joint subspaces that are consistently exploited, within shared architectures across diverse tasks and datasets. Our findings offer new insights into the intrinsic organization of information within deep networks and raise important questions about the possibility of discovering these universal subspaces without the need for extensive data and computational resources. Furthermore, this inherent structure has significant implications for model reusability, multi-task learning, model merging, and the development of training and inference-efficient algorithms, potentially reducing the carbon footprint of large-scale neural models.
【2】Value Gradient Guidance for Flow Matching Alignment
标题:流量匹配对齐的价值梯度指导
链接:https://arxiv.org/abs/2512.05116
作者:Zhen Liu,Tim Z. Xiao,Carles Domingo-Enrich,Weiyang Liu,Dinghuai Zhang
备注:Accepted at NeurIPS 2025; 26 pages, 20 figures
摘要:虽然存在用于将流匹配模型(一种流行且有效的生成模型)与人类偏好对齐的方法,但现有方法无法实现自适应效率和概率上合理的先验保存。在这项工作中,我们利用最优控制理论,提出了VGG-Flow,这是一种基于梯度匹配的方法,用于微调预训练的流匹配模型。该算法的核心思想是,微调后的速度场和预训练的速度场之间的最佳差异应该与值函数的梯度场相匹配。该方法不仅结合了来自奖励模型的一阶信息,而且还受益于值函数的启发式初始化,以实现快速自适应。从经验上讲,我们展示了一个流行的文本到图像的流匹配模型,稳定扩散3,我们的方法可以在有限的计算预算下微调流匹配模型,同时实现有效的和先验保持对齐。
摘要:While methods exist for aligning flow matching models--a popular and effective class of generative models--with human preferences, existing approaches fail to achieve both adaptation efficiency and probabilistically sound prior preservation. In this work, we leverage the theory of optimal control and propose VGG-Flow, a gradient-matching-based method for finetuning pretrained flow matching models. The key idea behind this algorithm is that the optimal difference between the finetuned velocity field and the pretrained one should be matched with the gradient field of a value function. This method not only incorporates first-order information from the reward model but also benefits from heuristic initialization of the value function to enable fast adaptation. Empirically, we show on a popular text-to-image flow matching model, Stable Diffusion 3, that our method can finetune flow matching models under limited computational budgets while achieving effective and prior-preserving alignment.
【3】The Geometry of Intelligence: Deterministic Functional Topology as a Foundation for Real-World Perception
标题:智力的几何学:确定性功能布局作为现实世界感知的基础
链接:https://arxiv.org/abs/2512.05089
作者:Eduardo Di Santi
备注:35 pages, 6 figures. This preprint develops a deterministic functional-topological framework showing that physical systems generate compact perceptual manifolds with finite radius. We provide theory, Monte-Carlo estimators, and validation across PM, battery, and ECG domains, unifying biological perception and self-supervised AI
摘要:真实世界的物理过程不会产生任意的可变性:它们的信号集中在函数空间的紧凑和低可变性的子集上。这种几何结构使得能够从生物和人工系统中的几个例子快速概括。 这项工作开发了一个确定性的功能拓扑框架,其中一个物理现象的有效实现的集合形成了一个紧凑的感知流形与稳定的不变量和有限的Hausdorff半径。我们表明,这个流形的边界可以发现在一个完全自我监督的方式,通过蒙特卡罗采样,即使当系统的控制方程是未知的。 我们提供了理论保证,知识边界的实际估计,并在三个领域的经验验证:机电铁路点机,电化学电池放电曲线,和生理ECG信号。 我们的研究结果表明,确定性功能拓扑为感知、表征和世界模型构建提供了统一的数学基础,这解释了为什么生物学习者和自我监督的AI模型可以从有限的观察中概括。
摘要:Real-world physical processes do not generate arbitrary variability: their signals concentrate on compact and low-variability subsets of functional space. This geometric structure enables rapid generalization from a few examples in both biological and artificial systems. This work develops a deterministic functional-topological framework in which the set of valid realizations of a physical phenomenon forms a compact perceptual manifold with stable invariants and a finite Hausdorff radius. We show that the boundaries of this manifold can be discovered in a fully self-supervised manner through Monte Carlo sampling, even when the governing equations of the system are unknown. We provide theoretical guarantees, practical estimators of knowledge boundaries, and empirical validations across three domains: electromechanical railway point machines, electrochemical battery discharge curves, and physiological ECG signals. Our results demonstrate that deterministic functional topology offers a unified mathematical foundation for perception, representation, and world-model construction, explaining why biological learners and self-supervised AI models can generalize from limited observations.
【4】SuperActivators: Only the Tail of the Distribution Contains Reliable Concept Signals
标题:SuperActivators:只有分布的尾部包含可靠的概念信号
链接:https://arxiv.org/abs/2512.05038
作者:Cassandra Goldberg,Chaehyeon Kim,Adam Stein,Eric Wong
摘要:概念向量旨在通过将内部表示与人类可理解的语义联系起来来增强模型的可解释性,但其实用性往往受到噪声和不一致激活的限制。在这项工作中,我们发现了噪音中的一个清晰的模式,我们称之为超级激活机制:虽然概念内和概念外的激活有相当大的重叠,但概念内分布的极高尾部的令牌激活提供了概念存在的可靠信号。我们通过展示SuperActivator令牌始终优于标准的基于向量的提示概念检测方法来证明这种机制的通用性,在图像和文本模态,模型架构,模型层和概念提取技术中实现了高达14%的F1分数。最后,我们利用SuperActivator令牌来改进概念的特征属性。
摘要:Concept vectors aim to enhance model interpretability by linking internal representations with human-understandable semantics, but their utility is often limited by noisy and inconsistent activations. In this work, we uncover a clear pattern within the noise, which we term the SuperActivator Mechanism: while in-concept and out-of-concept activations overlap considerably, the token activations in the extreme high tail of the in-concept distribution provide a reliable signal of concept presence. We demonstrate the generality of this mechanism by showing that SuperActivator tokens consistently outperform standard vector-based and prompting concept detection approaches, achieving up to a 14% higher F1 score across image and text modalities, model architectures, model layers, and concept extraction techniques. Finally, we leverage SuperActivator tokens to improve feature attributions for concepts.
【5】Evolutionary Architecture Search through Grammar-Based Sequence Alignment
标题:通过基于文法的序列比对进行进化架构搜索
链接:https://arxiv.org/abs/2512.04992
作者:Adri Gómez Martín,Felix Möller,Steven McDonagh,Monica Abella,Manuel Desco,Elliot J. Crowley,Aaron Klein,Linus Ericsson
摘要:表达搜索空间中的神经架构搜索(NAS)是一个计算困难的问题,但它也有可能自动发现完全新颖和高性能的架构。为了实现这一点,我们需要有效的搜索算法,可以识别功能强大的组件,并在新的候选架构中重用它们。在本文中,我们介绍了两个适应的变体的史密斯-沃特曼算法的局部序列比对,并使用它们来计算编辑距离在基于语法的进化架构搜索。这些算法使我们能够有效地计算神经架构的距离度量,并从两个父模型生成一组混合后代。这有助于部署基于交叉的搜索策略,使我们能够对架构损失情况进行全面分析,并在搜索过程中跟踪种群多样性。我们强调了我们的方法如何大大提高了计算复杂性比以前的工作,使我们能够有效地计算架构之间的最短路径。当在进化搜索中实例化交叉时,我们获得了有竞争力的结果,优于竞争方法。未来的工作可以建立在这个新工具的基础上,发现可以在神经架构设计中更广泛使用的新组件,并将其应用扩展到NAS之外。
摘要:Neural architecture search (NAS) in expressive search spaces is a computationally hard problem, but it also holds the potential to automatically discover completely novel and performant architectures. To achieve this we need effective search algorithms that can identify powerful components and reuse them in new candidate architectures. In this paper, we introduce two adapted variants of the Smith-Waterman algorithm for local sequence alignment and use them to compute the edit distance in a grammar-based evolutionary architecture search. These algorithms enable us to efficiently calculate a distance metric for neural architectures and to generate a set of hybrid offspring from two parent models. This facilitates the deployment of crossover-based search heuristics, allows us to perform a thorough analysis on the architectural loss landscape, and track population diversity during search. We highlight how our method vastly improves computational complexity over previous work and enables us to efficiently compute shortest paths between architectures. When instantiating the crossover in evolutionary searches, we achieve competitive results, outperforming competing methods. Future work can build upon this new tool, discovering novel components that can be used more broadly across neural architecture design, and broadening its applications beyond NAS.
【6】Towards Continuous-Time Approximations for Stochastic Gradient Descent without Replacement
标题:无替换随机梯度下降的连续时间逼近
链接:https://arxiv.org/abs/2512.04703
作者:Stefan Perko
摘要:使用历元的梯度优化算法,即基于无替换随机梯度下降(SGDo)的算法,主要用于在实践中训练机器学习模型。然而,SGDo的数学理论和相关算法与它们的“带替换”和“一遍通过”对应物相比仍然未被充分探索。在本文中,我们基于由我们称为“历元布朗运动”的随机过程驱动的Young微分方程,提出了具有加性噪声的SGDo的随机、连续时间逼近。我们证明了它的有用性,通过证明几乎必然收敛的连续时间逼近的强凸目标和学习率时间表的形式$u_t = \frac{1}{(1+t)^β},β\in(0,1)$。此外,我们计算上界的渐近速度几乎必然收敛,这是一样好或更好地比以前的结果SGDo。
摘要:Gradient optimization algorithms using epochs, that is those based on stochastic gradient descent without replacement (SGDo), are predominantly used to train machine learning models in practice. However, the mathematical theory of SGDo and related algorithms remain underexplored compared to their "with replacement" and "one-pass" counterparts. In this article, we propose a stochastic, continuous-time approximation to SGDo with additive noise based on a Young differential equation driven by a stochastic process we call an "epoched Brownian motion". We show its usefulness by proving the almost sure convergence of the continuous-time approximation for strongly convex objectives and learning rate schedules of the form $u_t = \frac{1}{(1+t)^β}, β\in (0,1)$. Moreover, we compute an upper bound on the asymptotic rate of almost sure convergence, which is as good or better than previous results for SGDo.
【7】Score Matching for Estimating Finite Point Processes
标题:估计有限点过程的分数匹配
链接:https://arxiv.org/abs/2512.04617
作者:Haoqun Cao,Yixuan Zhang,Feng Zhou
摘要:近年来,分数匹配估计量引起了人们的极大关注,因为它们消除了计算归一化常数的需要,从而减轻了与最大似然估计(MLE)相关的计算挑战。虽然有几项研究提出了点过程的分数匹配估计量,但这项工作突出了这些现有方法的局限性,这主要源于缺乏对分数匹配在有限点过程上如何表现的数学上的严格分析-有限点过程是有界空间上的特殊随机配置,其中分数匹配的许多通常假设和属性不再成立。为此,我们通过Janossy度量开发了一个正式的有限点过程分数匹配框架,并在此框架内引入了一个(自回归)加权分数匹配估计量,我们在经典参数设置中分析了其统计特性。对于一般非参数(例如,深)点过程模型,我们表明,单独的分数匹配并不能唯一地识别地面实况分布由于微妙的归一化问题,我们提出了一个简单的生存分类增强,产生一个完整的,集成免费的训练目标为任何基于强度的点过程模型的时空情况。在合成和真实世界的时间和时空数据集上的实验表明,我们的方法可以准确地恢复强度,并以更好的效率实现与MLE相当的性能。
摘要
:Score matching estimators have garnered significant attention in recent years because they eliminate the need to compute normalizing constants, thereby mitigating the computational challenges associated with maximum likelihood estimation (MLE).While several studies have proposed score matching estimators for point processes, this work highlights the limitations of these existing methods, which stem primarily from the lack of a mathematically rigorous analysis of how score matching behaves on finite point processes -- special random configurations on bounded spaces where many of the usual assumptions and properties of score matching no longer hold. To this end, we develop a formal framework for score matching on finite point processes via Janossy measures and, within this framework, introduce an (autoregressive) weighted score-matching estimator, whose statistical properties we analyze in classical parametric settings. For general nonparametric (e.g., deep) point process models, we show that score matching alone does not uniquely identify the ground-truth distribution due to subtle normalization issues, and we propose a simple survival-classification augmentation that yields a complete, integration-free training objective for any intensity-based point process model for spatio-temporal case. Experiments on synthetic and real-world temporal and spatio-temporal datasets, demonstrate that our method accurately recovers intensities and achieves performance comparable to MLE with better efficiency.
【8】When Robots Should Say "I Don't Know": Benchmarking Abstention in Embodied Question Answering
链接:https://arxiv.org/abs/2512.04597
作者:Tao Wu,Chuhao Zhou,Guangyu Zhao,Haozhi Cao,Yewen Pu,Jianfei Yang
摘要:智能问答(EQA)要求智能体解释语言,感知环境,并在3D场景中导航以产生响应。现有的EQA基准假设每个问题都必须得到回答,但具体的代理应该知道什么时候他们没有足够的信息来回答。在这项工作中,我们专注于EQA代理的最低要求,谨慎:知道什么时候拒绝回答。在对500个人类查询的初步研究中,我们发现32.4%包含缺失或未指定的上下文。基于这一初步研究和人类沟通错误的认知理论,我们得出了五个代表性的类别,需要避免:可诉性限制,参考欠规范,偏好依赖,信息不可用性,和虚假的预设。我们通过让注释者将适定问题转换为这些类别所概述的模糊变体来增强OpenEQA。由此产生的数据集AbstainEQA包括1,636个带注释的预防案例,与1,636个原始OpenEQA实例配对,以进行平衡评估。在AbstainEQA上进行评价,我们发现即使是最好的前沿模型也只能达到42.79%的召回率,而人类可以达到91.17%。我们还发现,缩放,提示和推理只产生边际收益,微调模型过拟合文本线索。总之,这些结果的立场作为一个基本的先决条件,在具体的设置可靠的互动,并作为一个必要的基础,有效的澄清。
摘要:Embodied Question Answering (EQA) requires an agent to interpret language, perceive its environment, and navigate within 3D scenes to produce responses. Existing EQA benchmarks assume that every question must be answered, but embodied agents should know when they do not have sufficient information to answer. In this work, we focus on a minimal requirement for EQA agents, abstention: knowing when to withhold an answer. From an initial study of 500 human queries, we find that 32.4% contain missing or underspecified context. Drawing on this initial study and cognitive theories of human communication errors, we derive five representative categories requiring abstention: actionability limitation, referential underspecification, preference dependence, information unavailability, and false presupposition. We augment OpenEQA by having annotators transform well-posed questions into ambiguous variants outlined by these categories. The resulting dataset, AbstainEQA, comprises 1,636 annotated abstention cases paired with 1,636 original OpenEQA instances for balanced evaluation. Evaluating on AbstainEQA, we find that even the best frontier model only attains 42.79% abstention recall, while humans achieve 91.17%. We also find that scaling, prompting, and reasoning only yield marginal gains, and that fine-tuned models overfit to textual cues. Together, these results position abstention as a fundamental prerequisite for reliable interaction in embodied settings and as a necessary basis for effective clarification.
【9】Reliable Statistical Guarantees for Conformal Predictors with Small Datasets
标题:小数据集的保形预测的可靠统计保证
链接:https://arxiv.org/abs/2512.04566
作者:Miguel Sánchez-Domínguez,Lucas Lacasa,Javier de Vicente,Gonzalo Rubio,Eusebio Valero
摘要:代理模型(包括深度神经网络和监督学习中的其他机器学习算法)能够近似科学和工程中任意复杂的高维输入输出问题,但在将其部署到任何安全关键型应用之前,需要进行彻底的数据不可知不确定性量化分析。数据不可知的不确定性量化的标准方法是使用保形预测(CP),这是一个建立良好的框架,用于构建具有已证明的统计保证的不确定性模型,该模型不会假设代理模型的误差分布的任何形状。然而,由于CP提供的经典统计保证是在边缘覆盖范围的界限方面给出的,因此对于小的校准集大小,(这在旨在量化不同区域的误差的现实替代建模中很常见),覆盖率分布围绕其平均值的潜在强烈分散对不确定性模型的可靠性产生了负面影响,通常获得低于预期值的覆盖率,导致不太适用的框架。在为机器学习从业者提供代理模型的不确定性量化的温和介绍之后,在本文中,我们通过提出一种新的统计保证来弥合这一差距,该保证为单个共形预测器的覆盖范围提供概率信息。我们表明,所提出的框架收敛到CP提供的标准解决方案的大校准集的大小,不同的是经典的保证,仍然提供可靠的信息覆盖的保形预测小的数据大小。我们说明和验证的方法,在一套的例子,并实现一个开放访问的软件解决方案,可用于共同的共形预测库,以获得不确定性模型,满足新的保证。
摘要:Surrogate models (including deep neural networks and other machine learning algorithms in supervised learning) are capable of approximating arbitrarily complex, high-dimensional input-output problems in science and engineering, but require a thorough data-agnostic uncertainty quantification analysis before these can be deployed for any safety-critical application. The standard approach for data-agnostic uncertainty quantification is to use conformal prediction (CP), a well-established framework to build uncertainty models with proven statistical guarantees that do not assume any shape for the error distribution of the surrogate model. However, since the classic statistical guarantee offered by CP is given in terms of bounds for the marginal coverage, for small calibration set sizes (which are frequent in realistic surrogate modelling that aims to quantify error at different regions), the potentially strong dispersion of the coverage distribution around its average negatively impacts the reliability of the uncertainty model, often obtaining coverages below the expected value, resulting in a less applicable framework. After providing a gentle presentation of uncertainty quantification for surrogate models for machine learning practitioners, in this paper we bridge the gap by proposing a new statistical guarantee that offers probabilistic information for the coverage of a single conformal predictor. We show that the proposed framework converges to the standard solution offered by CP for large calibration set sizes and, unlike the classic guarantee, still offers reliable information about the coverage of a conformal predictor for small data sizes. We illustrate and validate the methodology in a suite of examples, and implement an open access software solution that can be used alongside common conformal prediction libraries to obtain uncertainty models that fulfil the new guarantee.
【10】Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function
标题:通过软Q函数的重新参数化政策梯度进行扩散微调
链接:https://arxiv.org/abs/2512.04559
作者:Hyeongyu Kang,Jaewoo Lee,Woocheol Shin,Kiyoung Om,Jinkyoo Park
备注:36 pages, 21 figures, 4 tables
摘要:扩散模型擅长生成高似然样本,但通常需要与下游目标保持一致。现有的扩散模型的微调方法严重遭受奖励过度优化,导致高奖励但不自然的样本和退化的多样性。为了减轻过度优化,我们提出了\textbf{Soft Q-based Diffusion Finetuning(SQDF)},这是一种用于扩散对齐的新型KL正则化RL方法,它应用了软Q函数的无训练可微估计的重新参数化策略梯度。SQDF通过三项创新进一步增强:在去噪过程中适当分配信用的折扣因子,一致性模型的集成以改进Q函数估计,以及使用非策略重放缓冲区来提高模式覆盖率并管理奖励多样性权衡。我们的实验表明,SQDF在保持文本到图像对齐的多样性的同时实现了卓越的目标回报。此外,在在线黑盒优化中,SQDF在保持自然性和多样性的同时获得了高的采样效率。
摘要:Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose \textbf{Soft Q-based Diffusion Finetuning (SQDF)}, a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves superior target rewards while preserving diversity in text-to-image alignment. Furthermore, in online black-box optimization, SQDF attains high sample efficiency while maintaining naturalness and diversity.
【11】Mathematical Framing for Different Agent Strategies
标题:不同代理策略的数学框架
链接:https://arxiv.org/abs/2512.04469
作者:Philip Stephens,Emmanuel Salawu
摘要:我们引入了一个统一的数学和概率框架来理解和比较不同的AI代理策略。我们弥合了高级代理设计概念(如ReAct,多代理系统和控制流)与严格的数学公式之间的差距。我们的方法框架代理过程作为一个链的概率,使不同的策略如何操纵这些概率,以实现预期的结果进行了详细的分析。我们的框架提供了一个共同的语言,讨论各种代理架构中固有的权衡。我们的主要贡献之一是引入了“自由度”概念,直观地区分了每种方法的可优化杠杆,从而指导为特定任务选择适当的策略。这项工作旨在提高设计和评估人工智能代理的清晰度和精确度,为最大限度地提高复杂代理系统中成功行动的概率提供见解。
摘要:We introduce a unified mathematical and probabilistic framework for understanding and comparing diverse AI agent strategies. We bridge the gap between high-level agent design concepts, such as ReAct, multi-agent systems, and control flows, and a rigorous mathematical formulation. Our approach frames agentic processes as a chain of probabilities, enabling a detailed analysis of how different strategies manipulate these probabilities to achieve desired outcomes. Our framework provides a common language for discussing the trade-offs inherent in various agent architectures. One of our many key contributions is the introduction of the "Degrees of Freedom" concept, which intuitively differentiates the optimizable levers available for each approach, thereby guiding the selection of appropriate strategies for specific tasks. This work aims to enhance the clarity and precision in designing and evaluating AI agents, offering insights into maximizing the probability of successful actions within complex agentic systems.
【12】Plug-and-Play Image Restoration with Flow Matching: A Continuous Viewpoint
标题:使用流量匹配的即插即用图像恢复:连续的观点
链接:https://arxiv.org/abs/2512.04283
作者:Fan Jia,Yuhao Huang,Shih-Hsin Wang,Cristina Garcia-Cardona,Andrea L. Bertozzi,Bao Wang
摘要:基于流匹配的生成模型已被集成到即插即用图像恢复框架中,并且由此产生的即插即用流匹配(PnP-Flow)模型在图像恢复方面取得了一些显着的经验成功。然而,PnP-Flow的理论理解滞后于其实证成功。在本文中,我们得到了一个连续的极限PnP-Flow,导致一个随机微分方程代理模型的PnP-Flow。该模型提供了两个特殊的见解,以改善PnP-Flow图像恢复:(1)它使我们能够量化图像恢复的误差,通知我们改进步骤调度和正则化神经网络参数化向量场的Lipschitz常数以减少误差。(2)它告诉我们,通过外推加速现成的PnP-Flow模型,从而产生所提出的PnP-Flow模型的重新缩放版本。我们使用几个基准任务验证了SDE-informed改进的PnP-Flow的有效性,包括图像去噪,去模糊,超分辨率和修复。数值结果表明,我们的方法显着优于基准PnP-Flow和其他最先进的方法,实现了跨评估指标的卓越性能。
摘要:Flow matching-based generative models have been integrated into the plug-and-play image restoration framework, and the resulting plug-and-play flow matching (PnP-Flow) model has achieved some remarkable empirical success for image restoration. However, the theoretical understanding of PnP-Flow lags its empirical success. In this paper, we derive a continuous limit for PnP-Flow, resulting in a stochastic differential equation (SDE) surrogate model of PnP-Flow. The SDE model provides two particular insights to improve PnP-Flow for image restoration: (1) It enables us to quantify the error for image restoration, informing us to improve step scheduling and regularize the Lipschitz constant of the neural network-parameterized vector field for error reduction. (2) It informs us to accelerate off-the-shelf PnP-Flow models via extrapolation, resulting in a rescaled version of the proposed SDE model. We validate the efficacy of the SDE-informed improved PnP-Flow using several benchmark tasks, including image denoising, deblurring, super-resolution, and inpainting. Numerical results show that our method significantly outperforms the baseline PnP-Flow and other state-of-the-art approaches, achieving superior performance across evaluation metrics.
【13】Bootstrapped Mixed Rewards for RL Post-Training: Injecting Canonical Action Order
标题:RL训练后引导混合奖励:注入经典行动命令
链接:https://arxiv.org/abs/2512.04277
作者:Prakhar Gupta,Vaibhav Gupta
摘要:强化学习(RL)的后训练通常会优化单个标量目标,并忽略如何产生解决方案的结构。我们问是否标量提示一个规范的求解器排序,仅在RL后训练,提高性能,即使在随机化的解决方案序列进行微调。在数独上,我们用标准的随机求解顺序微调训练一个Transformer,然后用组相对策略优化(GRPO)对其进行后训练,并有两个奖励:单元格准确度和当模型的发射顺序与求解器顺序一致时增加的排序奖励。为了干净地比较信号,我们通过固定的混合来组合它们,并使用简单的自举缩放来均衡初始化时的分量幅度。混合奖励通常优于仅细胞优化-最佳混合物产生的测试准确度远远高于在随机顺序上训练的仅微调模型,并且在准确度上接近在求解器顺序序列上训练的仅微调模型。这些结果表明,粗排序信号可以在不修改监督数据或架构的情况下将RL后训练转向求解器顺序轨迹。
摘要:Post-training with reinforcement learning (RL) typically optimizes a single scalar objective and ignores structure in how solutions are produced. We ask whether a scalar hint toward a canonical solver ordering, used only during RL post-training, improves performance even when fine-tuned on randomized solution sequences. On Sudoku, we train a Transformer with standard fine-tuning on randomized solving orders, then post-train it with Group Relative Policy Optimization (GRPO) with two rewards: cell accuracy and an ordering reward that increases when the model's emission order aligns with the solver order. To compare signals cleanly, we combine them via fixed mixtures and use a simple bootstrapped scaling to equalize component magnitudes at initialization. Mixed rewards generally outperform cell-only optimization--the best mixture yields substantially higher test accuracy than the fine-tuned-only model trained on random-order and approaches the fine-tuned-only model trained on solver-order sequences in accuracy. These results suggest that coarse ordering signals can steer RL post-training toward solver-order trajectories without modifying supervised data or architecture.
【14】The Geometry of Benchmarks: A New Path Toward AGI
标题:基准几何:AGI的新道路
链接:https://arxiv.org/abs/2512.04276
作者:Przemyslaw Chojecki
摘要:基准测试是评估人工智能(AI)进展的主要工具,但目前的做法是在孤立的测试套件上评估模型,并为一般性或自主自我改进的推理提供很少的指导。在这里,我们引入了一个几何框架,在这个框架中,所有的人工智能代理的心理测量电池被视为一个结构化的模空间中的点,和代理性能描述的能力泛函在这个空间。首先,我们定义了一个自主人工智能(AAI)量表,这是一个Kardashev式的自主层次结构,基于跨任务家族(例如推理,规划,工具使用和长期控制)的电池上的可测量性能。其次,我们构建了一个模块空间的电池,确定等价类的基准是无法区分的代理订单和能力推断的水平。这种几何结构产生确定性结果:密集的电池家族足以证明整个任务空间区域的性能。第三,我们引入了一个通用的Generator-Verifier-Updater(GVU)算子,该算子将强化学习、自我游戏、辩论和基于验证者的微调作为特例,并且我们定义了一个自我改进系数$κ$作为沿着诱导流的能力泛函的李导数。关于生成和验证的组合噪声的方差不等式提供了$κ> 0$的充分条件。我们的研究结果表明,人工通用智能(AGI)的进展最好理解为基准模块的流动,由GVU动态驱动,而不是由个人排行榜上的分数驱动。
摘要
:Benchmarks are the primary tool for assessing progress in artificial intelligence (AI), yet current practice evaluates models on isolated test suites and provides little guidance for reasoning about generality or autonomous self-improvement. Here we introduce a geometric framework in which all psychometric batteries for AI agents are treated as points in a structured moduli space, and agent performance is described by capability functionals over this space. First, we define an Autonomous AI (AAI) Scale, a Kardashev-style hierarchy of autonomy grounded in measurable performance on batteries spanning families of tasks (for example reasoning, planning, tool use and long-horizon control). Second, we construct a moduli space of batteries, identifying equivalence classes of benchmarks that are indistinguishable at the level of agent orderings and capability inferences. This geometry yields determinacy results: dense families of batteries suffice to certify performance on entire regions of task space. Third, we introduce a general Generator-Verifier-Updater (GVU) operator that subsumes reinforcement learning, self-play, debate and verifier-based fine-tuning as special cases, and we define a self-improvement coefficient $κ$ as the Lie derivative of a capability functional along the induced flow. A variance inequality on the combined noise of generation and verification provides sufficient conditions for $κ> 0$. Our results suggest that progress toward artificial general intelligence (AGI) is best understood as a flow on moduli of benchmarks, driven by GVU dynamics rather than by scores on individual leaderboards.
【15】Polynomiogram: An Integrated Framework for Root Visualization and Generative Art
标题:多边形图:根可视化和生成艺术的集成框架
链接:https://arxiv.org/abs/2512.04263
作者:Hoang Duc Nguyen,Anh Van Pham,Hien D. Nguyen
摘要:这项工作提出了多项式框架,一个综合的计算平台,探索,可视化,并从多项式根系统生成艺术。主要创新点是一个灵活的采样方案,其中两个独立的参数从用户定义的域和映射到多项式系数通过生成函数。该框架集成了两个互补的数值引擎:NumPy伴随矩阵求解器用于快速,大规模计算,MPSolve用于高精度,科学严谨的验证。这种双重架构可以实现高效的可视化,用于创造性的使用,并为研究和教育提供准确的计算。使用经典的合奏,包括卡茨和卢卡斯多项式的数值精度进行验证。该方法被应用到三次多项式系统,分析其分叉结构,展示其价值作为一个科学工具,探索根现象和教育援助,可视化的基本概念,代数和动力系统。除了分析之外,Polynomiogram还展示了其作为个性化生成艺术工具的潜力。例如,使用该平台生成类似芙蓉花的自然形式,并通过致敬作品创作个性化艺术品,表达对人工智能和大型语言模型进步的感激之情。
摘要:This work presents the Polynomiogram framework, an integrated computational platform for exploring, visualizing, and generating art from polynomial root systems. The main innovation is a flexible sampling scheme in which two independent parameters are drawn from user defined domains and mapped to the polynomial coefficients through a generating function. This design allows the same mathematical foundation to support both scientific investigation and generative algorithmic art. The framework integrates two complementary numerical engines: NumPy companion matrix solver for fast, large scale computation and MPSolve for high precision, scientifically rigorous validation. This dual architecture enables efficient visualization for creative use and accurate computation for research and education. Numerical accuracy was verified using classical ensembles, including the Kac and Lucas polynomials. The method was applied to the cubic polynomial system to analyze its bifurcation structure, demonstrating its value as both a scientific tool for exploring root phenomena and an educational aid for visualizing fundamental concepts in algebra and dynamical systems. Beyond analysis, the Polynomiogram also demonstrated its potential as a tool for personalized generative art. Examples include the use of the platform to generate a natural form resembling a hibiscus flower and to create personalized artwork expressing gratitude toward advances in artificial intelligence and large language models through a tribute composition.
【16】ActVAE: Modelling human activity schedules with a deep conditional generative approach
标题:ActVAE:用深度条件生成方法建模人类活动时间表
链接:https://arxiv.org/abs/2512.04223
作者:Fred Shone,Tim Hillel
摘要:模拟人类活动调度行为的复杂性和多样性本质上是具有挑战性的。我们展示了一种深度条件生成机器学习方法,用于根据输入标签(如个人的年龄、就业状况或与其日程安排相关的其他信息)对现实活动日程进行建模。我们结合(i)一个结构化的潜在生成方法,(ii)一个条件的方法,通过一个新的条件VAE架构。这允许为不同的输入标签快速生成精确和现实的时间表。我们广泛评估模型的能力,使用联合密度估计框架和几个案例研究。我们还表明,我们的方法具有实际的数据和计算要求,可以部署在新的和现有的需求建模框架。我们更普遍地评估生成能力的重要性,通过比较我们的组合方法(i)没有条件的纯生成模型,以及(ii)在给定输入标签的情况下输出最有可能的时间表的纯条件模型。这种比较突出了使用深度生成方法明确建模复杂多样的人类行为的随机性的有用性。
摘要:Modelling the complexity and diversity of human activity scheduling behaviour is inherently challenging. We demonstrate a deep conditional-generative machine learning approach for the modelling of realistic activity schedules depending on input labels such as an individual's age, employment status, or other information relevant to their scheduling. We combine (i) a structured latent generative approach, with (ii) a conditional approach, through a novel Conditional VAE architecture. This allows for the rapid generation of precise and realistic schedules for different input labels. We extensively evaluate model capabilities using a joint density estimation framework and several case studies. We additionally show that our approach has practical data and computational requirements, and can be deployed within new and existing demand modelling frameworks. We evaluate the importance of generative capability more generally, by comparing our combined approach to (i) a purely generative model without conditionality, and (ii) a purely conditional model which outputs the most likely schedule given the input labels. This comparison highlights the usefulness of explicitly modelling the randomness of complex and diverse human behaviours using deep generative approaches.
【17】Measuring Agents in Production
标题:生产中的测量试剂
链接:https://arxiv.org/abs/2512.04123
作者:Melissa Z. Pan,Negar Arabzadeh,Riccardo Cogo,Yuxuan Zhu,Alexander Xiong,Lakshya A Agrawal,Huanzhi Mao,Emma Shen,Sid Pallerla,Liana Patel,Shu Liu,Tianneng Shi,Xiaoyuan Liu,Jared Quincy Davis,Emmanuele Lacavalla,Alessandro Basile,Shuyi Yang,Paul Castro,Daniel Kang,Joseph E. Gonzalez,Koushik Sen,Dawn Song,Ion Stoica,Matei Zaharia,Marquita Ellis
摘要:人工智能代理正在不同行业的生产中积极运行,但公众对哪些技术方法能够成功实现现实世界的部署知之甚少。我们首次对生产中的人工智能代理进行了大规模的系统研究,调查了306名从业者,并通过26个领域的访谈进行了20个深入的案例研究。我们调查为什么组织建立代理,他们如何建立他们,他们如何评估他们,以及最大的发展挑战是什么。我们发现,生产代理通常使用简单,可控的方法构建:68%在需要人工干预之前执行最多10个步骤,70%依赖于提示现成的模型而不是权重调整,74%主要依赖于人工评估。可靠性仍然是最大的开发挑战,在确保和评估代理的正确性的困难。尽管存在这些挑战,但简单而有效的方法已经使代理商能够在不同行业产生影响。我们的研究记录了当前的实践状态,并通过为研究人员提供生产挑战的可见性,同时为实践者提供成功部署的经过验证的模式,弥合了研究和部署之间的差距。
摘要:AI agents are actively running in production across diverse industries, yet little is publicly known about which technical approaches enable successful real-world deployments. We present the first large-scale systematic study of AI agents in production, surveying 306 practitioners and conducting 20 in-depth case studies via interviews across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and what the top development challenges are. We find that production agents are typically built using simple, controllable approaches: 68% execute at most 10 steps before requiring human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability remains the top development challenge, driven by difficulties in ensuring and evaluating agent correctness. Despite these challenges, simple yet effective methods already enable agents to deliver impact across diverse industries. Our study documents the current state of practice and bridges the gap between research and deployment by providing researchers visibility into production challenges while offering practitioners proven patterns from successful deployments.
【18】Patient Safety Risks from AI Scribes: Signals from End-User Feedback
标题:AI Scribes的患者安全风险:来自最终用户反馈的信号
链接:https://arxiv.org/abs/2512.04118
作者
:Jessica Dai,Anwen Huang,Catherine Nasrallah,Rhiannon Croci,Hossein Soleimani,Sarah J. Pollet,Julia Adler-Milstein,Sara G. Murray,Jinoos Yazdany,Irene Y. Chen
备注:ML4H Findings 2025
摘要:AI抄写员正在大规模改变临床文档。然而,它们在现实世界中的表现仍然没有得到充分研究,特别是在它们对患者安全的影响方面。为此,我们启动了一项混合方法研究,研究美国一家大型医院系统中AI抄写员用户(医疗服务提供者)提交的反馈中提出的患者安全问题。定量和定性分析都表明,由于转录错误,AI抄写员可能会引起各种患者安全风险,最重要的是关于药物和治疗;然而,需要进一步研究以确定风险的绝对程度。
摘要:AI scribes are transforming clinical documentation at scale. However, their real-world performance remains understudied, especially regarding their impacts on patient safety. To this end, we initiate a mixed-methods study of patient safety issues raised in feedback submitted by AI scribe users (healthcare providers) in a large U.S. hospital system. Both quantitative and qualitative analysis suggest that AI scribes may induce various patient safety risks due to errors in transcription, most significantly regarding medication and treatment; however, further study is needed to contextualize the absolute degree of risk.
【19】AI-Enabled grading with near-domain data for scaling feedback with human-level accuracy
标题:使用近域数据的人工智能分级,以人类水平的准确度扩展反馈
链接:https://arxiv.org/abs/2512.04113
作者:Shyam Agarwal,Ali Moghimi,Kevin C. Haudek
摘要:建构反应问题对于鼓励生成性处理和测试学习者对核心概念的理解至关重要。然而,教师时间有限,班级规模大,以及其他资源限制,在提供及时和详细的评估,这是一个整体的教育经验至关重要的重大挑战。此外,提供及时和频繁的评估是具有挑战性的,因为手动分级是劳动密集型的,并且自动分级很复杂,无法推广到每一种可能的响应场景。本文提出了一种新颖实用的简答结构式问答题评分方法。我们讨论了为什么这个问题是具有挑战性的,定义的问题,我们的方法的工作原理的性质,最后提出了一个框架,教师可以用来评估他们的学生的开放式响应,利用近域数据,如数据从类似的问题管理在前几年。所提出的方法优于最先进的机器学习模型以及未经微调的大型语言模型,如GPT 3.5,GPT 4和GPT 4 o,在某些情况下,甚至在为LLM提供参考/模型答案之后,也有超过10-20%的可观幅度。我们的框架不需要预先编写的评分规则,并明确设计了实用的课堂设置。我们的研究结果还揭示了关于从近域数据中学习的令人兴奋的见解,包括我们称之为使用人类标记数据的准确性和数据优势,我们相信这是第一个正式确定基于近域数据的自动简短答案评分问题的工作。
摘要:Constructed-response questions are crucial to encourage generative processing and test a learner's understanding of core concepts. However, the limited availability of instructor time, large class sizes, and other resource constraints pose significant challenges in providing timely and detailed evaluation, which is crucial for a holistic educational experience. In addition, providing timely and frequent assessments is challenging since manual grading is labor intensive, and automated grading is complex to generalize to every possible response scenario. This paper proposes a novel and practical approach to grade short-answer constructed-response questions. We discuss why this problem is challenging, define the nature of questions on which our method works, and finally propose a framework that instructors can use to evaluate their students' open-responses, utilizing near-domain data like data from similar questions administered in previous years. The proposed method outperforms the state of the art machine learning models as well as non-fine-tuned large language models like GPT 3.5, GPT 4, and GPT 4o by a considerable margin of over 10-20% in some cases, even after providing the LLMs with reference/model answers. Our framework does not require pre-written grading rubrics and is designed explicitly with practical classroom settings in mind. Our results also reveal exciting insights about learning from near-domain data, including what we term as accuracy and data advantages using human-labeled data, and we believe this is the first work to formalize the problem of automated short answer grading based on the near-domain data.
【20】Rethinking AI Evaluation in Education: The TEACH-AI Framework and Benchmark for Generative AI Assistants
标题:重新思考教育中的AI评估:TEACH-AI框架和生成AI助手的基准
链接:https://arxiv.org/abs/2512.04107
作者:Shi Ding,Brian Magerko
备注:6 pages, NeurIPS 2025 Responsible Foundation Models Workshop
摘要:随着生成式人工智能(AI)继续改变教育,大多数现有的AI评估主要依赖于技术性能指标,如准确性或任务效率,而忽视了人类身份,学习者代理,上下文学习过程和道德考虑。在本文中,我们介绍了TEACH-AI(值得信赖和有效的AI课堂启发式),这是一个独立于领域的,以教学为基础的,与知识产权人一致的框架,具有可衡量的指标和实用的工具包,用于指导教育环境中生成AI系统的设计,开发和评估。基于广泛的文献综述和综合,十个组成部分的评估框架和工具包清单为教育中可扩展的,价值一致的人工智能评估提供了基础。TEACH-AI通过社会技术,教育,理论和应用镜头重新思考“评估”,吸引人工智能和教育领域的设计师,开发人员,研究人员和政策制定者。我们的工作邀请社区重新考虑在教育中构建“有效”人工智能的内容,并设计模型评估方法,以促进共同创造,包容性以及对人类,社会和教育的长期影响。
摘要:As generative artificial intelligence (AI) continues to transform education, most existing AI evaluations rely primarily on technical performance metrics such as accuracy or task efficiency while overlooking human identity, learner agency, contextual learning processes, and ethical considerations. In this paper, we present TEACH-AI (Trustworthy and Effective AI Classroom Heuristics), a domain-independent, pedagogically grounded, and stakeholder-aligned framework with measurable indicators and a practical toolkit for guiding the design, development, and evaluation of generative AI systems in educational contexts. Built on an extensive literature review and synthesis, the ten-component assessment framework and toolkit checklist provide a foundation for scalable, value-aligned AI evaluation in education. TEACH-AI rethinks "evaluation" through sociotechnical, educational, theoretical, and applied lenses, engaging designers, developers, researchers, and policymakers across AI and education. Our work invites the community to reconsider what constructs "effective" AI in education and to design model evaluation approaches that promote co-creation, inclusivity, and long-term human, social, and educational impact.
【21】Control Consistency Losses for Diffusion Bridges
标题:控制扩散桥的稠度损失
链接:https://arxiv.org/abs/2512.05070
作者:Samuel Howard,Nikolas Nüsken,Jakiw Pidstrigach
备注:Frontiers in Probabilistic Inference: Sampling Meets Learning Workshop at NeurIPS 2025 (Oral)
摘要:给定扩散过程的初始和终态,模拟扩散过程的条件动力学是一个重要但具有挑战性的科学问题。困难是特别明显的罕见事件,其中无条件的动态很少达到终端状态。在这项工作中,我们利用条件动力学的自洽性,以迭代的在线方式学习扩散桥,并在一系列设置中展示了有前途的经验结果。
摘要:Simulating the conditioned dynamics of diffusion processes, given their initial and terminal states, is an important but challenging problem in the sciences. The difficulty is particularly pronounced for rare events, for which the unconditioned dynamics rarely reach the terminal state. In this work, we leverage a self-consistency property of the conditioned dynamics to learn the diffusion bridge in an iterative online manner, and demonstrate promising empirical results in a range of settings.
【22】QKAN-LSTM: Quantum-inspired Kolmogorov-Arnold Long Short-term Memory
标题:QKAN-LSTM:量子启发的Kolmogorov-Arnold长短期记忆
链接:https://arxiv.org/abs/2512.05049
作者:Yu-Chao Hsu,Jiun-Cheng Jiang,Chun-Hua Lin,Kuo-Chung Peng,Nan-Yow Chen,Samuel Yen-Chi Chen,En-Jui Kuo,Hsi-Sheng Goan
摘要:长短期记忆(LSTM)模型是一种特殊类型的递归神经网络(RNN),它是城市电信预测等领域中顺序建模任务的核心,其中时间相关性和非线性依赖性占主导地位。然而,传统的LSTM遭受高参数冗余和有限的非线性表达。在这项工作中,我们提出了量子启发的Kolmogorov-Arnold长短期记忆(QKAN-LSTM),它将数据重传激活(DARUAN)模块集成到LSTM的门控结构中。每个DARUAN充当量子变分激活函数(QVAF),增强频率适应性,并在没有多量子比特纠缠的情况下实现指数丰富的光谱表示。由此产生的架构保留量子级的表现力,同时保持完全可执行的经典硬件。对阻尼简谐运动、贝塞尔函数和城市电信三个数据集的实证评估表明,与经典LSTM相比,QKAN-LSTM实现了卓越的预测准确性和泛化能力,可训练参数减少了79%。我们将该框架扩展到Jiang-Huang-Chen-Goan网络(JHCG Net),将KAN推广到编码器-解码器结构,然后进一步使用QKAN来实现潜在的KAN,从而创建用于分层表示学习的混合QKAN(HQKAN)。因此,所提出的HQKAN-LSTM为现实世界数据环境中的量子启发顺序建模提供了一条可扩展和可解释的途径。
摘要:Long short-term memory (LSTM) models are a particular type of recurrent neural networks (RNNs) that are central to sequential modeling tasks in domains such as urban telecommunication forecasting, where temporal correlations and nonlinear dependencies dominate. However, conventional LSTMs suffer from high parameter redundancy and limited nonlinear expressivity. In this work, we propose the Quantum-inspired Kolmogorov-Arnold Long Short-Term Memory (QKAN-LSTM), which integrates Data Re-Uploading Activation (DARUAN) modules into the gating structure of LSTMs. Each DARUAN acts as a quantum variational activation function (QVAF), enhancing frequency adaptability and enabling an exponentially enriched spectral representation without multi-qubit entanglement. The resulting architecture preserves quantum-level expressivity while remaining fully executable on classical hardware. Empirical evaluations on three datasets, Damped Simple Harmonic Motion, Bessel Function, and Urban Telecommunication, demonstrate that QKAN-LSTM achieves superior predictive accuracy and generalization with a 79% reduction in trainable parameters compared to classical LSTMs. We extend the framework to the Jiang-Huang-Chen-Goan Network (JHCG Net), which generalizes KAN to encoder-decoder structures, and then further use QKAN to realize the latent KAN, thereby creating a Hybrid QKAN (HQKAN) for hierarchical representation learning. The proposed HQKAN-LSTM thus provides a scalable and interpretable pathway toward quantum-inspired sequential modeling in real-world data environments.
【23】Shorting Dynamics and Structured Kernel Regularization
标题:卖空动力学和结构化核规则化
链接:https://arxiv.org/abs/2512.04874
作者:James Tian
摘要:本文提出了一种非线性算子动态,逐步消除规定的特征子空间的影响,同时保留最大的结构在其他地方。正算子的诱导序列是单调的,允许精确的剩余分解,并收敛于经典的shorted算子。将这种动态传输到再生核希尔伯特空间产生相应的核族,该核族收敛于由原始核支配的最大核,并使给定的子空间零化。在有限样本设置中,相关的Gram算子继承了结构化的残差分解,导致核岭回归的规范形式和执行滋扰不变性的原则性方法。这给出了一个统一的运营商分析方法不变核的建设和结构化正则化数据分析。
摘要:This paper develops a nonlinear operator dynamic that progressively removes the influence of a prescribed feature subspace while retaining maximal structure elsewhere. The induced sequence of positive operators is monotone, admits an exact residual decomposition, and converges to the classical shorted operator. Transporting this dynamic to reproducing kernel Hilbert spaces yields a corresponding family of kernels that converges to the largest kernel dominated by the original one and annihilating the given subspace. In the finite-sample setting, the associated Gram operators inherit a structured residual decomposition that leads to a canonical form of kernel ridge regression and a principled way to enforce nuisance invariance. This gives a unified operator-analytic approach to invariant kernel construction and structured regularization in data analysis.
【24】Provable FDR Control for Deep Feature Selection: Deep MLPs and Beyond
标题:深度特征选择的可证明的罗斯福控制:深度MLP及其他
链接:https://arxiv.org/abs/2512.04696
作者:Kazuma Sawaya
摘要:我们开发了一个基于深度神经网络的灵活的特征选择框架,该框架可以近似控制错误发现率(FDR),这是一种衡量I型错误的指标。该方法适用于第一层完全连接的架构。从第二层开始,它可以容纳任意宽度和深度的多层感知器(MLP),卷积和递归网络,注意力机制,剩余连接和辍学。该过程还适应随机梯度下降与数据无关的初始化和学习率。据我们所知,这是第一个在这样一个通用的深度学习设置中为特征选择提供FDR控制理论保证的工作。 我们的分析是建立在一个多指标的数据生成模型和渐近制度,其中的特征维度$n$发散速度比潜在的维度$q^{*}$,而样本大小,训练迭代次数,网络深度和隐藏层的宽度是不受限制的。在这种设置下,我们表明,每个坐标的梯度为基础的特征重要性向量承认一个边际正常的近似,从而支持渐近FDR控制的有效性。作为一个理论上的限制,我们假设$\mathbf{B}$-右正交不变性的设计矩阵,我们讨论更广泛的推广。我们还提出了数值实验,强调理论研究结果。
摘要:We develop a flexible feature selection framework based on deep neural networks that approximately controls the false discovery rate (FDR), a measure of Type-I error. The method applies to architectures whose first layer is fully connected. From the second layer onward, it accommodates multilayer perceptrons (MLPs) of arbitrary width and depth, convolutional and recurrent networks, attention mechanisms, residual connections, and dropout. The procedure also accommodates stochastic gradient descent with data-independent initializations and learning rates. To the best of our knowledge, this is the first work to provide a theoretical guarantee of FDR control for feature selection within such a general deep learning setting. Our analysis is built upon a multi-index data-generating model and an asymptotic regime in which the feature dimension $n$ diverges faster than the latent dimension $q^{*}$, while the sample size, the number of training iterations, the network depth, and hidden layer widths are left unrestricted. Under this setting, we show that each coordinate of the gradient-based feature-importance vector admits a marginal normal approximation, thereby supporting the validity of asymptotic FDR control. As a theoretical limitation, we assume $\mathbf{B}$-right orthogonal invariance of the design matrix, and we discuss broader generalizations. We also present numerical experiments that underscore the theoretical findings.
【25】Fermionic neural Gibbs states
标题:费米子神经吉布斯态
链接:https://arxiv.org/abs/2512.04663
作者:Jannes Nys,Juan Carrasquilla
摘要:我们介绍了费米子神经吉布斯态(fNGS),一个变分的框架建模强相互作用费米子的有限温度特性。fNGS从一个参考平均场热场双态开始,并使用神经网络变换和时间演化来系统地建立强相关性。应用于掺杂费米-哈伯德模型,一个最小的晶格模型捕捉强电子相关的基本特征,fNGS准确地再现了热能在很宽的温度范围内,相互作用强度,即使在大掺杂,系统尺寸超出了精确的方法。这些结果表明,一个可扩展的路线,研究有限温度特性的强关联费米子系统超越一维与神经网络表示的量子态。
摘要:We introduce fermionic neural Gibbs states (fNGS), a variational framework for modeling finite-temperature properties of strongly interacting fermions. fNGS starts from a reference mean-field thermofield-double state and uses neural-network transformations together with imaginary-time evolution to systematically build strong correlations. Applied to the doped Fermi-Hubbard model, a minimal lattice model capturing essential features of strong electronic correlations, fNGS accurately reproduces thermal energies over a broad range of temperatures, interaction strengths, even at large dopings, for system sizes beyond the reach of exact methods. These results demonstrate a scalable route to studying finite-temperature properties of strongly correlated fermionic systems beyond one dimension with neural-network representations of quantum states.
【26】NORi: An ML-Augmented Ocean Boundary Layer Parameterization
标题
:NORi:ML增强的海洋边界层参数化
链接:https://arxiv.org/abs/2512.04452
作者:Xin Kai Lee,Ali Ramadhan,Andre Souza,Gregory LeClaire Wagner,Simone Silvestri,John Marshall,Raffaele Ferrari
备注:48 pages, 16 figures, submitted to Journal of Advances in Modeling Earth Systems (JAMES)
摘要:NORi是海洋边界层湍流的机器学习(ML)参数化,它是基于物理的,并通过神经网络进行增强。NORi代表神经常微分方程(NODEs)理查森数(Ri)闭包。物理参数化由Richardson数依赖的扩散率和粘度控制。训练NODE以捕获通过边界层底部的夹带,这不能用局部扩散闭合来表示。参数化训练使用大涡模拟在“后验”的方式,其中参数校准与损失函数,明确取决于实际的时间积分的变量的兴趣,而不是瞬时的子网格通量,这是固有的噪音。NORi是专为现实的非线性状态方程的海水,并表现出良好的预测和概括能力,在捕捉不同的对流强度,海洋背景分层,旋转强度和表面风强迫下的卷吸动力学。在大规模模拟中,NORi在至少100年的积分时间内数值稳定,尽管只在2天的范围内进行训练,并且可以以长达1小时的时间步长运行。高度表达性的神经网络,结合物理上严格的基础闭包,被证明是一个强大的范式,用于设计气候模型的参数化,其中数据要求大大降低,推理性能可以直接针对和优化,并且在训练过程中隐含鼓励数值稳定性。
摘要:NORi is a machine-learned (ML) parameterization of ocean boundary layer turbulence that is physics-based and augmented with neural networks. NORi stands for neural ordinary differential equations (NODEs) Richardson number (Ri) closure. The physical parameterization is controlled by a Richardson number-dependent diffusivity and viscosity. The NODEs are trained to capture the entrainment through the base of the boundary layer, which cannot be represented with a local diffusive closure. The parameterization is trained using large-eddy simulations in an "a posteriori" fashion, where parameters are calibrated with a loss function that explicitly depends on the actual time-integrated variables of interest rather than the instantaneous subgrid fluxes, which are inherently noisy. NORi is designed for the realistic nonlinear equation of state of seawater and demonstrates excellent prediction and generalization capabilities in capturing entrainment dynamics under different convective strengths, oceanic background stratifications, rotation strengths, and surface wind forcings. NORi is numerically stable for at least 100 years of integration time in large-scale simulations, despite only being trained on 2-day horizons, and can be run with time steps as long as one hour. The highly expressive neural networks, combined with a physically-rigorous base closure, prove to be a robust paradigm for designing parameterizations for climate models where data requirements are drastically reduced, inference performance can be directly targeted and optimized, and numerical stability is implicitly encouraged during training.
机器翻译由腾讯交互翻译提供,仅供参考
点击“阅读原文”获取带摘要的学术速递