点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
cs.LG 方向,今日共计105篇
大模型相关(21篇)
【1】AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation
标题:AnatomiX,一种用于胸部X射线解释的解剖感知接地多模式大型语言模型
链接:https://arxiv.org/abs/2601.03191
作者:Anees Ur Rehman Hashmi,Numan Saeed,Christoph Lippert
摘要:多模态医学大型语言模型在胸部X射线解释方面取得了令人印象深刻的进展,但在空间推理和解剖理解方面仍面临挑战。虽然现有的接地技术提高了整体性能,但它们通常无法建立真正的解剖对应关系,从而导致医学领域中不正确的解剖理解。为了解决这一差距,我们引入了AnatomiX,这是一种多任务多模态大型语言模型,专门为解剖学基础的胸部X射线解释而设计。受放射学工作流程的启发,AnatomiX采用了两个阶段的方法:首先,它识别解剖结构并提取其特征,然后利用大型语言模型执行各种下游任务,如短语基础,报告生成,视觉问题回答和图像理解。在多个基准测试中进行的广泛实验表明,与现有方法相比,AnatomiX实现了卓越的解剖推理,并在解剖学基础,短语基础,接地诊断和接地字幕任务方面的性能提高了25%以上。代码和预训练模型可在https://github.com/aneesurhashmi/anatomix上获得
摘要:Multimodal medical large language models have shown impressive progress in chest X-ray interpretation but continue to face challenges in spatial reasoning and anatomical understanding. Although existing grounding techniques improve overall performance, they often fail to establish a true anatomical correspondence, resulting in incorrect anatomical understanding in the medical domain. To address this gap, we introduce AnatomiX, a multitask multimodal large language model explicitly designed for anatomically grounded chest X-ray interpretation. Inspired by the radiological workflow, AnatomiX adopts a two stage approach: first, it identifies anatomical structures and extracts their features, and then leverages a large language model to perform diverse downstream tasks such as phrase grounding, report generation, visual question answering, and image understanding. Extensive experiments across multiple benchmarks demonstrate that AnatomiX achieves superior anatomical reasoning and delivers over 25% improvement in performance on anatomy grounding, phrase grounding, grounded diagnosis and grounded captioning tasks compared to existing approaches. Code and pretrained model are available at https://github.com/aneesurhashmi/anatomix
【2】PersonaLedger: Generating Realistic Financial Transactions with Persona Conditioned LLMs and Rule Grounded Feedback
标题:PersonaLedger:使用Persona条件LLM和规则接地反馈生成真实的财务交易
链接:https://arxiv.org/abs/2601.03149
作者:Dehao Yuan,Tyler Farnan,Stefan Tesliuc,Doron L Bergman,Yulun Wu,Xiaoyu Liu,Minghui Liu,James Montgomery,Nam H Nguyen,C. Bayan Bruss,Furong Huang
摘要:严格的隐私法规限制了对真实交易数据的访问,减缓了金融AI的开放研究。合成数据可以弥合这一差距,但现有的生成器并不能共同实现行为多样性和逻辑兼容性。规则驱动的模拟器依赖于手工制作的工作流程和肤浅的随机性,这错过了人类行为的丰富性。基于学习的生成器,如GANs,捕获相关性,但往往违反严格的财务限制,仍然需要对私有数据进行训练。我们引入了PersonaLedger,这是一个生成引擎,它使用一个基于丰富用户角色的大型语言模型来生成不同的交易流,再加上一个专家可配置的编程引擎来维护正确性。LLM和引擎在闭环中交互:在每个事件之后,引擎更新用户状态,强制执行财务规则,并返回上下文感知的“nextprompt”,该上下文感知的“nextprompt”引导LLM进行可行的下一步操作。有了这个引擎,我们创建了一个来自23,000名用户的3000万笔交易的公共数据集,以及一个具有两个任务的基准套件,即非流动性分类和身份盗窃分割。PersonaLedger提供了一个现实的,隐私保护的资源,支持预测和异常检测模型的严格评估。PersonaLedger为社区提供了一个丰富、现实和隐私保护的资源--包括代码、规则和生成日志--以加速金融AI的创新,并实现严格、可重复的评估。
摘要:Strict privacy regulations limit access to real transaction data, slowing open research in financial AI. Synthetic data can bridge this gap, but existing generators do not jointly achieve behavioral diversity and logical groundedness. Rule-driven simulators rely on hand-crafted workflows and shallow stochasticity, which miss the richness of human behavior. Learning-based generators such as GANs capture correlations yet often violate hard financial constraints and still require training on private data. We introduce PersonaLedger, a generation engine that uses a large language model conditioned on rich user personas to produce diverse transaction streams, coupled with an expert configurable programmatic engine that maintains correctness. The LLM and engine interact in a closed loop: after each event, the engine updates the user state, enforces financial rules, and returns a context aware "nextprompt" that guides the LLM toward feasible next actions. With this engine, we create a public dataset of 30 million transactions from 23,000 users and a benchmark suite with two tasks, illiquidity classification and identity theft segmentation. PersonaLedger offers a realistic, privacy preserving resource that supports rigorous evaluation of forecasting and anomaly detection models. PersonaLedger offers the community a rich, realistic, and privacy preserving resource -- complete with code, rules, and generation logs -- to accelerate innovation in financial AI and enable rigorous, reproducible evaluation.
【3】ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation
标题:ToxiGAN:通过LLM引导的定向对抗生成进行有毒数据增强
链接:https://arxiv.org/abs/2601.03121
作者:Peiran Li,Jan Fillies,Adrian Paschke
备注:This paper has been accepted to the main conference of EACL 2026
摘要:以可控且特定于类别的方式增强有毒语言数据对于提高毒性分类的鲁棒性至关重要,但由于监督有限和分布偏差,仍然具有挑战性。我们提出了ToxiGAN,这是一个类感知的文本增强框架,它将对抗生成与大型语言模型(LLM)的语义指导相结合。为了解决基于GAN的增强中的常见问题,例如模式崩溃和语义漂移,ToxiGAN引入了两步定向训练策略,并利用LLM生成的中性文本作为语义压舱物。与以前的工作,把LLM作为静态发电机,我们的方法动态地选择中立的范例,以提供平衡的指导。有毒样本被明确优化,以偏离这些范例,加强类特异性对比信号。在四个仇恨语音基准测试上的实验表明,ToxiGAN在macro-F1和hate-F1中都实现了最强的平均性能,始终优于传统的和基于LLM的增强方法。消融和敏感性分析进一步证实了语义镇流器和定向训练在增强分类器鲁棒性方面的好处。
摘要:Augmenting toxic language data in a controllable and class-specific manner is crucial for improving robustness in toxicity classification, yet remains challenging due to limited supervision and distributional skew. We propose ToxiGAN, a class-aware text augmentation framework that combines adversarial generation with semantic guidance from large language models (LLMs). To address common issues in GAN-based augmentation such as mode collapse and semantic drift, ToxiGAN introduces a two-step directional training strategy and leverages LLM-generated neutral texts as semantic ballast. Unlike prior work that treats LLMs as static generators, our approach dynamically selects neutral exemplars to provide balanced guidance. Toxic samples are explicitly optimized to diverge from these exemplars, reinforcing class-specific contrastive signals. Experiments on four hate speech benchmarks show that ToxiGAN achieves the strongest average performance in both macro-F1 and hate-F1, consistently outperforming traditional and LLM-based augmentation methods. Ablation and sensitivity analyses further confirm the benefits of semantic ballast and directional training in enhancing classifier robustness.
【4】ATLAS: Adaptive Test-Time Latent Steering with External Verifiers for Enhancing LLMs Reasoning
标题:ATLAS:使用外部验证器的自适应测试时潜在引导,以增强LLM推理
链接:https://arxiv.org/abs/2601.03093
作者:Tuc Nguyen,Thai Le
备注:12 pages, 3 figures
摘要:最近关于激活和潜在转向的研究表明,修改内部表示可以有效地指导大型语言模型(LLM)提高推理和效率,而无需额外的训练。然而,大多数现有的方法依赖于固定的转向策略和静态干预强度,这限制了它们在问题实例中的鲁棒性,并且经常导致转向过度或转向不足。我们提出了自适应测试时潜在转向,称为(ATLAS),一个特定于任务的框架,使用外部的,轻量级的潜在验证器在推理时动态控制转向决策。给定中间隐藏状态,验证器预测正在进行的推理的质量,并自适应地选择是否以及如何强烈地应用转向,从而以最小的开销实现每个示例和每个步骤的调整。据我们所知,ATLAS是第一种将学习的潜在验证集成到测试时转向中以增强LLM推理的方法。多个数学推理基准测试的实验表明,ATLAS始终优于香草解码和固定转向基线,实现了更高的准确性,同时大大减少了测试时令牌的使用。这些结果表明,验证者引导的潜在适应提供了一个有效的和可扩展的机制,控制推理效率,而不牺牲解决方案的质量。所有源代码都将公开。
摘要:Recent work on activation and latent steering has demonstrated that modifying internal representations can effectively guide large language models (LLMs) toward improved reasoning and efficiency without additional training. However, most existing approaches rely on fixed steering policies and static intervention strengths, which limit their robustness across problem instances and often result in over- or under-steering. We propose Adaptive Test-time Latent Steering, called (ATLAS), a task- specific framework that dynamically controls steering decisions at inference time using an external, lightweight latent verifier. Given intermediate hidden states, the verifier predicts the quality of ongoing reasoning and adaptively selects whether and how strongly to apply steering, enabling per-example and per-step adjustment with minimal overhead. To our knowledge, ATLAS is the first method to integrate learned latent verification into test-time steering for enhancing LLMs reasoning. Experiments on multiple mathematical reasoning benchmarks show that ATLAS consistently outperforms both vanilla decoding and fixed steering baselines, achieving higher accuracy while substantially reducing test-time token usage. These results demonstrate that verifier-guided latent adaptation provides an effective and scalable mechanism for controlling reasoning efficiency without sacrificing solution quality. All source code will be publicly available.
【5】Grad-ELLM: Gradient-based Explanations for Decoder-only LLMs
标题:Grad-ELLM:仅限解码器的LLM的基于对象的简化
链接:https://arxiv.org/abs/2601.03089
作者:Xin Huang,Antoni B. Chan
摘要:大型语言模型(LLM)在不同的任务中表现出了卓越的能力,但其黑箱性质引起了人们对透明度和忠诚度的担忧。输入属性方法旨在突出每个输入令牌对模型输出的贡献,但现有方法通常与模型无关,并且不关注特定于transformer的架构,导致忠诚度有限。为了解决这个问题,我们提出了梯度ELLM,基于梯度的归属方法解码器只基于变压器的LLM。通过从输出logit的梯度中聚合关于注意力层的通道重要性和从注意力图中聚合空间重要性,Grad-ELLM在每个生成步骤中生成热图,而不需要架构修改。此外,我们引入了两个faithfulneses度量$π$-Soft-NC和$π$-Soft-NS,这是Soft-NC/NS的修改,通过控制扰动文本时保留的信息量来提供更公平的比较。我们使用不同的模型评估Grad-ELLM在情感分类,问答和开放生成任务上的性能。实验结果表明,梯度ELLM一致实现优于其他归因方法的忠实性。
摘要:Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their black-box nature raises concerns about transparency and faithfulness. Input attribution methods aim to highlight each input token's contributions to the model's output, but existing approaches are typically model-agnostic, and do not focus on transformer-specific architectures, leading to limited faithfulness. To address this, we propose Grad-ELLM, a gradient-based attribution method for decoder-only transformer-based LLMs. By aggregating channel importance from gradients of the output logit with respect to attention layers and spatial importance from attention maps, Grad-ELLM generates heatmaps at each generation step without requiring architectural modifications. Additionally, we introduce two faithfulneses metrics $π$-Soft-NC and $π$-Soft-NS, which are modifications of Soft-NC/NS that provide fairer comparisons by controlling the amount of information kept when perturbing the text. We evaluate Grad-ELLM on sentiment classification, question answering, and open-generation tasks using different models. Experiment results show that Grad-ELLM consistently achieves superior faithfulness than other attribution methods.
【6】Audit Me If You Can: Query-Efficient Active Fairness Auditing of Black-Box LLMs
标题:如果你可以的话审计我:黑匣子LLM的查询高效主动公平审计
链接:https://arxiv.org/abs/2601.03087
作者:David Hartmann,Lena Pohlmann,Lelia Hanslik,Noah Gießing,Bettina Berendt,Pieter Delobelle
备注:Submitted to ACL ARR 2026
摘要:大型语言模型(LLM)在人口统计学群体中表现出系统性偏差。审计被提议作为黑盒LLM应用程序的问责工具,但遭受资源密集型查询访问。我们将审计概念化为对目标公平性度量的不确定性估计,并引入BAFA,用于黑盒LLM的查询高效审计的有界主动公平审计器。BAFA维护与查询分数一致的代理模型的版本空间,并计算公平性度量的不确定性区间(例如,$Δ$ AUC)。主动查询选择缩小了这些间隔,以减少估计误差。我们在两个标准的公平数据集案例研究中评估了BAFA:\textsc{CivilComments}和\textsc{Bias-in-Bios},与分层抽样,幂抽样和消融进行了比较。BAFA达到目标错误阈值,查询次数比分层抽样少40倍(例如,144 vs 5,956次查询,$\vareps =0.02$ for \textsc{CivilComments}),证明了随着时间的推移性能显著提高,并且在运行中显示出较低的方差。这些结果表明,主动抽样可以减少所需的资源,独立的公平审计与LLM,支持连续的模型评估。
摘要:Large Language Models (LLMs) exhibit systematic biases across demographic groups. Auditing is proposed as an accountability tool for black-box LLM applications, but suffers from resource-intensive query access. We conceptualise auditing as uncertainty estimation over a target fairness metric and introduce BAFA, the Bounded Active Fairness Auditor for query-efficient auditing of black-box LLMs. BAFA maintains a version space of surrogate models consistent with queried scores and computes uncertainty intervals for fairness metrics (e.g., $Δ$ AUC) via constrained empirical risk minimisation. Active query selection narrows these intervals to reduce estimation error. We evaluate BAFA on two standard fairness dataset case studies: \textsc{CivilComments} and \textsc{Bias-in-Bios}, comparing against stratified sampling, power sampling, and ablations. BAFA achieves target error thresholds with up to 40$\times$ fewer queries than stratified sampling (e.g., 144 vs 5,956 queries at $\varepsilon=0.02$ for \textsc{CivilComments}) for tight thresholds, demonstrates substantially better performance over time, and shows lower variance across runs. These results suggest that active sampling can reduce resources needed for independent fairness auditing with LLMs, supporting continuous model evaluations.
【7】Joint Encoding of KV-Cache Blocks for Scalable LLM Serving
标题:用于可扩展LLM服务的KV-缓存块联合编码
链接:https://arxiv.org/abs/2601.03067
作者:Joseph Kampeas,Emir Haleva
备注:12 pages, 16 figures, 2 tables
摘要
:现代大型语言模型(LLM)驱动交互式AI系统,但受到键值(KV)缓存的内存增长的影响,这限制了并发负载下的实时吞吐量。现有的KV缓存压缩方法依赖于严格的几何学,破坏张量布局,或者需要专门的计算,阻碍了可扩展性和部署。 我们提出了KV缓存块的联合编码,它将请求和输入块中的相似块融合到共享表示中,同时保留标准缓存结构。这消除了KV缓存内存瓶颈,支持高并发服务,而无需专门的硬件。在理论上,我们分析了率失真权衡下的泊松过程模型的融合缓存块。从经验上讲,我们的方法实现了高达4.38 $\times$KV缓存压缩,在不同的LLM和基准测试中的准确性损失可以忽略不计,优于最近的结构化和自适应压缩基线。在实际的LLM服务中,联合编码在单机vLLM基准测试中将令牌吞吐量提高了$\sim$40\%,证明了推理吞吐量的大幅提高。代码可在https://github.com/sef1/kv_fast_fusion上找到。
摘要:Modern large language models (LLMs) drive interactive AI systems but are bottlenecked by the memory-heavy growth of key-value (KV) caches, which limits real-time throughput under concurrent loads. Existing KV-cache compression methods rely on rigid heuristics, disrupt tensor layouts, or require specialized compute, hindering scalability and deployment. We propose joint encoding of KV-cache blocks, which fuses similar blocks across requests and input chunks into shared representations while preserving standard cache structure. This alleviates the KV-cache memory bottleneck, supporting high-concurrency serving without specialized hardware. Theoretically, we analyze the rate-distortion tradeoff of fused cache blocks under a Poisson process model. Empirically, our method achieves up to 4.38 $\times$ KV-cache compression with negligible accuracy loss across diverse LLMs and benchmarks, outperforming recent structured and adaptive compression baselines. In real LLM serving, joint encoding improves the token throughput by $\sim$40\% on a single-machine vLLM benchmark, demonstrating substantial gains in inference throughput. Code is available at https://github.com/sef1/kv_fast_fusion kv_joint_encoding.
【8】Do LLMs Encode Functional Importance of Reasoning Tokens?
标题:LLM是否编码推理令牌的功能重要性?
链接:https://arxiv.org/abs/2601.03066
作者:Janvijay Singh,Dilek Hakkani-Tür
备注:20 pages, 8 figures, 2 tables
摘要:大型语言模型通过生成长推理链来解决复杂任务,以增加计算成本和降低隔离功能相关推理的能力为代价来实现更高的准确性。先前关于紧凑推理的工作通过概率抽样、逻辑学或来自前沿模型的监督来缩短这种链,但对模型是否内部编码了用于答案生成的标记级功能重要性的了解有限。我们解决这个差距诊断,并提出贪婪修剪,一个可能性保持删除过程,迭代删除推理令牌,其删除最低限度地降低模型的可能性在指定的目标,产生长度控制的推理链。我们在蒸馏框架中评估修剪推理,并表明在修剪链上训练的学生在匹配的推理长度上优于前沿模型监督的压缩基线。最后,我们的分析揭示了系统的修剪模式,并表明注意力分数可以预测贪婪的修剪排名,进一步表明模型编码的非平凡的功能重要性结构推理令牌。
摘要:Large language models solve complex tasks by generating long reasoning chains, achieving higher accuracy at the cost of increased computational cost and reduced ability to isolate functionally relevant reasoning. Prior work on compact reasoning shortens such chains through probabilistic sampling, heuristics, or supervision from frontier models, but offers limited insight into whether models internally encode token-level functional importance for answer generation. We address this gap diagnostically and propose greedy pruning, a likelihood-preserving deletion procedure that iteratively removes reasoning tokens whose removal minimally degrades model likelihood under a specified objective, yielding length-controlled reasoning chains. We evaluate pruned reasoning in a distillation framework and show that students trained on pruned chains outperform a frontier-model-supervised compression baseline at matched reasoning lengths. Finally, our analysis reveals systematic pruning patterns and shows that attention scores can predict greedy pruning ranks, further suggesting that models encode a nontrivial functional importance structure over reasoning tokens.
【9】From Memorization to Creativity: LLM as a Designer of Novel Neural-Architectures
标题:从小型化到创意:法学硕士作为新颖神经建筑设计师
链接:https://arxiv.org/abs/2601.02997
作者:Waleed Khalid,Dmitry Ignatov,Radu Timofte
摘要:大型语言模型(LLM)在程序合成方面表现出色,但它们自主导航神经架构设计的能力--平衡语法可靠性、性能和结构新颖性--仍有待探索。我们通过将面向代码的LLM放置在闭环合成框架内,分析其在22个监督微调周期内的演变来解决这个问题。该模型合成了PyTorch卷积网络,这些网络通过低保真度性能信号(单历元精度)进行验证和评估,并使用MinHash-Jaccard标准进行过滤以防止结构冗余。高性能,新颖的架构被转换为可编程代码对,用于通过从LEMUR数据集初始化的参数高效LoRA自适应进行迭代微调。在整个周期中,LLM内化了经验架构先验,成为一个强大的生成器。有效生成率稳定在50.6%(峰值为74.5%),而平均第一个时期的准确率从28.06%上升到50.99%,超过40%准确率的候选人比例从2.04%上升到96.81%。分析证实,该模型超越了复制现有的主题,合成了原始语料库中没有的455个高性能架构。通过在执行反馈中进行代码合成,这项工作提供了一个可扩展的蓝图,用于将随机生成器转换为自主的、性能驱动的神经设计器,建立LLM可以内化经验的、非文本的奖励,以超越其训练数据。
摘要:Large language models (LLMs) excel in program synthesis, yet their ability to autonomously navigate neural architecture design--balancing syntactic reliability, performance, and structural novelty--remains underexplored. We address this by placing a code-oriented LLM within a closed-loop synthesis framework, analyzing its evolution over 22 supervised fine-tuning cycles. The model synthesizes PyTorch convolutional networks which are validated, evaluated via low-fidelity performance signals (single-epoch accuracy), and filtered using a MinHash-Jaccard criterion to prevent structural redundancy. High-performing, novel architectures are converted into prompt-code pairs for iterative fine-tuning via parameter-efficient LoRA adaptation, initialized from the LEMUR dataset. Across cycles, the LLM internalizes empirical architectural priors, becoming a robust generator. The valid generation rate stabilizes at 50.6 percent (peaking at 74.5 percent), while mean first-epoch accuracy rises from 28.06 percent to 50.99 percent, and the fraction of candidates exceeding 40 percent accuracy grows from 2.04 percent to 96.81 percent. Analyses confirm the model moves beyond replicating existing motifs, synthesizing 455 high-performing architectures absent from the original corpus. By grounding code synthesis in execution feedback, this work provides a scalable blueprint for transforming stochastic generators into autonomous, performance-driven neural designers, establishing that LLMs can internalize empirical, non-textual rewards to transcend their training data.
【10】Reliability-Aware Adaptive Self-Consistency for Efficient Sampling in LLM Reasoning
标题:可靠性感知自适应自一致性,用于LLM推理中的高效采样
链接:https://arxiv.org/abs/2601.02970
作者:Junseok Kim,Nakyeong Yang,Kyungmin Min,Kyomin Jung
备注:15 pages, 8 figures
摘要:自一致性通过多样本聚合提高了推理的可靠性,但会产生大量的推理成本。自适应自我一致性方法通过调整采样预算来缓解这个问题;然而,它们依赖于基于计数的停止规则,该规则平等对待所有响应,通常导致不必要的采样。我们提出了可靠性感知自适应自我一致性(ReASC),它解决了这一限制,通过重构自适应采样从响应计数到证据充分性,利用响应级别的信心,原则性的信息聚合。ReASC分两个阶段运作:一个单样本决策阶段,解决了从一个单一的响应自信地回答的实例,和一个可靠性意识的积累阶段,通过联合利用他们的频率和信心聚集的反应。在五个模型和四个数据集上,与现有基线相比,ReASC始终实现了最佳的准确性-成本权衡,提高了从3B到27 B参数的模型规模的推理效率。作为一个具体的例子,ReASC相对于自一致性降低了高达70%的推理成本,同时使用Gemma-3- 4 B-it在GSM 8 K上保持准确性。
摘要
:Self-Consistency improves reasoning reliability through multi-sample aggregation, but incurs substantial inference cost. Adaptive self-consistency methods mitigate this issue by adjusting the sampling budget; however, they rely on count-based stopping rules that treat all responses equally, often leading to unnecessary sampling. We propose Reliability-Aware Adaptive Self-Consistency (ReASC), which addresses this limitation by reframing adaptive sampling from response counting to evidence sufficiency, leveraging response-level confidence for principled information aggregation. ReASC operates in two stages: a single-sample decision stage that resolves instances confidently answerable from a single response, and a reliability-aware accumulation stage that aggregates responses by jointly leveraging their frequency and confidence. Across five models and four datasets, ReASC consistently achieves the best accuracy-cost trade-off compared to existing baselines, yielding improved inference efficiency across model scales from 3B to 27B parameters. As a concrete example, ReASC reduces inference cost by up to 70\% relative to self-consistency while preserving accuracy on GSM8K using Gemma-3-4B-it.
【11】Image, Word and Thought: A More Challenging Language Task for the Iterated Learning Model
标题:图像、文字和思想:迭代学习模型的更复杂的语言任务
链接:https://arxiv.org/abs/2601.02911
作者:Hyoyeon Lee,Seth Bullock,Conor Houghton
备注:This is an extended version of a paper accepted for EvoLang2026, it includes additional details of the numerical experiments
摘要:迭代学习模型模拟了语言的代代相传,以探索语言传播所施加的约束如何促进语言结构的出现。尽管每个模型化的语言学习者都是从一张白纸开始的,但限制学习者所接触的话语数量的瓶颈的存在可能导致语言的出现,这些语言缺乏歧义,受语法规则支配,并且在连续几代人中是一致的,也就是说,一种表达性,成分和稳定的语言。最近引入了一种计算上更易处理且生态上有效的半监督迭代学习模型,将监督和无监督学习结合在自动编码器架构中,从而能够探索更大意义信号空间的语言传输动态。在这里,第一次,该模型已成功地应用于语言学习任务,涉及更复杂的意义的沟通:七段显示图像。在这个模型中,智能体能够学习和传递一种表达性的语言:所有128个字形都采用不同的代码;组成性:信号成分一致地映射到意义成分;稳定性:语言不会从一代到一代发生变化。
摘要:The iterated learning model simulates the transmission of language from generation to generation in order to explore how the constraints imposed by language transmission facilitate the emergence of language structure. Despite each modelled language learner starting from a blank slate, the presence of a bottleneck limiting the number of utterances to which the learner is exposed can lead to the emergence of language that lacks ambiguity, is governed by grammatical rules, and is consistent over successive generations, that is, one that is expressive, compositional and stable. The recent introduction of a more computationally tractable and ecologically valid semi supervised iterated learning model, combining supervised and unsupervised learning within an autoencoder architecture, has enabled exploration of language transmission dynamics for much larger meaning-signal spaces. Here, for the first time, the model has been successfully applied to a language learning task involving the communication of much more complex meanings: seven-segment display images. Agents in this model are able to learn and transmit a language that is expressive: distinct codes are employed for all 128 glyphs; compositional: signal components consistently map to meaning components, and stable: the language does not change from generation to generation.
【12】TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors
标题:TA格式:通过时间先验信息增强用于密集视频字幕的视频大型语言模型
链接:https://arxiv.org/abs/2601.02908
作者:Wei-Yuan Cheng,Kai-Po Chang,Chi-Pin Huang,Fu-En Yang,Yu-Chiang Frank Wang
备注:8 pages for main paper (exclude citation pages), 6 pages for appendix, totally 10 figures 7 tables and 2 algorithms. The paper is accepted by WACV 2026
摘要:密集视频字幕旨在解释和描述整个输入视频中的所有时间局部化事件。最近的最先进的方法利用大语言模型(LLM)提供详细的时刻描述的视频数据。然而,现有的VideoLLM在识别未修剪视频中的精确事件边界方面仍然具有挑战性,导致生成的字幕没有正确接地。在本文中,我们提出了TA-bridging,它通过时间参数来增强VideoLLM,学习精确定位事件,并提示VideoLLM执行时间感知的视频事件理解。在推理过程中,为了正确地确定输出字幕序列从任意数量的视频中呈现的事件,我们引入了一个事件相干采样策略,以选择事件字幕具有足够的时间事件和跨模态相似性与给定的视频的连贯性。通过对基准数据集的广泛实验,我们表明,我们的TA识别对最先进的VideoLLM是有利的,在密集的视频字幕和时间理解任务(包括时刻检索和temporalQA)上具有优异的性能。
摘要:Dense video captioning aims to interpret and describe all temporally localized events throughout an input video. Recent state-of-the-art methods leverage large language models (LLMs) to provide detailed moment descriptions for video data. However, existing VideoLLMs remain challenging in identifying precise event boundaries in untrimmed videos, causing the generated captions to be not properly grounded. In this paper, we propose TA-Prompting, which enhances VideoLLMs via Temporal Anchors that learn to precisely localize events and prompt the VideoLLMs to perform temporal-aware video event understanding. During inference, in order to properly determine the output caption sequence from an arbitrary number of events presented within a video, we introduce an event coherent sampling strategy to select event captions with sufficient coherence across temporal events and cross-modal similarity with the given video. Through extensive experiments on benchmark datasets, we show that our TA-Prompting is favorable against state-of-the-art VideoLLMs, yielding superior performance on dense video captioning and temporal understanding tasks including moment retrieval and temporalQA.
【13】HAL: Inducing Human-likeness in LLMs with Alignment
标题:HAL:通过对齐在LLM中诱导人类相似性
链接:https://arxiv.org/abs/2601.02813
作者:Masum Hasan,Junjie Zhao,Ehsan Hoque
摘要:对话类人在人类与人工智能的交互中起着核心作用,但它仍然难以定义,测量和优化。因此,类人行为的改善主要是由规模或广泛的监督训练驱动的,而不是有针对性的对齐。我们介绍了Human Aligning LLM(HAL),这是一个使用可解释的数据驱动奖励将语言模型与会话人类相似性对齐的框架。HAL从对比对话数据中导出明确的会话特征,将它们组合成紧凑的标量分数,并将此分数用作与标准偏好优化方法对齐的透明奖励信号。使用这种方法,我们可以对齐不同大小的模型,而不会影响它们的整体性能。在大规模的人类评估中,与HAL一致的模型在对话中更经常被认为是人类。由于HAL操作于明确的、可解释的特征,因此它能够检查对齐行为并诊断意外影响。更广泛地说,HAL展示了语言的软质属性-以前不在对齐范围内-可以以可解释和可解释的方式进行测量和对齐。
摘要:Conversational human-likeness plays a central role in human-AI interaction, yet it has remained difficult to define, measure, and optimize. As a result, improvements in human-like behavior are largely driven by scale or broad supervised training, rather than targeted alignment. We introduce Human Aligning LLMs (HAL), a framework for aligning language models to conversational human-likeness using an interpretable, data-driven reward. HAL derives explicit conversational traits from contrastive dialogue data, combines them into a compact scalar score, and uses this score as a transparent reward signal for alignment with standard preference optimization methods. Using this approach, we align models of varying sizes without affecting their overall performance. In large-scale human evaluations, models aligned with HAL are more frequently perceived as human-like in conversation. Because HAL operates over explicit, interpretable traits, it enables inspection of alignment behavior and diagnosis of unintended effects. More broadly, HAL demonstrates how soft, qualitative properties of language--previously outside the scope for alignment--can be made measurable and aligned in an interpretable and explainable way.
【14】Adversarial Contrastive Learning for LLM Quantization Attacks
标题:LLM量化攻击的对抗对比学习
链接:https://arxiv.org/abs/2601.02680
作者:Dinghong Song,Zhiwei Xu,Hai Wan,Xibin Zhao,Pengfei Su,Dong Li
备注:14 pages, 5 figures
摘要
:模型量化对于在资源受限的硬件上部署大型语言模型(LLM)至关重要,但最近的工作揭示了严重的安全风险,即完全精确的良性LLM在量化后可能会表现出恶意行为。在本文中,我们提出了对抗性对比学习(ACL),这是一种新的基于梯度的量化攻击,通过显式地最大化良性和有害响应概率之间的差距来实现卓越的攻击效果。ACL将攻击目标定义为基于三元组的对比损失,并将其与投影梯度下降两阶段分布式微调策略相结合,以确保稳定和高效的优化。大量的实验表明ACL的显着效果,实现攻击成功率为86.00%的过度拒绝,97.69%的越狱和92.40%的广告注入,大大优于最先进的方法高达44.67%,18.84%和50.80%,分别。
摘要:Model quantization is critical for deploying large language models (LLMs) on resource-constrained hardware, yet recent work has revealed severe security risks that benign LLMs in full precision may exhibit malicious behaviors after quantization. In this paper, we propose Adversarial Contrastive Learning (ACL), a novel gradient-based quantization attack that achieves superior attack effectiveness by explicitly maximizing the gap between benign and harmful responses probabilities. ACL formulates the attack objective as a triplet-based contrastive loss, and integrates it with a projected gradient descent two-stage distributed fine-tuning strategy to ensure stable and efficient optimization. Extensive experiments demonstrate ACL's remarkable effectiveness, achieving attack success rates of 86.00% for over-refusal, 97.69% for jailbreak, and 92.40% for advertisement injection, substantially outperforming state-of-the-art methods by up to 44.67%, 18.84%, and 50.80%, respectively.
【15】Uni-FinLLM: A Unified Multimodal Large Language Model with Modular Task Heads for Micro-Level Stock Prediction and Macro-Level Systemic Risk Assessment
标题:Uni-FinLLM:具有模块化任务头的统一多模式大语言模型,用于微观层面股票预测和宏观层面系统风险评估
链接:https://arxiv.org/abs/2601.02677
作者:Gongao Zhang,Haijiang Zeng,Lu Jiang
摘要:金融机构和监管机构需要集成异构数据的系统,以评估从股票波动到系统脆弱性的风险。现有的方法通常孤立地处理这些任务,无法捕获跨尺度的依赖关系。我们提出了Uni-FinLLM,一个统一的多模态大型语言模型,它使用共享的Transformer骨干和模块化任务头来共同处理金融文本、数值时间序列、基本面和视觉数据。通过跨模态注意力和多任务优化,它学习了微观,中观和宏观预测的连贯表示。在股票预测、信用风险评估和系统风险检测方面,Uni-FinLLM的表现明显优于基线。股票方向准确率从61.7%提高到67.4%,信用风险准确率从79.6%提高到84.1%,宏观预警准确率提高到82.3%。结果验证了统一的多模态LLM可以联合建模资产行为和系统漏洞,为金融提供可扩展的决策支持引擎。
摘要:Financial institutions and regulators require systems that integrate heterogeneous data to assess risks from stock fluctuations to systemic vulnerabilities. Existing approaches often treat these tasks in isolation, failing to capture cross-scale dependencies. We propose Uni-FinLLM, a unified multimodal large language model that uses a shared Transformer backbone and modular task heads to jointly process financial text, numerical time series, fundamentals, and visual data. Through cross-modal attention and multi-task optimization, it learns a coherent representation for micro-, meso-, and macro-level predictions. Evaluated on stock forecasting, credit-risk assessment, and systemic-risk detection, Uni-FinLLM significantly outperforms baselines. It raises stock directional accuracy to 67.4% (from 61.7%), credit-risk accuracy to 84.1% (from 79.6%), and macro early-warning accuracy to 82.3%. Results validate that a unified multimodal LLM can jointly model asset behavior and systemic vulnerabilities, offering a scalable decision-support engine for finance.
【16】Extracting books from production language models
标题:从生产语言模型中提取书籍
链接:https://arxiv.org/abs/2601.02671
作者:Ahmed Ahmed,A. Feder Cooper,Sanmi Koyejo,Percy Liang
备注:We ran experiments from mid-August to mid-September 2025, notified affected providers shortly after, and now make our findings public after a 90-day disclosure window
摘要:关于LLM和版权的许多未解决的法律问题都集中在记忆上:在训练过程中,特定的训练数据是否已编码到模型的权重中,以及这些记忆的数据是否可以在模型的输出中提取。虽然许多人认为LLM不会记住很多训练数据,但最近的工作表明,可以从开放权重模型中提取大量受版权保护的文本。然而,考虑到这些系统实施的安全措施,对于生产LLM,类似的提取是否可行仍然是一个悬而未决的问题。我们使用两个阶段的过程来研究这个问题:(1)初始探测以测试提取可行性,有时使用最佳N(BoN)越狱,然后是(2)迭代继续提示以尝试提取书。我们在四个生产LLM上评估了我们的程序- Claude 3.7 Sonnet,GPT-4.1,Gemini 2.5 Pro和Grok 3 -我们用基于块的最长公共子串近似值(NV-召回)计算的分数来衡量提取成功。通过不同的每LLM实验配置,我们能够提取不同数量的文本。对于第一阶段的探测,没有必要越狱Gemini 2.5 Pro和Grok 3来提取文本(例如,对于哈利波特和魔法石,nv-recall分别为76.8%和70.3%),而对于Claude 3.7 Sonnet和GPT-4.1则是必要的。在某些情况下,越狱的克劳德3.7十四行诗几乎逐字输出整本书(例如,nv-recall=95.8%)。GPT-4.1需要显著更多的BoN尝试(例如,20 X),并最终拒绝继续(例如,nv-recall=4.0%)。总之,我们的工作强调,即使有模型和系统级的保护措施,提取(版权)训练数据仍然是生产LLM的风险。
摘要:Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model's weights during training, and whether those memorized data can be extracted in the model's outputs. While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement. We investigate this question using a two-phase procedure: (1) an initial probe to test for extraction feasibility, which sometimes uses a Best-of-N (BoN) jailbreak, followed by (2) iterative continuation prompts to attempt to extract the book. We evaluate our procedure on four production LLMs -- Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 -- and we measure extraction success with a score computed from a block-based approximation of longest common substring (nv-recall). With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini 2.5 Pro and Grok 3 to extract text (e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer's Stone), while it was necessary for Claude 3.7 Sonnet and GPT-4.1. In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim (e.g., nv-recall=95.8%). GPT-4.1 requires significantly more BoN attempts (e.g., 20X), and eventually refuses to continue (e.g., nv-recall=4.0%). Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.
【17】Empirical Comparison of Encoder-Based Language Models and Feature-Based Supervised Machine Learning Approaches to Automated Scoring of Long Essays
标题:基于编码器的语言模型和基于条件的监督机器学习方法对长论文自动评分的实证比较
链接:https://arxiv.org/abs/2601.02659
作者:Kuo Wang,Haowei Hua,Pengfei Yan,Hong Jiao,Dan Song
备注:22 pages, 5 figures, 3 tables, presented at National Council on Measurement in Education 2025
摘要:长上下文可能会对文本处理中的仅编码器语言模型带来挑战,特别是对文章的自动评分。该研究训练了几种常用的基于编码器的语言模型,用于长文章的自动评分。这些训练模型的性能进行了评估,并与集成模型建立在基础语言模型与令牌限制为512?。实验模型包括基于BERT的模型(BERT、RoBERTa、DistilBERT和DeBERTa)、集成来自多个编码器模型的嵌入的集成模型以及基于特征的监督机器学习模型的集成模型,包括一致性提升决策树、极端梯度提升和轻梯度提升机。我们在17,307篇文章的数据集上训练,验证和测试了每个模型,其中80%/10%/10%的分裂,并使用二次加权Kappa评估模型性能。这项研究表明,将多个预先训练的语言模型表示与梯度提升分类器相结合的嵌入模型集成模型在长文评分方面显着优于单个语言模型。
摘要
:Long context may impose challenges for encoder-only language models in text processing, specifically for automated scoring of essays. This study trained several commonly used encoder-based language models for automated scoring of long essays. The performance of these trained models was evaluated and compared with the ensemble models built upon the base language models with a token limit of 512?. The experimented models include BERT-based models (BERT, RoBERTa, DistilBERT, and DeBERTa), ensemble models integrating embeddings from multiple encoder models, and ensemble models of feature-based supervised machine learning models, including Gradient-Boosted Decision Trees, eXtreme Gradient Boosting, and Light Gradient Boosting Machine. We trained, validated, and tested each model on a dataset of 17,307 essays, with an 80%/10%/10% split, and evaluated model performance using Quadratic Weighted Kappa. This study revealed that an ensemble-of-embeddings model that combines multiple pre-trained language model representations with gradient-boosting classifier as the ensemble model significantly outperforms individual language models at scoring long essays.
【18】Chronicals: A High-Performance Framework for LLM Fine-Tuning with 3.51x Speedup over Unsloth
标题:Chronicals:用于LLM微调的高性能框架,比Unsloth提高3.51倍
链接:https://arxiv.org/abs/2601.02609
作者:Arjun S. Nair
备注:61 pages, 25 figures, open-source framework available at https://github.com/Ajwebdevs/Chronicals and pip install chronicals
摘要:大型语言模型的微调受到内存的限制:7 B参数模型需要84 GB--14 GB用于权重,14 GB用于梯度,56 GB用于FP 32优化器状态--甚至超过了100 - 40 GB的容量。我们提出了Chronicals,一个开源的训练框架,通过四个协同优化实现了Unsloth的3.51倍加速:(1)融合Triton内核,通过RMSNorm(7倍),SwiGLU(5倍)和QK-RoPE消除了75%的内存流量(2)通过在线softmax计算,削减交叉熵,将logit内存从5GB减少到135 MB;(3)LoRA+,适配器矩阵之间的理论推导的16倍差分学习率;以及(4)最佳拟合递减序列打包,恢复60-75%的计算浪费在填充上。 在Qwen2.5-0.5B和A100- 40 GB上,Chronicals实现了41,184个令牌/秒的完全微调,而Unsloth为11,736个令牌/秒(3.51倍)。对于排名第32的LoRA,我们达到了11,699个令牌/秒,而Unsloth MAX为2,857个令牌/秒(4.10倍)。关键的是,我们发现Unsloth报告的每秒46,000个令牌的基准测试显示出零梯度范数-模型没有训练。 我们提供了完整的数学基础:在线softmax正确性证明,FlashAttention IO复杂度界限O(N^2 d^2 M^{-1}),来自梯度幅度分析的LoRA+学习率推导,以及装箱近似保证。所有实现、基准测试和证明都可以在www.example.com上获得,pip安装通过https://pypi.org/project/chronicals/。https://github.com/Ajwebdevs/Chronicals
摘要:Large language model fine-tuning is bottlenecked by memory: a 7B parameter model requires 84GB--14GB for weights, 14GB for gradients, and 56GB for FP32 optimizer states--exceeding even A100-40GB capacity. We present Chronicals, an open-source training framework achieving 3.51x speedup over Unsloth through four synergistic optimizations: (1) fused Triton kernels eliminating 75% of memory traffic via RMSNorm (7x), SwiGLU (5x), and QK-RoPE (2.3x) fusion; (2) Cut Cross-Entropy reducing logit memory from 5GB to 135MB through online softmax computation; (3) LoRA+ with theoretically-derived 16x differential learning rates between adapter matrices; and (4) Best-Fit Decreasing sequence packing recovering 60-75% of compute wasted on padding. On Qwen2.5-0.5B with A100-40GB, Chronicals achieves 41,184 tokens/second for full fine-tuning versus Unsloth's 11,736 tokens/second (3.51x). For LoRA at rank 32, we reach 11,699 tokens/second versus Unsloth MAX's 2,857 tokens/second (4.10x). Critically, we discovered that Unsloth's reported 46,000 tokens/second benchmark exhibited zero gradient norms--the model was not training. We provide complete mathematical foundations: online softmax correctness proofs, FlashAttention IO complexity bounds O(N^2 d^2 M^{-1}), LoRA+ learning rate derivations from gradient magnitude analysis, and bin-packing approximation guarantees. All implementations, benchmarks, and proofs are available at https://github.com/Ajwebdevs/Chronicals with pip installation via https://pypi.org/project/chronicals/.
【19】LendNova: Towards Automated Credit Risk Assessment with Language Models
标题:LendNova:使用语言模型实现自动化信用风险评估
链接:https://arxiv.org/abs/2601.02573
作者:Kiarash Shamsi,Danijel Novokmet,Joshua Peters,Mao Lin Liu,Paul K Edwards,Vahab Khoshdel
摘要:信用风险评估在金融部门至关重要,但传统上依赖于昂贵的基于特征的模型,这些模型往往无法利用原始信用记录中的所有可用信息。本文介绍了LendNova,这是第一个用于信用风险评估的实用自动化端到端管道,旨在通过利用先进的NLP技术和语言模型来利用原始信用记录中的所有可用信息。LendNova通过使用一种语言模型直接对原始的、专业术语密集的信用局文本进行操作来转换风险建模,该语言模型可以学习任务相关的表示,而无需手动进行特征工程。通过自动捕获文本中嵌入的模式和风险信号,它取代了手动预处理步骤,降低了成本并提高了可扩展性。对真实世界数据的评价进一步表明了其在准确和有效的风险评估方面的强大潜力。LendNova为智能信用风险代理建立了一个基线,证明了语言模型在该领域的可行性。它为未来的基础系统研究奠定了基础,这些系统能够实现更准确,适应性更强和自动化的财务决策。
摘要:Credit risk assessment is essential in the financial sector, but has traditionally depended on costly feature-based models that often fail to utilize all available information in raw credit records. This paper introduces LendNova, the first practical automated end-to-end pipeline for credit risk assessment, designed to utilize all available information in raw credit records by leveraging advanced NLP techniques and language models. LendNova transforms risk modeling by operating directly on raw, jargon-heavy credit bureau text using a language model that learns task-relevant representations without manual feature engineering. By automatically capturing patterns and risk signals embedded in the text, it replaces manual preprocessing steps, reducing costs and improving scalability. Evaluation on real-world data further demonstrates its strong potential in accurate and efficient risk assessment. LendNova establishes a baseline for intelligent credit risk agents, demonstrating the feasibility of language models in this domain. It lays the groundwork for future research toward foundation systems that enable more accurate, adaptable, and automated financial decision-making.
【20】LLM-Enhanced Reinforcement Learning for Time Series Anomaly Detection
标题:用于时间序列异常检测的LLM增强强化学习
链接:https://arxiv.org/abs/2601.02511
作者:Bahareh Golchin,Banafsheh Rekabdar,Danielle Justo
摘要:检测时间序列数据中的异常对于金融、医疗保健、传感器网络和工业监控应用至关重要。然而,时间序列异常检测往往遭受稀疏的标签,复杂的时间模式,和昂贵的专家注释。我们提出了一个统一的框架,该框架集成了基于大语言模型(LLM)的潜在功能,用于奖励整形与强化学习(RL),变分自动编码器(VAE)增强的动态奖励缩放,以及带有标签传播的主动学习。基于LSTM的RL代理利用LLM派生的语义奖励来指导探索,而VAE重建错误则添加了无监督的异常信号。主动学习选择最不确定的样本,标签传播有效地扩展标记数据。对Yahoo-A1和SMD基准测试的评估表明,我们的方法在有限的标签预算下实现了最先进的检测精度,并在数据受限的情况下有效地运行。这项研究强调了将LLM与RL和先进的无监督技术相结合,在现实世界的应用中进行鲁棒的,可扩展的异常检测的前景。
摘要:Detecting anomalies in time series data is crucial for finance, healthcare, sensor networks, and industrial monitoring applications. However, time series anomaly detection often suffers from sparse labels, complex temporal patterns, and costly expert annotation. We propose a unified framework that integrates Large Language Model (LLM)-based potential functions for reward shaping with Reinforcement Learning (RL), Variational Autoencoder (VAE)-enhanced dynamic reward scaling, and active learning with label propagation. An LSTM-based RL agent leverages LLM-derived semantic rewards to guide exploration, while VAE reconstruction errors add unsupervised anomaly signals. Active learning selects the most uncertain samples, and label propagation efficiently expands labeled data. Evaluations on Yahoo-A1 and SMD benchmarks demonstrate that our method achieves state-of-the-art detection accuracy under limited labeling budgets and operates effectively in data-constrained settings. This study highlights the promise of combining LLMs with RL and advanced unsupervised techniques for robust, scalable anomaly detection in real-world applications.
【21】How to Discover Knowledge for FutureG: Contextual RAG and LLM Prompting for O-RAN
标题:如何发现FutureG的知识:针对O-RAN的上下文RAG和LLM预算
链接:https://arxiv.org/abs/2601.02382
作者:Nathan Conger,Nathan Scollar,Kemal Davaslioglu,Yalin E. Sagduyu,Sastry Kompella
摘要:我们提出了一个用于5G/6 G网络的检索增强问答框架,其中开放无线电接入网络(O-RAN)已成为分散,虚拟化和AI驱动的无线系统的核心。虽然O-RAN支持多供应商互操作性和云原生部署,但其快速变化的规范和接口对研究人员和从业人员构成了重大挑战。这些复杂文档的手动导航是劳动密集型的,并且容易出错,会减慢系统设计、集成和部署。为了应对这一挑战,我们采用了上下文检索增强生成(上下文RAG),候选答案选择指导文档检索和特定于块的上下文,以提高大型语言模型(LLM)的性能的策略。这种对传统RAG的改进实现了更有针对性和上下文感知的检索,这提高了传递给LLM的文档的相关性,特别是当查询本身缺乏足够的上下文来进行准确的基础时。我们的框架是为动态领域而设计的,在这些领域中,数据快速发展,模型必须不断更新或重新部署,而所有这些都不需要LLM微调。我们使用ORAN Benchmark-13 K数据集评估了这个框架,并比较了三个LLM,即Llama3.2,Qwen2.5- 7 B和Qwen3.0- 4 B,跨直接问答(Direct Q&A)和思想链(CoT)提示策略。我们表明,上下文RAG始终提高准确性,在标准RAG和基地提示,同时保持竞争力的运行时间和二氧化碳排放量。这些结果凸显了Contextual RAG作为ORAN和更广泛的5G/6 G环境中特定领域问答的可扩展和有效解决方案的潜力,从而能够更准确地解释不断发展的标准,同时保持效率和可持续性。
摘要:We present a retrieval-augmented question answering framework for 5G/6G networks, where the Open Radio Access Network (O-RAN) has become central to disaggregated, virtualized, and AI-driven wireless systems. While O-RAN enables multi-vendor interoperability and cloud-native deployments, its fast-changing specifications and interfaces pose major challenges for researchers and practitioners. Manual navigation of these complex documents is labor-intensive and error-prone, slowing system design, integration, and deployment. To address this challenge, we adopt Contextual Retrieval-Augmented Generation (Contextual RAG), a strategy in which candidate answer choices guide document retrieval and chunk-specific context to improve large language model (LLM) performance. This improvement over traditional RAG achieves more targeted and context-aware retrieval, which improves the relevance of documents passed to the LLM, particularly when the query alone lacks sufficient context for accurate grounding. Our framework is designed for dynamic domains where data evolves rapidly and models must be continuously updated or redeployed, all without requiring LLM fine-tuning. We evaluate this framework using the ORANBenchmark-13K dataset, and compare three LLMs, namely, Llama3.2, Qwen2.5-7B, and Qwen3.0-4B, across both Direct Question Answering (Direct Q&A) and Chain-of-Thought (CoT) prompting strategies. We show that Contextual RAG consistently improves accuracy over standard RAG and base prompting, while maintaining competitive runtime and CO2 emissions. These results highlight the potential of Contextual RAG to serve as a scalable and effective solution for domain-specific Q&A in ORAN and broader 5G/6G environments, enabling more accurate interpretation of evolving standards while preserving efficiency and sustainability.
Graph相关(图学习|图神经网络|图优化等)(6篇)
【1】Counterfactual Fairness with Graph Uncertainty
标题:图不确定性下的反事实公平
链接:https://arxiv.org/abs/2601.03203
作者:Davi Valério,Chrysoula Zerva,Mariana Pinto,Ricardo Santos,André Carreiro
备注:Peer reviewed pre-print. Presented at the BIAS 2025 Workshop at ECML PKDD
摘要:评估机器学习(ML)模型偏差是构建可靠和强大的ML系统的关键。反事实公平(CF)审计允许用因果框架来测量ML模型的偏差,但它们的结论依赖于一个单一的因果图,而这个因果图在现实世界中很少被确定。我们提出CF与图形不确定性(CF-GU),偏差评估程序,将指定一个因果图到CF的不确定性。CF-GU(i)在领域知识约束下引导因果发现算法,以产生一袋看似合理的有向无环图(DAG),(ii)用归一化香农熵量化图的不确定性,以及(iii)提供CF度量的置信界限。合成数据上的实验显示了对比的领域知识假设如何支持或反驳CF的审计,而现实世界的数据(COMPAS和成人数据集)上的实验以高置信度查明了众所周知的偏见,即使在提供最小的领域知识约束时。
摘要:Evaluating machine learning (ML) model bias is key to building trustworthy and robust ML systems. Counterfactual Fairness (CF) audits allow the measurement of bias of ML models with a causal framework, yet their conclusions rely on a single causal graph that is rarely known with certainty in real-world scenarios. We propose CF with Graph Uncertainty (CF-GU), a bias evaluation procedure that incorporates the uncertainty of specifying a causal graph into CF. CF-GU (i) bootstraps a Causal Discovery algorithm under domain knowledge constraints to produce a bag of plausible Directed Acyclic Graphs (DAGs), (ii) quantifies graph uncertainty with the normalized Shannon entropy, and (iii) provides confidence bounds on CF metrics. Experiments on synthetic data show how contrasting domain knowledge assumptions support or refute audits of CF, while experiments on real-world data (COMPAS and Adult datasets) pinpoint well-known biases with high confidence, even when supplied with minimal domain knowledge constraints.
【2】Temporal Graph Network: Hallucination Detection in Multi-Turn Conversation
标题:时间图网络:多轮对话中的幻觉检测
链接:https://arxiv.org/abs/2601.03051
作者:Vidhi Rathore,Sambu Aneesh,Himanshu Singh
摘要:对话式人工智能系统可以产生幻觉,特别是在多轮对话中,上下文变化和矛盾最终可能会浮出水面。通过将整个对话表示为时间图,我们提出了一种新的基于图的方法来检测对话级幻觉。我们的框架将每个对话建模为一个节点,使用句子Transformer对其进行编码。我们探索了两种不同的连接方式:i)共享实体边缘,连接引用相同实体的回合; ii)时间边缘,连接会话中的连续回合。消息传递用于更新节点嵌入,允许相关节点之间的信息流。然后使用注意力池将上下文感知节点嵌入组合成单个向量,然后将其传递到分类器以确定幻觉的存在和类型。我们证明,我们的方法提供了比现有方法略有改善的性能。此外,我们的注意力机制可以用来证明决策过程。代码和模型重量可在https://github.com/sambuaneesh/anlp-project上获得。
摘要:Hallucinations can be produced by conversational AI systems, particularly in multi-turn conversations where context changes and contradictions may eventually surface. By representing the entire conversation as a temporal graph, we present a novel graph-based method for detecting dialogue-level hallucinations. Our framework models each dialogue as a node, encoding it using a sentence transformer. We explore two different ways of connectivity: i) shared-entity edges, which connect turns that refer to the same entities; ii) temporal edges, which connect contiguous turns in the conversation. Message-passing is used to update the node embeddings, allowing flow of information between related nodes. The context-aware node embeddings are then combined using attention pooling into a single vector, which is then passed on to a classifier to determine the presence and type of hallucinations. We demonstrate that our method offers slightly improved performance over existing methods. Further, we show the attention mechanism can be used to justify the decision making process. The code and model weights are made available at: https://github.com/sambuaneesh/anlp-project.
【3】When Prompting Meets Spiking: Graph Sparse Prompting via Spiking Graph Prompt Learning
标题:当预算遇到峰值时:通过峰值图形提示学习进行图形稀疏预算
链接:https://arxiv.org/abs/2601.02662
作者:Bo Jiang,Weijun Zhao,Beibei Wang,Jin Tang
摘要:图提示特征(GPF)学习已广泛用于在下游任务上调整预训练的GNN模型。GPFs首先引入一些提示原子,然后使用提示原子的线性组合来学习每个图节点的最佳提示向量。然而,现有的GPF一般对节点的所有特征维度进行提示,这是明显冗余的,并且对节点特征噪声敏感。为了克服这个问题,本文首次提出了利用尖峰神经元机制来学习稀疏图提示,称为尖峰图提示特征(SpikingGPF)。我们的方法的动机是观察到尖峰神经元可以执行廉价的信息处理,并产生稀疏的输出,这自然适合我们的图稀疏提示的任务。具体来说,SpikingGPF有两个主要方面。首先,它通过利用尖峰神经元架构来学习每个节点的稀疏提示向量,从而实现对选择性节点特征的提示。这产生了更紧凑和轻量的提示设计,同时还提高了对节点噪声的鲁棒性。其次,SpikingGPF引入了一种新的基于稀疏表示理论的提示表示学习模型,即,它将每个节点提示表示为提示原子的稀疏组合。这鼓励了更紧凑的表示,并且还促进了有效的计算。在多个基准测试上的大量实验证明了SpikingGPF的有效性和鲁棒性。
摘要
:Graph Prompt Feature (GPF) learning has been widely used in adapting pre-trained GNN model on the downstream task. GPFs first introduce some prompt atoms and then learns the optimal prompt vector for each graph node using the linear combination of prompt atoms. However, existing GPFs generally conduct prompting over node's all feature dimensions which is obviously redundant and also be sensitive to node feature noise. To overcome this issue, for the first time, this paper proposes learning sparse graph prompts by leveraging the spiking neuron mechanism, termed Spiking Graph Prompt Feature (SpikingGPF). Our approach is motivated by the observation that spiking neuron can perform inexpensive information processing and produce sparse outputs which naturally fits the task of our graph sparse prompting. Specifically, SpikingGPF has two main aspects. First, it learns a sparse prompt vector for each node by exploiting a spiking neuron architecture, enabling prompting on selective node features. This yields a more compact and lightweight prompting design while also improving robustness against node noise. Second, SpikingGPF introduces a novel prompt representation learning model based on sparse representation theory, i.e., it represents each node prompt as a sparse combination of prompt atoms. This encourages a more compact representation and also facilitates efficient computation. Extensive experiments on several benchmarks demonstrate the effectiveness and robustness of SpikingGPF.
【4】Multi-scale Graph Autoregressive Modeling: Molecular Property Prediction via Next Token Prediction
标题:多尺度图自回归建模:通过下一个代币预测进行分子性质预测
链接:https://arxiv.org/abs/2601.02530
作者:Zhuoyang Jiang,Yaosen Min,Peiran Jin,Lei Chen
摘要:我们提出了连接感知基序测序(CamS),一个图形到序列的表示,使解码器只有Transformers通过标准的下一个令牌预测(NTP)学习分子图。对于分子性质预测,基于SMILES的NTP扩展良好,但缺乏明确的拓扑结构,而图形本地掩蔽建模捕获连接性,但有可能破坏关键的化学细节(例如,活动悬崖)。CamS通过将分子图序列化为结构丰富的因果序列来弥合这一差距。CamS首先挖掘数据驱动的连接感知基序。然后,它通过基于支架的宽度优先搜索(BFS)序列化基序,以建立稳定的核心到外围顺序。至关重要的是,CamS通过连接从精细到粗糙的基序尺度的序列来实现分层建模,使模型能够在密集,未损坏的局部结构证据上调节全局支架。我们通过在CamS序列上预训练香草LLaMA骨干来实例化CamS-LLaMA。它在MoleculeNet和活动悬崖基准MoleculeACE上实现了最先进的性能,优于基于SMILES的语言模型和强大的图形基线。可解释性分析证实,我们的多尺度因果序列化有效地推动了对悬崖决定性差异的关注。
摘要:We present Connection-Aware Motif Sequencing (CamS), a graph-to-sequence representation that enables decoder-only Transformers to learn molecular graphs via standard next-token prediction (NTP). For molecular property prediction, SMILES-based NTP scales well but lacks explicit topology, whereas graph-native masked modeling captures connectivity but risks disrupting the pivotal chemical details (e.g., activity cliffs). CamS bridges this gap by serializing molecular graphs into structure-rich causal sequences. CamS first mines data-driven connection-aware motifs. It then serializes motifs via scaffold-rooted breadth-first search (BFS) to establish a stable core-to-periphery order. Crucially, CamS enables hierarchical modeling by concatenating sequences from fine to coarse motif scales, allowing the model to condition global scaffolds on dense, uncorrupted local structural evidence. We instantiate CamS-LLaMA by pre-training a vanilla LLaMA backbone on CamS sequences. It achieves state-of-the-art performance on MoleculeNet and the activity-cliff benchmark MoleculeACE, outperforming both SMILES-based language models and strong graph baselines. Interpretability analysis confirms that our multi-scale causal serialization effectively drives attention toward cliff-determining differences.
【5】mHC-GNN: Manifold-Constrained Hyper-Connections for Graph Neural Networks
标题:mHC-GNN:图神经网络的Manifold约束超连接
链接:https://arxiv.org/abs/2601.02451
作者:Subhankar Mishra
摘要:图神经网络(GNN)在深度架构中存在过度平滑问题,并且表现力受到1-Weisfeiler-Leman(1-WL)测试的限制。我们适应流形约束超连接(\mhc)~\citep{xie 2025 mhc},最近提出的Transformers,图形神经网络。我们的方法,mHC-GNN,扩展节点表示在$n$并行流和约束流混合矩阵的Birkhoff多面体通过Sinkhorn-Knopp规范化。我们证明,mHC-GNN表现出指数较慢的过平滑(率$(1-γ)^{L/n}$与\ $(1-γ)^L$),并能区分1-WL以上的图。在4种GNN架构的10个数据集上的实验表明了一致的改进。从2到128层的深度实验表明,标准GNN在超过16层时会崩溃到接近随机的性能,而mHC-GNN即使在128层时也能保持超过74%的准确性,在极端深度时的准确性提高了50个百分点。烧蚀证实,歧管约束是必不可少的:去除它会导致高达82%的性能下降。代码可在\href{https://github.com/smlab-niser/mhc-gnn}{https://github.com/smlab-niser/mhc-gnn}获得
摘要:Graph Neural Networks (GNNs) suffer from over-smoothing in deep architectures and expressiveness bounded by the 1-Weisfeiler-Leman (1-WL) test. We adapt Manifold-Constrained Hyper-Connections (\mhc)~\citep{xie2025mhc}, recently proposed for Transformers, to graph neural networks. Our method, mHC-GNN, expands node representations across $n$ parallel streams and constrains stream-mixing matrices to the Birkhoff polytope via Sinkhorn-Knopp normalization. We prove that mHC-GNN exhibits exponentially slower over-smoothing (rate $(1-γ)^{L/n}$ vs.\ $(1-γ)^L$) and can distinguish graphs beyond 1-WL. Experiments on 10 datasets with 4 GNN architectures show consistent improvements. Depth experiments from 2 to 128 layers reveal that standard GNNs collapse to near-random performance beyond 16 layers, while mHC-GNN maintains over 74\% accuracy even at 128 layers, with improvements exceeding 50 percentage points at extreme depths. Ablations confirm that the manifold constraint is essential: removing it causes up to 82\% performance degradation. Code is available at \href{https://github.com/smlab-niser/mhc-gnn}{https://github.com/smlab-niser/mhc-gnn}
【6】Spiking Heterogeneous Graph Attention Networks
标题:峰值异类图注意力网络
链接:https://arxiv.org/abs/2601.02401
作者:Buqing Cao,Qian Peng,Xiang Xie,Liang Chen,Min Shi,Jianxun Liu
备注:This paper has been accepted by AAAI 2026
摘要:现实世界的图或网络通常是异构的,涉及多种类型的节点和关系。异构图神经网络(HGNNs)可以有效地处理这些不同的节点和边,捕获图中的异构信息,从而表现出出色的性能。然而,HGNN的大多数方法通常涉及复杂的结构设计,导致诸如高内存使用、长推理时间和大量消耗计算资源等问题。这些限制对HGNN的实际应用提出了某些挑战,特别是对于资源受限的设备。为了缓解这个问题,我们提出了尖峰异构图注意力网络(SpikingHAN),它将尖峰神经网络(SNN)的大脑启发和节能特性融入异构图学习中,以降低计算成本而不影响性能。具体来说,SpikingHAN使用具有共享参数的单层图卷积来聚合基于元路径的邻居信息。然后,它采用了语义级的注意力机制,以捕捉不同的元路径的重要性,并执行语义聚合。最后,通过SNN将异构信息编码成尖峰序列,模拟生物信息学处理,得到异构图的二值化1位表示。从三个真实世界的异构图数据集的综合实验结果表明,SpikingHAN提供有竞争力的节点分类性能。它通过更少的参数、更快的推理、更少的内存使用和更低的能耗来实现这一目标。代码可在https://github.com/QianPeng369/SpikingHAN上获得。
摘要
:Real-world graphs or networks are usually heterogeneous, involving multiple types of nodes and relationships. Heterogeneous graph neural networks (HGNNs) can effectively handle these diverse nodes and edges, capturing heterogeneous information within the graph, thus exhibiting outstanding performance. However, most methods of HGNNs usually involve complex structural designs, leading to problems such as high memory usage, long inference time, and extensive consumption of computing resources. These limitations pose certain challenges for the practical application of HGNNs, especially for resource-constrained devices. To mitigate this issue, we propose the Spiking Heterogeneous Graph Attention Networks (SpikingHAN), which incorporates the brain-inspired and energy-saving properties of Spiking Neural Networks (SNNs) into heterogeneous graph learning to reduce the computing cost without compromising the performance. Specifically, SpikingHAN aggregates metapath-based neighbor information using a single-layer graph convolution with shared parameters. It then employs a semantic-level attention mechanism to capture the importance of different meta-paths and performs semantic aggregation. Finally, it encodes the heterogeneous information into a spike sequence through SNNs, simulating bioinformatic processing to derive a binarized 1-bit representation of the heterogeneous graph. Comprehensive experimental results from three real-world heterogeneous graph datasets show that SpikingHAN delivers competitive node classification performance. It achieves this with fewer parameters, quicker inference, reduced memory usage, and lower energy consumption. Code is available at https://github.com/QianPeng369/SpikingHAN.
Transformer(3篇)
【1】From Muscle to Text with MyoText: sEMG to Text via Finger Classification and Transformer-Based Decoding
标题:使用Myotext从肌肉到文本:通过手指分类和基于转换器的解码从sEMG到文本
链接:https://arxiv.org/abs/2601.03098
作者:Meghna Roy Chowdhury,Shreyas Sen,Yi Ding
备注:25 pages, 11 tables, 11 figures
摘要:表面肌电图(sEMG)为解码肌肉活动提供了直接的神经接口,并为可穿戴和混合现实系统中的无键盘文本输入提供了有前途的基础。以前的sEMG到文本的研究主要集中在直接从sEMG信号识别字母,形成了将肌肉活动转化为文本的重要的第一步。在此基础上,我们提出了MyoText,一个分层的框架,通过生理接地的中间阶段解码sEMG信号到文本。MyoText首先使用CNN-BiLSTM-Attention模型对来自多通道sEMG的手指激活进行分类,应用人体工程学打字先验来推断字母,并使用微调的T5 Transformer重建完整的句子。这种模块化设计反映了打字的自然层次,将肌肉意图与语言输出联系起来,并减少了解码的搜索空间。在emg 2 qwerty数据集的30名用户上进行评估后,MyoText的表现优于基线,手指分类准确率为85.4%,字符错误率(CER)为5.4%,单词错误率(WER)为6.5%。除了提高准确性之外,这种方法还建立了从神经肌肉信号到文本的原则性路径,为完全无需物理键盘即可操作的虚拟和增强现实打字界面提供了蓝图。通过将人体工程学结构与基于转换器的语言推理相结合,MyoText推进了未来普适计算环境中无缝、可穿戴神经输入的可行性。
摘要:Surface electromyography (sEMG) provides a direct neural interface for decoding muscle activity and offers a promising foundation for keyboard-free text input in wearable and mixed-reality systems. Previous sEMG-to-text studies mainly focused on recognizing letters directly from sEMG signals, forming an important first step toward translating muscle activity into text. Building on this foundation, we present MyoText, a hierarchical framework that decodes sEMG signals to text through physiologically grounded intermediate stages. MyoText first classifies finger activations from multichannel sEMG using a CNN-BiLSTM-Attention model, applies ergonomic typing priors to infer letters, and reconstructs full sentences with a fine-tuned T5 transformer. This modular design mirrors the natural hierarchy of typing, linking muscle intent to language output and reducing the search space for decoding. Evaluated on 30 users from the emg2qwerty dataset, MyoText outperforms baselines by achieving 85.4% finger-classification accuracy, 5.4% character error rate (CER), and 6.5% word error rate (WER). Beyond accuracy gains, this methodology establishes a principled pathway from neuromuscular signals to text, providing a blueprint for virtual and augmented-reality typing interfaces that operate entirely without physical keyboards. By integrating ergonomic structure with transformer-based linguistic reasoning, MyoText advances the feasibility of seamless, wearable neural input for future ubiquitous computing environments.
【2】TAP-ViTs: Task-Adaptive Pruning for On-Device Deployment of Vision Transformers
标题:TAP-ViTs:用于Vision Transformer设备部署的任务自适应修剪
链接:https://arxiv.org/abs/2601.02437
作者:Zhibo Wang,Zuoyuan Zhang,Xiaoyi Pang,Qile Zhang,Xuanyi Hao,Shuguo Zhuo,Peng Sun
摘要:Vision Transformers(ViT)在各种视觉任务中表现出了强大的性能,但其大量的计算和内存需求阻碍了在资源受限的移动和边缘设备上的高效部署。修剪已经成为降低ViT复杂性的一个有前途的方向。然而,现有的方法要么(i)产生跨所有设备共享的单个修剪模型,忽略设备异构性,要么(ii)依赖于对设备本地数据的微调,这通常由于有限的设备上资源和严格的隐私约束而不可行。因此,当前方法达不到在隐私保护移动计算设置中实现任务定制的ViT修剪。本文介绍了TAP-ViTs,一种新的任务自适应修剪框架,生成特定于设备的修剪ViT模型,而无需访问任何原始本地数据。具体而言,在隐私约束下推断设备级任务特征,我们提出了一个基于高斯混合模型(GMM)的度量数据集构建机制。每个设备都适合一个轻量级GMM来近似其私有数据分布,并仅上传GMM参数。使用这些参数,云从公共数据中选择分布一致的样本,为每个设备构建任务代表性指标数据集。基于此代理数据集,我们进一步开发了一种基于双粒度重要性评估的修剪策略,该策略联合测量复合神经元重要性和自适应层重要性,从而实现针对每个设备的计算预算量身定制的细粒度任务感知修剪。在多个ViT主干和数据集上进行的大量实验表明,在可比的压缩比下,TAP-ViTs始终优于最先进的修剪方法。
摘要:Vision Transformers (ViTs) have demonstrated strong performance across a wide range of vision tasks, yet their substantial computational and memory demands hinder efficient deployment on resource-constrained mobile and edge devices. Pruning has emerged as a promising direction for reducing ViT complexity. However, existing approaches either (i) produce a single pruned model shared across all devices, ignoring device heterogeneity, or (ii) rely on fine-tuning with device-local data, which is often infeasible due to limited on-device resources and strict privacy constraints. As a result, current methods fall short of enabling task-customized ViT pruning in privacy-preserving mobile computing settings. This paper introduces TAP-ViTs, a novel task-adaptive pruning framework that generates device-specific pruned ViT models without requiring access to any raw local data. Specifically, to infer device-level task characteristics under privacy constraints, we propose a Gaussian Mixture Model (GMM)-based metric dataset construction mechanism. Each device fits a lightweight GMM to approximate its private data distribution and uploads only the GMM parameters. Using these parameters, the cloud selects distribution-consistent samples from public data to construct a task-representative metric dataset for each device. Based on this proxy dataset, we further develop a dual-granularity importance evaluation-based pruning strategy that jointly measures composite neuron importance and adaptive layer importance, enabling fine-grained, task-aware pruning tailored to each device's computational budget. Extensive experiments across multiple ViT backbones and datasets demonstrate that TAP-ViTs consistently outperforms state-of-the-art pruning methods under comparable compression ratios.
【3】Physical Transformer
标题:物理Transformer
链接:https://arxiv.org/abs/2601.02433
作者:Tao Xu,Zhixin Hu,Li Luo,Momiao Xiong
备注:38 pages, 2 figures
摘要:数字AI系统跨越大型语言模型、视觉模型和生成架构,主要在符号、语言或像素领域运行。他们已经取得了惊人的进步,但几乎所有这些进步都存在于虚拟空间中。这些系统转换嵌入和令牌,但本身不接触世界,很少承认物理解释。在这项工作中,我们提出了一个物理Transformer耦合现代Transformer风格的计算与几何表示和物理动力学。在微观层面上,注意头,和前馈块被建模为相互作用的自旋有效哈密顿加上非哈密顿浴项。在中观层次上,它们的聚合状态在Hamilton流和Hamilton,Jacobi,Bellman(HJB)最优控制下的学习神经微分流形(NDM)上演化,通过辛层离散化,近似保持几何和能量不变量。在宏观层面上,该模型保持了一个生成的语义工作空间和一个二维的信息相画像,跟踪不确定性和信息增益的推理轨迹。在这个层次结构中,推理任务被制定为控制信息流的流形上,与解决方案对应的低成本轨迹,满足几何,精力充沛,和工作空间的一致性约束。在涉及数值积分和动力系统的简单玩具问题上,物理Transformer在稳定性和长期精度方面优于朴素基线,突出了尊重底层几何和哈密顿结构的好处。更广泛地说,该框架提出了一条通往物理人工智能的道路,将数字推理与物理基础流形统一起来,开辟了一条通往更可解释和潜在的统一推理,控制和与现实世界互动模型的道路。
摘要
:Digital AI systems spanning large language models, vision models, and generative architectures that operate primarily in symbolic, linguistic, or pixel domains. They have achieved striking progress, but almost all of this progress lives in virtual spaces. These systems transform embeddings and tokens, yet do not themselves touch the world and rarely admit a physical interpretation. In this work we propose a physical transformer that couples modern transformer style computation with geometric representation and physical dynamics. At the micro level, attention heads, and feed-forward blocks are modeled as interacting spins governed by effective Hamiltonians plus non-Hamiltonian bath terms. At the meso level, their aggregated state evolves on a learned Neural Differential Manifold (NDM) under Hamiltonian flows and Hamilton, Jacobi, Bellman (HJB) optimal control, discretized by symplectic layers that approximately preserve geometric and energetic invariants. At the macro level, the model maintains a generative semantic workspace and a two-dimensional information-phase portrait that tracks uncertainty and information gain over a reasoning trajectory. Within this hierarchy, reasoning tasks are formulated as controlled information flows on the manifold, with solutions corresponding to low cost trajectories that satisfy geometric, energetic, and workspace-consistency constraints. On simple toy problems involving numerical integration and dynamical systems, the physical transformer outperforms naive baselines in stability and long-horizon accuracy, highlighting the benefits of respecting underlying geometric and Hamiltonian structure. More broadly, the framework suggests a path toward physical AI that unify digital reasoning with physically grounded manifolds, opening a route to more interpretable and potentially unified models of reasoning, control, and interaction with the real world.
GAN|对抗|攻击|生成相关(3篇)
【1】Decentralized Autoregressive Generation
标题:去中心化的自回归一代
链接:https://arxiv.org/abs/2601.03184
作者:Stepan Maschan,Haoxuan Qu,Jun Liu
备注:Work in progress
摘要:本文对自回归生成的分散性进行了理论分析。我们定义的分散离散流匹配的目标,通过表达概率生成速度作为一个线性组合的专家流。我们还进行了实验,证明了分散式和集中式训练设置之间的等效性,多模态语言模型跨越不同的基准集。具体来说,我们比较了两种不同的范例:LLaVA和InternVL 2.5-1B,后者使用固定的CLIP视觉编码器,并在指令调整阶段进行全参数微调(ViT+MLP+LLM)。
摘要:We present a theoretical analysis of decentralization of autoregressive generation. We define the Decentralized Discrete Flow Matching objective, by expressing probability generating velocity as a linear combination of expert flows. We also conduct experiments demonstrat- ing the equivalence between decentralized and centralized training settings for multimodal language models across diverse set of benchmarks. Specifically, we compare two distinct paradigms: LLaVA and InternVL 2.5-1B, which uses a fixed CLIP vision encoder and per- forms full-parameter fine-tuning (ViT+MLP+LLM) during the instruction tuning stage.
【2】Flow Matching and Diffusion Models via PointNet for Generating Fluid Fields on Irregular Geometries
标题:基于PointNet的流匹配和扩散模型用于在不规则几何体上生成流场
链接:https://arxiv.org/abs/2601.03030
作者:Ali Kashefi
摘要:我们提出了两个新的生成式几何深度学习框架,称为Flow Matching PointNet和Diffusion PointNet,用于通过将PointNet分别纳入流动匹配和扩散模型来预测不规则几何形状上的流体流动变量。在这些框架中,反向生成过程从以看不见的几何形状为条件的标准高斯噪声中重建物理场。所提出的方法直接对计算域的点云表示进行操作(例如,有限体积网格的网格顶点),因此避免了用于将几何形状投影到均匀网格上的像素的限制。与基于图形神经网络的扩散模型相比,Flow Matching PointNet和Diffusion PointNet在预测字段中不会显示高频噪声伪影。此外,与这些方法不同,这些方法需要辅助中间网络来调节几何形状,所提出的框架仅依赖于PointNet,从而形成简单而统一的架构。所提出的框架的性能进行评估稳定的不可压缩流通过一个圆柱体,使用几何数据集构造不同的圆柱体的横截面形状和方向的样品。结果表明,流量匹配PointNet和扩散PointNet实现了速度和压力场,以及升力和阻力的更准确的预测,并表现出更大的鲁棒性不完整的几何形状相比,香草PointNet具有相同数量的可训练参数。
摘要:We present two novel generative geometric deep learning frameworks, termed Flow Matching PointNet and Diffusion PointNet, for predicting fluid flow variables on irregular geometries by incorporating PointNet into flow matching and diffusion models, respectively. In these frameworks, a reverse generative process reconstructs physical fields from standard Gaussian noise conditioned on unseen geometries. The proposed approaches operate directly on point-cloud representations of computational domains (e.g., grid vertices of finite-volume meshes) and therefore avoid the limitations of pixelation used to project geometries onto uniform lattices. In contrast to graph neural network-based diffusion models, Flow Matching PointNet and Diffusion PointNet do not exhibit high-frequency noise artifacts in the predicted fields. Moreover, unlike such approaches, which require auxiliary intermediate networks to condition geometry, the proposed frameworks rely solely on PointNet, resulting in a simple and unified architecture. The performance of the proposed frameworks is evaluated on steady incompressible flow past a cylinder, using a geometric dataset constructed by varying the cylinder's cross-sectional shape and orientation across samples. The results demonstrate that Flow Matching PointNet and Diffusion PointNet achieve more accurate predictions of velocity and pressure fields, as well as lift and drag forces, and exhibit greater robustness to incomplete geometries compared to a vanilla PointNet with the same number of trainable parameters.
【3】Topology-Independent Robustness of the Weighted Mean under Label Poisoning Attacks in Heterogeneous Decentralized Learning
标题:异构分散学习中加权均值对标签中毒攻击的拓扑无关鲁棒性
链接:https://arxiv.org/abs/2601.02682
作者:Jie Peng,Weiyu Li,Stefan Vlaski,Qing Ling
摘要:对恶意攻击的鲁棒性对于实际的分散信号处理和机器学习系统至关重要。这种攻击的一个典型例子是标签中毒,这意味着一些代理拥有损坏的本地标签,并共享在这些中毒数据上训练的模型。为了抵御恶意攻击,现有的作品往往集中在设计强大的聚合器,同时,加权平均聚合器通常被认为是一个简单的,脆弱的基线。本文分析了分散梯度下降算法在标签中毒攻击下的鲁棒性,同时考虑了鲁棒性和加权均值聚集器。理论结果表明,鲁棒聚合器的学习误差依赖于网络拓扑结构,而加权平均聚合器的性能是拓扑无关的。值得注意的是,加权平均聚合器,虽然通常被认为是脆弱的,但在足够的异质性下,可以胜过鲁棒聚合器,特别是当:(i)全局污染率(即,整个网络的中毒试剂的分数)小于局部污染率(即,规则代理的中毒邻居的最大分数);(ii)规则代理的网络是断开的;或者(iii)规则代理的网络是稀疏的并且局部污染率高。实证结果支持我们的理论研究结果,突出了网络拓扑结构的鲁棒性标签中毒攻击的重要作用。
摘要:Robustness to malicious attacks is crucial for practical decentralized signal processing and machine learning systems. A typical example of such attacks is label poisoning, meaning that some agents possess corrupted local labels and share models trained on these poisoned data. To defend against malicious attacks, existing works often focus on designing robust aggregators; meanwhile, the weighted mean aggregator is typically considered a simple, vulnerable baseline. This paper analyzes the robustness of decentralized gradient descent under label poisoning attacks, considering both robust and weighted mean aggregators. Theoretical results reveal that the learning errors of robust aggregators depend on the network topology, whereas the performance of weighted mean aggregator is topology-independent. Remarkably, the weighted mean aggregator, although often considered vulnerable, can outperform robust aggregators under sufficient heterogeneity, particularly when: (i) the global contamination rate (i.e., the fraction of poisoned agents for the entire network) is smaller than the local contamination rate (i.e., the maximal fraction of poisoned neighbors for the regular agents); (ii) the network of regular agents is disconnected; or (iii) the network of regular agents is sparse and the local contamination rate is high. Empirical results support our theoretical findings, highlighting the important role of network topology in the robustness to label poisoning attacks.
半/弱/无/有监督|不确定性|主动学习(3篇)
【1】PET-TURTLE: Deep Unsupervised Support Vector Machines for Imbalanced Data Clusters
标题:PET-TUTLE:用于不平衡数据集群的深度无监督支持载体机
链接
:https://arxiv.org/abs/2601.03237
作者:Javier Salazar Cavazos
摘要:基础视觉、音频和语言模型通过其潜在表示实现下游任务的zero-shot性能。最近,使用深度学习方法对数据组结构进行无监督学习已经越来越流行。TURTLE是一种最先进的深度聚类算法,它通过交替的标签和超平面更新来发现数据标签,从而最大化超平面边缘,而无需监督,其方式与支持向量机(SVM)类似。然而,TURTLE假设聚类是平衡的;当数据不平衡时,它会产生非理想的超平面,导致更高的聚类错误。我们提出了PET-TURTLE,它概括了成本函数,以处理不平衡的数据分布的幂律先验。此外,通过在标记过程中引入稀疏对数,PET-TURTLE优化了一个更简单的搜索空间,从而提高了平衡数据集的准确性。在合成数据和真实数据上的实验表明,PET-TURTLE算法提高了对不平衡源的准确性,防止了对少数聚类的过度预测,并增强了整体聚类。
摘要:Foundation vision, audio, and language models enable zero-shot performance on downstream tasks via their latent representations. Recently, unsupervised learning of data group structure with deep learning methods has gained popularity. TURTLE, a state of the art deep clustering algorithm, uncovers data labeling without supervision by alternating label and hyperplane updates, maximizing the hyperplane margin, in a similar fashion to support vector machines (SVMs). However, TURTLE assumes clusters are balanced; when data is imbalanced, it yields non-ideal hyperplanes that cause higher clustering error. We propose PET-TURTLE, which generalizes the cost function to handle imbalanced data distributions by a power law prior. Additionally, by introducing sparse logits in the labeling process, PET-TURTLE optimizes a simpler search space that in turn improves accuracy for balanced datasets. Experiments on synthetic and real data show that PET-TURTLE improves accuracy for imbalanced sources, prevents over-prediction of minority clusters, and enhances overall clustering.
【2】Self-Supervised Learning from Noisy and Incomplete Data
标题:从噪音和不完整数据中进行自我监督学习
链接:https://arxiv.org/abs/2601.03244
作者:Julián Tachella,Mike Davies
摘要:科学和工程中的许多重要问题涉及从噪声和/或不完整的观测中推断信号,其中观测过程是已知的。从历史上看,这个问题已经使用手工制作的正则化(例如,稀疏性、总变差)以获得有意义的估计。最近的数据驱动方法通常通过直接从地面实况信号和相关观测的示例中学习求解器来提供更好的解决方案。然而,在许多现实世界的应用中,获得用于训练的地面实况参考是昂贵的或不可能的。自监督学习方法提供了一种很有前途的替代方案,它只从测量数据中学习求解器,而不需要地面实况参考。这份手稿提供了一个全面的总结不同的自监督方法的反问题,特别强调其理论基础,并提出了实际应用中的成像反问题。
摘要:Many important problems in science and engineering involve inferring a signal from noisy and/or incomplete observations, where the observation process is known. Historically, this problem has been tackled using hand-crafted regularization (e.g., sparsity, total-variation) to obtain meaningful estimates. Recent data-driven methods often offer better solutions by directly learning a solver from examples of ground-truth signals and associated observations. However, in many real-world applications, obtaining ground-truth references for training is expensive or impossible. Self-supervised learning methods offer a promising alternative by learning a solver from measurement data alone, bypassing the need for ground-truth references. This manuscript provides a comprehensive summary of different self-supervised methods for inverse problems, with a special emphasis on their theoretical underpinnings, and presents practical applications in imaging inverse problems.
【3】Shallow-circuit Supervised Learning on a Quantum Processor
标题:量子处理器上的浅层电路监督学习
链接:https://arxiv.org/abs/2601.03235
作者:Luca Candelori,Swarnadeep Majumder,Antonio Mezzacapo,Javier Robledo Moreno,Kharen Musaelian,Santhanam Nagarajan,Sunil Pinnamaneni,Kunal Sharma,Dario Villani
摘要:量子计算长期以来一直承诺在数据分析方面取得变革性进展,但实际的量子机器学习仍然难以捉摸,这是由于一些基本障碍,例如经典数据加载的量子成本过高,以及许多为近期量子硬件设计的量子机器学习算法的可训练性差。在这项工作中,我们表明,人们可以克服这些障碍,通过使用线性哈密顿为基础的机器学习方法,它提供了一个紧凑的量子表示的经典数据通过基态问题的k-本地哈密顿。我们使用最近的基于样本的Krylov量子对角化方法来计算数据哈密顿量的低能态,其参数通过局部梯度来训练以表达经典数据集。我们通过使用高达50个量子比特的IBM Heron量子处理器对基准数据集进行实验,证明了该方法的有效性和可扩展性。
摘要:Quantum computing has long promised transformative advances in data analysis, yet practical quantum machine learning has remained elusive due to fundamental obstacles such as a steep quantum cost for the loading of classical data and poor trainability of many quantum machine learning algorithms designed for near-term quantum hardware. In this work, we show that one can overcome these obstacles by using a linear Hamiltonian-based machine learning method which provides a compact quantum representation of classical data via ground state problems for k-local Hamiltonians. We use the recent sample-based Krylov quantum diagonalization method to compute low-energy states of the data Hamiltonians, whose parameters are trained to express classical datasets through local gradients. We demonstrate the efficacy and scalability of the methods by performing experiments on benchmark datasets using up to 50 qubits of an IBM Heron quantum processor.
迁移|Zero/Few/One-Shot|自适应(3篇)
【1】Real-Time Adaptive Anomaly Detection in Industrial IoT Environments
标题:工业物联网环境中的实时自适应异常检测
链接:https://arxiv.org/abs/2601.03085
作者:Mahsa Raeiszadeh,Amin Ebrahimzadeh,Roch H. Glitho,Johan Eker,Raquel A. F. Mini
摘要:为了确保可靠性和服务可用性,下一代网络预计将依赖于由先进的机器学习方法提供支持的自动异常检测系统,该系统具有处理多维数据的能力。这种多维、异构的数据主要出现在当今的工业物联网(IIoT)中,实时检测异常对于防止即将发生的故障并及时解决这些故障至关重要。然而,现有的异常检测方法往往不能有效地应对工业物联网中多维数据流的复杂性和动态性。在本文中,我们提出了一种利用多源预测模型和概念漂移自适应来检测IIoT流数据中异常的自适应方法。所提出的异常检测算法合并到一个新的漂移自适应方法的预测模型,导致准确和高效的异常检测,表现出更好的可扩展性。我们的跟踪驱动的评估表明,所提出的方法优于国家的最先进的异常检测方法,实现高达89.71%的准确性(在曲线下面积(AUC)),同时满足给定的效率和可扩展性的要求。
摘要:To ensure reliability and service availability, next-generation networks are expected to rely on automated anomaly detection systems powered by advanced machine learning methods with the capability of handling multi-dimensional data. Such multi-dimensional, heterogeneous data occurs mostly in today's industrial Internet of Things (IIoT), where real-time detection of anomalies is critical to prevent impending failures and resolve them in a timely manner. However, existing anomaly detection methods often fall short of effectively coping with the complexity and dynamism of multi-dimensional data streams in IIoT. In this paper, we propose an adaptive method for detecting anomalies in IIoT streaming data utilizing a multi-source prediction model and concept drift adaptation. The proposed anomaly detection algorithm merges a prediction model into a novel drift adaptation method resulting in accurate and efficient anomaly detection that exhibits improved scalability. Our trace-driven evaluations indicate that the proposed method outperforms the state-of-the-art anomaly detection methods by achieving up to an 89.71% accuracy (in terms of Area under the Curve (AUC)) while meeting the given efficiency and scalability requirements.
【2】MixTTE: Multi-Level Mixture-of-Experts for Scalable and Adaptive Travel Time Estimation
标题:MixTTE:用于可扩展和自适应旅行时间估计的多级别专家混合
链接:https://arxiv.org/abs/2601.02943
作者:Wenzhao Jiang,Jindong Han,Ruiqian Han,Hao Liu
备注:Accepted to KDD 2026
摘要:准确的行程时间估计(TTE)对于网约车平台至关重要,其中错误直接影响用户体验和运营效率。虽然现有的生产系统擅长于整体的路线级依赖建模,但它们很难捕捉城市规模的交通动态和长尾场景,导致大型城市网络的预测不可靠。在本文中,我们提出了\模型,一个可扩展的和自适应的框架,协同集成链路级建模与工业路由级TTE系统。具体来说,我们提出了一个时空外部注意力模块,以捕捉全球交通动态依赖性跨百万级的道路网络有效。此外,我们构建了一个稳定的图混合专家网络来处理异构的流量模式,同时保持推理效率。此外,异步增量学习策略是定制的,以实现实时和稳定的适应动态流量分布的变化。在真实世界数据集上的实验验证了MixTTE与七个基线相比显着降低了预测误差。MixTTE已部署在DiDi中,大大提高了TTE服务的准确性和稳定性。
摘要:Accurate Travel Time Estimation (TTE) is critical for ride-hailing platforms, where errors directly impact user experience and operational efficiency. While existing production systems excel at holistic route-level dependency modeling, they struggle to capture city-scale traffic dynamics and long-tail scenarios, leading to unreliable predictions in large urban networks. In this paper, we propose \model, a scalable and adaptive framework that synergistically integrates link-level modeling with industrial route-level TTE systems. Specifically, we propose a spatio-temporal external attention module to capture global traffic dynamic dependencies across million-scale road networks efficiently. Moreover, we construct a stabilized graph mixture-of-experts network to handle heterogeneous traffic patterns while maintaining inference efficiency. Furthermore, an asynchronous incremental learning strategy is tailored to enable real-time and stable adaptation to dynamic traffic distribution shifts. Experiments on real-world datasets validate MixTTE significantly reduces prediction errors compared to seven baselines. MixTTE has been deployed in DiDi, substantially improving the accuracy and stability of the TTE service.
【3】Hierarchical temporal receptive windows and zero-shot timescale generalization in biologically constrained scale-invariant deep networks
标题:生物约束规模不变深度网络中的分层时间接受窗口和Zero-Shot时间尺度概括
链接:https://arxiv.org/abs/2601.02618
作者:Aakash Sarkar,Marc W. Howard
摘要:人类认知在嵌套的时间尺度上整合信息。虽然皮层表现出分层的时间接受窗口(TRW),局部电路往往显示异构的时间常数。为了调和这一点,我们基于尺度不变的海马时间细胞,在模仿语言层次结构的语言分类任务上训练了生物约束的深度网络(例如,“字母”形成“单词”)。首先,使用前馈模型(SITHCon),我们发现TRW的层次结构在各层中自然出现,尽管网络在各层中具有相同的时间常数谱。然后,我们将这些归纳先验提取到一个生物学上合理的递归架构SITH-RNN中。训练从通用RNN到这个受限子集的一系列架构表明,尺度不变的SITH-RNN学习速度更快,参数数量级更少,并且将zero-shot推广到分布时间尺度之外。这些结果表明,大脑使用尺度不变的顺序先验-编码“什么时候”发生了什么-使具有这种先验的递归网络特别适合描述人类认知。
摘要:Human cognition integrates information across nested timescales. While the cortex exhibits hierarchical Temporal Receptive Windows (TRWs), local circuits often display heterogeneous time constants. To reconcile this, we trained biologically constrained deep networks, based on scale-invariant hippocampal time cells, on a language classification task mimicking the hierarchical structure of language (e.g., 'letters' forming 'words'). First, using a feedforward model (SITHCon), we found that a hierarchy of TRWs emerged naturally across layers, despite the network having an identical spectrum of time constants within layers. We then distilled these inductive priors into a biologically plausible recurrent architecture, SITH-RNN. Training a sequence of architectures ranging from generic RNNs to this restricted subset showed that the scale-invariant SITH-RNN learned faster with orders-of-magnitude fewer parameters, and generalized zero-shot to out-of-distribution timescales. These results suggest the brain employs scale-invariant, sequential priors - coding "what" happened "when" - making recurrent networks with such priors particularly well-suited to describe human cognition.
强化学习(2篇)
【1】In-Context Reinforcement Learning through Bayesian Fusion of Context and Value Prior
标题:基于上下文和值先验贝叶斯融合的上下文内强化学习
链接:https://arxiv.org/abs/2601.03015
作者:Anaïs Berkes,Vincent Taboga,Donna Vakalis,David Rolnick,Yoshua Bengio
摘要:情境强化学习(ICRL)承诺在没有参数更新的情况下快速适应未知环境,但目前的方法要么无法超越训练分布,要么需要接近最佳的数据,限制了实际应用。我们介绍了SPICE,一种贝叶斯ICRL方法,它通过深度集成学习Q值的先验,并通过贝叶斯更新使用上下文信息在测试时更新此先验。为了从次优数据训练产生的不良先验中恢复,我们的在线推理遵循有利于探索和适应的置信上限规则。我们证明了SPICE在随机强盗和有限视野MDP中实现了遗憾最优行为,即使只在次优轨迹上进行预训练。我们验证了这些发现经验上的强盗和控制基准。SPICE在看不见的任务上实现了接近最优的决策,与之前的ICRL和元RL方法相比,大大减少了遗憾,同时快速适应看不见的任务,并在分布变化下保持稳健。
摘要:In-context reinforcement learning (ICRL) promises fast adaptation to unseen environments without parameter updates, but current methods either cannot improve beyond the training distribution or require near-optimal data, limiting practical adoption. We introduce SPICE, a Bayesian ICRL method that learns a prior over Q-values via deep ensemble and updates this prior at test-time using in-context information through Bayesian updates. To recover from poor priors resulting from training on sub-optimal data, our online inference follows an Upper-Confidence Bound rule that favours exploration and adaptation. We prove that SPICE achieves regret-optimal behaviour in both stochastic bandits and finite-horizon MDPs, even when pretrained only on suboptimal trajectories. We validate these findings empirically across bandit and control benchmarks. SPICE achieves near-optimal decisions on unseen tasks, substantially reduces regret compared to prior ICRL and meta-RL approaches while rapidly adapting to unseen tasks and remaining robust under distribution shift.
【2】SWaRL: Safeguard Code Watermarking via Reinforcement Learning
标题:SWaRL:通过强化学习保护代码水印
链接:https://arxiv.org/abs/2601.02602
作者:Neusha Javidnia,Ruisi Zhang,Ashish Kundu,Farinaz Koushanfar
备注:Under review
摘要:我们提出了SWaRL,一个强大的和保留完整性的水印框架,旨在通过在生成的输出中嵌入唯一的和可验证的签名来保护代码LLM所有者的知识产权。现有的方法依赖于手工制作的转换规则来保留带水印的代码功能或在推理时操纵令牌生成概率,这很容易出现编译错误。为了解决这些挑战,SWaRL采用了一个基于强化学习的协同训练框架,该框架使用编译器反馈来确定功能正确性,并使用联合训练的机密验证器作为奖励信号来保持水印的可检测性。此外,SWaRL在微调期间采用低秩自适应(LoRA),允许学习的水印信息在模型更新之间转移。大量的实验表明,SWaRL实现了更高的水印检测精度相比,以前的方法,同时完全保持水印代码的功能。基于LoRA的签名嵌入引导基础模型以水印特定的方式生成和求解代码,而无需显著的计算开销。此外,SWaRL对重构和对抗性转换攻击具有很强的弹性。
摘要
:We present SWaRL, a robust and fidelity-preserving watermarking framework designed to protect the intellectual property of code LLM owners by embedding unique and verifiable signatures in the generated output. Existing approaches rely on manually crafted transformation rules to preserve watermarked code functionality or manipulate token-generation probabilities at inference time, which are prone to compilation errors. To address these challenges, SWaRL employs a reinforcement learning-based co-training framework that uses compiler feedback for functional correctness and a jointly trained confidential verifier as a reward signal to maintain watermark detectability. Furthermore, SWaRL employs low-rank adaptation (LoRA) during fine-tuning, allowing the learned watermark information to be transferable across model updates. Extensive experiments show that SWaRL achieves higher watermark detection accuracy compared to prior methods while fully maintaining watermarked code functionality. The LoRA-based signature embedding steers the base model to generate and solve code in a watermark-specific manner without significant computational overhead. Moreover, SWaRL exhibits strong resilience against refactoring and adversarial transformation attacks.
符号|符号学习(1篇)
【1】hdlib 2.0: Extending Machine Learning Capabilities of Vector-Symbolic Architectures
标题:hdlib 2.0:扩展Vector-Symbolic架构的机器学习能力
链接:https://arxiv.org/abs/2601.02509
作者:Fabio Cumbo,Kabir Dhillon,Daniel Blankenberg
备注:7 pages, 1 figure
摘要:在最初发布hdlib(一个用于设计矢量符号架构(VSA)的Python库)之后,我们引入了一个主要的扩展,可以显着增强其机器学习功能。VSA,也称为超维计算,是一种使用高维向量表示和处理信息的计算范式。虽然hdlib的第一个版本为创建和操作这些向量奠定了坚实的基础,但此更新解决了在VSA框架内对更高级的数据驱动建模日益增长的需求。在这里,我们提出了四个扩展:对现有监督分类模型的显着增强,还可以进行特征选择,以及用于预测连续变量的新回归模型,用于无监督学习的聚类模型和基于图的学习模型。此外,我们提出了第一个实现量子超维计算与量子动力算术运算和一个新的量子机器学习模型的监督学习。hdlib仍然是开源的,可以在GitHub的www.example.com上获得,并通过Python包索引(pip install hdlib)和Conda(conda install-c conda-forge hdlib)分发。这些新功能的文档和示例可在官方Wiki上获得,网址为www.example.com。
摘要:Following the initial publication of hdlib, a Python library for designing Vector-Symbolic Architectures (VSA), we introduce a major extension that significantly enhances its machine learning capabilities. VSA, also known as Hyperdimensional Computing, is a computing paradigm that represents and processes information using high-dimensional vectors. While the first version of hdlib established a robust foundation for creating and manipulating these vectors, this update addresses the growing need for more advanced, data-driven modeling within the VSA framework. Here, we present four extensions: significant enhancements to the existing supervised classification model also enabling feature selection, and a new regression model for predicting continuous variables, a clustering model for unsupervised learning, and a graph-based learning model. Furthermore, we propose the first implementation ever of Quantum Hyperdimensional Computing with quantum-powered arithmetic operations and a new Quantum Machine Learning model for supervised learning. hdlib remains open-source and available on GitHub at https://github.com/cumbof/hdlib under the MIT license, and distributed through the Python Package Index (pip install hdlib) and Conda (conda install -c conda-forge hdlib). Documentation and examples of these new features are available on the official Wiki at https://github.com/cumbof/hdlib/wiki.
医学相关(4篇)
【1】LeafLife: An Explainable Deep Learning Framework with Robustness for Grape Leaf Disease Recognition
标题:LeafLife:一个可解释的深度学习框架,用于葡萄叶病识别
链接:https://arxiv.org/abs/2601.03124
作者:B. M. Shahria Alam,Md. Nasim Ahmed
备注:4 pages, 8 figures, 2025 IEEE International Conference on Signal Processing, Information, Communication and Systems (SPICSCON)
摘要:植物病害诊断对农民的管理选择至关重要,因为植物病害经常降低作物产量和产品质量。为了丰收和农业生产力的提高,葡萄叶病检测非常重要。植物病害数据集包含葡萄叶片病害共4类9,032幅图像,其中3类为叶片病害,另一类为健康叶片。经过严格的预处理后,数据集被分割(70%训练,20%验证,10%测试),并部署了两个预训练模型:InceptionV 3和Xception。Xception显示了96.23%的准确率,这比InceptionV 3更显着。对抗性训练用于增强鲁棒性,以及更高的透明度。整合Grad-CAM以确认叶病。最后,使用Streamlit部署了一个Web应用程序,该应用程序具有热图可视化和具有置信度的预测,用于稳健的葡萄叶病分类。
摘要:Plant disease diagnosis is essential to farmers' management choices because plant diseases frequently lower crop yield and product quality. For harvests to flourish and agricultural productivity to boost, grape leaf disease detection is important. The plant disease dataset contains grape leaf diseases total of 9,032 images of four classes, among them three classes are leaf diseases, and the other one is healthy leaves. After rigorous pre-processing dataset was split (70% training, 20% validation, 10% testing), and two pre-trained models were deployed: InceptionV3 and Xception. Xception shows a promising result of 96.23% accuracy, which is remarkable than InceptionV3. Adversarial Training is used for robustness, along with more transparency. Grad-CAM is integrated to confirm the leaf disease. Finally deployed a web application using Streamlit with a heatmap visualization and prediction with confidence level for robust grape leaf disease classification.
【2】Dementia-R1: Reinforced Pretraining and Reasoning from Unstructured Clinical Notes for Real-World Dementia Prognosis
标题:痴呆症-R1:现实世界痴呆症预后的非结构化临床笔记的强化预训练和推理
链接:https://arxiv.org/abs/2601.03018
作者:Choonghan Kim,Hyunmin Hwang,Hangeol Chang,Jaemin Kim,Jinse Park,Jae-Sung Lim,Jong Chul Ye
摘要:虽然大型语言模型(LLM)在临床文本理解方面表现出了很强的性能,但它们在诸如痴呆症预后等纵向预测任务中表现不佳,这些任务需要在多次访问中对复杂的非单调症状轨迹进行推理。标准监督训练缺乏对症状演变的明确注释,而直接强化学习(RL)受到稀疏二进制奖励的阻碍。为了应对这一挑战,我们引入了Dementia-R1,这是一个基于RL的框架,用于根据非结构化临床记录进行纵向痴呆预后。我们的方法采用冷启动RL策略,该策略预训练模型以预测从患者病史中提取的可验证的临床指标,从而增强在确定最终临床状态之前推理疾病进展的能力。大量的实验表明,Dementia-R1在真实世界的非结构化临床数据集上获得了77.03%的F1评分。值得注意的是,在ADNI基准测试中,我们的7 B模型与GPT-4 o相媲美,有效地捕捉了波动的认知轨迹。代码可从https://anonymous.4open.science/r/dementiar1-CDB5获得
摘要:While Large Language Models (LLMs) have shown strong performance on clinical text understanding, they struggle with longitudinal prediction tasks such as dementia prognosis, which require reasoning over complex, non-monotonic symptom trajectories across multiple visits. Standard supervised training lacks explicit annotations for symptom evolution, while direct Reinforcement Learning (RL) is hindered by sparse binary rewards. To address this challenge, we introduce Dementia-R1, an RL-based framework for longitudinal dementia prognosis from unstructured clinical notes. Our approach adopts a Cold-Start RL strategy that pre-trains the model to predict verifiable clinical indices extracted from patient histories, enhancing the capability to reason about disease progression before determining the final clinical status. Extensive experiments demonstrate that Dementia-R1 achieves an F1 score of 77.03% on real-world unstructured clinical datasets. Notably, on the ADNI benchmark, our 7B model rivals GPT-4o, effectively capturing fluctuating cognitive trajectories. Code is available at https://anonymous.4open.science/r/dementiar1-CDB5
【3】CutisAI: Deep Learning Framework for Automated Dermatology and Cancer Screening
标题:CutisAI:自动皮肤病学和癌症筛查的深度学习框架
链接:https://arxiv.org/abs/2601.02562
作者:Rohit Kaushik,Eva Kaushik
备注:10 pages, 3 figures
摘要:皮肤病学成像和移动诊断工具的快速发展要求系统不仅要展示经验性能,还要提供强有力的理论保证。深度学习模型已经显示出很高的预测准确性;然而,它们经常被批评缺乏良好的校准不确定性估计,没有这些模型就很难在临床环境中部署。为此,我们提出了共形贝叶斯皮肤病分类器(CBDC),这是一个结合统计学习理论,拓扑数据分析(TDA)和贝叶斯共形推理的良好框架。CBDC提供了反映皮肤病变异性的分布相关泛化边界,证明了一个拓扑稳定性定理,该定理保证了卷积神经网络嵌入在光度和形态扰动下的不变性,并为可信的不确定性量化提供了有限的保形覆盖保证。 通过对HAM 10000,PH 2和ISIC 2020数据集的详尽实验,我们表明CBDC不仅达到了分类准确性,而且还生成了从临床角度可解释的校准预测。该研究为深度皮肤病诊断提供了理论和实践上的飞跃,从而打开了机器学习理论临床应用的接口。
摘要:The rapid growth of dermatological imaging and mobile diagnostic tools calls for systems that not only demonstrate empirical performance but also provide strong theoretical guarantees. Deep learning models have shown high predictive accuracy; however, they are often criticized for lacking well, calibrated uncertainty estimates without which these models are hardly deployable in a clinical setting. To this end, we present the Conformal Bayesian Dermatological Classifier (CBDC), a well, founded framework that combines Statistical Learning Theory, Topological Data Analysis (TDA), and Bayesian Conformal Inference. CBDC offers distribution, dependent generalization bounds that reflect dermatological variability, proves a topological stability theorem that guarantees the invariance of convolutional neural network embeddings under photometric and morphological perturbations and provides finite conformal coverage guarantees for trustworthy uncertainty quantification. Through exhaustive experiments on the HAM10000, PH2, and ISIC 2020 datasets, we show that CBDC not only attains classification accuracy but also generates calibrated predictions that are interpretable from a clinical perspective. This research constitutes a theoretical and practical leap for deep dermatological diagnostics, thereby opening the machine learning theory clinical applicability interface.
【4】Deep Learning Superresolution for 7T Knee MR Imaging: Impact on Image Quality and Diagnostic Performance
标题:7 T膝关节MR成像的深度学习超分辨率:对图像质量和诊断性能的影响
链接:https://arxiv.org/abs/2601.02436
作者:Pinzhen Chen,Libo Xu,Boyang Pan,Jing Li,Yuting Wang,Ran Xiong,Xiaoli Gou,Long Qing,Wenjing Hou,Nan-jie Gong,Wei Chen
摘要:背景资料:深度学习超分辨率(SR)可以提高肌肉骨骼MR图像质量,但其在7T膝关节成像中的诊断价值尚不清楚。目的:比较SR、低分辨率(LR)和高分辨率(HR)7T膝关节MRI的图像质量和诊断性能。方法:在这项前瞻性研究中,42名参与者接受了LR(0.8*0.8*2 mm3)和HR(0.4*0.4*2 mm3)序列的7T膝关节MRI。使用混合注意力Transformer模型从LR数据生成SR图像。三名放射科医生评估了图像质量、解剖结构的显著性和膝关节病变的检测。10例以关节镜为参考。结果:SR图像的总体质量高于LR(中值评分5比4,P<.001>= 0.095)。结论:深度学习超分辨率改善了7T膝关节MRI的主观图像质量,但与标准LR成像相比,并未提高诊断准确性。
摘要:Background: Deep learning superresolution (SR) may enhance musculoskeletal MR image quality, but its diagnostic value in knee imaging at 7T is unclear. Objectives: To compare image quality and diagnostic performance of SR, low-resolution (LR), and high-resolution (HR) 7T knee MRI. Methods: In this prospective study, 42 participants underwent 7T knee MRI with LR (0.8*0.8*2 mm3) and HR (0.4*0.4*2 mm3) sequences. SR images were generated from LR data using a Hybrid Attention Transformer model. Three radiologists assessed image quality, anatomic conspicuity, and detection of knee pathologies. Arthroscopy served as reference in 10 cases. Results: SR images showed higher overall quality than LR (median score 5 vs 4, P<.001 and lower noise than hr vs p visibility of cartilage menisci ligaments was superior in sr compared to lr detection rates diagnostic performance specificity auc for intra-articular pathology were similar across image types>=.095). Conclusions: Deep learning superresolution improved subjective image quality in 7T knee MRI but did not increase diagnostic accuracy compared with standard LR imaging.
蒸馏|知识提取(3篇)
【1】Sparse Knowledge Distillation: A Mathematical Framework for Probability-Domain Temperature Scaling and Multi-Stage Compression
标题:稀疏知识蒸馏:概率域温度缩放和多阶段压缩的数学框架
链接:https://arxiv.org/abs/2601.03195
作者:Aaron R. Flouro,Shawn P. Chadwick
备注:Machine learning theory. Develops an axiomatic, operator-agnostic framework for probability-domain knowledge distillation, including bias--variance analysis of sparse students, homotopy-based multi-stage pruning, $O(1/n)$ convergence guarantees, and equivalence classes of probability-domain softening operators. Theoretical analysis only
摘要:我们开发了一个统一的理论框架,稀疏知识蒸馏概率域软化算子的基础上。虽然等价p^{1/T} \propto \mathrm{softmax}(z/T)$是众所周知的,但我们的贡献是建立在此基础上的操作员级别的分析框架,而不是等价本身。 该框架包括四个核心组成部分:(i)算子不可知的偏差-方差分解,其特征在于稀疏学生何时优于密集教师,(ii)函数空间中多级修剪的同伦路径形式化,其解释了为什么迭代压缩成功而一次性修剪失败,(iii)收敛保证建立具有显式参数依赖性的n级蒸馏的O(1/n)速率,以及(iv)等价类表征,其识别在容量约束下产生相同学生模型的不同概率域算子。 我们引入了一个公理化的定义的概率域软化算子的基础上排名保持,连续性,熵单调性,身份和边界行为,并表明,多个非等价的运营商家庭满足这些公理。所有的学习理论的保证,以保持一致,在这个运营商类,独立的实施细节。这些结果为黑盒教师蒸馏,部分访问设置,如top $k$截断和纯文本输出,以及隐私保护模型压缩提供了理论基础。
摘要:We develop a unified theoretical framework for sparse knowledge distillation based on probability-domain softening operators. While the equivalence $p^{1/T} \propto \mathrm{softmax}(z/T)$ is well known, our contribution is an operator-level analytical framework built on this foundation rather than the equivalence itself. The framework comprises four core components: (i) operator-agnostic bias--variance decompositions that characterize when sparse students outperform dense teachers, (ii) a homotopy path formalization of multi-stage pruning in function space explaining why iterative compression succeeds where one-shot pruning fails, (iii) convergence guarantees establishing $O(1/n)$ rates for $n$-stage distillation with explicit parameter dependence, and (iv) equivalence class characterizations identifying distinct probability-domain operators that yield identical student models under capacity constraints. We introduce an axiomatic definition of probability-domain softening operators based on ranking preservation, continuity, entropy monotonicity, identity, and boundary behavior, and show that multiple non-equivalent operator families satisfy these axioms. All learning-theoretic guarantees are shown to hold uniformly across this operator class, independent of implementation details. These results provide theoretical grounding for black-box teacher distillation, partial-access settings such as top-$k$ truncation and text-only outputs, and privacy-preserving model compression.
【2】When the Coffee Feature Activates on Coffins: An Analysis of Feature Extraction and Steering for Mechanistic Interpretability
标题:当咖啡特征在棺材上激活时:特征提取和机械解释性引导的分析
链接:https://arxiv.org/abs/2601.03047
作者:Raphael Ronge,Markus Maier,Frederick Eberhardt
备注:33 pages (65 with appendix), 1 figure
摘要
:Anthropic最近关于机械可解释性的工作声称通过使用稀疏自动编码器(SAE)从其神经激活模式中提取人类可解释的特征来理解和控制大型语言模型。如果成功,这种方法将为人工智能安全的人类监督提供最有希望的途径之一。我们通过使用Llama 3.1的开源SAE复制其主要结果,对这些声明进行了初步压力测试。虽然我们成功地再现了基本的特征提取和转向能力,但我们的调查表明,对于这些主张的普遍性,需要谨慎行事。我们发现,功能转向表现出相当大的脆弱性,层的选择,转向幅度和上下文的敏感性。我们观察到非标准的激活行为,并证明了区分主题相似的功能彼此之间的困难。虽然基于SAE的可解释性在选定的情况下产生令人信服的演示,目前的方法往往达不到安全关键应用所需的系统可靠性。这表明,重点从优先考虑内部表示的可解释性转向可靠的预测和控制模型输出是必要的。我们的工作有助于更细致地理解机械可解释性所取得的成就,并突出了人工智能安全面临的尚未解决的根本挑战。
摘要:Recent work by Anthropic on Mechanistic interpretability claims to understand and control Large Language Models by extracting human-interpretable features from their neural activation patterns using sparse autoencoders (SAEs). If successful, this approach offers one of the most promising routes for human oversight in AI safety. We conduct an initial stress-test of these claims by replicating their main results with open-source SAEs for Llama 3.1. While we successfully reproduce basic feature extraction and steering capabilities, our investigation suggests that major caution is warranted regarding the generalizability of these claims. We find that feature steering exhibits substantial fragility, with sensitivity to layer selection, steering magnitude, and context. We observe non-standard activation behavior and demonstrate the difficulty to distinguish thematically similar features from one another. While SAE-based interpretability produces compelling demonstrations in selected cases, current methods often fall short of the systematic reliability required for safety-critical applications. This suggests a necessary shift in focus from prioritizing interpretability of internal representations toward reliable prediction and control of model output. Our work contributes to a more nuanced understanding of what mechanistic interpretability has achieved and highlights fundamental challenges for AI safety that remain unresolved.
【3】Compressed code: the hidden effects of quantization and distillation on programming tokens
标题:压缩代码:量化和蒸馏对编程标记的隐藏影响
链接:https://arxiv.org/abs/2601.02563
作者:Viacheslav Siniaev,Iaroslav Chelombitko,Aleksey Komissarov
备注:18 pages, 1 figure and 6 tables
摘要:大型语言模型(LLM)已经展示了出色的代码生成能力,但它们的标记级机制仍然没有得到充分的研究,特别是在压缩模型中。通过对编程语言标记表示的系统分析,我们通过分析其词汇分布和关键字覆盖模式来描述编程语言如何在LLM标记器中编码。我们介绍了一种新的冷启动概率分析方法,提供了洞察模型的行为,而不需要明确的提示。此外,我们还对不同的模型优化技术(包括量化、蒸馏、模型缩放和特定于任务的微调)如何影响令牌级表示和代码生成质量进行了全面评估。我们的实验,全面的概率分布分析和评估指标的支持下,揭示了关键的洞察令牌级的行为,并提供了验证性的指导方针,在各种优化约束下保持代码生成质量。这些发现推进了对LLM代码生成的理论理解和在生产环境中优化模型的实际实现。
摘要:Large Language Models (LLMs) have demonstrated exceptional code generation capabilities, yet their token-level mechanisms remain underexplored, particularly in compressed models. Through systematic analysis of programming language token representations, we characterize how programming languages are encoded in LLM tokenizers by analyzing their vocabulary distribution and keyword coverage patterns. We introduce a novel cold-start probability analysis method that provides insights into model behavior without requiring explicit prompts. Additionally, we present a comprehensive evaluation of how different model optimization techniques - including quantization, distillation, model scaling, and task-specific fine-tuning - affect token-level representations and code generation quality. Our experiments, supported by comprehensive probability distribution analysis and evaluation metrics, reveal critical insights into token-level behavior and provide empirically-validated guidelines for maintaining code generation quality under various optimization constraints. These findings advance both theoretical understanding of LLM code generation and practical implementation of optimized models in production environments.
推荐(1篇)
【1】FUSE : Failure-aware Usage of Subagent Evidence for MultiModal Search and Recommendation
标题:FUSE:多模式搜索和推荐的子代理证据的故障意识使用
链接:https://arxiv.org/abs/2601.02365
作者:Tushar Vatsa,Vibha Belavadi,Priya Shanmugasundaram,Suhas Suresha,Dewang Sultania
备注:ICDM MMSR 2025: Workshop on Multimodal Search and Recommendations
摘要:多模态创意助手分解用户目标并将任务路由到子代理以进行布局、样式化、检索和生成。检索质量是关键,但失败可能出现在几个阶段:理解用户意图,选择内容类型,寻找候选人(召回)或对结果进行排名。同时,发送和处理图像的成本很高,使得天真的多模式方法不切实际。FUSE:Failure-aware Usage of Subagent Evidence for MultiModal Search and Recommendation。FUSE用紧凑的接地设计表示(GDR)取代了大多数原始图像提示:由Planner团队提供的画布元素(图像、文本、形状、图标、视频、徽标)、结构、样式、突出颜色和用户选择的选择感知JSON。FUSE实现了七种上下文预算策略:全面基线提示、上下文压缩、思维链推理、小镜头优化、检索增强上下文、两阶段处理和zero-shot极简主义。最后,管道属性层通过将子代理信号转换为简单的检查来监控系统性能:意图对齐、内容类型/路由健全性、召回健康(例如,零命中和顶级匹配强度),以及排名位移分析。我们评估了来自不同用户和设计模板的788个评估查询中的七个上下文预算变体(参见图3)。我们的系统评估显示,上下文压缩在所有管道阶段都实现了最佳性能,意图准确率为93.3%,路由成功率为86.8%(带回退),召回率为99.4%,NDCG@5为88.5%。这种方法表明,战略背景概括优于全面和最小的语境化策略。
摘要:Multimodal creative assistants decompose user goals and route tasks to subagents for layout, styling, retrieval, and generation. Retrieval quality is pivotal, yet failures can arise at several stages: understanding user intent, choosing content types, finding candidates (recall), or ranking results. Meanwhile, sending and processing images is costly, making naive multimodal approaches impractical. We present FUSE: Failure-aware Usage of Subagent Evidence for MultiModal Search and Recommendation. FUSE replaces most raw-image prompting with a compact Grounded Design Representation (GDR): a selection aware JSON of canvas elements (image, text, shape, icon, video, logo), structure, styles, salient colors, and user selection provided by the Planner team. FUSE implements seven context budgeting strategies: comprehensive baseline prompting, context compression, chain-of-thought reasoning, mini-shot optimization, retrieval-augmented context, two-stage processing, and zero-shot minimalism. Finally, a pipeline attribution layer monitors system performance by converting subagent signals into simple checks: intent alignment, content-type/routing sanity, recall health (e.g., zero-hit and top-match strength), and ranking displacement analysis. We evaluate the seven context budgeting variants across 788 evaluation queries from diverse users and design templates (refer Figure 3). Our systematic evaluation reveals that Context Compression achieves optimal performance across all pipeline stages, with 93.3% intent accuracy, 86.8% routing success(with fallbacks), 99.4% recall, and 88.5% NDCG@5. This approach demonstrates that strategic context summarization outperforms both comprehensive and minimal contextualization strategies.
聚类(1篇)
【1】Statistical Inference for Fuzzy Clustering
标题:模糊聚集的统计推断
链接:https://arxiv.org/abs/2601.02656
作者:Qiuyi Wu,Zihan Zhu,Anru R. Zhang
摘要
:聚类是生物医学研究中发现异质患者亚群的核心工具,其中组边界通常是分散的,而不是急剧分离的。传统方法产生硬分区,而模糊$c$-均值(FCM)等软聚类方法允许混合成员资格并更好地捕捉不确定性和逐渐过渡。尽管FCM的广泛使用,模糊聚类的原则性统计推断仍然有限。 我们开发了一个新的框架,加权模糊$c$-手段(WFCM)的设置与潜在的集群大小不平衡。特定的权重重新平衡经典的FCM标准,使较小的集群不被压倒的优势群体,和加权的目标诱导一个归一化的密度模型与尺度参数$σ$和模糊参数$m$。估计是通过一个块优化-最小化(MM)程序,交替封闭形式的成员和质心更新与基于似然更新的$(σ,\bw)$。的棘手的归一化常数近似的重要性采样使用的数据自适应高斯混合提案。我们还提供了似然比测试比较聚类中心和基于自举的置信区间。 我们建立了最大似然估计的一致性和渐近正态性,通过模拟验证了该方法,并使用单细胞RNA-seq和阿尔茨海默病神经成像倡议(ADNI)数据进行了说明。这些应用证明了稳定的不确定性量化和生物学上有意义的软成员,范围从不平衡下分离良好的细胞群到与疾病进展一致的分级AD与非AD连续体。
摘要:Clustering is a central tool in biomedical research for discovering heterogeneous patient subpopulations, where group boundaries are often diffuse rather than sharply separated. Traditional methods produce hard partitions, whereas soft clustering methods such as fuzzy $c$-means (FCM) allow mixed memberships and better capture uncertainty and gradual transitions. Despite the widespread use of FCM, principled statistical inference for fuzzy clustering remains limited. We develop a new framework for weighted fuzzy $c$-means (WFCM) for settings with potential cluster size imbalance. Cluster-specific weights rebalance the classical FCM criterion so that smaller clusters are not overwhelmed by dominant groups, and the weighted objective induces a normalized density model with scale parameter $σ$ and fuzziness parameter $m$. Estimation is performed via a blockwise majorize--minimize (MM) procedure that alternates closed-form membership and centroid updates with likelihood-based updates of $(σ,\bw)$. The intractable normalizing constant is approximated by importance sampling using a data-adaptive Gaussian mixture proposal. We further provide likelihood ratio tests for comparing cluster centers and bootstrap-based confidence intervals. We establish consistency and asymptotic normality of the maximum likelihood estimator, validate the method through simulations, and illustrate it using single-cell RNA-seq and Alzheimer disease Neuroimaging Initiative (ADNI) data. These applications demonstrate stable uncertainty quantification and biologically meaningful soft memberships, ranging from well-separated cell populations under imbalance to a graded AD versus non-AD continuum consistent with disease progression.
超分辨率|去噪|去模糊|去雾(1篇)
【1】RadioDiff-Flux: Efficient Radio Map Construction via Generative Denoise Diffusion Model Trajectory Midpoint Reuse
标题:RadioDiff-Flux:基于生成式去噪扩散模型轨迹中点重用的高效无线电地图构建
链接:https://arxiv.org/abs/2601.02790
作者:Xiucheng Wang,Peilin Zheng,Honggang Jia,Nan Cheng,Ruijin Sun,Conghao Zhou,Xuemin Shen
摘要:精确的无线电地图(RM)的建设是必不可少的,使环境感知和自适应无线通信。然而,在以高速网络实体和快速变化的环境为特征的未来6 G场景中,满足实时要求非常具有挑战性。虽然生成扩散模型(DM)可以实现国家的最先进的精度与第二级延迟,其迭代性质导致禁止延迟敏感的情况下的推理延迟。在本文中,通过揭示扩散过程的一个关键结构特性:潜在的中点在语义相似的场景中保持高度一致,我们提出了RadioDiff-Flux,一种新的两阶段潜在扩散框架,它将静态环境建模与动态细化相结合,使预先计算的中点能够重复使用,以绕过冗余的去噪。特别地,第一阶段仅使用静态场景特征生成粗略的潜在表示,其可以在类似场景中被缓存和共享。第二阶段使用预先训练的模型使该表示适应动态条件和发射机位置,从而避免重复的早期计算。建议RadioDiff-Flux显着减少推理时间,同时保持保真度。实验结果表明,RadioDiff-Flux可以实现高达50的加速,精度损失小于0.15%,证明了其在未来6 G网络中快速,可扩展的RM生成的实用性。
摘要:Accurate radio map (RM) construction is essential to enabling environment-aware and adaptive wireless communication. However, in future 6G scenarios characterized by high-speed network entities and fast-changing environments, it is very challenging to meet real-time requirements. Although generative diffusion models (DMs) can achieve state-of-the-art accuracy with second-level delay, their iterative nature leads to prohibitive inference latency in delay-sensitive scenarios. In this paper, by uncovering a key structural property of diffusion processes: the latent midpoints remain highly consistent across semantically similar scenes, we propose RadioDiff-Flux, a novel two-stage latent diffusion framework that decouples static environmental modeling from dynamic refinement, enabling the reuse of precomputed midpoints to bypass redundant denoising. In particular, the first stage generates a coarse latent representation using only static scene features, which can be cached and shared across similar scenarios. The second stage adapts this representation to dynamic conditions and transmitter locations using a pre-trained model, thereby avoiding repeated early-stage computation. The proposed RadioDiff-Flux significantly reduces inference time while preserving fidelity. Experiment results show that RadioDiff-Flux can achieve up to 50 acceleration with less than 0.15% accuracy loss, demonstrating its practical utility for fast, scalable RM generation in future 6G networks.
自动驾驶|车辆|车道检测等(1篇)
【1】Which Deep Learner? A Systematic Evaluation of Advanced Deep Forecasting Models Accuracy and Efficiency for Network Traffic Prediction
标题:哪个深度学习者?高级深度预测模型用于网络流量预测的准确性和效率的系统评估
链接:https://arxiv.org/abs/2601.02694
作者:Eilaf MA Babai,Aalaa MA Babai,Koji Okamura
备注:19 pages, 13 figures
摘要:网络流量预测是现代网络管理自动化的基础。这是一个困难的时间序列预测(TSF)问题,深度学习(DL)模型已经解决了这个问题,因为它们能够捕获复杂的模式。从复杂的Transformer架构到简单的线性模型,预测的进步提高了各种预测任务的性能。然而,鉴于网络环境和流量序列时间尺度的网络流量的可变性,确定有效的部署选择和建模方向对网络流量预测至关重要。该研究系统地识别和评估了12种先进的TSF模型-包括基于transformer的和传统的DL方法,每种方法在网络流量预测方面都具有独特的优势-针对四个真实流量数据集的三个统计基线,跨多个时间尺度和范围,评估性能,对异常的鲁棒性,数据缺口,外部因素,数据效率和资源效率。结果突出了性能机制、效率阈值以及平衡准确性和效率的有前途的架构,展示了对流量挑战的鲁棒性,并提出了超越传统RNN的新方向。
摘要:Network traffic prediction is essential for automating modern network management. It is a difficult time series forecasting (TSF) problem that has been addressed by Deep Learning (DL) models due to their ability to capture complex patterns. Advances in forecasting, from sophisticated transformer architectures to simple linear models, have improved performance across diverse prediction tasks. However, given the variability of network traffic across network environments and traffic series timescales, it is essential to identify effective deployment choices and modeling directions for network traffic prediction. This study systematically identify and evaluates twelve advanced TSF models -including transformer-based and traditional DL approaches, each with unique advantages for network traffic prediction- against three statistical baselines on four real traffic datasets, across multiple time scales and horizons, assessing performance, robustness to anomalies, data gaps, external factors, data efficiency, and resource efficiency in terms of time, memory, and energy. Results highlight performance regimes, efficiency thresholds, and promising architectures that balance accuracy and efficiency, demonstrating robustness to traffic challenges and suggesting new directions beyond traditional RNNs.
点云|SLAM|雷达|激光|深度RGBD相关(1篇)
【1】Gradient descent reliably finds depth- and gate-optimal circuits for generic unitaries
标题:梯度下降可靠地找到通用正元的深度和门最优电路
链接:https://arxiv.org/abs/2601.03123
作者:Janani Gomathi,Alex Meiburg
备注:14 pages, 17 figures
摘要:当门集合具有连续参数时,使用精确方法将酉算子合成为量子电路总是可能的,但是有效地找到最小电路仍然是一个具有挑战性的问题。对于编译的酉单位,与使用所有参数并且通常需要最大大小的电路的通用酉单位相比,情况非常不同,编译的酉单位来自编程并且通常具有短路。我们表明,简单的梯度下降可靠地找到深度和门的通用酉最优电路,包括在存在限制芯片连接。这违背了早期的证据,最佳合成需要组合搜索,我们表明,这种差异可以解释为避免随机选择某些参数不足的电路骨架。
摘要
:When the gate set has continuous parameters, synthesizing a unitary operator as a quantum circuit is always possible using exact methods, but finding minimal circuits efficiently remains a challenging problem. The landscape is very different for compiled unitaries, which arise from programming and typically have short circuits, as compared with generic unitaries, which use all parameters and typically require circuits of maximal size. We show that simple gradient descent reliably finds depth- and gate-optimal circuits for generic unitaries, including in the presence of restricted chip connectivity. This runs counter to earlier evidence that optimal synthesis required combinatorial search, and we show that this discrepancy can be explained by avoiding the random selection of certain parameter-deficient circuit skeletons.
推理|分析|理解|解释(5篇)
【1】Prompt-Counterfactual Explanations for Generative AI System Behavior
标题:生成式人工智能系统行为的反事实解释
链接:https://arxiv.org/abs/2601.03156
作者:Sofie Goethals,Foster Provost,João Sedoc
摘要:随着生成式人工智能系统集成到现实世界的应用程序中,组织越来越需要能够理解和解释他们的行为。特别是,决策者需要了解是什么原因导致生成式AI系统表现出特定的输出特征。在这个一般性的主题中,本文探讨了一个关键问题:输入-提示-是什么导致基于LLM的生成式AI系统产生具有特定特征的输出,例如毒性,负面情绪或政治偏见。为了研究这个问题,我们从可解释的人工智能文献中采用了一种常见的技术:反事实解释。我们解释了为什么传统的反事实解释不能直接应用于生成式AI系统,因为生成式AI系统的功能有几个不同之处。然后,我们提出了一个灵活的框架,在下游分类器可以揭示其输出的关键特征的情况下,将反事实解释适应于非确定性的生成AI系统。基于这个框架,我们介绍了一个算法生成的非真实反事实的解释(PCE)。最后,我们用三个案例研究来展示生成式人工智能系统的反事实解释的产生,研究不同的输出特征(即,政治倾向、毒性和情绪)。案例研究进一步表明,PCE可以简化提示工程,以抑制不良的输出特性,并可以提高红队的努力,发现额外的提示,引起不良的输出。最终,这项工作为生成式人工智能中的可解释性奠定了基础:这种能力将变得不可或缺,因为这些模型被赋予了更高风险的任务,并受到新的透明度和问责制监管要求的约束。
摘要:As generative AI systems become integrated into real-world applications, organizations increasingly need to be able to understand and interpret their behavior. In particular, decision-makers need to understand what causes generative AI systems to exhibit specific output characteristics. Within this general topic, this paper examines a key question: what is it about the input -the prompt- that causes an LLM-based generative AI system to produce output that exhibits specific characteristics, such as toxicity, negative sentiment, or political bias. To examine this question, we adapt a common technique from the Explainable AI literature: counterfactual explanations. We explain why traditional counterfactual explanations cannot be applied directly to generative AI systems, due to several differences in how generative AI systems function. We then propose a flexible framework that adapts counterfactual explanations to non-deterministic, generative AI systems in scenarios where downstream classifiers can reveal key characteristics of their outputs. Based on this framework, we introduce an algorithm for generating prompt-counterfactual explanations (PCEs). Finally, we demonstrate the production of counterfactual explanations for generative AI systems with three case studies, examining different output characteristics (viz., political leaning, toxicity, and sentiment). The case studies further show that PCEs can streamline prompt engineering to suppress undesirable output characteristics and can enhance red-teaming efforts to uncover additional prompts that elicit undesirable outputs. Ultimately, this work lays a foundation for prompt-focused interpretability in generative AI: a capability that will become indispensable as these models are entrusted with higher-stakes tasks and subject to emerging regulatory requirements for transparency and accountability.
【2】Explainable Fuzzy GNNs for Leak Detection in Water Distribution Networks
标题:用于供水管网泄漏检测的可解释模糊GNN
链接:https://arxiv.org/abs/2601.03062
作者:Qusai Khaled,Pasquale De Marinis,Moez Louati,David Ferras,Laura Genga,Uzay Kaymak
备注:Accepted at IFSA-NAFIPS 2025
摘要:供水管网的及时泄漏检测对于节约资源和保持运营效率至关重要。虽然图神经网络(GNN)擅长捕捉传感器数据中的时空依赖性,但其黑盒性质和基于图的可解释水网络模型的有限工作阻碍了实际应用。我们提出了一个可解释的GNN框架,它集成了互信息来识别关键网络区域和模糊逻辑,为节点分类任务提供清晰的、基于规则的解释。在对几种GNN架构进行基准测试后,我们选择了性能优越的广义图卷积网络(GENConv),并开发了一种模糊增强型变体,为分类泄漏位置提供直观的解释。我们的模糊图神经网络(FGENConv)在检测和定位方面的Graph F1得分分别为0.889和0.814,分别略低于清晰的GENConv 0.938和0.858。然而,它通过提供空间局部化,模糊的基于规则的解释来补偿。通过在精度和可解释性之间取得适当的平衡,所提出的模糊网络可以使水利工程师验证预测的泄漏位置,节省人力资源,并优化维护策略。该代码可在github.com/pasqualedem/GNNLeakDetection上获得。
摘要:Timely leak detection in water distribution networks is critical for conserving resources and maintaining operational efficiency. Although Graph Neural Networks (GNNs) excel at capturing spatial-temporal dependencies in sensor data, their black-box nature and the limited work on graph-based explainable models for water networks hinder practical adoption. We propose an explainable GNN framework that integrates mutual information to identify critical network regions and fuzzy logic to provide clear, rule-based explanations for node classification tasks. After benchmarking several GNN architectures, we selected the generalized graph convolution network (GENConv) for its superior performance and developed a fuzzy-enhanced variant that offers intuitive explanations for classified leak locations. Our fuzzy graph neural network (FGENConv) achieved Graph F1 scores of 0.889 for detection and 0.814 for localization, slightly below the crisp GENConv 0.938 and 0.858, respectively. Yet it compensates by providing spatially localized, fuzzy rule-based explanations. By striking the right balance between precision and explainability, the proposed fuzzy network could enable hydraulic engineers to validate predicted leak locations, conserve human resources, and optimize maintenance strategies. The code is available at github.com/pasqualedem/GNNLeakDetection.
【3】ChemBART: A Pre-trained BART Model Assisting Organic Chemistry Analysis
标题:ChemBART:一个预训练的BART模型,辅助有机化学分析
链接:https://arxiv.org/abs/2601.02915
作者:Kenan Li,Yijian Zhang,Jin Wang,Haipeng Gan,Zeying Sun,Xiaoguang Lei,Hao Dong
摘要:大型语言模型(LLM)的最新进展已经在不同领域展示了变革潜力。虽然LLM已应用于计算机辅助合成计划(CASP)中的分子简化分子输入行输入系统(SMILES),但现有方法通常解决单一任务,例如前体预测。我们介绍了ChemBART,一个基于SMILES的LLM,对化学反应进行了预训练,它为多个下游化学任务提供了一个统一的模型-实现了“一个模型,一个预训练,多个任务”的范式。通过利用反应表达式的掩模填充预训练任务的输出,ChemBART有效地解决了各种化学问题,包括前体/试剂生成,温度-产率回归,分子性质分类,以及在强化学习框架内优化策略和值函数,并与Monte Carlo树搜索集成用于多步合成路线设计。与单分子预训练LLM受限于特定应用不同,ChemBART解决了更广泛的化学挑战,并将其整合用于综合合成规划。至关重要的是,ChemBART设计的多步合成路线和反应条件直接激发了湿实验室验证,证实了较短的途径,比文献基准提高了约30%的产率。我们的工作验证了以反应为中心的预训练的力量,并展示了ChemBART在推进完整合成规划周期方面的广泛实用性。
摘要
:Recent advances in large language models (LLMs) have demonstrated transformative potential across diverse fields. While LLMs have been applied to molecular simplified molecular input line entry system (SMILES) in computer-aided synthesis planning (CASP), existing methodologies typically address single tasks, such as precursor prediction. We introduce ChemBART, a SMILES-based LLM pre-trained on chemical reactions, which enables a unified model for multiple downstream chemical tasks--achieving the paradigm of "one model, one pre-training, multiple tasks." By leveraging outputs from a mask-filling pre-training task on reaction expressions, ChemBART effectively solves a variety of chemical problems, including precursor/reagent generation, temperature-yield regression, molecular property classification, and optimizing the policy and value functions within a reinforcement learning framework, integrated with Monte Carlo tree search for multi-step synthesis route design. Unlike single-molecule pre-trained LLMs constrained to specific applications, ChemBART addresses broader chemical challenges and integrates them for comprehensive synthesis planning. Crucially, ChemBART-designed multi-step synthesis routes and reaction conditions directly inspired wet-lab validation, which confirmed shorter pathways with ~30% yield improvement over literature benchmarks. Our work validates the power of reaction-focused pre-training and showcases the broad utility of ChemBART in advancing the complete synthesis planning cycle.
【4】Threat Detection in Social Media Networks Using Machine Learning Based Network Analysis
标题:使用基于机器学习的网络分析在社交媒体网络中进行威胁检测
链接:https://arxiv.org/abs/2601.02581
作者:Aditi Sanjay Agrawal
备注:11 Pages, 6 figures
摘要:社交媒体网站的加速发展给网络空间带来了复杂的安全问题,这些网站越来越多地成为犯罪活动的受害者,包括试图入侵它们、异常流量模式和有组织的攻击。传统的基于规则的安全系统并不总是可扩展的和动态的,以满足这样的威胁。本文介绍了一种基于机器学习的威胁检测框架,可用于根据网络流量的性质对社交媒体网络环境中的恶意行为进行分类。利用丰富的网络流量数据集,进行大量的预处理和探索性数据分析,以克服数据不平衡,特征不一致和噪声的问题。然后创建一个人工神经网络(ANN)模型来获取恶意行为的复杂的非线性趋势。所提出的模型进行了测试,传统的性能指标,如准确性,准确性,召回率,F1分数,和ROC-AUC,并显示出良好的检测和高水平的强度。研究结果表明,基于神经网络的解决方案有可能被有效地用于识别大规模社交媒体网络背景下的潜在威胁动态,它们可以用来补充现有的入侵检测系统,并更好地进行主动网络安全操作。
摘要:The accelerated development of social media websites has posed intricate security issues in cyberspace, where these sites have increasingly become victims of criminal activities including attempts to intrude into them, abnormal traffic patterns, and organized attacks. The conventional rule-based security systems are not always scalable and dynamic to meet such a threat. This paper introduces a threat detection framework based on machine learning that can be used to classify malicious behavior in the social media network environment based on the nature of network traffic. Exploiting a rich network traffic dataset, the massive preprocessing and exploratory data analysis is conducted to overcome the problem of data imbalance, feature inconsistency, and noise. A model of artificial neural network (ANN) is then created to acquire intricate, non-linear tendencies of malicious actions. The proposed model is tested on conventional performance metrics, such as accuracy, accuracy, recall, F1-score, and ROC-AUC, and shows good detection and high levels of strength. The findings suggest that neural network-based solutions have the potential to be used effectively to identify the latent threat dynamics within the context of a large-scale social media network and that they can be employed to complement the existing intrusion detection system and better to conduct proactive cybersecurity operations.
【5】Cross-Platform Digital Discourse Analysis of the Israel-Hamas Conflict: Sentiment, Topics, and Event Dynamics
标题:以色列与哈马斯冲突的跨平台数字话语分析:情绪、话题和事件动态
链接:https://arxiv.org/abs/2601.02367
作者:Despoina Antonakaki,Sotiris Ioannidis
摘要:以色列-巴勒斯坦冲突仍然是最两极分化的地缘政治问题之一,2023年10月的升级加剧了在线辩论。社交媒体平台,特别是Telegram,已经成为实时新闻分享、宣传和宣传的核心。在这项研究中,我们分析了Telegram,Twitter/X和Reddit,以研究冲突叙事如何在不同的数字领域产生,放大和竞争。基于我们之前在2023年升级期间对Telegram话语的研究,我们使用2023年10月至2025年年中的更新数据集纵向和跨平台扩展了分析。该语料库包括超过187,000条Telegram消息,210万条Reddit评论和策划的Twitter/X帖子。我们将潜在狄利克雷分配(LDA)、BERTopic和基于transformer的情感和情绪模型相结合,以识别主导主题、情感动态和宣传策略。Telegram频道提供未经过滤的高强度事件文档; Twitter/X向全球受众放大帧; Reddit举办更多反思和审议讨论。我们的研究结果揭示了持续的负面情绪,人道主义框架和团结表达之间的强烈耦合,以及亲巴勒斯坦和亲以色列叙事的传播平台特定途径。本文提供了三个贡献:(1)一个多平台的,符合FAIR的以色列-哈马斯战争数据集,(2)一个集成的管道,结合主题建模,情绪和情感分析,以及大规模冲突话语的垃圾邮件过滤,以及(3)对平台启示和情感公众如何塑造数字冲突传播的演变的实证见解。
摘要:The Israeli-Palestinian conflict remains one of the most polarizing geopolitical issues, with the October 2023 escalation intensifying online debate. Social media platforms, particularly Telegram, have become central to real-time news sharing, advocacy, and propaganda. In this study, we analyze Telegram, Twitter/X, and Reddit to examine how conflict narratives are produced, amplified, and contested across different digital spheres. Building on our previous work on Telegram discourse during the 2023 escalation, we extend the analysis longitudinally and cross-platform using an updated dataset spanning October 2023 to mid-2025. The corpus includes more than 187,000 Telegram messages, 2.1 million Reddit comments, and curated Twitter/X posts. We combine Latent Dirichlet Allocation (LDA), BERTopic, and transformer-based sentiment and emotion models to identify dominant themes, emotional dynamics, and propaganda strategies. Telegram channels provide unfiltered, high-intensity documentation of events; Twitter/X amplifies frames to global audiences; and Reddit hosts more reflective and deliberative discussions. Our findings reveal persistent negative sentiment, strong coupling between humanitarian framing and solidarity expressions, and platform-specific pathways for the diffusion of pro-Palestinian and pro-Israeli narratives. This paper offers three contributions: (1) a multi-platform, FAIR-compliant dataset on the Israel-Hamas war, (2) an integrated pipeline combining topic modeling, sentiment and emotion analysis, and spam filtering for large-scale conflict discourse, and (3) empirical insights into how platform affordances and affective publics shape the evolution of digital conflict communication.
分类|识别(2篇)
【1】Low-Resource Heuristics for Bahnaric Optical Character Recognition Improvement
标题:Bahnaric光学字符识别改进的低资源启发式方法
链接:https://arxiv.org/abs/2601.02965
作者:Phat Tran,Phuoc Pham,Hung Trinh,Tho Quan
摘要:Bahnar是越南、柬埔寨和老挝的一种少数民族语言,由于研究和数据有限,它面临着重大的保护挑战。这项研究解决了通过光学字符识别(OCR)技术准确数字化Bahnar语言文件的关键需求。数字化扫描的纸质文档带来了巨大的挑战,因为破碎或模糊区域的图像质量下降会引入相当大的OCR错误,从而危及信息检索系统。我们提出了一种综合的方法,结合先进的表和非表检测技术与基于概率的后处理算法,以提高识别精度。我们的方法首先应用检测算法来提高输入数据的质量,然后采用概率错误校正OCR输出。实验结果表明,识别准确率从72.86%提高到79.26%。这项工作为Bahnar语言的保护提供了宝贵的资源,并为其他少数民族语言的数字化工作提供了适用的框架。
摘要:Bahnar, a minority language spoken across Vietnam, Cambodia, and Laos, faces significant preservation challenges due to limited research and data availability. This study addresses the critical need for accurate digitization of Bahnar language documents through optical character recognition (OCR) technology. Digitizing scanned paper documents poses significant challenges, as degraded image quality from broken or blurred areas introduces considerable OCR errors that compromise information retrieval systems. We propose a comprehensive approach combining advanced table and non-table detection techniques with probability-based post-processing heuristics to enhance recognition accuracy. Our method first applies detection algorithms to improve input data quality, then employs probabilistic error correction on OCR output. Experimental results indicate a substantial improvement, with recognition accuracy increasing from 72.86% to 79.26%. This work contributes valuable resources for Bahnar language preservation and provides a framework applicable to other minority language digitization efforts.
【2】Normalized Conditional Mutual Information Surrogate Loss for Deep Neural Classifiers
标题:深度神经分类器的规范化条件互信息替代损失
链接
:https://arxiv.org/abs/2601.02543
作者:Linfeng Ye,Zhixiang Chi,Konstantinos N. Plataniotis,En-hui Yang
备注:8 pages, 4 figures
摘要:在本文中,我们提出了一种新的信息理论替代损失;归一化条件互信息(NCMI);作为事实上的交叉熵(CE)的替代品,用于训练基于深度神经网络(DNN)的分类器。我们首先观察到模型的NCMI与其精度成反比。基于这一认识,我们引入了一种交替算法,以有效地最小化NCMI。在图像识别和全载玻片成像(WSI)子分型基准中,NCMI训练的模型以与CE相当的计算成本超过了最先进的损失。值得注意的是,在ImageNet上,与CE相比,使用ResNet-50的NCMI产生了2.77%的top-1准确率提高;在CAMELYON-17上,使用NCMI取代CE将macro-F1提高了8.6%。在各种架构和批量大小中,收益是一致的,这表明NCMI是CE的一种实用且有竞争力的替代方案。
摘要:In this paper, we propose a novel information theoretic surrogate loss; normalized conditional mutual information (NCMI); as a drop in alternative to the de facto cross-entropy (CE) for training deep neural network (DNN) based classifiers. We first observe that the model's NCMI is inversely proportional to its accuracy. Building on this insight, we introduce an alternating algorithm to efficiently minimize the NCMI. Across image recognition and whole-slide imaging (WSI) subtyping benchmarks, NCMI-trained models surpass state of the art losses by substantial margins at a computational cost comparable to that of CE. Notably, on ImageNet, NCMI yields a 2.77% top-1 accuracy improvement with ResNet-50 comparing to the CE; on CAMELYON-17, replacing CE with NCMI improves the macro-F1 by 8.6% over the strongest baseline. Gains are consistent across various architectures and batch sizes, suggesting that NCMI is a practical and competitive alternative to CE.
表征(1篇)
【1】Causal Manifold Fairness: Enforcing Geometric Invariance in Representation Learning
标题:因果流公平性:在表示学习中强制几何不变性
链接:https://arxiv.org/abs/2601.03032
作者:Vidhi Rathore
摘要:机器学习中的公平性越来越重要,但标准方法通常将数据视为高维空间中的静态点,忽略了底层的生成结构。我们认为敏感属性(例如,种族、性别)不仅改变了数据分布,而且还因果地扭曲了数据流形本身的几何形状。为了解决这个问题,我们引入了因果流形公平(CMF),这是一个连接因果推理和几何深度学习的新框架。CMF学习一个潜在的表示,其中由度量张量和曲率定义的局部黎曼几何在敏感属性上的反事实干预下保持不变。通过对解码器的雅可比矩阵和海森矩阵施加约束,CMF确保了潜在空间的规则(距离和形状)在人口统计组中得到保留。我们在合成结构因果模型(SCM)上验证了CMF,证明它有效地解开了敏感的几何扭曲,同时保留了任务效用,通过几何度量提供了公平效用权衡的严格量化。
摘要:Fairness in machine learning is increasingly critical, yet standard approaches often treat data as static points in a high-dimensional space, ignoring the underlying generative structure. We posit that sensitive attributes (e.g., race, gender) do not merely shift data distributions but causally warp the geometry of the data manifold itself. To address this, we introduce Causal Manifold Fairness (CMF), a novel framework that bridges causal inference and geometric deep learning. CMF learns a latent representation where the local Riemannian geometry, defined by the metric tensor and curvature, remains invariant under counterfactual interventions on sensitive attributes. By enforcing constraints on the Jacobian and Hessian of the decoder, CMF ensures that the rules of the latent space (distances and shapes) are preserved across demographic groups. We validate CMF on synthetic Structural Causal Models (SCMs), demonstrating that it effectively disentangles sensitive geometric warping while preserving task utility, offering a rigorous quantification of the fairness-utility trade-off via geometric metrics.
3D|3D重建等相关(1篇)
【1】Enhanced 3D Gravity Inversion Using ResU-Net with Density Logging Constraints: A Dual-Phase Training Approach
标题:使用具有密度记录约束的ResU-Net增强3D重力倒置:双阶段训练方法
链接:https://arxiv.org/abs/2601.02890
作者:Siyuan Dong,Jinghuai Gao,Shuai Zhou,Baohai Wu,Hongfa Jia
摘要:重力勘探以其低成本、高效率的特点成为一种重要的地球物理勘探方法。随着人工智能的兴起,基于深度学习(DL)的数据驱动重力反演方法具有常规正则化方法所缺乏的物理属性恢复能力。然而,现有的DL方法受到先验信息约束不足的影响,这导致反演模型具有较大的数据拟合误差和不可靠的结果。此外,反演结果缺乏其他勘探方法的约束和匹配,导致可能与已知地质条件相矛盾的结果。在这项研究中,我们提出了一种新的方法,集成了以前的密度测井信息,以解决上述问题。首先,我们引入一个深度加权函数的神经网络(NN)和训练它在加权密度参数域。在加权前向算子的约束下,神经网络表现出更好的反演性能,所得到的反演模型表现出更小的数据拟合误差。接下来,我们将整个网络训练分为两个阶段:首先训练一个大型的预训练网络Net-I,然后使用密度日志信息作为约束,得到优化的微调网络Net-II。通过在合成模型和Bishop模型上的测试和比较,与无约束数据驱动的DL反演方法相比,该方法的反演质量有了明显的提高。此外,我们还进行了比较和讨论,我们的方法与传统的聚焦反演(FI)方法及其测井约束的变体。最后,我们将该方法应用于墨西哥San Nicolas矿区的实测数据,并与最近的两种基于DL的重力反演方法进行了比较和分析。
摘要:Gravity exploration has become an important geophysical method due to its low cost and high efficiency. With the rise of artificial intelligence, data-driven gravity inversion methods based on deep learning (DL) possess physical property recovery capabilities that conventional regularization methods lack. However, existing DL methods suffer from insufficient prior information constraints, which leads to inversion models with large data fitting errors and unreliable results. Moreover, the inversion results lack constraints and matching from other exploration methods, leading to results that may contradict known geological conditions. In this study, we propose a novel approach that integrates prior density well logging information to address the above issues. First, we introduce a depth weighting function to the neural network (NN) and train it in the weighted density parameter domain. The NN, under the constraint of the weighted forward operator, demonstrates improved inversion performance, with the resulting inversion model exhibiting smaller data fitting errors. Next, we divide the entire network training into two phases: first training a large pre-trained network Net-I, and then using the density logging information as the constraint to get the optimized fine-tuning network Net-II. Through testing and comparison in synthetic models and Bishop Model, the inversion quality of our method has significantly improved compared to the unconstrained data-driven DL inversion method. Additionally, we also conduct a comparison and discussion between our method and both the conventional focusing inversion (FI) method and its well logging constrained variant. Finally, we apply this method to the measured data from the San Nicolas mining area in Mexico, comparing and analyzing it with two recent gravity inversion methods based on DL.
编码器(1篇)
【1】Lil: Less is Less When Applying Post-Training Sparse-Attention Algorithms in Long-Decode Stage
标题:Lil:在长解码阶段应用训练后的稀疏注意算法时,少就是少
链接:https://arxiv.org/abs/2601.03043
作者:Junhao Hu,Fangze Li,Mingtao Xu,Feifan Meng,Shiju Zhao,Tiancheng Hu,Ting Peng,Anmin Liu,Wenrui Huang,Chenxu Liu,Ziyue Hua,Tao Xie
摘要
:大型语言模型(LLM)在广泛的复杂任务中表现出强大的能力,并且越来越多地大规模部署,对推理效率提出了很高的要求。先前的工作通常将推理分解为预填充和解码阶段,其中解码阶段主导总延迟。为了减少解码阶段的时间和内存复杂度,一系列工作引入了稀疏注意算法。在本文中,我们表明,无论是经验和理论,稀疏的注意力可以矛盾地增加端到端的复杂性:信息丢失往往会导致显着更长的序列,我们称之为“少即是少”(Lil)的现象。为了缓解Lil问题,我们提出了一种早期停止算法,该算法检测在稀疏解码过程中信息损失超过信息增益的阈值。我们的早期停止算法将令牌消耗减少了高达90%,在推理密集型基准测试中,边际准确性下降不到2%。
摘要:Large language models (LLMs) demonstrate strong capabilities across a wide range of complex tasks and are increasingly deployed at scale, placing significant demands on inference efficiency. Prior work typically decomposes inference into prefill and decode stages, with the decode stage dominating total latency. To reduce time and memory complexity in the decode stage, a line of work introduces sparse-attention algorithms. In this paper, we show, both empirically and theoretically, that sparse attention can paradoxically increase end-to-end complexity: information loss often induces significantly longer sequences, a phenomenon we term ``Less is Less'' (Lil). To mitigate the Lil problem, we propose an early-stopping algorithm that detects the threshold where information loss exceeds information gain during sparse decoding. Our early-stopping algorithm reduces token consumption by up to 90% with a marginal accuracy degradation of less than 2% across reasoning-intensive benchmarks.
优化|敛散性(7篇)
【1】Dynamic Hyperparameter Importance for Efficient Multi-Objective Optimization
标题:高效多目标优化的动态超参数重要性
链接:https://arxiv.org/abs/2601.03166
作者:Daphne Theodorakopoulos,Marcel Wever,Marius Lindauer
备注:Submitted to IJCAI 2026
摘要:选择合适的ML模型是一项复杂的任务,可能取决于几个目标,例如,准确性、模型大小、公平性、推理时间或能耗。在实践中,这需要通过多目标优化(MOO)来权衡多个通常相互竞争的目标。然而,现有的MOO方法通常将所有超参数视为同等重要,忽略了超参数重要性(HPI)可以根据目标之间的权衡而显着变化。我们提出了一种新的动态优化方法,在搜索过程中根据不同的目标权衡优先考虑最有影响力的超参数,这加速了经验收敛,并导致更好的解决方案。基于先前针对MOO后分析的HPI工作,我们现在将使用HyperSHAP计算的HPI集成到优化中。为此,我们利用MOO算法ParEGO自然产生的客观权重,并通过固定不重要的超参数来调整配置空间,使搜索专注于重要的超参数。最后,我们用PyMOO和YAHPO-Gym的不同任务验证了我们的方法。实证结果表明,与基线相比,收敛速度和帕累托前沿质量有所改善。
摘要:Choosing a suitable ML model is a complex task that can depend on several objectives, e.g., accuracy, model size, fairness, inference time, or energy consumption. In practice, this requires trading off multiple, often competing, objectives through multi-objective optimization (MOO). However, existing MOO methods typically treat all hyperparameters as equally important, overlooking that hyperparameter importance (HPI) can vary significantly depending on the trade-off between objectives. We propose a novel dynamic optimization approach that prioritizes the most influential hyperparameters based on varying objective trade-offs during the search process, which accelerates empirical convergence and leads to better solutions. Building on prior work on HPI for MOO post-analysis, we now integrate HPI, calculated with HyperSHAP, into the optimization. For this, we leverage the objective weightings naturally produced by the MOO algorithm ParEGO and adapt the configuration space by fixing the unimportant hyperparameters, allowing the search to focus on the important ones. Eventually, we validate our method with diverse tasks from PyMOO and YAHPO-Gym. Empirical results demonstrate improvements in convergence speed and Pareto front quality compared to baselines.
【2】On the Convergence Behavior of Preconditioned Gradient Descent Toward the Rich Learning Regime
标题:预条件梯度下降算法向富学习的收敛性
链接:https://arxiv.org/abs/2601.03162
作者:Shuai Jiang,Alexey Voronin,Eric Cyr,Ben Southworth
备注:21 pages, 13 figures,
摘要:频谱偏差,即神经网络首先学习低频的倾向,可能是一种祝福,也可能是一种诅咒。虽然它通过抑制高频噪声增强了泛化能力,但它可能是需要捕获精细尺度结构的科学任务的限制。被称为grokking的延迟泛化现象是神经网络快速训练的另一个障碍。Grokking被假设为从NTK学习过渡到功能丰富的制度。本文探讨了预条件梯度下降(PGD),如高斯-牛顿,对谱偏差和grokking现象的影响。我们通过理论和实证结果证明了PGD如何减轻与光谱偏差相关的问题。此外,丰富的学习制度grokking假设的基础上,我们研究如何PGD可以用来减少与grokking相关的延迟。我们的猜想是,PGD,没有光谱偏差的阻碍,使NTK政权的参数空间的统一探索。我们的实验结果证实了这一预测,提供了强有力的证据,grokking代表了懒惰的政权之间的过渡行为,其特征是NTK和丰富的政权。这些发现加深了我们对优化动力学、光谱偏差和神经网络学习阶段之间相互作用的理解。
摘要:Spectral bias, the tendency of neural networks to learn low frequencies first, can be both a blessing and a curse. While it enhances the generalization capabilities by suppressing high-frequency noise, it can be a limitation in scientific tasks that require capturing fine-scale structures. The delayed generalization phenomenon known as grokking is another barrier to rapid training of neural networks. Grokking has been hypothesized to arise as learning transitions from the NTK to the feature-rich regime. This paper explores the impact of preconditioned gradient descent (PGD), such as Gauss-Newton, on spectral bias and grokking phenomena. We demonstrate through theoretical and empirical results how PGD can mitigate issues associated with spectral bias. Additionally, building on the rich learning regime grokking hypothesis, we study how PGD can be used to reduce delays associated with grokking. Our conjecture is that PGD, without the impediment of spectral bias, enables uniform exploration of the parameter space in the NTK regime. Our experimental results confirm this prediction, providing strong evidence that grokking represents a transitional behavior between the lazy regime characterized by the NTK and the rich regime. These findings deepen our understanding of the interplay between optimization dynamics, spectral bias, and the phases of neural network learning.
【3】Finite Memory Belief Approximation for Optimal Control in Partially Observable Markov Decision Processes
标题:部分可观察Markov决策过程最优控制的有限记忆相信逼近
链接:https://arxiv.org/abs/2601.03132
作者:Mintae Kim
备注:6 pages, 3 figures
摘要:研究了部分可观测随机最优控制问题的有限记忆信念逼近。虽然在部分可观测马尔可夫决策过程(POMDPs)中,信念状态对于SOC是足够的,但它们通常是无限维的,不切实际。我们解释截断的输入输出(IO)的历史,诱导一个信念近似和发展的度量为基础的理论,直接关系到信息损失的控制性能。使用Wasserstein度量,我们推导出政策条件的性能界限,量化有限内存引起的价值退化沿典型的闭环轨迹。我们的分析收益通过固定的政策比较:我们评估两个成本泛函下相同的闭环执行和隔离的影响,取代真正的信念,其有限的内存近似内的信念级成本。对于线性二次高斯(LQG)系统,我们提供了封闭形式的信念不匹配的评估和经验验证预测的机制,表明信念不匹配的记忆长度近似呈指数衰减,诱导的性能不匹配相应的规模。总之,这些结果提供了一个度量感知表征什么有限记忆信念近似可以和不能实现PO设置。
摘要
:We study finite memory belief approximation for partially observable (PO) stochastic optimal control (SOC) problems. While belief states are sufficient for SOC in partially observable Markov decision processes (POMDPs), they are generally infinite-dimensional and impractical. We interpret truncated input-output (IO) histories as inducing a belief approximation and develop a metric-based theory that directly relates information loss to control performance. Using the Wasserstein metric, we derive policy-conditional performance bounds that quantify value degradation induced by finite memory along typical closed-loop trajectories. Our analysis proceeds via a fixed-policy comparison: we evaluate two cost functionals under the same closed-loop execution and isolate the effect of replacing the true belief by its finite memory approximation inside the belief-level cost. For linear quadratic Gaussian (LQG) systems, we provide closed-form belief mismatch evaluation and empirically validate the predicted mechanism, demonstrating that belief mismatch decays approximately exponentially with memory length and that the induced performance mismatch scales accordingly. Together, these results provide a metric-aware characterization of what finite memory belief approximation can and cannot achieve in PO settings.
【4】Q-Regularized Generative Auto-Bidding: From Suboptimal Trajectories to Optimal Policies
标题:Q-正规生成自动竞价:从次优轨迹到最优策略
链接:https://arxiv.org/abs/2601.02754
作者:Mingming Zhang,Na Li,Zhuang Feiqing,Hongyang Zheng,Jiangbing Zhou,Wang Wuyin,Sheng-jie Sun,XiaoWei Chen,Junxiong Zhu,Lixin Zou,Chenliang Li
备注:11pages, 5figures, In Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining
摘要:随着电子商务的快速发展,自动竞价已成为在不同广告客户环境下优化广告效果的关键资产。目前的方法集中在强化学习(RL)和生成模型。这些努力通过利用具有昂贵的超参数调整的复杂结构来模仿离线历史行为。次优轨迹进一步加剧了政策学习的难度。 为了解决这些挑战,我们提出了QGA,一种新的Q值正则化生成式自动投标方法。在QGA中,我们建议将具有双Q学习策略的Q值正则化插入到决策Transformer骨干中。这种设计能够实现策略模仿和行动价值最大化的联合优化,允许学习的投标策略既利用数据集的经验,又减轻次优轨迹的不利影响。此外,为了安全地探索数据分布之外的策略空间,我们提出了一种Q值引导的双重探索机制,其中DT模型以多个返回目标和局部扰动动作为条件。整个探索过程由上述Q值模块动态引导,该模块为每个候选动作提供原则性评估。公共基准测试和仿真环境的实验表明,QGA一贯实现优越的或具有竞争力的结果相比,现有的替代品。值得注意的是,在大规模的真实A/B测试中,QGA实现了广告GMV增加3.27%,广告ROI提高2.49%。
摘要:With the rapid development of e-commerce, auto-bidding has become a key asset in optimizing advertising performance under diverse advertiser environments. The current approaches focus on reinforcement learning (RL) and generative models. These efforts imitate offline historical behaviors by utilizing a complex structure with expensive hyperparameter tuning. The suboptimal trajectories further exacerbate the difficulty of policy learning. To address these challenges, we proposes QGA, a novel Q-value regularized Generative Auto-bidding method. In QGA, we propose to plug a Q-value regularization with double Q-learning strategy into the Decision Transformer backbone. This design enables joint optimization of policy imitation and action-value maximization, allowing the learned bidding policy to both leverage experience from the dataset and alleviate the adverse impact of the suboptimal trajectories. Furthermore, to safely explore the policy space beyond the data distribution, we propose a Q-value guided dual-exploration mechanism, in which the DT model is conditioned on multiple return-to-go targets and locally perturbed actions. This entire exploration process is dynamically guided by the aforementioned Q-value module, which provides principled evaluation for each candidate action. Experiments on public benchmarks and simulation environments demonstrate that QGA consistently achieves superior or highly competitive results compared to existing alternatives. Notably, in large-scale real-world A/B testing, QGA achieves a 3.27% increase in Ad GMV and a 2.49% improvement in Ad ROI.
【5】Scaling Laws of Machine Learning for Optimal Power Flow
标题:最优潮流机器学习的标度定律
链接:https://arxiv.org/abs/2601.02706
作者:Xinyi Liu,Xuan He,Yize Chen
备注:5 pages
摘要:最优潮流是电力系统运行的基本任务之一。虽然深度神经网络(DNN)等机器学习(ML)方法已被广泛研究以提高OPF解决方案的速度和性能,但其实际部署面临两个关键的扩展问题:可靠结果所需的最小训练数据量是多少?ML模型的复杂性应该如何平衡准确性和实时计算限制?现有的研究评估了离散的场景,而没有量化这些缩放关系,导致在现实世界的应用程序中基于试错的ML开发。这项工作提出了基于ML的OPF在两个维度上的第一个系统缩放研究:数据规模(0.1K-40 K训练样本)和计算规模(具有不同FLOP的多个NN架构)。我们的研究结果揭示了DNN和物理信息NN(PINN)在每个资源维度和三个核心性能指标(预测误差(MAE),约束违反和速度)之间的一致幂律关系。我们发现,对于ACOPF,准确性度量尺度与数据集大小和训练计算有关。这些缩放定律使OPF的可预测和原则性ML流水线设计成为可能。我们进一步确定预测精度和约束可行性之间的分歧,并表征计算最优前沿。该工作为ML-OPF的设计和部署提供了定量指导。
摘要:Optimal power flow (OPF) is one of the fundamental tasks for power system operations. While machine learning (ML) approaches such as deep neural networks (DNNs) have been widely studied to enhance OPF solution speed and performance, their practical deployment faces two critical scaling questions: What is the minimum training data volume required for reliable results? How should ML models' complexity balance accuracy with real-time computational limits? Existing studies evaluate discrete scenarios without quantifying these scaling relationships, leading to trial-and-error-based ML development in real-world applications. This work presents the first systematic scaling study for ML-based OPF across two dimensions: data scale (0.1K-40K training samples) and compute scale (multiple NN architectures with varying FLOPs). Our results reveal consistent power-law relationships on both DNNs and physics-informed NNs (PINNs) between each resource dimension and three core performance metrics: prediction error (MAE), constraint violations and speed. We find that for ACOPF, the accuracy metric scales with dataset size and training compute. These scaling laws enable predictable and principled ML pipeline design for OPF. We further identify the divergence between prediction accuracy and constraint feasibility and characterize the compute-optimal frontier. This work provides quantitative guidance for ML-OPF design and deployments.
【6】Polynomial Convergence of Riemannian Diffusion Models
标题:Riemann扩散模型的多元收敛性
链接:https://arxiv.org/abs/2601.02499
作者:Xingyu Xu,Ziyi Zhang,Yorie Nakahira,Guannan Qu,Yuejie Chi
摘要:近年来,扩散模型在经验上取得了显著的成功,被认为是现代AI中最先进的生成模型之一。这些模型包括一个前向过程,它逐渐将数据分布扩散到跨越整个空间的噪声分布,以及一个后向过程,它将这种变换反转以从噪声中恢复数据分布。大多数现有的文献假设底层空间是欧几里得空间。然而,在许多实际应用中,数据被限制在欧氏空间的子流形上。为了解决这个问题,De Bortoli等人(2022)引入了黎曼扩散模型,并证明了使用指数小步长会在Wasserstein距离中产生小的抽样误差,前提是数据分布是平滑的,严格为正的,并且得分估计是$L_\infty$-准确的。在本文中,我们大大加强了这一理论的建立,下$L_2$-准确的得分估计,{\em多项式小步长}足以保证小的抽样误差的总变异距离,而不需要光滑或积极的数据分布。我们的分析只需要温和的和标准的曲率假设的基础流形。我们分析的主要内容是热核对数梯度的Li-Yau估计和扰动热方程的Minakshisundaram-Pleijel参数展开。我们的方法打开了大门,一个更清晰的分析非欧几里德空间的扩散模型。
摘要:Diffusion models have demonstrated remarkable empirical success in the recent years and are considered one of the state-of-the-art generative models in modern AI. These models consist of a forward process, which gradually diffuses the data distribution to a noise distribution spanning the whole space, and a backward process, which inverts this transformation to recover the data distribution from noise. Most of the existing literature assumes that the underlying space is Euclidean. However, in many practical applications, the data are constrained to lie on a submanifold of Euclidean space. Addressing this setting, De Bortoli et al. (2022) introduced Riemannian diffusion models and proved that using an exponentially small step size yields a small sampling error in the Wasserstein distance, provided the data distribution is smooth and strictly positive, and the score estimate is $L_\infty$-accurate. In this paper, we greatly strengthen this theory by establishing that, under $L_2$-accurate score estimate, a {\em polynomially small stepsize} suffices to guarantee small sampling error in the total variation distance, without requiring smoothness or positivity of the data distribution. Our analysis only requires mild and standard curvature assumptions on the underlying manifold. The main ingredients in our analysis are Li-Yau estimate for the log-gradient of heat kernel, and Minakshisundaram-Pleijel parametrix expansion of the perturbed heat equation. Our approach opens the door to a sharper analysis of diffusion models on non-Euclidean spaces.
【7】First Provably Optimal Asynchronous SGD for Homogeneous and Heterogeneous Data
标题:第一个可证明最优的同质和异类数据的同步SPD
链接:https://arxiv.org/abs/2601.02523
作者:Artavazd Maranjyan
备注:PhD thesis
摘要:人工智能通过使用数千个GPU或TPU在大规模数据集上训练的大型神经网络快速发展。这样的训练可能会占用整个数据中心数周,并需要大量的计算和能源资源。然而,这些运行背后的优化算法并没有跟上步伐。大多数大规模培训仍然依赖于同步方法,工作人员必须等待最慢的设备,浪费计算并放大硬件和网络可变性的影响。删除同步似乎是一个简单的修复,但同步会引入过时性,这意味着更新是在过时的模型上计算的。这使得分析变得困难,特别是当延迟来自系统级随机性而不是算法选择时。因此,异步方法的时间复杂度仍然知之甚少。本文针对异步一阶随机优化问题,提出了一个严格的框架,重点研究了异构工人速度的核心挑战。在这个框架内,我们表明,通过适当的设计,异步SGD可以实现最佳的时间复杂度,匹配以前只知道同步方法的保证。我们的第一个贡献,Ringmaster ASGD,通过选择性地丢弃陈旧的更新,在同构数据设置中达到最佳的时间复杂度。第二,Ringleader ASGD,使用结构化梯度表机制将最优性扩展到异构数据,这在联邦学习中很常见。最后,ATA通过学习工作者计算时间分布和自适应分配任务来提高资源效率,以更少的计算实现接近最佳的挂钟时间。总之,这些结果建立了异步优化作为分布式学习的理论上合理和实际有效的基础,表明没有同步的协调可以是可行的和最佳的。
摘要:Artificial intelligence has advanced rapidly through large neural networks trained on massive datasets using thousands of GPUs or TPUs. Such training can occupy entire data centers for weeks and requires enormous computational and energy resources. Yet the optimization algorithms behind these runs have not kept pace. Most large scale training still relies on synchronous methods, where workers must wait for the slowest device, wasting compute and amplifying the effects of hardware and network variability. Removing synchronization seems like a simple fix, but asynchrony introduces staleness, meaning updates computed on outdated models. This makes analysis difficult, especially when delays arise from system level randomness rather than algorithmic choices. As a result, the time complexity of asynchronous methods remains poorly understood. This dissertation develops a rigorous framework for asynchronous first order stochastic optimization, focusing on the core challenge of heterogeneous worker speeds. Within this framework, we show that with proper design, asynchronous SGD can achieve optimal time complexity, matching guarantees previously known only for synchronous methods. Our first contribution, Ringmaster ASGD, attains optimal time complexity in the homogeneous data setting by selectively discarding stale updates. The second, Ringleader ASGD, extends optimality to heterogeneous data, common in federated learning, using a structured gradient table mechanism. Finally, ATA improves resource efficiency by learning worker compute time distributions and allocating tasks adaptively, achieving near optimal wall clock time with less computation. Together, these results establish asynchronous optimization as a theoretically sound and practically efficient foundation for distributed learning, showing that coordination without synchronization can be both feasible and optimal.
预测|估计(7篇)
【1】Predicting Time Pressure of Powered Two-Wheeler Riders for Proactive Safety Interventions
标题:预测机动两轮车骑手的时间压力以进行主动安全干预
链接:https://arxiv.org/abs/2601.03173
作者:Sumit S. Shevtekar,Chandresh K. Maurya,Gourab Sil,Subasish Das
备注:13 pages, 8 figures
摘要:时间压力严重影响动力两轮车驾驶员的危险机动和碰撞倾向,但其预测在智能交通系统中仍有待研究。我们提出了一个大规模的数据集,129,000+标记的多变量时间序列序列,来自51名参与者在无,低和高时间压力条件下的153次骑行。每个序列捕获63个特征,涵盖车辆运动学,控制输入,行为违规和环境背景。我们的实证分析表明,与没有时间压力相比,高时间压力会导致48%的速度提高,36.4%的速度变化,58%的交叉路口危险转弯,36%的突然制动和50%的后制动力。为了对该数据集进行基准测试,我们提出了MotoTimePressure,这是一种结合了卷积预处理、双阶段时间注意力和挤压和激励特征重新校准的深度学习模型,实现了91.53%的准确率和98.93%的ROC AUC,优于8个基线。由于时间压力不能直接实时测量,我们证明了它的实用性在碰撞预测和阈值确定。使用MTPS预测的时间压力作为特征,将基于Informer的冲突风险准确率从91.25%提高到93.51%,接近Oracle性能(93.72%)。实时压力状态捕捉骑手的认知压力,并实现主动的ITS干预,包括自适应警报、触觉反馈、V2I信号和速度指导,在安全系统方法下支持更安全的两轮车移动性。
摘要:Time pressure critically influences risky maneuvers and crash proneness among powered two-wheeler riders, yet its prediction remains underexplored in intelligent transportation systems. We present a large-scale dataset of 129,000+ labeled multivariate time-series sequences from 153 rides by 51 participants under No, Low, and High Time Pressure conditions. Each sequence captures 63 features spanning vehicle kinematics, control inputs, behavioral violations, and environmental context. Our empirical analysis shows High Time Pressure induces 48% higher speeds, 36.4% greater speed variability, 58% more risky turns at intersections, 36% more sudden braking, and 50% higher rear brake forces versus No Time Pressure. To benchmark this dataset, we propose MotoTimePressure, a deep learning model combining convolutional preprocessing, dual-stage temporal attention, and Squeeze-and-Excitation feature recalibration, achieving 91.53% accuracy and 98.93% ROC AUC, outperforming eight baselines. Since time pressure cannot be directly measured in real time, we demonstrate its utility in collision prediction and threshold determination. Using MTPS-predicted time pressure as features, improves Informer-based collision risk accuracy from 91.25% to 93.51%, approaching oracle performance (93.72%). Thresholded time pressure states capture rider cognitive stress and enable proactive ITS interventions, including adaptive alerts, haptic feedback, V2I signaling, and speed guidance, supporting safer two-wheeler mobility under the Safe System Approach.
【2】Multi-Distribution Robust Conformal Prediction
标题:多分布稳健保形预测
链接:https://arxiv.org/abs/2601.02998
作者:Yuqi Yang,Ying Jin
摘要:在许多公平性和分布鲁棒性问题中,人们可以访问来自多个源分布的标记数据,但测试数据可能来自任意成员或它们的混合。我们研究的问题,构建一个共形的预测集,是均匀有效的多个,异构分布,在这个意义上说,无论哪个分布的测试点是从,预测集的覆盖率保证超过预先指定的水平。我们首先提出了一个最大p聚合方案,提供有限样本,多分布覆盖与每个分布相关联的任何一致性分数。在研究了几个效率优化方案的均匀覆盖,我们证明了我们的聚合方案的最优性和紧密性,并提出了一个通用的算法来学习一致性得分,导致有效的预测集后,在标准条件下的聚合。我们讨论了我们的框架如何与分组分布鲁棒优化,子种群转移,公平性和多源学习相关。在合成和真实数据的实验中,我们的方法提供了有效的最坏情况下的覆盖多个分布,同时大大减少了集的大小相比,天真地应用max-p聚合单源一致性得分,并可以在大小与流行的,标准的一致性得分的单源预测集。
摘要:In many fairness and distribution robustness problems, one has access to labeled data from multiple source distributions yet the test data may come from an arbitrary member or a mixture of them. We study the problem of constructing a conformal prediction set that is uniformly valid across multiple, heterogeneous distributions, in the sense that no matter which distribution the test point is from, the coverage of the prediction set is guaranteed to exceed a pre-specified level. We first propose a max-p aggregation scheme that delivers finite-sample, multi-distribution coverage given any conformity scores associated with each distribution. Upon studying several efficiency optimization programs subject to uniform coverage, we prove the optimality and tightness of our aggregation scheme, and propose a general algorithm to learn conformity scores that lead to efficient prediction sets after the aggregation under standard conditions. We discuss how our framework relates to group-wise distributionally robust optimization, sub-population shift, fairness, and multi-source learning. In synthetic and real-data experiments, our method delivers valid worst-case coverage across multiple distributions while greatly reducing the set size compared with naively applying max-p aggregation to single-source conformity scores, and can be comparable in size to single-source prediction sets with popular, standard conformity scores.
【3】Domain Generalization for Time Series: Enhancing Drilling Regression Models for Stick-Slip Index Prediction
标题:时间序列的领域概括:增强用于粘滑指数预测的钻探回归模型
链接
:https://arxiv.org/abs/2601.02884
作者:Hana Yahia,Bruno Figliuzzi,Florent Di Meglio,Laurent Gerbaud,Stephane Menand,Mohamed Mahjoub
摘要:本文提供了一个全面的比较域泛化技术应用于时间序列数据在钻井环境中,专注于预测的连续粘滑指数(SSI),一个关键的指标,用于评估扭转井下振动的钻头。该研究旨在开发一种鲁棒的回归模型,该模型可以通过训练60秒标记的1 Hz地面钻井数据序列来预测SSI。该模型在与训练期间使用的井不同的井中进行测试。为了微调模型架构,采用网格搜索方法来优化关键超参数。对对抗域泛化(ADG)、不变风险最小化(INVESTIGATION)和基线模型进行了比较分析,并对迁移学习(TL)在提高模型性能方面的有效性进行了评估。ADG和ADG模型的性能分别比基线模型提高了10%和8%。最重要的是,60%的时间检测到严重事件,而基线模型为20%。总体而言,结果表明,ADG和ADG模型均超过基线,ADG模型略优于ADG模型。此外,将TL应用于预先训练的模型进一步提高了性能。我们的研究结果表明,在钻井应用领域的泛化方法的潜力,与ADG成为最有效的方法。
摘要:This paper provides a comprehensive comparison of domain generalization techniques applied to time series data within a drilling context, focusing on the prediction of a continuous Stick-Slip Index (SSI), a critical metric for assessing torsional downhole vibrations at the drill bit. The study aims to develop a robust regression model that can generalize across domains by training on 60 second labeled sequences of 1 Hz surface drilling data to predict the SSI. The model is tested in wells that are different from those used during training. To fine-tune the model architecture, a grid search approach is employed to optimize key hyperparameters. A comparative analysis of the Adversarial Domain Generalization (ADG), Invariant Risk Minimization (IRM) and baseline models is presented, along with an evaluation of the effectiveness of transfer learning (TL) in improving model performance. The ADG and IRM models achieve performance improvements of 10% and 8%, respectively, over the baseline model. Most importantly, severe events are detected 60% of the time, against 20% for the baseline model. Overall, the results indicate that both ADG and IRM models surpass the baseline, with the ADG model exhibiting a slight advantage over the IRM model. Additionally, applying TL to a pre-trained model further improves performance. Our findings demonstrate the potential of domain generalization approaches in drilling applications, with ADG emerging as the most effective approach.
【4】Electricity Price Forecasting: Bridging Linear Models, Neural Networks and Online Learning
标题:电价预测:桥接线性模型、神经网络和在线学习
链接:https://arxiv.org/abs/2601.02856
作者:Btissame El Mahtout,Florian Ziel
摘要:准确的日前电价预测对于确保有效的投资组合管理、支持电厂运营的战略决策、实现高效的电池存储优化以及促进需求响应规划至关重要。然而,在不确定和波动的市场环境中,开发准确的预测模型具有很大的挑战性。例如,虽然线性模型通常在预测电价方面表现出具有竞争力的性能,但它们无法捕捉相关的非线性关系。另一方面,非线性模型可以提高预测精度,但计算成本会激增。我们提出了一种新的多元神经网络方法,结合线性和非线性前馈神经结构。与以前的混合模型不同,我们的方法集成了在线学习和预测组合,以提高训练效率和准确性。它还包含所有相关特征,特别是风能和太阳能发电、电力需求模式、相关能源燃料和碳市场产生的基本关系,以及自回归动态和日历效应。与当前最先进的基准模型相比,所提出的预测方法显着降低了计算成本,同时提供了卓越的预测精度(12-13%的RMSE和15-18%的MAE减少)。我们的研究结果来自对欧洲主要电力市场进行的六年预测研究。
摘要:Precise day-ahead forecasts for electricity prices are crucial to ensure efficient portfolio management, support strategic decision-making for power plant operations, enable efficient battery storage optimization, and facilitate demand response planning. However, developing an accurate prediction model is highly challenging in an uncertain and volatile market environment. For instance, although linear models generally exhibit competitive performance in predicting electricity prices with minimal computational requirements, they fail to capture relevant nonlinear relationships. Nonlinear models, on the other hand, can improve forecasting accuracy with a surge in computational costs. We propose a novel multivariate neural network approach that combines linear and nonlinear feed-forward neural structures. Unlike previous hybrid models, our approach integrates online learning and forecast combination for efficient training and accuracy improvement. It also incorporates all relevant characteristics, particularly the fundamental relationships arising from wind and solar generation, electricity demand patterns, related energy fuel and carbon markets, in addition to autoregressive dynamics and calendar effects. Compared to the current state-of-the-art benchmark models, the proposed forecasting method significantly reduces computational cost while delivering superior forecasting accuracy (12-13% RMSE and 15-18% MAE reductions). Our results are derived from a six-year forecasting study conducted on major European electricity markets.
【5】A Spatio-Temporal Deep Learning Approach For High-Resolution Gridded Monsoon Prediction
标题:用于高分辨率网格季风预测的时空深度学习方法
链接:https://arxiv.org/abs/2601.02445
作者:Parashjyoti Borah,Sanghamitra Sarkar,Ranjan Phukan
备注:8 pages, 3 figures, 2 Tables, to be submitted to "IEEE Transactions on Geoscience and Remote Sensing"
摘要:印度夏季季风(ISM)是一种严重的气候现象,从根本上影响着超过10亿人的农业、经济和水安全。传统的长期预测,无论是统计的还是动态的,主要集中在预测一个单一的,空间平均的季节值,缺乏区域一级资源管理所必需的空间细节。为了解决这一差距,我们引入了一个新的深度学习框架,将网格化季风预测重新定义为时空计算机视觉任务。我们把多变量,前季风大气和海洋场作为一个序列的多通道图像,有效地创建一个视频一样的输入张量。使用85年的ERA5再分析数据的预测和IMD降雨数据的目标,我们采用卷积神经网络(CNN)为基础的架构,学习复杂的映射从五个月前季风期(1月至5月)的高分辨率网格降雨模式,随后的季风季节。我们的框架成功地为四个季风月(6月至9月)中的每一个月以及总的季节平均值产生了不同的预测,证明了其对季节内和季节前景的实用性。
摘要:The Indian Summer Monsoon (ISM) is a critical climate phenomenon, fundamentally impacting the agriculture, economy, and water security of over a billion people. Traditional long-range forecasting, whether statistical or dynamical, has predominantly focused on predicting a single, spatially-averaged seasonal value, lacking the spatial detail essential for regional-level resource management. To address this gap, we introduce a novel deep learning framework that reframes gridded monsoon prediction as a spatio-temporal computer vision task. We treat multi-variable, pre-monsoon atmospheric and oceanic fields as a sequence of multi-channel images, effectively creating a video-like input tensor. Using 85 years of ERA5 reanalysis data for predictors and IMD rainfall data for targets, we employ a Convolutional Neural Network (CNN)-based architecture to learn the complex mapping from the five-month pre-monsoon period (January-May) to a high-resolution gridded rainfall pattern for the subsequent monsoon season. Our framework successfully produces distinct forecasts for each of the four monsoon months (June-September) as well as the total seasonal average, demonstrating its utility for both intra-seasonal and seasonal outlooks.
【6】SpikySpace: A Spiking State Space Model for Energy-Efficient Time Series Forecasting
标题:SpikySpace:用于节能时间序列预测的峰值状态空间模型
链接:https://arxiv.org/abs/2601.02411
作者:Kaiwen Tang,Jiaqi Zheng,Yuze Jin,Yupeng Qiu,Guangda Sun,Zhanglu Yan,Weng-Fai Wong
备注:13 pages, 4 figures
摘要
:时间序列预测通常在诸如流量管理、工业状态监控和设备上感测等领域中在紧张的功率和延迟预算下运行。这些应用通常需要边缘设备的近实时响应和低能耗。脉冲神经网络(SNN)通过利用时间稀疏性和无乘法计算提供事件驱动的计算和超低功耗。然而,现有的SNN为基础的时间序列预测器往往继承复杂的Transformer块,从而失去了很多的效率优势。为了解决这个问题,我们提出了SpikySpace,一个尖峰状态空间模型(SSM),通过选择性扫描将注意力块中的二次成本降低到线性时间。此外,我们用稀疏尖峰序列替换密集SSM更新,并仅对尖峰事件执行选择性扫描,从而避免密集乘法,同时保留SSM的结构化存储器。由于复杂的操作,如指数和除法在神经形态芯片上是昂贵的,我们引入了SiLU和Softplus的简化近似,以实现神经形态友好的模型架构。在匹配的设置中,SpikySpace与两种最先进的基于Transformer的方法(即iTransformer和iSpikformer)相比,分别将估计能耗降低了98.73%和96.24%。在标准时间序列预测数据集中,SpikySpace提供具有竞争力的准确性,同时大幅降低能源成本和内存流量。作为第一个完整的尖峰状态空间模型,SpikySpace将神经形态效率与现代序列建模联系起来,标志着通往高效时间序列预测系统的实用和可扩展的道路。
摘要:Time-series forecasting often operates under tight power and latency budgets in fields like traffic management, industrial condition monitoring, and on-device sensing. These applications frequently require near real-time responses and low energy consumption on edge devices. Spiking neural networks (SNNs) offer event-driven computation and ultra-low power by exploiting temporal sparsity and multiplication-free computation. Yet existing SNN-based time-series forecasters often inherit complex transformer blocks, thereby losing much of the efficiency benefit. To solve the problem, we propose SpikySpace, a spiking state-space model (SSM) that reduces the quadratic cost in the attention block to linear time via selective scanning. Further, we replace dense SSM updates with sparse spike trains and execute selective scans only on spike events, thereby avoiding dense multiplications while preserving the SSM's structured memory. Because complex operations such as exponentials and divisions are costly on neuromorphic chips, we introduce simplified approximations of SiLU and Softplus to enable a neuromorphic-friendly model architecture. In matched settings, SpikySpace reduces estimated energy consumption by 98.73% and 96.24% compared to two state-of-the-art transformer based approaches, namely iTransformer and iSpikformer, respectively. In standard time series forecasting datasets, SpikySpace delivers competitive accuracy while substantially reducing energy cost and memory traffic. As the first full spiking state-space model, SpikySpace bridges neuromorphic efficiency with modern sequence modeling, marking a practical and scalable path toward efficient time series forecasting systems.
【7】Fast Conformal Prediction using Conditional Interquantile Intervals
标题:基于条件分位数区间的快速共形预测
链接:https://arxiv.org/abs/2601.02769
作者:Naixin Guo,Rui Luo,Zhixin Zhou
摘要:我们介绍了保形分位数间回归(CIR),保形回归方法,有效地构建保证覆盖率的近最小预测区间。CIR利用黑盒机器学习模型通过分位数间范围来估计结果分布,将这些估计值转换为紧凑的预测区间,同时实现近似的条件覆盖。我们进一步提出了CIR+(有更多比较的条件分位数间回归),它通过引入基于宽度的分位数间隔选择规则来增强CIR。这种改进产生更窄的预测区间,同时保持相当的覆盖率,但在计算时间略有增加的成本。这两种方法都解决了现有分布共形预测方法的关键限制:它们比共形分位数回归更有效地处理偏斜分布,并且通过消除直方图构建的需要,它们实现了比共形直方图回归更高的计算效率。在合成数据集和真实数据集上进行的大量实验表明,与现有方法相比,我们的方法在预测准确性和计算效率之间取得了最佳平衡。
摘要:We introduce Conformal Interquantile Regression (CIR), a conformal regression method that efficiently constructs near-minimal prediction intervals with guaranteed coverage. CIR leverages black-box machine learning models to estimate outcome distributions through interquantile ranges, transforming these estimates into compact prediction intervals while achieving approximate conditional coverage. We further propose CIR+ (Conditional Interquantile Regression with More Comparison), which enhances CIR by incorporating a width-based selection rule for interquantile intervals. This refinement yields narrower prediction intervals while maintaining comparable coverage, though at the cost of slightly increased computational time. Both methods address key limitations of existing distributional conformal prediction approaches: they handle skewed distributions more effectively than Conformalized Quantile Regression, and they achieve substantially higher computational efficiency than Conformal Histogram Regression by eliminating the need for histogram construction. Extensive experiments on synthetic and real-world datasets demonstrate that our methods optimally balance predictive accuracy and computational efficiency compared to existing approaches.
其他神经网络|深度学习|模型|建模(5篇)
【1】Learning to Act Robustly with View-Invariant Latent Actions
标题:学习使用视图不变的潜在操作进行鲁棒操作
链接:https://arxiv.org/abs/2601.02994
作者:Youngjoon Jeong,Junha Chun,Taesup Kim
备注:Website: https://joon-stack.github.io/VILA/
摘要:基于视觉的机器人策略通常会遇到即使是微小的视点变化,强调需要视图不变的视觉表示。这一挑战在现实环境中变得更加突出,在现实环境中,观点的变化是不可避免的,并且可能会严重破坏政策性能。现有的方法通常从场景级别的多视图观测中学习不变性,但这种方法依赖于视觉外观,并且未能结合鲁棒泛化所必需的物理动力学。我们提出了视图不变的潜在动作(VILA),它模拟了一个潜在的动作捕捉过渡模式的轨迹,以学习基于物理动力学的视图不变的表示。VILA使用基于地面事实动作序列的动作指导目标,在不同的视角下对齐这些潜在动作。模拟和现实世界中的实验表明,基于VILA的策略有效地推广到看不见的观点,并很好地转移到新的任务,建立VILA作为一个强大的预训练框架,提高鲁棒性和下游学习性能。
摘要:Vision-based robotic policies often struggle with even minor viewpoint changes, underscoring the need for view-invariant visual representations. This challenge becomes more pronounced in real-world settings, where viewpoint variability is unavoidable and can significantly disrupt policy performance. Existing methods typically learn invariance from multi-view observations at the scene level, but such approaches rely on visual appearance and fail to incorporate the physical dynamics essential for robust generalization. We propose View-Invariant Latent Action (VILA), which models a latent action capturing transition patterns across trajectories to learn view-invariant representations grounded in physical dynamics. VILA aligns these latent actions across viewpoints using an action-guided objective based on ground-truth action sequences. Experiments in both simulation and the real world show that VILA-based policies generalize effectively to unseen viewpoints and transfer well to new tasks, establishing VILA as a strong pretraining framework that improves robustness and downstream learning performance.
【2】Stratified Hazard Sampling: Minimal-Variance Event Scheduling for CTMC/DTMC Discrete Diffusion and Flow Models
标题:分层危险抽样:CTMC/DTMC离散扩散和流模型的最小方差事件调度
链接:https://arxiv.org/abs/2601.02799
作者:Seunghwan Jang,SooJean Han
备注:Work in progress. Feedback welcome
摘要:基于CTMC/DTMC的离散生成模型,包括均匀噪声离散扩散(例如,D3 PM/CTDD)和离散流匹配,通过时间非齐次马尔可夫过程重复替换令牌来实现非自回归序列生成。推理通常通过基于步骤的模拟来实现:每个令牌在每个离散化步骤中通过独立的伯努利(或分类)绘制决定跳转。在均匀噪声初始化下,其中自校正需要每个位置多次编辑,这些独立的决策在编辑的数量和时间上引起实质性变化,导致特征故障模式,例如编辑不足(残留噪声)或过度编辑(级联不必要的替换),降低再现性。 我们提出了分层危险抽样(SHS),一个下降和超参数自由推理原则的任何采样器,承认一个停留与。代替分解。SHS将每个令牌编辑建模为由累积危险(CTMC)或累积跳跃质量(DTMC)驱动的事件,并通过对该累积量进行分层来放置事件:每个位置具有单个随机相位,每当其累积危险越过单位间隔阈值时,令牌就会跳跃。这保留了预期的跳跃次数,同时实现了无偏整数估计值之间的最小可能方差(以1/4为界),而不会改变每个跳跃的目的地采样,从而保留了多模态。我们还引入了一个阶段分配的黑名单风格的词汇约束,优先在高风险的位置,以减轻后期屏蔽工件的早期编辑的变体。
摘要
:CTMC/DTMC-based discrete generative models, including uniform-noise discrete diffusion (e.g., D3PM/CTDD) and discrete flow matching, enable non-autoregressive sequence generation by repeatedly replacing tokens through a time-inhomogeneous Markov process. Inference is typically implemented with step-based simulation: each token decides to jump via independent Bernoulli (or categorical) draws at every discretization step. Under uniform-noise initialization, where self-correction requires multiple edits per position, these independent decisions induce substantial variance in both the number and timing of edits, leading to characteristic failure modes such as under-editing (residual noise) or over-editing (cascading unnecessary substitutions), decreasing reproducibility. We propose Stratified Hazard Sampling (SHS), a drop-in and hyperparameter-free inference principle for any sampler that admits a stay-vs.-replace decomposition. SHS models per-token edits as events driven by cumulative hazard (CTMC) or cumulative jump mass (DTMC) and places events by stratifying this cumulative quantity: with a single random phase per position, a token jumps whenever its accumulated hazard crosses unit-spaced thresholds. This preserves the expected number of jumps while achieving the minimum possible variance among unbiased integer estimators (bounded by 1/4), without altering per-jump destination sampling and thus retaining multimodality. We also introduce a phase-allocation variant for blacklist-style lexical constraints that prioritizes early edits at high-risk positions to mitigate late-masking artifacts.
【3】Variational (Energy-Based) Spectral Learning: A Machine Learning Framework for Solving Partial Differential Equations
标题:变分(基于能量的)谱学习:一种用于求解偏微方程的机器学习框架
链接:https://arxiv.org/abs/2601.02492
作者:M. M. Hammad
摘要:我们介绍变分谱学习(VSL),这是一种用于求解偏微分方程(PDE)的机器学习框架,直接在谱展开的系数空间中操作。VSL提供了变分偏微分方程理论,谱离散化和当代机器学习实践之间的原则性桥梁。其核心思想是将给定的PDE \[ \mathcal{L}u = f \quad \text{in} \quad Q=Ω\times(0,T)\]连同边界和初始条件一起重铸为由强形式最小二乘残差和弱(Galerkin)公式构建的可微时空能量。解表示为有限谱展开\[ u_N(x,t)=\sum_{n=1}^{N} c_n\,φ_n(x,t),\]其中$φ_n$是空间和时间上的张量积Chebyshev基,满足Dirichlet空间模的齐次边界条件解析地强制执行.这在系数向量$\mathbf{c}$中产生紧凑的线性参数化,而所有PDE复杂性被吸收到变分能量中。我们展示了如何构造强形式和弱形式的时空泛函,用初始条件和吉洪诺夫正则化项来增强它们,并用基于梯度的优化来最小化所得到的目标。在实践中,VSL在TensorFlow中使用自动微分和Keras余弦衰减与重新启动学习率计划来实现,从而实现中等大小的系数向量的鲁棒优化。基准椭圆和抛物问题,包括一维和二维泊松,扩散和Burgers型方程的数值实验表明,VSL达到的精度与经典的谱配置与Crank-Nicolson时间步进,同时提供了一个可微的目标,适合现代优化工具。
摘要:We introduce variational spectral learning (VSL), a machine learning framework for solving partial differential equations (PDEs) that operates directly in the coefficient space of spectral expansions. VSL offers a principled bridge between variational PDE theory, spectral discretization, and contemporary machine learning practice. The core idea is to recast a given PDE \[ \mathcal{L}u = f \quad \text{in} \quad Q=Ω\times(0,T), \] together with boundary and initial conditions, into differentiable space--time energies built from strong-form least-squares residuals and weak (Galerkin) formulations. The solution is represented as a finite spectral expansion \[ u_N(x,t)=\sum_{n=1}^{N} c_n\,φ_n(x,t), \] where $φ_n$ are tensor-product Chebyshev bases in space and time, with Dirichlet-satisfying spatial modes enforcing homogeneous boundary conditions analytically. This yields a compact linear parameterization in the coefficient vector $\mathbf{c}$, while all PDE complexity is absorbed into the variational energy. We show how to construct strong-form and weak-form space-time functionals, augment them with initial-condition and Tikhonov regularization terms, and minimize the resulting objective with gradient-based optimization. In practice, VSL is implemented in TensorFlow using automatic differentiation and Keras cosine-decay-with-restarts learning-rate schedules, enabling robust optimization of moderately sized coefficient vectors. Numerical experiments on benchmark elliptic and parabolic problems, including one- and two-dimensional Poisson, diffusion, and Burgers-type equations, demonstrate that VSL attains accuracy comparable to classical spectral collocation with Crank-Nicolson time stepping, while providing a differentiable objective suitable for modern optimization tooling.
【4】Quantifying Quanvolutional Neural Networks Robustness for Speech in Healthcare Applications
标题:量化量子卷积神经网络在医疗保健应用中的语音鲁棒性
链接:https://arxiv.org/abs/2601.02432
作者:Ha Tran,Bipasha Kashyap,Pubudu N. Pathirana
摘要:基于语音的机器学习系统对噪声敏感,使情感识别和语音病理检测的可靠部署复杂化。我们评估了混合量子机器学习模型的鲁棒性,quanvolutional神经网络(QNN)对经典卷积神经网络(CNN)在四种声学腐败(高斯噪声,音调偏移,时间偏移和速度变化)下的干净训练/腐败测试制度。使用AVFAD(语音病理学)和TESS(语音情感),我们将三种QNN模型(随机,基本,强)与简单的CNN基线(CNN-Base),ResNet-18和VGG-16进行了比较,使用准确性和腐败指标(CE,mCE,RCE,RmCE),并分析了架构因素(电路复杂性或深度,收敛性)以及每个情感的鲁棒性。QNN通常在音调偏移、时间偏移和速度变化下优于CNN-Base(在严重的时间偏移下,CE/RCE降低高达22%),而CNN-Base对高斯噪声保持更强的弹性。在量子电路中,QNN-Basic在AVFAD上实现了最好的整体鲁棒性,QNN-Random在TESS上表现最强。恐惧是最强大的(在严重腐败下的准确率为80-90%),中性可以在强高斯噪声下崩溃(5.5%的准确率),而快乐最容易受到音调,时间和速度失真的影响。QNN的收敛速度也比CNN-Base快6倍。据我们所知,这是对常见非对抗性声学破坏下的QNN语音鲁棒性的系统研究,表明浅纠缠量子前端可以提高噪声弹性,而对加性噪声的敏感性仍然是一个挑战。
摘要:Speech-based machine learning systems are sensitive to noise, complicating reliable deployment in emotion recognition and voice pathology detection. We evaluate the robustness of a hybrid quantum machine learning model, quanvolutional neural networks (QNNs) against classical convolutional neural networks (CNNs) under four acoustic corruptions (Gaussian noise, pitch shift, temporal shift, and speed variation) in a clean-train/corrupted-test regime. Using AVFAD (voice pathology) and TESS (speech emotion), we compare three QNN models (Random, Basic, Strongly) to a simple CNN baseline (CNN-Base), ResNet-18 and VGG-16 using accuracy and corruption metrics (CE, mCE, RCE, RmCE), and analyze architectural factors (circuit complexity or depth, convergence) alongside per-emotion robustness. QNNs generally outperform the CNN-Base under pitch shift, temporal shift, and speed variation (up to 22% lower CE/RCE at severe temporal shift), while the CNN-Base remains more resilient to Gaussian noise. Among quantum circuits, QNN-Basic achieves the best overall robustness on AVFAD, and QNN-Random performs strongest on TESS. Emotion-wise, fear is most robust (80-90% accuracy under severe corruptions), neutral can collapse under strong Gaussian noise (5.5% accuracy), and happy is most vulnerable to pitch, temporal, and speed distortions. QNNs also converge up to six times faster than the CNN-Base. To our knowledge, this is a systematic study of QNN robustness for speech under common non-adversarial acoustic corruptions, indicating that shallow entangling quantum front-ends can improve noise resilience while sensitivity to additive noise remains a challenge.
【5】NitroGen: An Open Foundation Model for Generalist Gaming Agents
标题:NitroGen:多面手游戏代理的开放基金会模型
链接:https://arxiv.org/abs/2601.02427
作者:Loïc Magne,Anas Awadalla,Guanzhi Wang,Yinzhen Xu,Joshua Belofsky,Fengyuan Hu,Joohwan Kim,Ludwig Schmidt,Georgia Gkioxari,Jan Kautz,Yisong Yue,Yejin Choi,Yuke Zhu,Linxi "Jim" Fan
备注:16 pages, 7 figures
摘要:我们介绍了NitroGen,这是一个为通才游戏代理人提供的视觉行动基础模型,该模型在1,000多个游戏中接受了40,000小时的游戏视频培训。我们结合了三个关键成分:1)通过从公开的游戏视频中自动提取玩家动作构建的互联网规模的视频动作数据集,2)可以测量跨游戏泛化的多游戏基准环境,以及3)通过大规模行为克隆训练的统一视觉动作模型。NitroGen在不同领域表现出强大的能力,包括3D动作游戏中的战斗遭遇,2D平台游戏中的高精度控制,以及程序生成世界的探索。它可以有效地转移到看不见的游戏中,与从头开始训练的模型相比,任务成功率相对提高了52%。我们发布的数据集,评估套件和模型权重,以推进研究的通才体现代理。
摘要
:We introduce NitroGen, a vision-action foundation model for generalist gaming agents that is trained on 40,000 hours of gameplay videos across more than 1,000 games. We incorporate three key ingredients: 1) an internet-scale video-action dataset constructed by automatically extracting player actions from publicly available gameplay videos, 2) a multi-game benchmark environment that can measure cross-game generalization, and 3) a unified vision-action model trained with large-scale behavior cloning. NitroGen exhibits strong competence across diverse domains, including combat encounters in 3D action games, high-precision control in 2D platformers, and exploration in procedurally generated worlds. It transfers effectively to unseen games, achieving up to 52% relative improvement in task success rates over models trained from scratch. We release the dataset, evaluation suite, and model weights to advance research on generalist embodied agents.
其他(22篇)
【1】From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence
标题:从熵到泛性:重新思考计算有界智能的信息
链接:https://arxiv.org/abs/2601.03220
作者:Marc Finzi,Shikai Qiu,Yiding Jiang,Pavel Izmailov,J. Zico Kolter,Andrew Gordon Wilson
摘要:我们能从数据中学到比生成过程本身更多的东西吗?仅仅通过对现有数据进行确定性转换,就能构造出新的有用信息吗?数据中的可学习内容是否可以在不考虑下游任务的情况下进行评估?在这些问题上,香农信息和柯尔莫哥洛夫复杂性几乎空手而归,部分原因是它们假设观察者具有无限的计算能力,并且未能针对有用的信息内容。在这项工作中,我们确定和澄清三个看似矛盾的信息理论:(1)信息不能增加确定性变换;(2)信息是独立的数据的顺序;(3)似然建模仅仅是分布匹配。为了阐明这些结果和现代实践之间的紧张关系,并量化数据的价值,我们引入epiplexity,一种形式化的信息捕捉计算有限的观察者可以从数据中学习。Epiplexity捕获数据中的结构内容,同时排除时间有界熵,以伪随机数生成器和混沌动力系统为例的随机不可预测内容。通过这些概念,我们展示了如何通过计算创建信息,它如何依赖于数据的排序,以及似然建模如何产生比数据生成过程本身更复杂的程序。我们还提出了实际的程序来估计epiplexity,我们显示捕获不同的数据源,跟踪下游性能,并突出数据集干预,提高了分布泛化。与模型选择的原则相反,epiplexity为数据选择提供了理论基础,指导如何为学习系统选择,生成或转换数据。
摘要:Can we learn more from data than existed in the generating process itself? Can new and useful information be constructed from merely applying deterministic transformations to existing data? Can the learnable content in data be evaluated without considering a downstream task? On these questions, Shannon information and Kolmogorov complexity come up nearly empty-handed, in part because they assume observers with unlimited computational capacity and fail to target the useful information content. In this work, we identify and exemplify three seeming paradoxes in information theory: (1) information cannot be increased by deterministic transformations; (2) information is independent of the order of data; (3) likelihood modeling is merely distribution matching. To shed light on the tension between these results and modern practice, and to quantify the value of data, we introduce epiplexity, a formalization of information capturing what computationally bounded observers can learn from data. Epiplexity captures the structural content in data while excluding time-bounded entropy, the random unpredictable content exemplified by pseudorandom number generators and chaotic dynamical systems. With these concepts, we demonstrate how information can be created with computation, how it depends on the ordering of the data, and how likelihood modeling can produce more complex programs than present in the data generating process itself. We also present practical procedures to estimate epiplexity which we show capture differences across data sources, track with downstream performance, and highlight dataset interventions that improve out-of-distribution generalization. In contrast to principles of model selection, epiplexity provides a theoretical foundation for data selection, guiding how to select, generate, or transform data for learning systems.
【2】Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion
标题:文本到图像扩散中的批判引导强化取消学习
链接:https://arxiv.org/abs/2601.03213
作者:Mykola Vysotskyi,Zahar Kohut,Mariia Shpir,Taras Rumezhak,Volodymyr Karpiv
备注:Preprint. Under review at ICLR 2026
摘要:文本到图像扩散模型中的机器去学习旨在删除目标概念,同时保留整体效用。先前的扩散非学习方法通常依赖于监督权重编辑或全局惩罚;重复学习(RL)方法虽然灵活,但通常优化稀疏的轨迹末端奖励,产生高方差更新和弱信用分配。我们提出了一个用于扩散学习的通用RL框架,该框架将去噪视为顺序决策过程,并引入了具有噪声步长奖励的时间步长感知评论家。具体地说,我们训练一个基于CLIP的奖励预测器的噪声潜伏期,并使用其每一步的信号来计算优势估计的政策梯度更新的反向扩散内核。我们的算法实现简单,支持离线策略重用,并插入到标准的文本到图像的骨干。在多个概念中,该方法实现了对强基线的更好或相当的遗忘,同时保持图像质量和良性提示保真度;消融表明(i)每步批评和(ii)噪声条件奖励是稳定性和有效性的关键。我们发布代码和评估脚本,以促进基于RL的扩散遗忘的可重复性和未来研究。
摘要:Machine unlearning in text-to-image diffusion models aims to remove targeted concepts while preserving overall utility. Prior diffusion unlearning methods typically rely on supervised weight edits or global penalties; reinforcement-learning (RL) approaches, while flexible, often optimize sparse end-of-trajectory rewards, yielding high-variance updates and weak credit assignment. We present a general RL framework for diffusion unlearning that treats denoising as a sequential decision process and introduces a timestep-aware critic with noisy-step rewards. Concretely, we train a CLIP-based reward predictor on noisy latents and use its per-step signal to compute advantage estimates for policy-gradient updates of the reverse diffusion kernel. Our algorithm is simple to implement, supports off-policy reuse, and plugs into standard text-to-image backbones. Across multiple concepts, the method achieves better or comparable forgetting to strong baselines while maintaining image quality and benign prompt fidelity; ablations show that (i) per-step critics and (ii) noisy-conditioned rewards are key to stability and effectiveness. We release code and evaluation scripts to facilitate reproducibility and future research on RL-based diffusion unlearning.
【3】Empowering Reliable Visual-Centric Instruction Following in MLLMs
标题:在MLLM中实现可靠的以视觉为中心的教学
链接:https://arxiv.org/abs/2601.03198
作者:Weilei He,Feng Ju,Zhiyuan Fan,Rui Min,Minhao Cheng,Yi R. Fung
备注:Submitted to ARR Jan
摘要:评估多模态大型语言模型(MLLM)的推理跟随(IF)能力对于严格评估模型输出是否忠实于用户指定的意图至关重要。然而,现有的基准评估MLLM的遵循能力主要集中在文字形式的口头指令。这些局限性阻碍了一个彻底的分析,遵循的能力,因为他们忽略了隐含的约束嵌入在语义丰富的视觉形态。为了解决这一差距,我们引入了VC-IFEval,这是一个新的基准测试,伴随着一个系统构建的数据集,用于评估MLLM在多模式设置下的预防能力。我们的基准系统地将视觉相关的约束纳入到指令设计中,从而能够更严格和细粒度地评估MLLM如何将其输出与视觉输入和文本指令对齐。此外,通过在我们的数据集上微调MLLM,我们在视觉跟踪准确性和依从性方面取得了实质性的进步。通过对具有代表性的MLLM进行广泛的评估,我们对当前模型的优势和局限性提供了新的见解。
摘要:Evaluating the instruction-following (IF) capabilities of Multimodal Large Language Models (MLLMs) is essential for rigorously assessing how faithfully model outputs adhere to user-specified intentions. Nevertheless, existing benchmarks for evaluating MLLMs' instruction-following capability primarily focus on verbal instructions in the textual modality. These limitations hinder a thorough analysis of instruction-following capabilities, as they overlook the implicit constraints embedded in the semantically rich visual modality. To address this gap, we introduce VC-IFEval, a new benchmark accompanied by a systematically constructed dataset that evaluates MLLMs' instruction-following ability under multimodal settings. Our benchmark systematically incorporates vision-dependent constraints into instruction design, enabling a more rigorous and fine-grained assessment of how well MLLMs align their outputs with both visual input and textual instructions. Furthermore, by fine-tuning MLLMs on our dataset, we achieve substantial gains in visual instruction-following accuracy and adherence. Through extensive evaluation across representative MLLMs, we provide new insights into the strengths and limitations of current models.
【4】Can Embedding Similarity Predict Cross-Lingual Transfer? A Systematic Study on African Languages
标题:嵌入相似性可以预测跨语言迁移吗?非洲语言的系统研究
链接:https://arxiv.org/abs/2601.03168
作者:Tewodros Kederalah Idris,Prasenjit Mitra,Roald Eiselen
备注:13 pages, 1 figure, 19 tables
摘要:跨语言迁移对于为低资源非洲语言构建NLP系统至关重要,但从业者缺乏选择源语言的可靠方法。我们系统地评估了816个迁移实验中的5个嵌入相似性指标,这些实验涵盖了3个NLP任务、3个以非洲为中心的多语言模型和4个语系的12种语言。我们发现,余弦差距和检索为基础的指标(P@1,CSLS)可靠地预测转移成功($ρ= 0.4-0.6$),而CKA显示出可忽略不计的预测能力($ρ\约0.1$)。重要的是,当跨模型进行池化时,相关性符号会反转(Simpson's Parkinson),因此从业者必须验证每个模型。嵌入度量实现了与URIEL语言类型学相当的预测能力。我们的研究结果为源语言的选择提供了具体的指导,并强调了特定模型分析的重要性。
摘要:Cross-lingual transfer is essential for building NLP systems for low-resource African languages, but practitioners lack reliable methods for selecting source languages. We systematically evaluate five embedding similarity metrics across 816 transfer experiments spanning three NLP tasks, three African-centric multilingual models, and 12 languages from four language families. We find that cosine gap and retrieval-based metrics (P@1, CSLS) reliably predict transfer success ($ρ= 0.4-0.6$), while CKA shows negligible predictive power ($ρ\approx 0.1$). Critically, correlation signs reverse when pooling across models (Simpson's Paradox), so practitioners must validate per-model. Embedding metrics achieve comparable predictive power to URIEL linguistic typology. Our results provide concrete guidance for source language selection and highlight the importance of model-specific analysis.
【5】Rapid Augmentations for Time Series (RATS): A High-Performance Library for Time Series Augmentation
标题:时间序列快速增强(RATS):用于时间序列增强的高性能库
链接:https://arxiv.org/abs/2601.03159
作者:Wadie Skaf,Felix Kern,Aryamaan Basu Roy,Tejas Pradhan,Roman Kalkreuth,Holger Hoos
摘要:时间序列增强对于训练强大的深度学习模型至关重要,特别是在标记数据稀缺且获取成本高昂的领域。然而,现有的时间序列增强库(主要用Python编写)存在性能瓶颈,随着数据集大小的增加,运行时间呈指数级增长,这限制了它们在大规模生产级系统中的适用性。我们介绍RATS(Rapid Augmentations for Time Series),这是一个用Python绑定Rust编写的高性能时间序列增强库(RATSpy)。RATS实现了多种增强方法,包括基本变换、频域操作和时间规整技术,所有这些都可以通过具有内置并行化的统一管道接口访问。在143个数据集上对RATSpy与常用库(tasug)进行的全面基准测试表明,RATSpy比tsaug平均加速74.5%(在大型数据集上高达94.8%),峰值内存使用量减少47.9%。
摘要:Time series augmentation is critical for training robust deep learning models, particularly in domains where labelled data is scarce and expensive to obtain. However, existing augmentation libraries for time series, mainly written in Python, suffer from performance bottlenecks, where running time grows exponentially as dataset sizes increase -- an aspect limiting their applicability in large-scale, production-grade systems. We introduce RATS (Rapid Augmentations for Time Series), a high-performance library for time series augmentation written in Rust with Python bindings (RATSpy). RATS implements multiple augmentation methods spanning basic transformations, frequency-domain operations and time warping techniques, all accessible through a unified pipeline interface with built-in parallelisation. Comprehensive benchmarking of RATSpy versus a commonly used library (tasug) on 143 datasets demonstrates that RATSpy achieves an average speedup of 74.5\% over tsaug (up to 94.8\% on large datasets), with up to 47.9\% less peak memory usage.
【6】One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling
标题:统治它们的一个示例:RL扩展中的极高数据效率
链接:https://arxiv.org/abs/2601.03111
作者:Yiyuan Li,Zhen Huang,Yanan Wu,Weixun Wang,Xuefeng Li,Yijia Luo,Wenbo Su,Bo Zheng,Pengfei Liu
摘要:大型语言模型(LLM)的推理能力可以通过强化学习(RL)来释放(OpenAI,2024; DeepSeek-AI等人,2025 a; Zeng等人,2025年)。LLM中现有RL尝试的成功通常依赖于数千个或更多的高质量样本。在本文中,我们通过展示一次性学习的显着有效性来挑战关于RL中LLM数据需求的基本假设。具体来说,我们介绍了博学者学习,一个框架,设计一个训练样本,electives多学科的影响。我们提出了三个关键发现:(1)一个策略性选择的数学推理样本可以在多个领域产生显着的性能提升,包括物理,化学和生物学与RL;(2)推理的数学技能突出表明了最佳博学者样本的特征;(3)集成多学科元素的工程合成样本优于自然发生的单个样本的训练。我们的方法在各种推理基准测试中使用更大的数据集进行训练时具有出色的性能,这表明样本质量和设计而不是数量可能是解锁语言模型中增强推理能力的关键。我们的研究结果表明,一种被称为样本工程的转变,朝着训练样本的精确工程而不是简单地增加数据量。
摘要:The reasoning ability of large language models (LLMs) can be unleashed with reinforcement learning (RL) (OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025). The success of existing RL attempts in LLMs usually relies on high-quality samples of thousands or beyond. In this paper, we challenge fundamental assumptions about data requirements in RL for LLMs by demonstrating the remarkable effectiveness of one-shot learning. Specifically, we introduce polymath learning, a framework for designing one training sample that elicits multidisciplinary impact. We present three key findings: (1) A single, strategically selected math reasoning sample can produce significant performance improvements across multiple domains, including physics, chemistry, and biology with RL; (2) The math skills salient to reasoning suggest the characteristics of the optimal polymath sample; and (3) An engineered synthetic sample that integrates multidiscipline elements outperforms training with individual samples that naturally occur. Our approach achieves superior performance to training with larger datasets across various reasoning benchmarks, demonstrating that sample quality and design, rather than quantity, may be the key to unlock enhanced reasoning capabilities in language models. Our results suggest a shift, dubbed as sample engineering, toward precision engineering of training samples rather than simply increasing data volume.
【7】Time-Aware Synthetic Control
标题:时间感知综合控制
链接:https://arxiv.org/abs/2601.03099
作者:Saeyoung Rho,Cyrus Illick,Samhitha Narasipura,Alberto Abadie,Daniel Hsu,Vishal Misra
摘要:合成控制(SC)框架广泛用于对时间序列面板数据的观察性因果推理。SC在不同的应用中已经取得了成功,但是现有的方法通常将干预前时间指数的排序处理为可互换的。这种不变性意味着,当存在强烈的趋势时,它们可能无法充分利用时间结构。我们提出了时间感知合成控制(TASC),它采用了一个状态空间模型,同时保持一个恒定的趋势,低秩结构的信号。TASC使用卡尔曼滤波器和Rauch-Tung-Striebel平滑器:它首先拟合具有期望最大化的生成时间序列模型,然后执行反事实推理。我们在模拟和真实世界的数据集上评估了TASC,包括政策评估和体育预测。我们的研究结果表明,TASC在具有强时间趋势和高水平观测噪声的环境中具有优势。
摘要
:The synthetic control (SC) framework is widely used for observational causal inference with time-series panel data. SC has been successful in diverse applications, but existing methods typically treat the ordering of pre-intervention time indices interchangeable. This invariance means they may not fully take advantage of temporal structure when strong trends are present. We propose Time-Aware Synthetic Control (TASC), which employs a state-space model with a constant trend while preserving a low-rank structure of the signal. TASC uses the Kalman filter and Rauch-Tung-Striebel smoother: it first fits a generative time-series model with expectation-maximization and then performs counterfactual inference. We evaluate TASC on both simulated and real-world datasets, including policy evaluation and sports prediction. Our results suggest that TASC offers advantages in settings with strong temporal trends and high levels of observation noise.
【8】PiDR: Physics-Informed Inertial Dead Reckoning for Autonomous Platforms
标题:PiDR:自主平台的物理知情惯性航位推算
链接:https://arxiv.org/abs/2601.03040
作者:Arup Kumar Sahoo,Itzik Klein
备注:11 pages and 7 figures
摘要:完全自主的一个基本要求是在没有外部数据(如全球导航卫星系统信号或视觉信息)的情况下维持准确导航的能力。在这些具有挑战性的环境中,平台必须完全依赖惯性传感器,从而实现纯惯性导航。然而,惯性传感器的固有噪声和其他误差项在这样的真实世界场景中将导致导航解决方案随时间漂移。尽管传统的深度学习模型已经成为惯性导航的一种可能方法,但它们本质上是黑箱。此外,它们很难在有限的监督传感器数据下有效地学习,并且往往无法保持物理原理。为了解决这些局限性,我们提出了PiDR,一个物理信息的惯性航位推算框架,用于纯惯性导航情况下的自主平台。PiDR通过将惯性导航原理明确集成到网络训练过程中,通过物理信息残差组件提供透明度。PiDR在减轻突然的轨迹偏差方面发挥着至关重要的作用,即使在有限或稀疏的监督下。我们评估了PiDR的移动机器人和自主水下航行器收集的真实世界的数据集。我们在两个数据集中获得了超过29%的定位改进,证明了PiDR在各种环境和动态下推广不同平台的能力。因此,PiDR提供了一个强大的,轻量级的,但有效的架构,并可以部署在资源有限的平台,使实时纯惯性导航在不利的情况下。
摘要:A fundamental requirement for full autonomy is the ability to sustain accurate navigation in the absence of external data, such as GNSS signals or visual information. In these challenging environments, the platform must rely exclusively on inertial sensors, leading to pure inertial navigation. However, the inherent noise and other error terms of the inertial sensors in such real-world scenarios will cause the navigation solution to drift over time. Although conventional deep-learning models have emerged as a possible approach to inertial navigation, they are inherently black-box in nature. Furthermore, they struggle to learn effectively with limited supervised sensor data and often fail to preserve physical principles. To address these limitations, we propose PiDR, a physics-informed inertial dead-reckoning framework for autonomous platforms in situations of pure inertial navigation. PiDR offers transparency by explicitly integrating inertial navigation principles into the network training process through the physics-informed residual component. PiDR plays a crucial role in mitigating abrupt trajectory deviations even under limited or sparse supervision. We evaluated PiDR on real-world datasets collected by a mobile robot and an autonomous underwater vehicle. We obtained more than 29% positioning improvement in both datasets, demonstrating the ability of PiDR to generalize different platforms operating in various environments and dynamics. Thus, PiDR offers a robust, lightweight, yet effective architecture and can be deployed on resource-constrained platforms, enabling real-time pure inertial navigation in adverse scenarios.
【9】Bridging Mechanistic Interpretability and Prompt Engineering with Gradient Ascent for Interpretable Persona Control
标题:通过梯度上升将机械可解释性和即时工程连接起来以实现可解释角色控制
链接:https://arxiv.org/abs/2601.02896
作者:Harshvardhan Saini,Yiming Tang,Dianbo Liu
摘要:控制紧急行为角色(例如,大语言模型(LLM)中的奉承、幻觉(如奉承、幻觉)对AI安全至关重要,但仍然是一个持续的挑战。现有的解决方案面临着一个困境:手动提示工程是直观的,但不可扩展和不精确,而自动优化方法是有效的,但作为“黑匣子”操作,没有可解释的连接到模型内部。我们提出了一个新的框架,适应梯度上升到LLM,使有针对性的及时发现。具体而言,我们提出了两种方法,RESGA和SAEGA,都优化随机初始化提示,以实现更好的对齐表示与确定的人物角色方向。我们引入流畅的梯度上升来控制发现的人物角色转向提示的流畅性。我们证明了RESGA和SAEGA在Llama 3.1,Qwen 2.5和Gemma 3中的有效性,用于指导三种不同的角色,奉承,幻觉和近视奖励。最重要的是,在阿谀奉承方面,我们自动发现的提示获得了显著的改善(49.90%比79.24%)。通过在机械上有意义的功能接地及时发现,我们的方法提供了一个新的范式可控和可解释的行为修改。
摘要:Controlling emergent behavioral personas (e.g., sycophancy, hallucination) in Large Language Models (LLMs) is critical for AI safety, yet remains a persistent challenge. Existing solutions face a dilemma: manual prompt engineering is intuitive but unscalable and imprecise, while automatic optimization methods are effective but operate as "black boxes" with no interpretable connection to model internals. We propose a novel framework that adapts gradient ascent to LLMs, enabling targeted prompt discovery. In specific, we propose two methods, RESGA and SAEGA, that both optimize randomly initialized prompts to achieve better aligned representation with an identified persona direction. We introduce fluent gradient ascent to control the fluency of discovered persona steering prompts. We demonstrate RESGA and SAEGA's effectiveness across Llama 3.1, Qwen 2.5, and Gemma 3 for steering three different personas,sycophancy, hallucination, and myopic reward. Crucially, on sycophancy, our automatically discovered prompts achieve significant improvement (49.90% compared with 79.24%). By grounding prompt discovery in mechanistically meaningful features, our method offers a new paradigm for controllable and interpretable behavior modification.
【10】RPIQ: Residual-Projected Multi-Collaboration Closed-Loop and Single Instance Quantization for Visually Impaired Assistance
标题:RPIQ:用于视觉障碍辅助的残差投影多协作闭环和单实例量化
链接:https://arxiv.org/abs/2601.02888
作者:Xuanyu Wang,Haisen Su,Jingtao Zhang,Xiangxiang Wang,Yongbin Yu,Manping Fan,Bo Gong,Siqi Chen,Mingsheng Cao,Liyong Ren
摘要:视障用户在日常信息获取和实时环境感知方面面临重大挑战,迫切需要具有准确识别能力的智能辅助系统。虽然大规模的模型提供了有效的解决方案的感知和推理,其实际部署的辅助设备严重限制了过多的内存消耗和高推理成本。此外,现有的量化策略往往忽略了块间误差积累,导致模型稳定性下降。为了解决这些问题,提出了一种新的量化框架--残差投影多协作闭环单实例量化(RPIQ),其量化过程采用基于单实例校正和高斯-赛德尔迭代量化的多协作闭环补偿方案。在各种类型的大规模模型上进行的实验,包括OPT、Qwen和LLaMA等语言模型,以及CogVLM 2等视觉语言模型,表明RPIQ可以将模型压缩到4位表示,同时显著降低峰值内存消耗(与原始全精度模型相比,减少约60%-75%)。该方法在多种语言和视觉任务中保持高度接近全精度模型的性能,并在复杂场景中的文本理解和视觉问答等关键应用中表现出出色的识别和推理能力。在验证RPIQ在实际辅助系统中部署的有效性的同时,本研究还提高了大型模型的计算效率和可靠性,使其能够准确快速地为视障用户提供所需的信息。
摘要:Visually impaired users face significant challenges in daily information access and real-time environmental perception, and there is an urgent need for intelligent assistive systems with accurate recognition capabilities. Although large-scale models provide effective solutions for perception and reasoning, their practical deployment on assistive devices is severely constrained by excessive memory consumption and high inference costs. Moreover, existing quantization strategies often ignore inter-block error accumulation, leading to degraded model stability. To address these challenges, this study proposes a novel quantization framework -- Residual-Projected Multi-Collaboration Closed-Loop and Single Instance Quantization(RPIQ), whose quantization process adopts a multi-collaborative closed-loop compensation scheme based on Single Instance Calibration and Gauss-Seidel Iterative Quantization. Experiments on various types of large-scale models, including language models such as OPT, Qwen, and LLaMA, as well as vision-language models such as CogVLM2, demonstrate that RPIQ can compress models to 4-bit representation while significantly reducing peak memory consumption (approximately 60%-75% reduction compared to original full-precision models). The method maintains performance highly close to full-precision models across multiple language and visual tasks, and exhibits excellent recognition and reasoning capabilities in key applications such as text understanding and visual question answering in complex scenarios. While verifying the effectiveness of RPIQ for deployment in real assistive systems, this study also advances the computational efficiency and reliability of large models, enabling them to provide visually impaired users with the required information accurately and rapidly.
【11】Quantum-Enhanced Neural Contextual Bandit Algorithms
标题:量子增强神经上下文盗贼算法
链接:https://arxiv.org/abs/2601.02870
作者:Yuqi Huang,Vincent Y. F Tan,Sharu Theresa Jose
备注:30 pages, under review
摘要:随机上下文强盗是顺序决策的基础,但对现有的基于神经网络的算法构成了重大挑战,特别是当扩展到量子神经网络(QNN)时,由于大规模过参数化,计算不稳定性和贫瘠高原现象等问题。本文介绍了量子神经正切核-置信上限(QNTK-UCB)算法,这是一种利用量子神经正切核(QNTK)来解决这些限制的新算法。 通过在随机初始化时冻结QNN并利用其静态QNTK作为岭回归的内核,QNTK-UCB绕过了显式参数化量子电路训练中固有的不稳定训练动态,同时充分利用了独特的量子感应偏置。对于时间范围$T$和$K$动作,我们的理论分析揭示了QNTK-UCB的$Ω((TK)^3)$的显著改进的参数缩放,与经典NeuralUCB算法对于类似的遗憾保证所需的$Ω((TK)^8)$相比,这是一个实质性的减少。对非线性合成基准和量子本机变分量子本征解算器任务的经验评估表明,QNTK-UCB在低数据状态下具有优异的采样效率。这项工作突出了QNTK的固有属性如何提供隐式正则化和更尖锐的光谱衰减,为在线学习中实现“量子化”铺平了道路。
摘要:Stochastic contextual bandits are fundamental for sequential decision-making but pose significant challenges for existing neural network-based algorithms, particularly when scaling to quantum neural networks (QNNs) due to issues such as massive over-parameterization, computational instability, and the barren plateau phenomenon. This paper introduces the Quantum Neural Tangent Kernel-Upper Confidence Bound (QNTK-UCB) algorithm, a novel algorithm that leverages the Quantum Neural Tangent Kernel (QNTK) to address these limitations. By freezing the QNN at a random initialization and utilizing its static QNTK as a kernel for ridge regression, QNTK-UCB bypasses the unstable training dynamics inherent in explicit parameterized quantum circuit training while fully exploiting the unique quantum inductive bias. For a time horizon $T$ and $K$ actions, our theoretical analysis reveals a significantly improved parameter scaling of $Ω((TK)^3)$ for QNTK-UCB, a substantial reduction compared to $Ω((TK)^8)$ required by classical NeuralUCB algorithms for similar regret guarantees. Empirical evaluations on non-linear synthetic benchmarks and quantum-native variational quantum eigensolver tasks demonstrate QNTK-UCB's superior sample efficiency in low-data regimes. This work highlights how the inherent properties of QNTK provide implicit regularization and a sharper spectral decay, paving the way for achieving ``quantum advantage'' in online learning.
【12】COFFEE: COdesign Framework for Feature Enriched Embeddings in Ads-Ranking Systems
标题:COFFEE:广告排名系统中功能丰富嵌入的协同设计框架
链接:https://arxiv.org/abs/2601.02807
作者:Sohini Roychowdhury,Doris Wang,Qian Ge,Joy Mu,Srihari Reddy
备注:4 pages, 5 figures, 1 table
摘要:多样化和丰富的数据源对于商业广告推荐模型来说是必不可少的,可以在用户参与内容之前和之后准确评估用户兴趣。虽然扩展的用户参与历史可以改善对用户兴趣的预测,但同样重要的是,根据缩放定律原理,嵌入来自多个源的活动序列以确保用户和广告表示的新鲜度。在本文中,我们提出了一种新的三维框架,用于增强用户广告表示,而不增加模型推理或服务的复杂性。第一个维度考察了整合不同事件源的影响,第二个维度考虑了较长用户历史的好处,第三个维度侧重于使用额外的事件属性和多模态嵌入来丰富数据。我们通过比较自然的用户参与来源(如内容观看)与广告印象来源来评估我们的来源丰富框架的投资回报率(ROI)。与有机使用源相比,所提出的方法可以将广告印象源的曲线下面积(AUC)和缩放曲线的斜率提高1.56至2倍,即使对于100至10K的短在线序列长度。此外,当使用丰富的广告印象事件源时,点击率(CTR)预测比基线生产广告推荐系统提高了0.56% AUC,从而改善了更长和离线用户广告表示的序列缩放分辨率。
摘要:Diverse and enriched data sources are essential for commercial ads-recommendation models to accurately assess user interest both before and after engagement with content. While extended user-engagement histories can improve the prediction of user interests, it is equally important to embed activity sequences from multiple sources to ensure freshness of user and ad-representations, following scaling law principles. In this paper, we present a novel three-dimensional framework for enhancing user-ad representations without increasing model inference or serving complexity. The first dimension examines the impact of incorporating diverse event sources, the second considers the benefits of longer user histories, and the third focuses on enriching data with additional event attributes and multi-modal embeddings. We assess the return on investment (ROI) of our source enrichment framework by comparing organic user engagement sources, such as content viewing, with ad-impression sources. The proposed method can boost the area under curve (AUC) and the slope of scaling curves for ad-impression sources by 1.56 to 2 times compared to organic usage sources even for short online-sequence lengths of 100 to 10K. Additionally, click-through rate (CTR) prediction improves by 0.56% AUC over the baseline production ad-recommendation system when using enriched ad-impression event sources, leading to improved sequence scaling resolutions for longer and offline user-ad representations.
【13】Scalable Tree Ensemble Proximities in Python
标题:可扩展树在Python中扩展邻近性
链接:https://arxiv.org/abs/2601.02735
作者:Adrien Aumon,Guy Wolf,Kevin R. Moon,Jake S. Rhodes
摘要:树集成方法,如随机森林自然地通过其决策树结构诱导监督相似性度量,但现有的实现从树集成获得的接近度通常遭受二次时间或内存复杂性,限制了它们的可扩展性。在这项工作中,我们引入了一个通用的框架,通过定义一个家庭的可分离加权叶碰撞邻近计算效率。我们表明,在这个家庭的任何接近措施承认一个确切的稀疏矩阵分解,限制计算叶级碰撞,避免明确的成对比较。这种公式可以在Python中使用稀疏线性代数实现低内存、可扩展的邻近计算。经验基准测试表明,与传统方法相比,运行时间和内存都有了很大的改进,允许树集成接近度在标准CPU硬件上有效地扩展到具有数十万样本的数据集。
摘要:Tree ensemble methods such as Random Forests naturally induce supervised similarity measures through their decision tree structure, but existing implementations of proximities derived from tree ensembles typically suffer from quadratic time or memory complexity, limiting their scalability. In this work, we introduce a general framework for efficient proximity computation by defining a family of Separable Weighted Leaf-Collision Proximities. We show that any proximity measure in this family admits an exact sparse matrix factorization, restricting computation to leaf-level collisions and avoiding explicit pairwise comparisons. This formulation enables low-memory, scalable proximity computation using sparse linear algebra in Python. Empirical benchmarks demonstrate substantial runtime and memory improvements over traditional approaches, allowing tree ensemble proximities to scale efficiently to datasets with hundreds of thousands of samples on standard CPU hardware.
【14】CRoPE: Efficient Parametrization of Rotary Positional Embedding
标题:CRoPE:旋转位置嵌入的高效参数化
链接:https://arxiv.org/abs/2601.02728
作者:Beicheng Lou,Zifei Xu
摘要:旋转位置嵌入已经成为在基于变换器的模型中编码位置信息的最先进的方法。虽然它通常在复线性代数中被简洁地表达,但我们注意到,$Q/K/V$-投影的实际实现并不等同于复线性变换。我们认为,复杂的线性变换是一个更自然的参数化,并节省近50%的注意力块内的参数。我们的经验表明,删除这种冗余的模型性能在样本和样本外的影响可以忽略不计。我们的修改实现了更有效的参数使用,以及更干净的解释的表示空间。
摘要
:Rotary positional embedding has become the state-of-the-art approach to encode position information in transformer-based models. While it is often succinctly expressed in complex linear algebra, we note that the actual implementation of $Q/K/V$-projections is not equivalent to a complex linear transformation. We argue that complex linear transformation is a more natural parametrization and saves near 50\% parameters within the attention block. We show empirically that removing such redundancy has negligible impact on the model performance both in sample and out of sample. Our modification achieves more efficient parameter usage, as well as a cleaner interpretation of the representation space.
【15】MAFS: Multi-head Attention Feature Selection for High-Dimensional Data via Deep Fusion of Filter Methods
标题:MAFS:通过过滤器方法深度融合进行多维数据的多头注意力特征选择
链接:https://arxiv.org/abs/2601.02668
作者:Xiaoyan Sun,Qingyu Meng,Yalu Wen
摘要:特征选择对于高维生物医学数据至关重要,可以实现更强的预测性能,降低计算成本,并提高精准医学应用的可解释性。现有的办法面临着显著的挑战。过滤器方法具有高度可伸缩性,但无法捕获复杂的关系或消除冗余。基于深度学习的方法可以对非线性模式进行建模,但通常缺乏稳定性、可解释性和大规模效率。单头注意力提高了可解释性,但在捕获多级依赖关系方面受到限制,并且对初始化保持敏感,从而降低了可重复性。大多数现有的方法很少将统计可解释性与深度学习的代表性能力结合起来,特别是在超高维环境中。在这里,我们介绍了MAFS(Multi-head Attention-based Feature Selection),这是一个将统计先验与深度学习功能集成在一起的混合框架。MAFS从基于滤波器的先验开始,用于稳定的初始化和指导学习。然后,它使用多头注意力从多个角度并行检查特征,捕捉复杂的非线性关系和相互作用。最后,一个重新排序模块整合了注意力头的输出,解决冲突并最大限度地减少信息损失,以生成鲁棒和一致的特征排名。这种设计将统计指导与深度建模能力相结合,产生可解释的重要性分数,同时最大限度地保留信息信号。在包括癌症基因表达和阿尔茨海默病数据在内的模拟和真实数据集上,与现有的基于过滤器和基于深度学习的替代方案相比,MAFS始终实现了卓越的覆盖率和稳定性,为高维生物医学数据中的特征选择提供了可扩展,可解释和强大的解决方案。
摘要:Feature selection is essential for high-dimensional biomedical data, enabling stronger predictive performance, reduced computational cost, and improved interpretability in precision medicine applications. Existing approaches face notable challenges. Filter methods are highly scalable but cannot capture complex relationships or eliminate redundancy. Deep learning-based approaches can model nonlinear patterns but often lack stability, interpretability, and efficiency at scale. Single-head attention improves interpretability but is limited in capturing multi-level dependencies and remains sensitive to initialization, reducing reproducibility. Most existing methods rarely combine statistical interpretability with the representational power of deep learning, particularly in ultra-high-dimensional settings. Here, we introduce MAFS (Multi-head Attention-based Feature Selection), a hybrid framework that integrates statistical priors with deep learning capabilities. MAFS begins with filter-based priors for stable initialization and guide learning. It then uses multi-head attention to examine features from multiple perspectives in parallel, capturing complex nonlinear relationships and interactions. Finally, a reordering module consolidates outputs across attention heads, resolving conflicts and minimizing information loss to generate robust and consistent feature rankings. This design combines statistical guidance with deep modeling capacity, yielding interpretable importance scores while maximizing retention of informative signals. Across simulated and real-world datasets, including cancer gene expression and Alzheimer's disease data, MAFS consistently achieves superior coverage and stability compared with existing filter-based and deep learning-based alternatives, offering a scalable, interpretable, and robust solution for feature selection in high-dimensional biomedical data.
【16】Prioritized Replay for RL Post-training
标题:RL训练后优先重播
链接:https://arxiv.org/abs/2601.02648
作者:Mehdi Fatemi
摘要:我们为大型语言模型的RL后训练引入了一个问题级优先级框架。基于深度强化学习中优先重播的见解,以及之前的观察结果,即中等成功率的推出往往会在GRPO等方法下产生更强的学习信号,我们的方法根据从经验中得出的简单、模型驱动的优先级评分来选择问题成功统计。与传统的课程策略不同,传统的课程策略在训练早期强调更容易的任务,由此产生的时间表自然地将训练集中在既没有持续解决也没有持续失败的问题上,同时降低那些贡献很少梯度信息的问题的优先级。该方法产生了一个不断适应和自动优先级的过程,不需要预定义的难度等级,辅助预测,或外部标签。我们进一步引入轻量级的实际部署机制,包括基于堆的优先级采样和定期重新测试已解决和未解决的问题,以减轻饥饿和遗忘。总的来说,该方法提供了一个原则性和可扩展的替代手动设计的课程,同时调整数据选择直接与基于GRPO的后期培训的动态。
摘要:We introduce a problem-level prioritization framework for RL post-training of large language models. Building on insights from prioritized replay in deep RL, as well as prior observations that rollouts with intermediate success rates tend to produce stronger learning signals under methods such as GRPO, our approach selects problems according to a simple, model-driven priority score derived from empirical success statistics. In contrast to conventional curriculum strategies that emphasize easier tasks early in training, the resulting schedule naturally focuses training on problems that are neither consistently solved nor consistently failed, while deprioritizing those that contribute little gradient information. The method yields a continuously adapting and automatic prioritization process that requires no predefined difficulty tiers, auxiliary predictors, or external labels. We further introduce lightweight mechanisms for practical deployment, including heap-based prioritized sampling and periodic retesting of solved and unsolved problems to mitigate starvation and forgetting. Overall, the approach offers a principled and scalable alternative to manually designed curricula while aligning data selection directly with the dynamics of GRPO-based post-training.
【17】Credit Assignment via Neural Manifold Noise Correlation
标题:通过神经多管噪相关进行信用分配
链接:https://arxiv.org/abs/2601.02636
作者:Byungwoo Kang,Maceo Richards,Bernardo Sabatini
摘要:信用分配--单个神经元和突触的变化如何影响网络的输出--是大脑和机器学习的核心。噪声相关性通过将活动的扰动与输出的变化相关联来估计梯度,为信用分配提供了一种生物学上合理的解决方案,但由于准确估计雅可比矩阵需要扰动的数量随网络大小而变化,因此规模很差。此外,各向同性噪声与神经生物学观察相冲突,神经活动位于低维流形上。为了解决这些缺点,我们提出了神经歧管噪声相关(NMNC),它使用仅限于神经歧管的扰动来执行信用分配。我们从理论上和经验上证明,雅可比行空间与训练网络中的神经流形对齐,并且流形维度随网络大小缓慢扩展。NMNC在CIFAR-10、ImageNet规模模型和递归网络上训练的卷积网络中,大大提高了性能和样本效率。NMNC也产生表示更类似于灵长类动物的视觉系统比香草噪声相关。这些发现为生物回路如何支持学分分配提供了一个机械假说,并表明生物启发的约束可能会使有效的学习规模,而不是限制。
摘要:Credit assignment--how changes in individual neurons and synapses affect a network's output--is central to learning in brains and machines. Noise correlation, which estimates gradients by correlating perturbations of activity with changes in output, provides a biologically plausible solution to credit assignment but scales poorly as accurately estimating the Jacobian requires that the number of perturbations scale with network size. Moreover, isotropic noise conflicts with neurobiological observations that neural activity lies on a low-dimensional manifold. To address these drawbacks, we propose neural manifold noise correlation (NMNC), which performs credit assignment using perturbations restricted to the neural manifold. We show theoretically and empirically that the Jacobian row space aligns with the neural manifold in trained networks, and that manifold dimensionality scales slowly with network size. NMNC substantially improves performance and sample efficiency over vanilla noise correlation in convolutional networks trained on CIFAR-10, ImageNet-scale models, and recurrent networks. NMNC also yields representations more similar to the primate visual system than vanilla noise correlation. These findings offer a mechanistic hypothesis for how biological circuits could support credit assignment, and suggest that biologically inspired constraints may enable, rather than limit, effective learning at scale.
【18】GEM-Style Constraints for PEFT with Dual Gradient Projection in LoRA
标题:LoRA中具有双梯度投影的PEFT的GEM式约束
链接:https://arxiv.org/abs/2601.02500
作者:Brian Tekmen,Jason Yin,Qianqian Tong
备注:Work accepted to the NSF REU Symposium at the 2025 IEEE International Conference on Data Mining (ICDM). Correspondence to: betekmen@uncg.edu
摘要:大型语言模型(LLM)的完全微调在计算上是昂贵的,这激发了利用参数高效适配器的持续学习(CL)方法。我们重新审视梯度情节记忆(GEM)的低秩适配器(LoRA)子空间,并介绍I-GEM:一个固定的预算,GPU驻留双投影梯度近似GEM的二次投影。通过仅在适配器参数内约束非干扰,I-GEM保留了GEM般的稳定性,具有数量级较低的平均投影开销。在一个3任务的AG新闻分裂与诱导域漂移,使用GPT-2(355 M)和LoRA($r=8$),I-GEM匹配GEM的平均精度(在$\sim\!0.04$ pts),并以$\sim\!1.4$ pts.至关重要的是,它减少了投影时间,创业板的一个因素$\sim\!10^3美元。这些结果表明,在LoRA子空间中应用GEM约束是在LLM尺度上持续学习的实用途径。
摘要:Full fine-tuning of Large Language Models (LLMs) is computationally costly, motivating Continual Learning (CL) approaches that utilize parameter-efficient adapters. We revisit Gradient Episodic Memory (GEM) within the Low-Rank Adapter (LoRA) subspace and introduce I-GEM: a fixed-budget, GPU-resident dual projected-gradient approximation to GEM's quadratic projection. By constraining non-interference solely within the adapter parameters, I-GEM preserves GEM-like stability with orders-of-magnitude lower mean projection overhead. On a 3-task AG News split with induced domain drift, using GPT-2 (355M) and LoRA ($r=8$), I-GEM matches GEM's average accuracy (within $\sim\!0.04$ pts) and outperforms A-GEM by $\sim\!1.4$ pts. Crucially, it reduces projection time vs.\ GEM by a factor of $\sim\!10^3$. These results suggest that applying GEM constraints in the LoRA subspace is a practical pathway for continual learning at the LLM scale.
【19】VocalBridge: Latent Diffusion-Bridge Purification for Defeating Perturbation-Based Voiceprint Defenses
标题:VocalBridge:潜在的扩散桥净化,击败基于微扰的声纹防御
链接:https://arxiv.org/abs/2601.02444
作者:Maryam Abbasihafshejani,AHM Nazmus Sakib,Murtuza Jadliwala
摘要:语音合成技术的快速发展,包括文本到语音(TTS)和语音转换(VC),加剧了与语音克隆相关的安全和隐私问题。最近的防御措施试图通过在语音中嵌入保护性扰动来防止未经授权的克隆,以在保持可理解性的同时模糊说话者的身份。然而,对手可以应用先进的净化技术来消除这些干扰,恢复真实的声学特征,并重新生成可克隆的声音。尽管这种攻击越来越现实,但现有防御在自适应净化下的鲁棒性仍然没有得到充分的研究。 大多数现有的净化方法被设计用于对抗自动语音识别(ASR)系统中的对抗性噪声,而不是说话人验证或语音克隆管道。因此,它们无法抑制定义说话人身份的细粒度声学线索,并且通常对说话人验证攻击(SVA)无效。为了解决这些限制,我们提出了扩散桥(VocalBridge),这是一个净化框架,可以在EnCodec潜在空间中学习从扰动语音到干净语音的潜在映射。该模型使用具有余弦噪声时间表的时间条件化1D U-Net,实现了高效的无转录纯化,同时保留了说话者区分结构。我们进一步介绍了耳语指导的音素变体,采用轻量级的时间指导,而不需要地面实况成绩单。实验结果表明,我们的方法始终优于现有的净化方法在恢复受保护的语音克隆的声音。我们的研究结果证明了当前基于扰动的防御的脆弱性,并强调需要更强大的保护机制来应对不断变化的语音克隆和说话者验证威胁。
摘要:The rapid advancement of speech synthesis technologies, including text-to-speech (TTS) and voice conversion (VC), has intensified security and privacy concerns related to voice cloning. Recent defenses attempt to prevent unauthorized cloning by embedding protective perturbations into speech to obscure speaker identity while maintaining intelligibility. However, adversaries can apply advanced purification techniques to remove these perturbations, recover authentic acoustic characteristics, and regenerate cloneable voices. Despite the growing realism of such attacks, the robustness of existing defenses under adaptive purification remains insufficiently studied. Most existing purification methods are designed to counter adversarial noise in automatic speech recognition (ASR) systems rather than speaker verification or voice cloning pipelines. As a result, they fail to suppress the fine-grained acoustic cues that define speaker identity and are often ineffective against speaker verification attacks (SVA). To address these limitations, we propose Diffusion-Bridge (VocalBridge), a purification framework that learns a latent mapping from perturbed to clean speech in the EnCodec latent space. Using a time-conditioned 1D U-Net with a cosine noise schedule, the model enables efficient, transcript-free purification while preserving speaker-discriminative structure. We further introduce a Whisper-guided phoneme variant that incorporates lightweight temporal guidance without requiring ground-truth transcripts. Experimental results show that our approach consistently outperforms existing purification methods in recovering cloneable voices from protected speech. Our findings demonstrate the fragility of current perturbation-based defenses and highlight the need for more robust protection mechanisms against evolving voice-cloning and speaker verification threats.
【20】WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
标题:WebGym:通过现实任务扩展视觉Web代理的训练环境
链接:https://arxiv.org/abs/2601.02439
作者:Hao Bai,Alexey Taymanov,Tong Zhang,Aviral Kumar,Spencer Whitehead
摘要:我们提出了WebGym,最大的最新的开源环境,用于训练逼真的视觉网络代理。真正的网站是不固定的和多样化的,使人为的或小规模的任务集不足以强大的政策学习。WebGym包含近300,000个任务,并在不同的真实世界网站和难度级别上进行基于规则的评估。我们用一个简单的强化学习(RL)配方来训练代理,该配方在代理自己的交互轨迹(推出)上进行训练,使用任务奖励作为反馈来指导学习。为了实现扩展RL,我们通过开发专门为Web代理设计的高吞吐量异步推出系统来加快WebGym中轨迹的采样。我们的系统实现了4- 5倍的推出加速相比,天真的实现。第二,我们扩展任务集的广度、深度和大小,这会导致持续的性能改进。在WebGym上微调一个强基础视觉语言模型Qwen-3-VL-8B-Instruct,可以将分发外测试集的成功率从26.2%提高到42.9%,显著优于基于GPT-4 o和GPT-5-Thinking等专有模型的代理,分别达到27.1%和29.8%。这种改进是实质性的,因为我们的测试集只包括在训练过程中从未见过的网站上的任务,不像许多其他以前的训练可视化Web代理的工作。
摘要:We present WebGym, the largest-to-date open-source environment for training realistic visual web agents. Real websites are non-stationary and diverse, making artificial or small-scale task sets insufficient for robust policy learning. WebGym contains nearly 300,000 tasks with rubric-based evaluations across diverse, real-world websites and difficulty levels. We train agents with a simple reinforcement learning (RL) recipe, which trains on the agent's own interaction traces (rollouts), using task rewards as feedback to guide learning. To enable scaling RL, we speed up sampling of trajectories in WebGym by developing a high-throughput asynchronous rollout system, designed specifically for web agents. Our system achieves a 4-5x rollout speedup compared to naive implementations. Second, we scale the task set breadth, depth, and size, which results in continued performance improvement. Fine-tuning a strong base vision-language model, Qwen-3-VL-8B-Instruct, on WebGym results in an improvement in success rate on an out-of-distribution test set from 26.2% to 42.9%, significantly outperforming agents based on proprietary models such as GPT-4o and GPT-5-Thinking that achieve 27.1% and 29.8%, respectively. This improvement is substantial because our test set consists only of tasks on websites never seen during training, unlike many other prior works on training visual web agents.
【21】STIPP: Space-time in situ postprocessing over the French Alps using proper scoring rules
标题:STIPP:使用适当的评分规则在法国阿尔卑斯山上空进行时空现场后处理
链接:https://arxiv.org/abs/2601.02882
作者:David Landry,Isabelle Gouttevin,Hugo Merizen,Claire Monteleoni,Anastase Charantonis
备注:17 pages, 11 figures
摘要:我们提出了时空原位后处理(STIPP),这是一种机器学习模型,可以为站点位置网络生成时空一致的天气预报。传统数值天气预报或数据驱动模式的网格化预报往往由于未解决的局部影响而缺乏必要的精度。典型的统计后处理方法纠正这些偏差,但往往会降低时空相关结构。最近的作品的基础上生成建模成功地提高了空间相关结构,但预测每个提前期独立。相比之下,科学、技术和方案联合进行时空预报,与基线方法相比,提高了地表温度、风、相对湿度和降水的准确性。它使每小时的集合预测只给出一个六小时的确定性预报,混合后处理和时间插值的边界。通过利用多变量适当的评分规则进行训练,STIPP有助于正在进行的工作数据驱动的大气模型监督只有分布边缘。
摘要:We propose Space-time in situ postprocessing (STIPP), a machine learning model that generates spatio-temporally consistent weather forecasts for a network of station locations. Gridded forecasts from classical numerical weather prediction or data-driven models often lack the necessary precision due to unresolved local effects. Typical statistical postprocessing methods correct these biases, but often degrade spatio-temporal correlation structures in doing so. Recent works based on generative modeling successfully improve spatial correlation structures but have to forecast every lead time independently. In contrast, STIPP makes joint spatio-temporal forecasts which have increased accuracy for surface temperature, wind, relative humidity and precipitation when compared to baseline methods. It makes hourly ensemble predictions given only a six-hourly deterministic forecast, blending the boundaries of postprocessing and temporal interpolation. By leveraging a multivariate proper scoring rule for training, STIPP contributes to ongoing work data-driven atmospheric models supervised only with distribution marginals.
【22】Mitigating Long-Tailed Anomaly Score Distributions with Importance-Weighted Loss
标题:用重要性加权损失缓解长尾异常评分分布
链接:https://arxiv.org/abs/2601.02440
作者:Jungi Lee,Jungkwon Kim,Chi Zhang,Sangmin Kim,Kwangsun Yoo,Seok-Joo Byun
备注:8 pages, Published as a conference paper at IJCNN 2025
摘要:在工业应用中,异常检测对于识别罕见和不可见的模式以确保系统可靠性至关重要。传统的模型,在单一类别的正常数据上训练,与正常数据呈现不同模式的真实世界分布作斗争,导致类别不平衡和长尾异常分数分布(LTD)。这种不平衡会扭曲模型训练并降低检测性能,特别是对于少数实例。为了解决这个问题,我们提出了一种新的重要性加权损失,专为异常检测。与以往的LTD分类方法相比,我们的方法不需要正常数据类的先验知识。相反,我们引入了一个加权损失函数,该函数结合了重要性采样,以使异常分数的分布与目标高斯分布保持一致,从而确保正常数据的平衡表示。在三个基准图像数据集和三个真实世界的高光谱成像数据集上的大量实验证明了我们的方法在减轻LTD引起的偏差方面的鲁棒性。我们的方法将异常检测性能提高了0.043,突出了其在实际应用中的有效性。
摘要:Anomaly detection is crucial in industrial applications for identifying rare and unseen patterns to ensure system reliability. Traditional models, trained on a single class of normal data, struggle with real-world distributions where normal data exhibit diverse patterns, leading to class imbalance and long-tailed anomaly score distributions (LTD). This imbalance skews model training and degrades detection performance, especially for minority instances. To address this issue, we propose a novel importance-weighted loss designed specifically for anomaly detection. Compared to the previous method for LTD in classification, our method does not require prior knowledge of normal data classes. Instead, we introduce a weighted loss function that incorporates importance sampling to align the distribution of anomaly scores with a target Gaussian, ensuring a balanced representation of normal data. Extensive experiments on three benchmark image datasets and three real-world hyperspectral imaging datasets demonstrate the robustness of our approach in mitigating LTD-induced bias. Our method improves anomaly detection performance by 0.043, highlighting its effectiveness in real-world applications.
机器翻译由腾讯交互翻译提供,仅供参考
点击“阅读原文”获取带摘要的学术速递