点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
cs.LG 方向,今日共计315篇
大模型相关(42篇)
【1】LLM Benchmark Datasets Should Be Contamination-Resistant
标题:LLM基准数据集应抗污染
链接:https://arxiv.org/abs/2605.19999
作者:Ali Al-Lawati,Jason Lucas,Dongwon Lee,Suhang Wang
备注:Accepted to ICML 2026 Position Paper Track
摘要:Benchmark datasets are critical for reproducible, reliable, and discriminative evaluation of LLMs. However, recent studies reveal that many benchmark datasets are included in pretraining corpora, i.e., $\textit{contaminated}$, which diminishes their value as reliable measures of model generalization. In this paper, we argue that benchmark datasets should be $\textit{contamination-resistant}$, i.e., $\textit{unlearnable}$, but support $\textit{inference}$. To accomplish this, we first highlight the wide prevalence of benchmark dataset contamination and outline the properties of contamination-resistant datasets. Second, we highlight how the asymmetry between the inference and training pipelines in the Transformer architecture can be leveraged to support contamination-resistance. Third, we outline mathematical advancements to make these datasets interoperable across various LLM architectures. Based on the above, we call on the community to ensure the reliability of LLM benchmarking by: (i) advancing novel contamination-resistant methodologies, (ii) developing supporting methods and platforms, and (iii) adopting contamination-resistant benchmarks into existing evaluation pipelines.
【2】PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
标题:TEK:上下文地图作为长上下文LLM代理的方向缓存
链接:https://arxiv.org/abs/2605.19932
作者:Zhuohan Gu,Qizheng Zhang,Omar Khattab,Samuel Madden
摘要:Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory, passive access to raw material, or task-level strategies. None of them preserves what we argue is most needed for repeated same-context workloads: reusable orientation knowledge (e.g., what the context contains, how it is organized, and which entities, constants, and schemas have historically been useful) about the recurring context itself. We introduce PEEK, a system that caches and maintains this orientation knowledge as a context map: a small, constant-sized artifact in the agent's prompt that gives it a persistent peek into the external context. The map is maintained by a programmable cache policy with three modules: a Distiller that extracts transferable knowledge from inference-time signals, a Cartographer that translates it into structured edits, and a priority-based Evictor that enforces a fixed token budget. On long-context reasoning and information aggregation, PEEK improves over strong baselines by 6.3-34.0% while using 93-145 fewer iterations and incurring 1.7-5.8x lower cost than the state-of-the-art prompt-learning framework, ACE. On context learning, PEEK improves solving rate and rubric accuracy by 6.0-14.0% and 7.8-12.1%, respectively, at 1.4x lower cost than ACE. These gains generalize across LMs and agent architectures, including OpenAI Codex, a production-grade coding agent. Together, these results show that a context map helps long-context LLM agents interact with recurring external contexts more accurately and efficiently.
【3】Prior Knowledge or Search? A Study of LLM Agents in Hardware-Aware Code Optimization
标题:先验知识还是搜索?LLM代理在硬件感知代码优化中的研究
链接:https://arxiv.org/abs/2605.19782
作者:Dmitry Redko,Albert Fazlyev,Konstantin Sozykin,Maria Ivanova,Evgeny Burnaev,Egor Shvetsov
摘要:LLM discovery and optimization systems are increasingly applied across domains, implementing a common propose-evaluate-revise loop. Such optimization or discovery progresses via context conditioning on received feedback from an environment. However, as modern LLM agents are increasingly complex in their structure, it is difficult to evaluate which components contribute the most, and when and how this exploration may fail. We answer these questions through three controlled experiments. Our findings: (1) In pure black-box optimization, LLMs act as greedy optimizers. (2) In zero-shot kernel generation, providing explicit input-size information has no measurable effect, models converge to the same kernel parameters regardless of size or temperature, as though the size instruction were invisible. Moreover, when tasked to perform kernel optimization for uncommon kernel sizes, performance sharply degrades regardless of the language used. (3) In feedback-loop kernel optimization, CUDA improves monotonically under iterative feedback, while TVM IR actively degrades, which demonstrates that kernel optimization degrades when models operate with low-density language. Our results conclude that LLMs in code optimization tasks highly depend on pretrained priors rather than provided feedback or agentic structure.
【4】EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design
标题:EngiAI:用于LLM驱动工程设计的多代理框架和基准套件
链接:https://arxiv.org/abs/2605.19743
作者:Gioele Molinari,Florian Felten,Soheyl Massoudi,Mark Fuge
备注:26 pages, 10 figures, to be published at IDETC 2026
摘要
:Large Language Model (LLM) agents are increasingly applied to engineering design tasks, yet existing evaluation frameworks do not adequately address multi-agent systems that combine simulation, retrieval, and manufacturing preparation. We introduce a benchmark suite with three evaluation dimensions: (1) a workflow benchmark with seven prompt styles targeting distinct cognitive demands-including direct tool use, semantic disambiguation, conditional branching, and working-memory tasks; (2) a Retrieval-Augmented Generation (RAG) benchmark with gated scoring isolating retrieval contributions to parameter selection; and (3) an High Performance Computing (HPC) benchmark evaluating end-to-end ML training orchestration on a SLURM cluster. Alongside the benchmark we present EngiAI, a Multi-Agent System (MAS) reference implementation built on LangGraph that operationalizes the benchmark by coordinating seven specialized agents through a supervisor architecture, unifying topology optimization, document retrieval, HPC job orchestration, and 3D printer control. Across four LLM backends and two EngiBench problems, proprietary models achieve 96-97% average task completion on Beams2D, while open-source 4B-parameter models reach 55-78%, with clear generational improvement. Conditional branching proves most challenging, with task completion dropping to 20-53% for the conditional style on Photonics2D. RAG gating confirms near-perfect retrieval-augmented scores ($\approx 1.0$) versus near-zero without retrieval, validating the evaluation design. On HPC orchestration, one model completes all pipeline steps in 100% of runs while another drops to 50%, revealing that multi-step instruction following degrades over long-running workflows.
【5】OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
标题:OScaR:LLM及以后中用于极端KV缓存量化的奥卡姆剃刀
链接:https://arxiv.org/abs/2605.19660
作者:Zunhai Su,Rui Yang,Chao Zhang,Yaxiu Liu,Yifan Zhang,Wei Wu,Jing Xiong,Dayou Du,Xialie Zhuang,Yulei Qian,Yuchen Xie,Yik-Chung Wu,Hongxia Yang,Ngai Wong
备注:Under review
摘要:The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X-LLMs (i.e., text-only, multi-modal, and omni-modal LLMs). Advancing the per-channel paradigm, OScaR employs Canalized Rotation followed by Omni-Token Scaling to mitigate TNI-induced sequence-dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X-LLMs show that OScaR consistently outperforms existing methods and achieves near-lossless performance under INT2 quantization, establishing it as a robust, low-complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding-v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at https://github.com/ZunhaiSu/OScaR-KV-Quant.
【6】The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility
标题:沉默的超参数:量化推理后端对LLM再现性的影响
链接:https://arxiv.org/abs/2605.19537
作者:David Pape,Jonathan Evertz,Lea Schönherr
摘要:Progress in LLMs is increasingly measured through standardized benchmarks, where state-of-the-art improvements are often separated by fractions of a percentage point. At the same time, the computational cost of evaluating modern LLMs has driven widespread adoption of specialized inference backends, software systems that execute trained models efficiently at inference time. While critical for scalability, system-level optimizations, such as custom CUDA kernels and reduced-precision arithmetic, can alter token probabilities and introduce non-determinism, possibly cascading into divergent generation. In this work, we first survey the inference landscape, identifying 200 distinct engines, and analyze 35,000 ML publications, finding that the specific inference stack is rarely reported despite this widespread diversity. We then present a systematic empirical study of how inference backends affect LLM benchmark results. Holding model weights, decoding parameters, and hardware constant, we evaluate five widely used inference engines, including vLLM, SGLang, and llama.cpp, across multiple open-weight models and established benchmarks. We show that the choice of backend alone can shift benchmark scores by up to 16.6 percentage points and induce high rates of output disagreement. By isolating backend optimizations and tracing the execution pipeline, we find this divergence is driven by system-level optimizations like prefix caching and CUDA graphs, custom kernels, and engine-specific defaults in logit processing. Our findings identify the inference backend as a previously unreported but consequential hyperparameter in the evaluation of LLM and advocate standardized reporting of inference stacks to improve the reproducibility and interpretability of benchmark comparisons.
【7】Drifting Objectives for Refining Discrete Diffusion Language Models
标题:细化离散扩散语言模型的目标漂移
链接:https://arxiv.org/abs/2605.19470
作者:Daisuke Oba,Hiroki Furuta,Naoaki Okazaki
备注:Project page: https://daioba.github.io/tokendrift/
摘要:Discrete diffusion language models (DDLMs) generate text by iteratively denoising categorical token sequences, while recent drifting methods for continuous generators suggest that part of this sampling-time correction can instead be absorbed into training through an anti-symmetric fixed-point objective. We study how to transfer this principle to DDLMs, where the main challenge is the interface with discrete text: hard token samples are non-differentiable, and categorical predictions do not directly provide continuous samples to drift. We formulate TokenDrift, a drifting objective that lifts categorical predictions to soft-token features, applies anti-symmetric drifting in a frozen semantic space, and backpropagates the resulting stop-gradient feature target to DDLM logits. In controlled continual-training experiments with masked and uniform-state diffusion backbones, TokenDrift improves fixed-NFE generation quality over matched continuation baselines, reducing Gen.-PPL at 4 NFEs by 89% on MDLM and 86% on DUO. These results suggest that drifting can provide a practical refinement objective for DDLMs.
【8】The Evaluation Game: Beyond Static LLM Benchmarking
标题:评估游戏:超越静态LLM基准
链接:https://arxiv.org/abs/2605.19377
作者:Paul Wang,Jade Garcia-Bourrée,Anne-Marie Kermarrec,Vincent Corruble
备注:36 pages
摘要:As jailbreaks, adversarially crafted inputs that bypass safety constraints, continue to be discovered in Large Language Models, practitioners increasingly rely on fine-tuning as a defensive strategy. Yet the theoretical foundations underlying this robustness fine-tuning remain underexplored. We introduce a game-theoretic framework in which the interaction between an evaluator (auditing the model for jailbreaks) and a trainer is formalized as a two-player game. A key feature of our approach is the use of group actions, a mathematical structure that captures symmetries and transformations, to formally represent data augmentation. The simplest non-trivial instance is the circle with cyclic translation groups, where we exhibit various regimes depending on the trainer's generalization range. Below a critical threshold, the evaluator maintains a constant miss ratio for linearly many rounds, whereas other settings can yield very different behaviors. We further provide empirical evidence supporting locality-dependence of the model: for the three model families we tested (Llama, Qwen and Mistral), we have significant evidence that fine-tuning on adversarial prompts induces only local generalization, with refusal rates on test examples highly correlated with the distance to the fine-tuning prompts. Our framework recasts the central object of adversarial evaluation: a benchmark is not a static set of prompts but an orbit under the evaluator's group action, and audit protocols that ignore trainer-side adaptation cannot distinguish a genuine fix from a memorized patch.
【9】Language models struggle with compartmentalization
标题:语言模型正在努力解决划分问题
链接:https://arxiv.org/abs/2605.19284
作者:Thomas Vincent Howe,David Wingate
备注:9 pages, 8 figures, plus 9 pages of appendices. Submitted to NeurIPS 2026. Code: https://github.com/vinhowe/compartmentalization. Eval data: https://doi.org/10.5281/zenodo.20171021
摘要:In the training data used by large language models (LLMs), the same latent concept is often presented in multiple distinct ways: the same facts appear in English and Swahili; many functions can be expressed in both Python and Haskell; we can express propositions in both formal and natural language. We show that LLMs can exhibit compartmentalization, where they fail to identify and share statistical strength between distinct presentations of unified concepts. In the worst case, LLMs simply learn parallel internal representations of each presentation of the concept, saturating model capacity with redundancies and decreasing sample efficiency with the number of such presentations. We also demonstrate that synthetic parallel data can fail to improve this despite being easily learned itself. Under this framework, we find that, for small models, early multilingual learning is nearly entirely compartmentalized. Finally, all interventions that we study exhibit a phase transition in which their effectiveness depends on the number of distinct presentations, suggesting that the language modeling objective may only inconsistently unify representations.
【10】OpenCompass: A Universal Evaluation Platform for Large Language Models
标题:OpenCompass:大型语言模型的通用评估平台
链接:https://arxiv.org/abs/2605.19276
作者:Maosong Cao,Kai Chen,Haodong Duan,Yixiao Fang,Tong Gao,Ge Jiaye,Mo Li,Hongwei Liu,Junnan Liu,Yuan Liu,Chengqi Lyu,Han Lyu,Ningsheng Ma,Zerun Ma,Yu Sun,Zhiyong Wu,Linchen Xiao,Jun Xu,Haochen Ye,Zhaohui Yu,Yike Yuan,Songyang Zhang,Yufeng Zhao,Fengzhe Zhou,Peiheng Zhou,Dongsheng Zhu,Lin Zhu,Jingming Zhuo
摘要:In recent years, the field of artificial intelligence has undergone a paradigm shift from task-specific small-scale models to general-purpose large language models (LLMs). With the rapid iteration of LLMs, objective, quantitative, and comprehensive evaluation of their capabilities has become a critical link in advancing technological development. Currently, the mainstream static benchmark dataset-based evaluation methods face challenges such as the diversity of task types, inconsistent evaluation criteria, and fragmentation of data and processing workflows, making it difficult to efficiently conduct cross-domain and large-scale model evaluation. To address the aforementioned issues, this paper proposes and open-sources OpenCompass, a one-stop, scalable, and high-concurrency-supported general-purpose LLM evaluation platform. Adhering to the design philosophy of modularization and component decoupling, the platform boasts three core advantages: high compatibility, flexibility, and high concurrency. The core architecture of OpenCompass comprises five key components: the Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, and Result Visualization Module. Its workflow provides rule-based, LLM-as-a-Judge, and cascaded evaluators to adapt to the requirements of different task scenarios. Supporting mainstream benchmark datasets across multiple domains, including knowledge, reasoning, computation, science, language, code, etc., the platform offers a unified and efficient LLM evaluation tool for both academia and industry, facilitating the accurate identification of strengths and weaknesses of LLMs as well as their subsequent optimization.
【11】Backdooring Masked Diffusion Language Models
标题:后门掩蔽扩散语言模型
链接:https://arxiv.org/abs/2605.19262
作者:Daniel Yiming Cao,Chengzhong Wang,Sheng-Yen Chou,Chengyu Huang,Pin-Yu Chen,Shengwei An
摘要
:Masked diffusion language models (MDLMs) are emerging as a compelling new paradigm for text generation, but their training-time security remains largely unexplored. Existing backdoor attacks on Gaussian diffusion models or autoregressive language models do not directly apply to MDLMs because MDLMs rely on discrete state corruption and iterative denoising rather than continuous noising or left-to-right prediction. In this work, we present the first systematic study of training-time backdoor attacks on MDLMs. We propose SHADOWMASK, a backdoor attack that modifies the MDLM forward corruption process by replacing the standard all-mask terminal distribution with a trigger-mask mixture prior. This creates a dedicated denoising pathway from trigger-corrupted states to attacker-specified targets while preserving clean denoising behavior. We further provide a principled mathematical formulation by defining the backdoored forward process, deriving the reverse-time posterior, and obtaining the continuous-time training objective. Evaluations on DiT-based MDLM and LLaDA-8B-Instruct across WikiText-103, OpenWebText, and Alpaca show that SHADOWMASK achieves near-100% attack success, substantially outperforms standard data poisoning, largely preserves clean utility, remains effective under full-model and parameter-efficient fine-tuning, and is robust against representative defenses.
【12】Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution
标题:通过逐步置信归因诊断黑匣子LLM中的多步推理失败
链接:https://arxiv.org/abs/2605.19228
作者:Xiaoou Liu,Tiejin Chen,Dengjia Zhang,Yaqing Wang,Lu Cheng,Hua Wei
备注:Accepted by ICML 2026
摘要:Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step-by-step solutions, but diagnosing where a multi-step reasoning trace might fail remains difficult. Confidence estimation offers a diagnostic signal, yet existing methods are restricted to final answers or require internal model access. In this paper, we introduce Stepwise Confidence Attribution (SCA), a framework for closed-source LLMs that assigns step-level confidence based only on generated reasoning traces. SCA applies the Information Bottleneck principle: steps aligning with consensus structures across correct solutions receive high confidence, while deviations are flagged as potentially erroneous. We propose two complementary methods: (1) NIBS, a non-parametric IB approach measuring consistency without graph structures, and (2) GIBS, a graph-based IB model that learns subgraphs through a differentiable mask to capture logical variability. Extensive experiments on mathematical reasoning and multi-hop question answering show that SCA reliably identifies low-confidence steps strongly correlated with reasoning errors. Moreover, using step-level confidence to guide self-correction improves the correction success rate by up to 13.5\% over answer-level feedback.
【13】Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering
标题:位置:LLM中的不确定性量化只是无监督聚类
链接:https://arxiv.org/abs/2605.19220
作者:Tiejin Chen,Longchao Da,Xiaoou Liu,Hua Wei
备注:Accepted by ICML 2026 Position Paper Track
摘要:Uncertainty Quantification (UQ) is widely regarded as the primary safeguard for deploying Large Language Models (LLMs) in high-stakes domains. However, we argue that the field suffers from a category error: mainstream UQ methods for LLMs are just unsupervised clustering algorithms. We demonstrate that most current approaches inherently quantify the internal consistency of the model's generations rather than their external correctness. Consequently, current methods are fundamentally blind to factual reality and fail to detect ``confident hallucinations,'' where models exhibit high confidence in stable but incorrect answers. Therefore, the current UQ methods may create a deceptive sense of safety when deploying the models with uncertainty. In detail, we identify three critical pathologies resulting from this dependence on internal state: a hyperparameter sensitivity crisis that renders deployment unsafe, an internal evaluation cycle that conflates stability with truth, and a fundamental lack of ground truth that forces reliance on unstable proxy metrics to evaluate uncertainty. To resolve this impasse, we advocate for a paradigm shift to UQ and outline a roadmap for the research community to adopt better evaluation metrics and settings, implement mechanism changes for native uncertainty, and anchor verification in objective truth, ensuring that model confidence serves as a reliable proxy for reality.
【14】Sequential Consensus for Multi-Agent LLM Debates: A Wald-SPRT compute governor with calibration-based failure detection
标题:多代理LLM辩论的顺序共识:具有基于校准的故障检测的Wald-SPRT计算调节器
链接:https://arxiv.org/abs/2605.19193
作者:Andrea Morandi
摘要:Multi-agent LLM debate improves factuality and reasoning, but most recipes pick a fixed round count, over-spending on easy items and under-spending on hard ones. We adapt Wald's Sequential Probability Ratio Test (SPRT) as a plug-in compute governor for LLM debates. After each round, an LLM judge emits a [0,1] consensus score on the latest agent positions; a Wald monitor accumulates the log-likelihood ratio of "useful convergence" vs "not yet useful" under a Beta likelihood family, and stops when either boundary is crossed or returns a capped best-effort outcome at R_max. Under i.i.d. assumptions the rule inherits SPRT type-I/type-II error guarantees; in deployment the calibration itself is the more important object, since it estimates whether the judge score actually separates useful from unhelpful convergence in a given domain. We evaluate two tracks: (i) a Monte-Carlo study under calibrated Beta models characterising working curves, error rates, capping behaviour, and sensitivity; and (ii) a real-LLM evaluation on 200 attempted MMLU and 200 attempted GSM8K items with three heterogeneous agents (gpt-5, claude-opus-4-6, gemini-2.5-pro) and a claude-opus-4-6 judge, using disjoint 40-item calibration subsets. On GSM8K the rule stops in 1.01 average rounds (4.06 LLM calls) at 97.0% accuracy vs 99.0% for fixed-5 debate at 15 calls: a 3.7x call reduction at -2pp accuracy. On MMLU the calibrated KL collapses to about 0 and the rule caps on 99.5% of items at 2.1x cost. The takeaway is not that SPRT makes debate more accurate, but that a classical sequential test serves as a cheap compute-control and failure-detection layer for multi-agent LLM systems.
【15】Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks
标题:友善,重写:通过重写防御LLM数据中毒攻击的良性预测
链接:https://arxiv.org/abs/2605.19147
作者:John T. Halloran,Noopur S. Bhatt
备注:15 pages, 2 Figures, 5 Tables
摘要
:Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termed open-book benign rewriting (OBBR)--the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. Finally, we show that OBBR is computationally efficient relative to other BA defenses, does not degrade model performance on natural language tasks after fine-tuning, and is capable of defending against non-trigger based data poisoning attacks.
【16】Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training
标题:用于高效音频大语言模型训练的异类感知数据集调度
链接:https://arxiv.org/abs/2605.19101
作者:Yanru Wu,Jianning Wang,Chongxin Gan,Yang Li
摘要:Training general-purpose Audio Large Language Models (ALLMs) across diverse datasets is essential for holistic audio understanding, yet it faces significant challenges due to dataset heterogeneity, which often leads to conflicting gradients and slow convergence. Despite its impact, how to explicitly manage this heterogeneity during training remains underexplored, with current practices relying primarily on uniform mixture. In this work, we analyze multi-dataset AudioQA training from a convergence perspective and propose Grouped Sequential Training (GST). GST strategically organizes datasets into affinity-aware groups and introduces them via a progressive scheduling protocol, effectively balancing the stability of parallel training with the efficiency of sequential optimization. To ensure scalability, we develop gradient-based affinity metrics that capture inter-dataset relationships without the prohibitive cost of empirical transferability estimation. Extensive evaluations on 14 AudioQA datasets spanning speech, music, and environmental sounds demonstrate that GST achieves 30--40\% faster convergence than standard parallel training while maintaining or even surpassing the performance of mix-all training. Our results provide both theoretical insights and a practical, model-agnostic framework for efficient large-scale ALLM optimization.
【17】ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models
标题:ScheduleFree+:将无学习率和无计划学习扩展到大型语言模型
链接:https://arxiv.org/abs/2605.19095
作者:Aaron Defazio
摘要:Schedule-Free Learning has shown promise as a practical anytime training method for machine learning, showing success across dozens of standard benchmark problems. However, strong performance for LLM training has only been demonstrated at small scales. We identify a number of fixes necessary to scale up Schedule-Free Learning to larger batch sizes and model sizes, and present a learning-rate-free and schedule-free method (ScheduleFree+) for training large language models which greatly outperforms Warmup-Stable-Decay (WSD) schedules. We also demonstrate that Schedule-Free Learning is most effective for long duration training, and at 1000 tokens per parameter, it outperforms SOTA schedules by 31%. Schedule-Free Learning provides a theoretical foundation for the use of model averaging and checkpoint merging during pretraining.
【18】HypergraphFormer: Learning Hypergraphs from LLMs for Editable Floor Plan Generation
标题:HypergraphFormer:从LLM学习Hypergraphs以生成可编辑平面图
链接:https://arxiv.org/abs/2605.18932
作者:Nikita Klimenko,Hesam Salehipour,Parham Eftekhar,Amir Khasahmadi,Ramon Elias Weber
摘要:In this work, we propose HypergraphFormer, a novel and efficient approach to floor plan generation based on learning hypergraph representations with a large language model (LLM). The model is trained via supervised fine-tuning to generate a hypergraph-based textual representation that encodes spatial relationships and connectivity information within floor plans. We train and evaluate our approach on the RPLAN dataset, and further demonstrate its generalizability on a separate out-of-distribution dataset, which we release in this paper. Our method outperforms state-of-the-art techniques based on rasterized or vectorized representations across a diverse set of metrics. We also show improved data efficiency, particularly under distribution shift. The hypergraph formulation enables the generation of floor plans for arbitrary, irregular, user-specified boundaries by decoupling apartment footprints from their functional and geometric subdivisions. Furthermore, we show that the proposed methodology offers a high degree of editability, making it particularly well suited to design-oriented workflows supported by LLMs.
【19】OEP: Poisoning Self-Evolving LLM Agents via Locally Correct but Non-Transferable Experiences
标题:OEP:通过本地正确但不可转移的体验毒害自我进化的LLM代理人
链接:https://arxiv.org/abs/2605.18930
作者:Kaixiang Wang,Jiong Lou,Zhaojiacheng Zhou,Jie Li
摘要
:Memory-augmented large language model (LLM) agents use iterative reflection and self-evolution to solve complex tasks, but these mechanisms introduce security risks. Existing agentic memory attacks require privileged access or explicit malicious content, making them detectable by advanced safety filters. This leaves a subtler attack surface underexplored: whether adversaries can induce agent to generate experiences that appear locally correct and semantically plausible yet induce harmful generalization during reflection. We find that reflective agents are vulnerable to such clean experiences, especially when paired with severe but plausible hypothetical consequences. Based on this observation, we introduce Obsessive Experience Poisoning (OEP), a low-privilege black-box attack requiring no direct control over the system prompt or memory database. OEP constructs adversarial clean edge-cases that combine locally correct solutions, non-transferable methods, and severe consequences, biasing reflection toward risk-averse rule formation. During memory consolidation, agents may over-trust self-generated reflections and distill localized experiences into high-priority but over-generalized rules, causing downstream failures. Evaluations across three domains show that OEP achieves ASR above 50\% with GPT-4o agents, and outperforms existing attacks under LLM auditing defense.
【20】Don't Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target
标题:不要让强盗的反馈导致LLM推荐人的持续更新偏离目标
链接:https://arxiv.org/abs/2605.18899
作者:Taesan Kim,Hyeongjun Yun,Jaegul Choo,Chung Park
摘要:Generative LLM-based recommenders (LLM-Rec) require continual post-deployment updates, yet deployment logs provide only policy-shaped contextual bandit feedback: outcomes are observed solely for items exposed by a prior serving policy, inducing exposure bias and yielding partial, asymmetric signals consisting of relatively reliable positive responses and ambiguous no-responses. We propose an Anchored Bandit Policy Optimization (ABPO) framework for continual LLM-Rec updates that combines group-relative policy optimization (GRPO) with explicit treatment of exposure bias and feedback ambiguity. Specifically, we insert the exposed recommendation as a logged anchor into each GRPO rollout group, so that group-relative normalization is calibrated against the action actually exposed by the prior policy rather than against newly sampled rollouts alone. Because both positive- and no-responses are observed only through prior-policy exposure, we apply self-normalized inverse propensity scoring to the fixed anchor for both feedback types to correct for policy mismatch. At the same time, we treat the two feedback types asymmetrically in reliability: positive responses provide relatively direct endorsement signals, whereas no-responses remain ambiguous because they may reflect either true disinterest or unobserved external factors. To avoid overly aggressive updates from ambiguous no-responses, we temper their penalties with self-certainty, using the model's output-token confidence as a verifier-free reliability signal. Across five domains from Amazon Reviews and MovieLens, our method yields consistent post-update gains in recommendation accuracy while mitigating prior-policy-induced exposure bias more effectively than prior baselines.
【21】To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents
标题:打电话或不打电话:诊断LLM代理人内在的过度呼叫偏见
链接:https://arxiv.org/abs/2605.18882
作者:Wei Shi,Ziheng Peng,Sihang Li,Xiting Wang,Xiang Wang,Mengnan Du,Na Zou
摘要:LLM agents exhibit a consistent tendency to over-call, invoking tools even in situations where none is needed. On the When2Call benchmark, six models from three families show high call accuracy but much lower no-call accuracy, leaving overall accuracy in the 55%-70% range. We trace this to an Intrinsic Bias Hypothesis (IBH): the call/no-call decision mapping carries an activation-independent call offset, so the model favors call even at activation parity. Using Sparse Autoencoders (SAEs), we recover behavior-aligned feature bases for the call/no_call decision, reduce them to a signed activation margin, and estimate the offset directly. Across all six models, the model is decision-neutral only when no_call activation outweighs call activation, consistent with IBH. We then causally test IBH with Adaptive Margin-Calibrated Steering (AMCS), a closed-form counter-bias shift along SAE decoder directions. Cancelling the diagnosed offset mitigates over-calling and improves overall accuracy with a negligible drop in call accuracy. Our work recasts over-calling from an empirical phenomenon into a mechanistic object amenable to causal correction. Code is available at https://github.com/SKURA502/agent-sae/.
【22】ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models
标题:ZeroUnlearn:大型语言模型中的少量知识学习
链接:https://arxiv.org/abs/2605.18879
作者:Yujie Lin,Chengyi Yang,Zhishang Xiang,Yiping Song,Jinsong Su
摘要:Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on massive web corpora, raising concerns for privacy and safety. Existing machine unlearning methods primarily rely on retraining or aggressive fine-tuning, which are either computationally expensive or prone to degrading related knowledge and overall model utility. In this work, we reformulate machine unlearning as a precise knowledge re-mapping problem via model editing. We propose ZeroUnlearn, a few-shot unlearning framework. It overwrites sensitive inputs by mapping them to a neutral target state and removing their original representations. ZeroUnlearn enforces representational orthogonality through a multiplicative parameter update with a closed-form solution, enabling efficient and targeted unlearning. We further extend ZeroUnlearn to a gradient-based variant for multi-sample unlearning. Experiments demonstrate that our approach outperforms existing baselines while preserving general model utility. Our code is available at the github: https://github.com/XMUDeepLIT/ZeroUnlearn.
【23】Distributional Energy-Based Models for Uncertainty-Aware Structured LLM Reasoning
标题:用于不确定性感知结构化LLM推理的分布式能量模型
链接:https://arxiv.org/abs/2605.18871
作者:Shireen Kudukkil Manchingal,Abhey Kalia,Fernanda Gonçalves,Shebin Rawther
摘要
:When Large Language Models produce structured outputs such as travel plans, code solutions, or multi-step proofs, individual reasoning steps may appear correct while the output as a whole violates budgets, fails test cases, or contradicts earlier deductions. We propose a decomposed energy function that combines a learned quality scorer with deterministic analytical constraint penalties for verifying structured LLM outputs. The quality scorer is a heterogeneous ensemble of low-rank adapters on a single frozen encoder (3% trainable parameters); the ensemble mean ranks candidates while the standard deviation quantifies epistemic uncertainty, driving a two-pass inference loop that triggers targeted regeneration or abstention. Across five benchmarks (GSM8K, MuSR, TravelPlanner, TACO, Knights & Knaves), our 149M-parameter verifier orchestrating a pool of 7-26B open generators outperforms single-shot Qwen-72B on every benchmark, matches Claude Sonnet 4.6 on MuSR (67.7% vs. 68.0%), and reduces constraint violations by 53% relative to Opus 4.6 on TravelPlanner (oracle 0.028, random 0.231). The two routes are complementary: structural verification wins when constraints are checkable (the verifier captures signal frontier models cannot self-detect), while pretraining-scale priors win where they are not (narrative inference, code semantics). A cross-dataset confounding analysis confirms genuine quality discrimination on four reasoning tasks and identifies a model-identity shortcut on code, mitigated via last-layer retraining. Scorers trained on difficult data transfer zero-shot: a MuSR-trained scorer achieves 93.9% on GSM8K without seeing a math problem.
【24】DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models
标题:DarkLLM:使用大型语言模型学习数据驱动的对抗攻击
链接:https://arxiv.org/abs/2605.18868
作者:Ye Sun,Xin Wang,Jiaming Zhang,Yifeng Gao,Yixu Wang,Yifan Ding,Qixian Zhang,Henghui Ding,Xingjun Ma,Yu-Gang Jiang
备注:23 pages, 13 figures
摘要:While vision and multimodal foundation models underpin critical tasks from perception to complex reasoning, they remain highly vulnerable to adversarial attacks. However, traditional adversarial attacks are typically limited to single, predefined objectives, tightly coupling each attack to a specific model or task, which restricts their scalability and flexibility in real-world scenarios. In this work, we present DarkLLM, a novel attack framework that trains an LLM to translate natural-language attack instructions into latent attack vectors, which are then decoded into visual adversarial perturbations. By leveraging natural-language instruction tuning, DarkLLM not only unifies targeted, untargeted, segmentation, and multi-model attacks within a single framework, but also achieves flexible and controllable adversarial generation, enabling each instruction to produce a perturbation that induces desired behaviors across heterogeneous models. Through extensive experiments across 4 tasks, 13 datasets, and 15 models, we demonstrate that DarkLLM with only 1B parameters can follow attacker instructions and generate highly effective attacks against CLIP, SAM, and frontier LLMs, revealing a systemic vulnerability in modern foundation models.
【25】SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs
标题:SAGE:在LLM的RLVR中塑造引导性探索的屏障
链接:https://arxiv.org/abs/2605.18864
作者:Chanuk Lee,Minki Kang,Sung Ju Hwang
备注:Preprint
摘要:Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model. Prior analyses largely support the latter view, attributing this limitation to structural properties of standard RLVR objectives that result in insufficient exploration pressure. In this work, we argue that a central structural constraint arises from reverse-KL regularization, which stabilizes training but inherently anchors the policy to the reference distribution, thereby suppressing the emergence of alternative reasoning modes. However, we show that neither removing the KL term nor replacing it with forward-KL provides a satisfactory solution, as both disrupt the efficiency-coverage trade-off by either inducing reward hacking or allocating probability mass to off-target regions. To resolve this tension, we propose SAGE, a principled framework that enables controllable empirical support expansion by reshaping the reverse-KL anchor distribution itself through a guide function q(x,y), achieving consistent improvements in both pass@1 and pass@k across challenging mathematical reasoning benchmarks. Our code is available at https://github.com/tally0818/SAGE.
【26】From Llama to Cria: Scaling Down Neural Networks via Neuron-Level Spectral Structural Importance Evaluation
标题:从Lama到Cria:通过神经元级谱结构重要性评估缩小神经网络规模
链接:https://arxiv.org/abs/2605.18860
作者:Yongyu Wang
摘要:This paper proposes a neuron pruning framework based on neuron-level spectral structural importance evaluation. Given a trained neural network, we record the hidden states of each hidden layer during inference and model neurons as graph nodes, with hidden states treated as graph signals. Using ideas from graph signal processing, we infer layer-wise input and output graphs that characterize the structural relationships among neurons before and after each layer transformation. We then evaluate the spectral structural importance of neurons by analyzing the transformation between these graphs based on spectral graph theory. Neurons with high spectral structural importance are regarded as strongly involved in the internal representation transformation and are therefore preserved, while neurons with low importance scores are selected as pruning candidates. The pruning process is conducted iteratively until a predefined effective parameter reduction target is reached. Instead of fine-tuning after every pruning step, the proposed strategy first removes low-importance neurons to obtain a compact architecture and then applies a final recovery fine-tuning stage to restore task performance. By connecting neuron pruning with graph signal processing and spectral structural analysis, the proposed framework offers a principled way to reduce neural network size while maintaining solution quality. Experimental results on CIFAR-10 image classification and SST-2 sentiment classification show that our method can effectively remove low-importance neurons and achieve compact networks with competitive performance after recovery fine-tuning.
【27】TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing
标题:TwinRouterBench:现实统计LLM路由的快速静态和实时动态评估
链接:https://arxiv.org/abs/2605.18859
作者:Pei Yang,Wanyi Chen,Tongyun Yang,Pengbin Feng,Jiarong Xing,Wentao Guo,Yuhang Yao,Yuhang Han,Hanchen Li,Xu Wang,Zeyu Wang,Jie Xiao,Anjie Yang,Liang Tian,Lynn Ai,Eric Yang,Tianyu Shi
摘要
:LLM routing matters most in long-horizon applications such as coding agents, deep research systems, and computer-use agents, where a single user request triggers many model calls. Routing each call to the cheapest sufficient model can cut costs without sacrificing quality, yet existing router benchmarks evaluate routers only on one-shot prompts. They never expose the router-visible prefix at an intermediate agent step, never test whether a cheaper replacement preserves downstream task success, and often rely on online LLM judges at evaluation time. We introduce TwinRouterBench, a step-level routing benchmark with two tracks. The static track provides 970 router-visible prefixes from 520 instances across SWE-bench, BFCL, mtRAG, QMSum, and PinchBench, each paired with an execution-verified target tier estimated under a released downgrade-and-cascade protocol; scoring is deterministic arithmetic over tier labels, trajectory membership, and token costs, with no online evaluator-side LLM judge. The dynamic track supplies a harness that runs routers on the full 500-case SWE-bench Verified suite; in this paper we report a 100-case held-out evaluation disjoint from the static SWE supervision split. At each LLM call the router selects a concrete model from a locked pool, and success is measured by official task resolution and realized API spend. The two tracks support fast offline iteration followed by end-to-end validation under live agent execution. Code and data are available at https://github.com/CommonstackAI/TwinRouterBench.
【28】Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking
标题:通过统计评估和稳定性感知排名为多模式LLM进行稳健检查点选择
链接:https://arxiv.org/abs/2605.18852
作者:Qinwu Xu,Zhuoheng Li,Jessie Salas
摘要:Checkpoint selection for multimodal large language models (MLLMs) presents significant challenges when performance differentials are marginal and evaluation signals are prone to noise. Existing methodologies rely heavily on static benchmarks or pointwise scoring, which frequently misalign with in-the-wild usage and lack robust uncertainty estimation, particularly in OCR-heavy scenarios. In this work, we formulate checkpoint selection as a robust decision problem under evaluation uncertainty. We propose a multi-stage framework that integrates curated real-world data, structured LLM-based judgment, and multi-stage ranking protocols. The evaluation system orchestrates progressive refinement via pointwise filtering, listwise ranking, and pairwise comparison. To enhance reliability, we introduce subsampling-based confidence estimation and a percentile-based scoring formulation that captures distributional characteristics while penalizing tail failures. Furthermore, we demonstrate that data quality, specifically OCR readability, is a critical determinant of evaluation validity.
【29】STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning
标题:FRAIDE:LLM推理的可学习逐步反馈语言
链接:https://arxiv.org/abs/2605.18851
作者:Junjie Zhang,Guozheng Ma,Shunyu Liu,Zetian Hu,Yongcheng Jing,Ting-En Lin,Yongbin Li,Dacheng Tao
摘要:Recent advances in Reinforcement Learning (RL) have underscored its potential for incentivizing reasoning capabilities of Large Language Models (LLMs). However, existing step-level efforts suffer from costly annotations that limit domain coverage, while scalar scores further impose an information bottleneck, offering insufficient semantic bandwidth to improve intermediate decisions. Alternative language-critique approaches, which rely on frozen or external critics, provide richer textual feedback but lack the scalability needed for sustained policy improvement. In this work, we propose language-driven stepwise trajectory redirection, termed as STRIDE, a novel training framework that shifts process supervision from scalar rewards to learnable stepwise language feedback. Specifically, we co-train a generator and a generative verifier using only outcome-based rewards, eliminating external annotations, while delivering sustained policy improvement through jointly aligned verifier training. The verifier's stepwise language critiques explicitly localize and explain failures, enabling the generator to redirect reasoning trajectories at intermediate steps toward alternative decisions. The trajectory redirection design guarantees harmless policy improvement, even under noisy or suboptimal verifier feedback. Experiments on diverse reasoning benchmarks show that STRIDE significantly outperforms state-of-the-art baselines, as well as achieving breakthroughs on zero-pass-rate problems where scalar methods yield no learning signal in our ablation studies, demonstrating the effectiveness of learnable stepwise language feedback for enhancing LLM reasoning.
【30】TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting
标题:TEMPO:通过模式分离的政策优化进行临时强制执行值得信赖的LLM回溯测试
链接:https://arxiv.org/abs/2605.18843
作者:Zeyu Zhang,Bradly C. Stadie
备注:9 pages in main context
摘要:Backtesting large language models on historical events requires reasoning exclusively from information available before a specified cutoff date. Yet models routinely leak post-cutoff knowledge from pre-training into their reasoning, inflating apparent accuracy and undermining evaluation validity. Prompt-based constraints fail when suppressed content is causally related to the prediction, and knowledge unlearning cannot address this problem because temporal compliance is instance-specific: the same fact may be legitimate evidence for one cutoff date and a violation for another. Rather than erasing knowledge, the model must learn temporal discipline: selecting evidence conditioned on each instance's cutoff date. We propose TEMPO (Temporal Enforcement via Mode-separated Policy Optimization), which trains this discipline via two contributions: (1) a two-mode reward where a leakage mode drives post-cutoff claims to zero as a hard prerequisite before a performance mode optimizes task performance; and (2) a GRPO-based training pipeline that enables the model to discover temporally valid reasoning strategies. We prove that training monotonically decreases leakage, converges to the leak-free optimum, and improves task performance once compliance is achieved. On three prediction tasks and two models, TEMPO reduces leakage from 2~13% to 0.6~3.7% across all conditions, with task performance improving 6~13% where strong pre-cutoff signals exist and maintained where the prediction task is inherently difficult from valid information alone.
【31】Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling
标题:撒谎只是一个阶段:语言模型缩放中隐藏的对齐转变
链接:https://arxiv.org/abs/2605.18838
作者:Adil Amin
备注:15 pages, 8 figures, 2 tables. Companion paper: "The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next." Code: https://github.com/adilamin89/cape-scaling. Dashboard: https://zehenlabs.com/cape/
摘要:Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family-dependent critical scale $N_c$, capabilities anticorrelate; above it, they cooperate. $N_c \approx 3.5$B parameters [2.9B, 13.4B] (bootstrap 95% CI), but model size is not the only variable that determines phase. Architecture, data curation, and training recipe each shift $N_c$ independently: curated training eliminated the coupling dip between Qwen generations ($0.025 \to 0.830$ at matched scale), Gemma-4 at 4B achieves coupling 0.871, characteristic of 13B+ standard-trained models, through distillation and architectural innovation, and Phi at 1B matches web-trained coupling at 10B through data curation alone. Width normalization eliminates the anticorrelation across all tested families, supporting an output-projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse-regression ODE cross-predicts held-out Llama-2 at 5.6% error. The diagnostic requires no model internals -- only public benchmark scores across a model family. The cooperative regime extends to the frontier ($r = +0.72$, 34 models, 10 labs). Code, data, and an open-source activation-steering tool for any open-weight model are released alongside an interactive dashboard that diagnoses any model's coupling phase, suggests concrete interventions (data curation, width, benchmark rotation), and provides ODE scaling predictions, frontier diagnostics, and eigenstructure analysis: https://zehenlabs.com/cape/.
【32】Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds
标题:小型语言模型的代码引导推理:评估可执行的MCQA框架
链接:https://arxiv.org/abs/2605.18827
作者:Prateek Biswas,Dhaval Patel,Vedant Khandelwal,Shuxin Lin,Amit Sheth
备注:28 Pages, 18 Figures
摘要:Multiple-choice QA benchmarks usually evaluate small language models (SLMs) as direct answerers, but deployed language-model systems increasingly rely on external scaffolds such as tools, code, and repeated model calls. We introduce Code-Guided Reasoning (CGR), an evaluation protocol and generated-program resource for measuring when executable reasoning scaffolds improve SLM performance on MCQA tasks. CGR standardizes six components: a normalized item interface, a direct solver prompt, a generator prompt, a Python scaffold, solver-call and extraction helpers, and a three-channel result record. On 20,498 retained result rows from a locally prepared MCQA bundle and six metadata-registered solver models, the observed non-zero-baseline partition shows 66.21% macro assisted accuracy versus 38.11% direct accuracy, a +28.10 percentage-point difference with a pair-bootstrap interval of [20.32, 36.43]. Under a stricter Ab > 30% direct-signal gate, the macro difference is +14.11 points. These estimates are descriptive. Assisted inference uses a larger solver-call budget, answer extraction is brittle, Time-MQA contains the observed regressions, and some generated programs violate the no-hard-coding instruction. CGR provides the trace package needed to interpret these results, including direct, assisted, and generator-side answers, partition definitions, generated programs, response metadata, and audits.
【33】Not All Tokens Are Worth Caching: Learning Semantic-Aware Eviction for LLM Prefix Caches
标题:并非所有代币都值得缓存:学习LLM PreFix Caches的语义感知驱逐
链接:https://arxiv.org/abs/2605.18825
作者:Shaoke Fang,Ziang Li,Wenfei Wu,Jiatong Ji,Qingsong Liu,Ruizhi Pu
摘要:Prefix caching is a key optimization in Large Language Model (LLM) serving, reusing attention Key-Value (KV) states across requests with shared prompt prefixes to reduce expensive prefill computation. However, its benefit depends critically on the eviction policy as GPU memory is scarce, and existing policies such as LRU largely treat cached blocks uniformly. This view ignores a fundamental property of LLM prompts: not all tokens are equally worth caching. We show that different token types within a prompt, including system prompts, user queries, tool outputs, model responses, and chain-of-thought reasoning, exhibit up to 756x variation in reuse rates, yet no existing eviction policy exploits this signal. In this paper, we present SAECache (Semantic-Adaptive Eviction for prefix caches), a semantic-adaptive prefix cache eviction policy that addresses this gap through three innovations: (1) a multi-queue architecture that routes KV blocks to task-specific queues with tailored priority metrics, capturing both session reuse in multi-turn requests and structural reuse in templated single-turn requests; (2) a semantic-aware token weighting mechanism that learns the reuse value of different token types online through eviction feedback; and (3) a fully adaptive online learning schema for all parameter updates, including log-normal timing parameters, position decay power, queue weights, and meta-parameters, which eliminates manual tuning and enables automatic adaptation to deployment-specific workload characteristics. Through extensive evaluation across heterogeneous workloads, we demonstrate that SAECache achieves 1.4x-2.7x TTFT improvement over production-style baselines, while fixed-parameter alternatives can degrade by up to 2.7x under workload mismatch -- a failure mode our adaptive approach avoids entirely.
【34】Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production
标题:操作文档AI:生产中OCR和LLM管道的微服务架构
链接:https://arxiv.org/abs/2605.18818
作者:Yao Fehlis,Benjamin Bengfort,Zhangzhang Si,Vahid Eyorokon,Prema Roman,Patrick Deziel,Devon Slonaker,Steve Veldman,Ben Johnson,Joyce Rigelo,Michael Wharton,Steve Kramer
摘要
:Academic research tends to focus on new models for document understanding creating a wide gap in the literature between model definition and running models at production scale. To close that gap, we present a microservice architecture that encapsulates pipelines of multiple models for classification, optical character recognition (OCR), and large language model structured field extraction as well as our experience running this pipeline on thousands of multi-page documents per hour. We describe our primary design decisions, including a hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, use of asynchronous processing for the many IO-bound operations in the pipeline, and an independent, horizontal scaling strategy. Using batch profiling, we identified two surprising qualitative findings that shape production deployments: OCR, not language-model parsing, dominates end-to-end latency, and the system saturates at a concurrency determined by shared GPU-inference capacity rather than worker count. Our goal is to provide practitioners with concrete architectural patterns for building document understanding systems that work beyond the benchmark; effectively operationalizing models in production.
【35】DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training
标题:DynaTrain:弹性LLM训练的快速在线兼职转换
链接:https://arxiv.org/abs/2605.18815
作者:Yuanqing Wang,Yuchen Zhang,Hao Lin,Junhao Hu,Chunyang Zhu,Quanlu Zhang,Boxun Li,Guohao Dai,Zhi Yang,Daning Cheng,Yunquan Zhang,Yu Wang
备注:GitHub Repo: https://github.com/infinigence/ElasticMegatron
摘要:Modern large language model (LLM) training is inherently dynamic: resource fluctuations, RLHF phase shifts, and cluster elasticity continually reshape the optimal parallelism layout, posing a significant challenge to existing training frameworks built around a static execution model. We present DynaTrain, a distributed training system for sub-second, online reconfiguration across arbitrary multi-dimensional parallelism. At its core, we propose a Virtual Parameter Space (VPS) abstraction that unifies all distributed training states under one logical coordinate space, turning any parallelism configuration into a deterministic mapping and collapsing complex transition into manageable geometric intersections. On top of VPS, a state routing-and-transition layer executes rank-local transfers under a memory-aware, deadlock-free schedule, and an Elastic Device Manager overlaps new-world construction with ongoing training to mask topology-change cost. On dense and MoE models up to 235B parameters, DynaTrain reconfigures a 70B dense model in under 2s and a 235B MoE model in 4.36s, outperforming state-of-the-art checkpoint-based and elastic systems by up to three orders of magnitude while preserving correctness.
【36】PASC: Pipeline-Aware Conformal Prediction with Joint Coverage Guarantees for Multi-Stage NLP and LLM Pipelines
标题:PASC:具有多级NLP和LLM管道联合覆盖保证的管道感知共形预测
链接:https://arxiv.org/abs/2605.18812
作者:Varun Kotte
摘要:Modern NLP and LLM systems are pipelines: named entity recognition (NER) -> entity disambiguation (NED) -> entity typing, retrieval-augmented generation (retriever -> reader), and agentic chains of planner -> tool -> critic. Errors compound across stages, but existing uncertainty quantification methods either calibrate each stage independently (no joint coverage) or apply a Bonferroni union bound (joint coverage, but conservative). We present PASC (Pipeline-Aware Split Conformal), which reduces multi-stage joint coverage to a single scalar conformal prediction problem on the joint maximum nonconformity score. PASC provides a finite-sample distribution-free guarantee that all K stages are simultaneously covered with probability at least 1 - alpha, and is nearly tight up to a 1/(n+1) factor. On a three-stage NER -> NED -> entity-typing pipeline over CoNLL-2003, PASC achieves 96.4% end-to-end coverage versus 93.4% for Bonferroni and 86.5% for independent CP, at identical average prediction set size (1.083). Under distribution shift to WNUT-17 Twitter and WikiNEuRal Wikipedia data, PASC empirically maintains the target coverage in the tested shift settings while independent CP collapses to 59%. PASC requires a single quantile computation, runs 1.7x faster than Bonferroni, and scales to K = 6 stages where independent CP drops to 0.53 end-to-end coverage. The same joint-maximum-score reduction applies directly to compound LLM systems and agent pipelines.
【37】Compositional Literary Primitives in Instruction-Tuned LLMs: Cross-Architectural SAE Features for Self, Style, and Affect
标题:在教学调整LLM组成文学原语:跨建筑SAE功能的自我,风格和影响
链接:https://arxiv.org/abs/2605.18808
作者:Joao Paulo Cavalcante Presa,Savio Salvarino Teles de Oliveira
备注:36 pages, 6 figures
摘要:We characterize a compositional architecture of literary primitives in two instruction-tuned large language models (Llama 3.1 8B-Instruct and Gemma 2 9B-IT) via sparse autoencoders on mid-depth residual streams. Four feature classes emerge: naming-gates that promote lexical tokens of a target affect, an eleven-self cluster of first-person register features, stylistic register modulators (show-don't-tell and defamiliarization), and compositional emotions that arise only from multi-feature steering. Under a forced-choice 5-LLM judge panel applied to a 27-category emotion taxonomy (Cowen-Keltner), Llama reaches full 27/27 coverage by combining naming-gates, multi-feature recipes, and single self-feature steering; Gemma reaches 23/27 with adoration as the single residual strict-fail. Under random judging, the per-cell pass probability is on the order of $10^{-3}$ and the expected number of two-seed false-positive cells across the catalog is negligible, so the observed coverage is not consistent with chance. A cross-architectural asymmetry sits in the strict-versus-soft judge contrast: on the same generations, judges agree more often on Llama outputs than on Gemma outputs because Llama outputs name the target affect more directly while Gemma outputs evoke it through scene and imagery. Both architectures contain self-features that serve simultaneously as register markers and as emotion emitters, including a single most-RLHF-loaded self-feature per architecture that intensifies the institutional Helper-AI persona at one operating regime and produces affect-categorizable output at the same calibrated coefficient. Methodologically, the paper presents a three-stage validation pipeline (logit-lens, LLM-rate, 5-LLM judge) with documented anti-patterns; the total compute is single-GPU and about 15 minutes per emotion-feature discovery cycle.
【38】RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents
标题:LLM推荐代理中从语义似然性到集级效用
链接:https://arxiv.org/abs/2605.18805
作者:Imad Aouali,Flavian Vasile,Otmane Sakhi,Alexandre Gilotte,Benjamin Heymann
备注:Benchmark on LLM Recommendation Agents
摘要
:LLM recommendation agents increasingly produce structured recommendation reports: sets of items accompanied by natural-language justifications. Yet existing evaluations often reduce this setting to reranking small shortlisted candidate sets or judge reports mainly by semantic plausibility. We introduce Recommendation Atlas (Agentic Tool-Level Assessment for Shopping), or RecoAtlas, a benchmark and toolkit for evaluating shopping agents with behavior-grounded metrics. RecoAtlas complements held-out interaction metrics with learned utility proxies for relevance, complementarity, and diversity derived from interaction data, while separately measuring semantic coherence and explanation quality. Its controlled tool environment exposes agents to either semantic, behavior-aligned, or faulty tools, enabling diagnosis of whether performance gains arise from stronger reasoning, better signals, or more effective tool-use policies. Across controlled experiments, we show that RecoAtlas exhibits key properties of a meaningful benchmark for agentic systems: performance scales with model capacity and test-time compute, improves with stronger and better-aligned tools, degrades under noisy or misaligned signals, and reveals that semantic plausibility does not necessarily capture behavior-grounded utility. RecoAtlas provides a foundation for developing and evaluating shopping assistants that optimize not only for plausible recommendations, but also for coherent, behaviorally grounded recommendation sets.
【39】Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance
标题:位置:让我们开发数据探测器,从根本上了解数据如何影响LLM性能
链接:https://arxiv.org/abs/2605.18801
作者:Shiqiang Wang,Herbert Woisetschläger,Hans Arno Jacobsen,Mingyue Ji
备注:Accepted to ICML 2026 Position Paper Track
摘要:Data is fundamental to large language models (LLMs). However, understanding of what makes certain data useful for different stages of an LLM workflow, including training, tuning, alignment, in-context learning, etc., and why, remains an open question. Current approaches rely heavily on extensive experimentation with large public datasets to obtain empirical heuristics for data filtering and dataset construction. These approaches are compute intensive and lack a principled way of understanding the essence of how specific data characteristics drive LLM behavior. In this position paper, we advocate for the need of developing systematic methodologies for generating synthetic sequences from appropriately defined random processes, with the goal that these sequences can reveal useful characteristics when they are used in one or multiple stages of the LLM workflow. We refer to such sequences as data probes. By observing LLM behavior on data probes, researchers can systematically conduct studies on how data characteristics influence model performance, generalization, and robustness. The probing sequences exhibit statistical properties that can be viewed using theoretical concepts, such as typical sets, which are generalized to describe the behaviors of LLMs. This data-probe approach provides a pathway for uncovering foundational insights into the role of data in LLM training and inference, beyond empirical heuristics.
【40】UCCI: Calibrated Uncertainty for Cost-Optimal LLM Cascade Routing
标题:UCI:成本最优LLM级联路由的校准不确定性
链接:https://arxiv.org/abs/2605.18796
作者:Varun Kotte
备注:9 pages, 2 figures, 4 tables. Code: https://github.com/varunkotte6/ucci
摘要:LLM cascades and model routing promise lower inference cost by sending easy queries to a small model and escalating hard ones to a large model, but most deployed routers use uncalibrated confidence scores and require per-workload threshold tuning. We present UCCI, a calibration-first router that maps token-level margin uncertainty to a per-query error probability via isotonic regression and selects the escalation threshold by constrained cost minimization. Under three explicit assumptions, threshold policies on the calibrated score are cost-optimal, and isotonic calibration achieves O(n^{-1/3}) sample complexity for expected calibration error (ECE). On a production named entity recognition workload of 75,000 queries served by 4B and 12B instruction-tuned LLMs on H100 GPUs, UCCI cuts inference cost by 31% (95% CI: [27%, 35%]) at micro-F1 = 0.91 while reducing ECE from 0.12 to 0.03. At the same operating point, UCCI beats entropy thresholding, split-conformal routing, and a FrugalGPT-style learned threshold. All cascade results use end-to-end routing on actual model outputs and measured H100 latency, not simulated routing from global accuracies or nominal API prices.
【41】A Reproducibility Analysis of PO4ISR: Diagnosing and Mitigating Semantic Drift in LLM-Based Session Recommendation
标题:PO 4 ZR的再现性分析:诊断和缓解基于LLM的会话推荐中的语义漂移
链接:https://arxiv.org/abs/2605.18780
作者:Aditya Tiwari,Konduri Naga Lakshmi Rekha,Rajesh Kumar Mundotiya
摘要:Reasoning-based Large Language Models (LLMs) like PO4ISR have set new benchmarks in session-based recommendation. However, the reproducibility of their reasoning capabilities across diverse semantic domains remains unexplored. In this work, we conduct a rigorous reproducibility study of PO4ISR to assess its generalization limits. Our analysis reveals a critical failure mode: standard reasoning prompts suffer from severe contextual drift in long sessions, leading to performance degradation on semantically complex datasets like Games and Bundle. To quantify and resolve this stability gap, we introduce PO4ISR++, a robustness-enhanced implementation that integrates reflexive prompting and consistent rank detection. Unlike the original static prompting strategy, our approach dynamically adapts to cross-domain cues. We benchmark both the original implementation and our robust variant on ML-1M, Games, and Bundle. Our results confirm that while the original model struggles in new domains, our reproducible extension restores performance, yielding a stabilized gain of up to 54% on Games and 96% on Bundle. We release open-source artifacts, including the reproduced baseline and our enhanced framework, to facilitate reliable future research in LLM-based recommendation.
【42】Brain alignment of reasoning and action representations from vision-language and action models during naturalistic gameplay
标题:自然主义游戏过程中视觉语言和动作模型的推理和动作表示的大脑对齐
链接:https://arxiv.org/abs/2605.19352
作者:Subba Reddy Oota,Anant Khandelwal,Khushbu Pahwa,Satya Sai Srinath Namburi,Tanmoy Chakraborty,Bapi S. Raju,Manish Gupta
备注:21 pages, 11 figures
摘要:Understanding how humans and artificial intelligence systems predict and plan by interacting with their environment is a fundamental challenge at the intersection of neuroscience and machine learning. Most brain-encoding studies focus on aligning artificial models with brain activity during language comprehension or passive visual processing, while interactive brain-alignment studies have to date been largely limited to reinforcement-learning (RL) agents and theory-based models. To address this gap, we study brain alignment of representative models from two foundation-model families, namely vision-language models (VLMs) and large-action models (LAMs), using fMRI recordings from participants playing naturalistic Atari-style video games. Specifically, we examine how action-focused and reasoning-focused prompts shape model's internal representations and align with fMRI brain activity. First, we find that both VLMs and LAMs exhibit significantly exhibit voxel-wise encoding performance than RL baselines, with the advantage holding even under matched feature dimensionality. Second, prompt-driven gains scale with the cortical processing hierarchy: the largest improvements appear in frontal-parietal and motor-planning regions, while early visual cortex gains roughly half as much. Third, variance partitioning reveals a qualitatively different representational organization: VLM is prompt-symmetric (12.5% unique action vs. 13.6% unique reasoning), whereas LAM is prompt-asymmetric (27% unique action vs. -5% unique reasoning), with the asymmetry strongest in frontal-motor cortex. Together, these results demonstrate that action-specialized fine-tuning reorganizes multimodal representations toward action-relevant neural computations even when whole-brain prediction accuracy is statistically equivalent between VLM and LAM.
Graph相关(图学习|图神经网络|图优化等)(16篇)
【1】CAMERA: Adapting to Semantic Camouflage in Unsupervised Text-Attributed Graph Fraud Detection
标题:CAMERA:适应无监督文本属性图欺诈检测中的语义伪装
链接:https://arxiv.org/abs/2605.20032
作者:Junjun Pan,Yixin Liu,Yu Zheng,Lianhua Chi,Alan Wee-Chung Liew,Shirui Pan
备注:Accepted by IJCAI 2026
摘要:Text-attributed graph fraud detection (TAGFD) plays a critical role in preventing fraudulent activities on online social and e-commerce platforms. However, to evade detection, fraudsters continuously evolve their camouflaging strategies by deliberately mimicking textual responses of benign users, thereby concealing their malicious purposes. This phenomenon, referred to as semantic camouflage, fundamentally undermines commonly relied assumptions on how structural and attribute cues can be exploited to identify fraudsters, and makes it difficult to spot fraudsters with unsupervised TAGFD. To bridge the gaps, we propose a Case-Adaptive Multi-cue Expert fRAmework (CAMERA) for unsupervised TAGFD. CAMERA employs an ego-decoupled mixture-of-experts architecture, where each expert specializes in modeling a distinct type of fraud-indicative cue. A context-informed gating model is introduced to jointly consider the ego node representation and its local neighborhood context for adaptive integration of cues learned by different experts. Furthermore, CAMERA leverages the inherent rarity of fraudsters to support unsupervised one-class learning with expert-level objectives that encourage modeling dominant benign patterns, thereby enabling reliable unsupervised detection of camouflaged fraudsters. Experiments on 4 challenging datasets show that CAMERA consistently outperforms competitors, showing its effectiveness against semantically camouflaged fraudsters. Code available at https://github.com/CampanulaBells/CAMERA
【2】Graph Neural Networks for Community Detection in Graph Signal Analysis
标题:图信号分析中用于社区检测的图神经网络
链接:https://arxiv.org/abs/2605.19733
作者:Roberto Cavoretto,Alessandra De Rossi,Enrico Montini
摘要:Community detection is a central problem in graph analysis, with applications ranging from network science to graph signal processing. In recent years, Graph Neural Networks (GNNs) have emerged as effective tools for learning low-dimensional representations of graph-structured data and have shown strong performance in clustering tasks, particularly on large and high-dimensional graphs. This paper investigates the use of GNN-based community detection within a graph signal interpolation framework. After reviewing the main classes of GNN architectures for community detection according to a standard taxonomy, we integrate the resulting graph communities into a Partition of Unity Method (PUM) for interpolation with Graph Basis Functions (GBFs). In this approach, GNN-derived communities are used to construct local subdomains on which GBF interpolants are computed and subsequently combined into a global approximation. Numerical experiments on benchmark %graph datasets, including geometric and urban network examples demonstrate that the proposed combination of GNN-based clustering and GBF-PUM interpolation yields accurate signal reconstructions. The results indicate that deep learning-based community detection can provide effective graph partitions for localized interpolation schemes, supporting its use in scalable graph signal analysis.
【3】Projecting Latent RL Actions: Towards Generalizable and Scalable Graph Combinatorial Optimization
标题:投影潜在RL动作:迈向可推广和可扩展的图组合优化
链接:https://arxiv.org/abs/2605.19721
作者:Franco Terranova,Guillermo Bernardez,Albert Cabellos-Aparicio,Nina Miolane,Abdelkader Lahmadi
备注:Preprint
摘要
:Graph combinatorial optimization (GCO) has attracted growing interest, as many NP-hard problems naturally admit graph formulations, yet their combinatorial explosion renders exact methods computationally intractable. Recent advances in Reinforcement Learning (RL) combined with Graph Neural Networks (GNNs) have significantly improved learning-based GCO solvers. However, existing approaches face limitations in both generalization across diverse graph instances and computational scalability as action spaces grow. To address both challenges, we introduce projection agents, a novel RL-GCO approach that operates directly in a continuous GNN-based action embedding space, predicting a desired latent action in a single forward pass and subsequently decoding it into a valid discrete action. Additionally, we enable fair comparison across RL methods through a shared embedding space for both observations and actions. Across diverse benchmarks, our approach achieves up to 16.2x faster inference and up to 40% better generalization than existing solutions using only simple nearest-neighbor decoding, while opening the door to strong RL performance in super-linear decision spaces with multiple interdependent variables. Finally, we release LaGCO-RL, a Python library that automates latent action-space construction and supports existing RL-GCO solutions, promoting reproducibility and adaptation to new GCO benchmarks.
【4】Inferring Sensitive Attributes from Knowledge Graph Embeddings: Attack and Defense Strategies
标题:从知识图嵌入推断敏感属性:攻击和防御策略
链接:https://arxiv.org/abs/2605.19644
作者:Yasmine Hayder
摘要:Knowledge Graphs (KGs) are a powerful representation of linked data, offering flexibility, semantic richness, and support for knowledge enrichment and reasoning. They help data owners organize and exploit heterogeneous data to provide insightful services (e.g., recommendations), yet real-world KGs are often incomplete, hiding true facts or missing valuable insights. Knowledge graph embedding techniques are commonly used to infer valuable missing information. However, reasoning over KGs can inadvertently expose sensitive user information, even when such data is not explicitly stored. In this work, we investigate the privacy risks associated with KGE-based reasoning, focusing on attribute inference attacks where adversaries attempt to deduce sensitive user attributes from seemingly non-sensitive outputs. We propose and evaluate a framework that mitigates these privacy risks by applying post processing sanitization techniques to KGE outputs. Preliminary results demonstrate the effectiveness of these attacks on the outputs of KGE models, and explore the trade-off between recommendation quality and privacy protection when applying randomization based approaches, highlighting the need to experiment with more advanced techniques in future work to address this issue.
【5】Physics-Informed Graph Neural Network Surrogates for Turbulent Nanoparticle Dispersion in Dental Clinical Environments
标题:牙科临床环境中湍流纳米颗粒分散的物理知识图神经网络替代品
链接:https://arxiv.org/abs/2605.19589
作者:Takshak Shende,Viktor Popov
备注:40 pages, 12 figures,
摘要:Dental aerosol procedures produce sub-50 micrometre nuclei that can remain airborne for long periods in enclosed clinics, creating pathways for airborne pathogen transmission. Reynolds-Averaged Navier-Stokes (RANS) simulations with Euler-Lagrange particle tracking capture this transport accurately but require very long run times per scenario, which precludes real-time clinical decision support in 3D. We present the Eulerian-Lagrangian Graph Interaction Network (ELGIN), a physics-informed graph surrogate that jointly predicts carrier-flow dynamics on the OpenFOAM polyhedral mesh and the per-parcel motion of the polydisperse spray cloud. ELGIN couples a multi-head Graph Transformer with Jacobi-preconditioned learnable pressure projection and a turbulence-closure head to a sigmoid-gated Lagrangian Interaction Network through differentiable inverse-distance mesh-parcel coupling, and advances parcels with a symplectic Stormer-Verlet integrator. A four-stage physics-informed curriculum stabilises 260-step autoregressive rollouts without gradient explosion. A parameter sweep with foam-extend 4.1 OpenFOAM reactingParcelFoam across clinically relevant ventilation rates and handpiece spray speeds provides CFD ground truth. This article reports a single-case demonstration in which both ELGIN and a Lagrangian-only baseline (M0) are trained and evaluated on Sweep_Case_03 of a twenty-case sweep; full 16/2/2 retraining is in progress and will replace all reported metrics. On this case, ELGIN tracks the foam-extend particle cloud much more closely than M0: mean parcel displacement error falls from 19.56% to 16.20% of room width and cloud radius-of-gyration error from 9.85% to 6.58%. A 26-second rollout completes in ~64 s on a 4 GB GPU, approximately 37x faster than the foam-extend reference pipeline, toward per-appointment infection-risk screening once the multi-case checkpoint is in place.
【6】Planner-Admissible Graph-PDE Value Extensions for Sparse Goal-Conditioned Planning
标题:稀疏目标条件规划的规划者容许图-PDE值扩张
链接:https://arxiv.org/abs/2605.19185
作者:Shiheng Zhang
摘要:Sparse goal-conditioned planning with few cost-to-go labels can be viewed as a graph-PDE Dirichlet extension problem: extend sparse labels on a goal-dependent boundary to unlabelled graph vertices so that greedy rollouts reach the goal. We study which graph value extensions are planner-admissible under the operational argmin-Q planner. Our main result is a local action-gap certificate: if the surrogate value error along the rollout stays below half the true action gap, then the greedy rollout reaches the goal. Absolutely Minimal Lipschitz Extension (AMLE), the p=infinity endpoint of the graph p-Laplacian family, instantiates this certificate through a comparison-principle fill-distance bound. Harmonic extension, by contrast, can mis-rank local actions because its values reflect boundary hitting probabilities rather than shortest-path greedy order. On 120 AntMaze layout-derived graph configurations, harmonic extension achieves 0.584 aggregate rollout success, while AMLE reaches 0.970. Finite high-p methods also enter a high-success regime, with success 0.903 for p=4, 0.973 for p=8, and 0.982 for a fixed-budget p=16 solver, though the p=16 row is not used as a converged endpoint ranking due to incomplete solver certification. Mechanism audits show that many rollout decisions occur in AMLE-compatible but harmonic-incompatible local geometry, and that AMLE corrects most harmonic inversions on the rollout-weighted decision scope.
【7】GRASP: Deterministic argument ranking in interaction graphs
标题:GRASP:交互图中的确定性论点排名
链接:https://arxiv.org/abs/2605.19141
作者:Diganta Misra,Antonio Orvieto,Rediet Abebe,Volkan Cevher
备注:Preprint
摘要
:Large language models are increasingly deployed as automated judges to evaluate the strength of arguments. As this role expands, their legitimacy depends on consistency, transparency, and the ability to separate argumentative structure from rhetorical appeal. However, we show that holistic judging - a common LLM-as-a-Judge practice where a model provides a global verdict on a debate - suffers from substantial inter-model disagreement. We argue that this instability arises from collapsing a debate's complex interaction structure into a single opaque score. To address this, we propose GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework that aggregates stable local interaction judgments into a global ranking via a convergent attack--defense propagation operator. We show that local interaction judgments are more reproducible than holistic rankings in LLM-as-a-Judge evaluations, allowing GRASP to produce more consistent global rankings. We further show that GRASP scores do not correlate with human "convincingness" labels, highlighting a vital sociotechnical distinction: GRASP does not measure persuasion, factuality, or rhetorical appeal, but structural sufficiency - a defense-aware notion of argument robustness over the explicit interaction graph. Overall, GRASP offers a transparent and auditable alternative to holistic LLM judging.
【8】GOAL: Graph-based Objective-Aligned Diffusion Solvers for Dynamic Multi-Objective Optimization
标题:目标:用于动态多目标优化的基于图形的时间对齐扩散求解器
链接:https://arxiv.org/abs/2605.19119
作者:Xingyu Li
摘要:Existing neural combinatorial optimization solvers frame solution search as imitation of optimal decisions, inherently limiting their utility to single-objective minimization and static constraints. We propose GOAL, a conditioned diffusion solver over relational graph representations that enables controllable decision generations by conditioning on human-specified objectives. We introduce a heterogeneous graph encoding in which distinct edge types, corresponding to different classes of constraints, define the message passing structure of the graph neural network, which allows information to propagate selectively according to the ontology of each constraint. GOAL is instantiated and evaluated on three canonical scheduling benchmarks of various constraint complexity: the Flow Shop Problem (FSP), the Job Shop Scheduling Problem (JSP), and the Flexible Job Shop Scheduling Problem (FJSP). Generalization is demonstrated across structurally distinct constraint regimes and problem types without architectural modification. On all three benchmarks, GOAL achieves 100% solution feasibility and near-zero MAPE (below 0.20%) on multiple objectives for problem sizes up to 20 jobs and 60 operations, outperforming NSGA-II and MOEA/D in both solution quality and inference speed by up to 25x.
【9】SCAFDS: Edge-Feature Graph Attention for Interbank Fraud Detection with Attribution-Grounded SAR Generation
标题:SCANDS:利用基于属性的SAR生成进行银行间欺诈检测的边缘特征图关注
链接:https://arxiv.org/abs/2605.18913
作者:Mohammad Nasir Uddin
摘要:The U.S. financial system processes approximately 1.3 million interbank transactions daily, yet no system in the reviewed literature models fraud propagation across the interbank network using fraud co-occurrence edge features. Prior interbank GNN architectures model credit contagion using credit distress supervision signals, producing systems misaligned for fraud forensics. No existing system generates SAR narratives with per-assertion forensic traceability to specific numerical detection outputs, creating regulatory auditability gaps in FinCEN-submitted reports. This paper introduces SCAFDS (Systemic Contagion-Aware Fraud Detection System), a seven-stage integrated surveillance pipeline addressing five structural limitations of prior art: (1) fraud-specific interbank topology encoding using fraud co-occurrence frequency metrics f(u,v,t) derived from FinCEN SAR registry records; (2) edge-feature-informed graph attention where coefficients are computed from both node representations and fraud co-occurrence edge features; (3) bilinear fraud co-occurrence risk fusion producing institution-level systemic fraud risk scores; (4) attribution-conditioned SAR narrative generation with per-assertion significance thresholds ensuring each FinCEN SAR assertion is traceable to a specific numerical pipeline output; and (5) topology-aware adaptive forensic feedback updating graph attention weights from regulatory dispositions. Experiments on the IEEE-CIS Fraud Detection Dataset (590,540 transactions) and a synthetic FDIC-aligned interbank network (8,103 institutions, 169,800 edges) show SCAFDS achieves AUPRC=0.515+/-0.032 and AUROC=0.802+/-0.018, representing +15.9pp and +13.7pp improvements over GraphSAGE-AML. Partial validation on FDIC enforcement action records (n=4,279) confirms consistent model ranking. USPTO Provisional Patent Application No. 64/061,083, filed May 8, 2026.
【10】Position: Graph Condensation Needs a Reset -- Move Beyond Full-dataset Training and Model-Dependence
标题:位置:图浓缩需要重置-超越全数据集训练和模型依赖
链接:https://arxiv.org/abs/2605.18893
作者:Mridul Gupta,Samyak Jain,Vansh Ramani,Hariprasad Kodamana,Sayan Ranu
摘要:Graph Neural Networks (GNNs) are powerful tools for learning from graph-structured data, but their scalability is increasingly strained by the size of real-world graphs in domains like recommender systems, fraud detection, and molecular biology. Graph condensation -- the task of generating a smaller synthetic graph that retains the performance of models trained on the original -- has emerged as a promising solution. However, the dominant approach of gradient matching introduces a fundamental contradiction: it requires training on the full dataset to create the compressed version, thereby undermining the goal of efficiency. Worse still, these methods suffer from high computational overhead, poor generalization across GNN architectures, and brittle reliance on specific model configurations. Equally concerning is the community's reliance on misleading evaluation protocols such as node compression ratios, which fail to reflect true resource savings, condensation overhead, and illusory application to neural architecture search. These shortcomings are not incidental -- they are systemic, and they obstruct meaningful progress. In this position paper, we argue that graph condensation, in its current form, needs a reset. We call for moving beyond full-dataset training and model-dependent design, and instead advocate for methods that are lightweight, architecture-agnostic, and practically deployable. By identifying key methodological flaws and outlining concrete research directions, we aim to reorient the field toward approaches that deliver on the true promise of condensation: efficient, generalizable, and usable GNN training at scale.
【11】Graph-Driven Cross-Industry Real-Time Monitoring Framework for Anti-Money Laundering Detection in Converged Mobility-Energy Supply Chain Networks
标题:图形驱动的跨行业实时监控框架,用于融合移动能源供应链网络中的反洗钱检测
链接:https://arxiv.org/abs/2605.18844
作者:Rong Liu,Xiaojun Xiao,Zhanqing Su
摘要:With the deep integration of the travel and energy industries, cross-industry supply chain finance has gradually become a high-risk field of hidden money laundering incidents. For this reason, this work proposes a graph-driven cross-industry real-time anti-money laundering monitoring framework (GCRMF) for integrated travel - energy supply chain networks. First, a cross-industry heterogeneous graph (CIHG) covering new energy vehicle rental platforms, energy suppliers, fintech institutions, etc., is constructed, and industry semantics are integrated through temporarily Dual-GAT (Temporal Dual-Graph Attention Network), dynamically encoding capital flow paths and evolution features over time. Subsequently, in order to identify the structural fraud behavior together produced by colluding subjects, a meta-path subgraph reasoning module based on contrastive learning and hierarchical graph sampling is proposed to enhance the discrimination capability of cross-industry recurring money laundering behavior. Meanwhile, a self-supervised online learning mechanism is adopted for real-time adaptation and continuous optimization to new money laundering strategies. The experimental results show that compared with existing graph neural network methods in cross-industry scenarios, GCRMF improves the performance by more than 17.8% of F1 score and greatly reduces the false positive rate.
【12】Automated Big Data Quality Assessment using Knowledge Graph Embeddings
标题:使用知识图嵌入自动化大数据质量评估
链接:https://arxiv.org/abs/2605.18833
作者:Hadi Fadlallah,Rima Kilany,Mitri Haber,Ali Jaber
备注:17 pages, 10 figures
摘要:Automated data quality assessment is crucial for managing big data, but existing solutions face challenges in achieving accurate context-aware assessment. This paper presents a novel knowledge-based approach to enhance automated data quality assessment. Our approach utilizes knowledge graph embeddings to predict missing edges between the input dataset's context representation and the relevant quality rules and dimensions within a knowledge graph representing contextual data characteristics and the required quality assessment operations. We surpass conventional practices by integrating diverse representations within the knowledge graph, drawing insights from contextual information from a thorough literature investigation. This integration allows us to develop a comprehensive and context-specific data quality assessment plan tailored to each context. Leveraging the knowledge graph improves our understanding of the input dataset's context, overcoming the limitations of traditional methods that rely solely on strict matching and overlook contextual characteristics. By injecting numerical edge attributes, we assign corresponding weights to each predicted quality measurement, providing a comprehensive data quality assessment plan for the input dataset. To evaluate our approach, we leverage AmpliGraph, a framework developed and benchmarked by AccentureLabs. The evaluation involves employing a real-world radiation sensors dataset provided by the Lebanese Atomic Energy Commission (LAEC-CNRS). The results obtained from this evaluation demonstrate the capability of our solution to generate a comprehensive data quality assessment plan for the given input dataset.
【13】Diffusion Graph Posterior Sampling for Nonlinear Inverse Problems with Application to Electrical Impedance Tomography
标题:非线性反问题的扩散图后验抽样及其在电阻抗断层扫描中的应用
链接:https://arxiv.org/abs/2605.19621
作者:Giovanni S. Alberti,Damiana Lazzaro,Serena Morigi,Matteo Santacesaria,Shibo Wang
摘要:Deep generative models have emerged as state-of-the-art for solving inverse problems, but applying them to inverse problems for PDEs, like electrical impedance tomography (EIT) remains challenging. Because physical domains are naturally discretized as unstructured meshes rather than regular grids, standard convolutional architectures are often inadequate. In this paper, we propose a novel framework that extends diffusion posterior sampling (DPS) to graph-structured data. We develop an unconditional score-based diffusion model directly on a 2D triangular mesh to learn an accurate prior over the physical solution space. Furthermore, we introduce a regularized variant, RDPS, which incorporates explicit regularization terms, such as total variation and generalized Tikhonov, to complement the implicit diffusion prior and mitigate severe ill-posedness. Extensive experiments on synthetic and real 2D EIT datasets demonstrate that RDPS produces stable, physically plausible reconstructions. Our approach generalizes well to out-of-distribution inclusion geometries, is highly robust to measurement noise, and outperforms current state-of-the-art solvers (e.g., GPnP-BM3D, DP-SGS) in reconstruction accuracy and artifact reduction.
【14】A Unified Framework for Structure-Aware Clustering and Heterogeneous Causal Graph Learning
标题:结构感知集群和异类因果图学习的统一框架
链接:https://arxiv.org/abs/2605.19313
作者:Honglin Du,Muxuan Liang,Xiang Zhong
摘要:In complex multivariate systems, interactions among variables are defined by dependency structures, often encoded as directed acyclic graphs ($\text{DAGs}$). However, dependency structures can vary across subjects, and ignoring this structural heterogeneity introduces bias and obscures subpopulation-specific dependencies. To address this, we propose Directed Acyclic Graph-based Dependency Clustering via Alternating Direction Method of Multipliers (DAG-DC-ADMM), a unified framework built upon Structural Equation Modeling (SEM) that jointly learns cluster assignments and cluster-specific dependency structures. We encode acyclicity via a smooth constraint and integrate a groupwise truncated Lasso fusion penalty (gTLP) to cluster subjects based on their structural similarity. This yields a nonconvex optimization problem that incorporates sparsity, acyclicity, and structural consensus constraints. We address the nonconvexity by using the augmented Lagrangian method and solve it with an adapted version of the Alternating Direction Method of Multipliers (ADMM) for difference-of-convex programs. For certain graph structures, such as upper triangular adjacency matrices, our algorithm is guaranteed to converge to a Karush-Kuhn-Tucker (KKT) point. Experiments demonstrate that our method recovers cluster-specific causal dependency structures with a high true positive rate and a low false discovery rate. This capability enables the robust discovery of heterogeneous dependencies across subjects where the subpopulation label is unknown.
【15】Do Better Volatility Forecasts Lead to Better Portfolios? Evidence from Graph Neural Networks
标题:更好的波动性预测会带来更好的投资组合吗?图神经网络(Graph Neural Networks)
链接:https://arxiv.org/abs/2605.19278
作者:Rylan Wade
摘要:This paper tests whether graph neural networks improve realized volatility forecasts and whether those forecasts improve portfolio performance. Using weekly realized volatility for 465 S\&P 500 equities from 2015--2025, Heterogeneous Autoregressive and Long Short-Term Memory baselines are compared against GraphSAGE models built on rolling correlation, sector, and Granger-causal graphs, with and without macro regime features. The empirical finding is that the model with the lowest forecast MSE, the model with the highest cross-sectional ranking accuracy, and the model with the highest portfolio Sharpe ratio are three different models. Forecast accuracy, ranking quality, and portfolio performance are related but not interchangeable objectives. Graph volatility models add value only when the portfolio rule can exploit the cross-sectional structure they encode.
【16】Bayesian Latent Space Models for Graphs Are Misspecified: Toward Robust Inference via Generalized Posteriors
标题:图的Bayesian潜空间模型被错误指定:通过广义后验实现鲁棒推理
链接:https://arxiv.org/abs/2605.18927
作者:Aldric Labarthe
摘要:Bayesian latent space models offer a principled approach to network representation, but rely on correct specification of both geometry and link function. Real-world networks often violate these assumptions, exhibiting geometric mismatch and structural anomalies that break standard metric properties. We show that such misspecification pushes the data-generating distribution outside the model class, causing Bayesian inference to become overconfident and poorly calibrated. To address this, we propose a generalized posterior framework for random geometric graphs. We introduce Link-Sequential R-SafeBayes, a method that exploits dyadic conditional independence to estimate prequential risk and adaptively tune posterior regularization. Experiments on synthetic and real-world networks demonstrate improved calibration, better link prediction performance, and a reliable criterion for selecting latent geometries across Euclidean, spherical, and hyperbolic spaces.
Transformer(10篇)
【1】Position: The Turing-Completeness of Real-World Autoregressive Transformers Relies Heavily on Context Management
标题:位置:现实世界的自回归Transformer的完整性严重依赖于上下文管理
链接:https://arxiv.org/abs/2605.19514
作者:Guanyu Cui,Zhewei Wei,Kun He
备注:Accepted to the ICML 2026 Position Paper Track
摘要:Many works make the eye-catching claim that Transformers are Turing-complete. However, the literature often conflates two distinct settings: (i) a fixed Transformer system setting, in which a fixed autoregressive Transformer is coupled with a fixed context-management method to process inputs of different lengths step by step, and (ii) a scaling-family setting, in which a family of different models (with increasing context-window length or numerical precision) is used to handle different input lengths. Existing proofs of Transformer Turing-completeness are frequently established in setting (ii), whereas real-world LLM deployment and the standard notion of Turing-completeness correspond more naturally to setting (i). In this paper, we first formalize the fixed-system setting, thereby providing a concrete characterization of how real-world LLMs operate. We then argue that results proved in the scaling-family setting provide theoretically meaningful resource bounds but do not establish Turing-completeness, thereby clarifying a common misinterpretation of existing results. Finally, we show that different context-management methods can yield sharply different computational power, and we advocate the position that context management is a central component that critically determines the computational power of real-world autoregressive Transformers.
【2】CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
标题:CODA:重写Transformer Blocks作为GEMM尾声程序
链接:https://arxiv.org/abs/2605.19269
作者:Han Guo,Jack Zhang,Arjun Menon,Driss Guessous,Vijay Thakkar,Yoon Kim,Tri Dao
摘要:Transformer training systems are built around dense linear algebra, yet a nontrivial fraction of end-to-end time is spent on surrounding memory-bound operators. Normalization, activations, residual updates, reductions, and related computations repeatedly move large intermediate tensors through global memory while performing little arithmetic, making data movement an increasingly important bottleneck in otherwise highly optimized training stacks. We introduce CODA, a GPU kernel abstraction that expresses these computations as GEMM-plus-epilogue programs. CODA is based on the observation that many Transformer operators exposed as separate framework kernels can be algebraically reparameterized to execute while a GEMM output tile remains on chip, before it is written to memory. The abstraction fixes the GEMM mainloop and exposes a small set of composable epilogue primitives for scaling, reductions, pairwise transformations, and accumulation. This constrained interface preserves the performance structure of expert-written GEMMs while remaining expressive enough to cover nearly all non-attention computation in the forward and backward pass of a standard Transformer block. Across representative Transformer workloads, both human- and LLM-authored CODA kernels achieve high performance, suggesting that GEMM-plus-epilogue programming offers a practical path toward combining framework-level productivity with hardware-level efficiency.
【3】Performance Monitoring of Proton Exchange Membrane Water Electrolyzer by Transformers-Based Machine Learning Model
标题:基于变换器的机器学习模型监测质子交换膜水电解器的性能
链接:https://arxiv.org/abs/2605.19107
作者:Bingqing Chen,Ivan Batalov,Qiu Chen,Weiqi Ji,Lei Cheng
摘要:Green hydrogen plays an essential role in decarbonization, with capacity projected to scale to 560 GW by 2030 (vs. 1.39 GW in 2023) in net-zero settings. Proton exchange membrane (PEM) electrolysis is one of the most promising technology routes to green hydrogen production, and real-time system health monitoring of PEM electrolyzers is essential for their scalable deployment. In lab settings, performance degradation can be characterized through electrochemical testing protocols by periodic pauses of normal operation. Such interruption is not practical for full-scale stack deployments, limiting system operators' ability to make real-time assessments of state-of-health (SoH). We present a machine learning (ML) framework that performs virtual electrochemical characterization during normal operation. The method uses an encoder-decoder transformer, conditioned on operational data, to reconstruct characterization outputs, focusing here on polarization curves. Inspired by patch-based sequence tokenization, we segment the inputs into patches and encode them to form meaningful tokens, which substantially improves learning efficiency. Across four longitudinal runs, lasting up to 478 hours on different test cells and loading cycles, the model accurately reconstructed polarization curves and achieved 10x reduction in mean squared error (MSE) compared to a vanilla transformer. This proof-of-concept demonstrates that ML models can enable continuous performance monitoring for PEM electrolyzers and that the encoder captures meaningful latent representations of SoH, opening up opportunities to derive interpretable indicators in future work.
【4】A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions
标题:诊断Transformer重量分布的双参数威布尔框架
链接:https://arxiv.org/abs/2605.18898
作者:Tiexin Ding
备注:27 pages, 14 figures. Companion library npm-weibull-py and benchmark database available at https://github.com/tiexinding/NPM-Weibull-public
摘要:We apply the Weibull distribution -- a two-parameter family from extreme-value theory -- as a diagnostic framework for element-wise weight magnitude distributions in transformers. At initialization, i.i.d. Gaussian weights give |w| ~ HalfNormal, yielding k ~ 1.20 via middle-80% probability-plot fit (the protocol used throughout this work). This anchor makes k a principled, architecture-independent measuring stick for training dynamics; fitting each weight matrix independently at every layer at every checkpoint enables per-component, per-layer, and per-step diagnostics that aggregate statistics cannot resolve. Applying this framework to 12 model entries spanning 7 architectural families (Pythia, OLMo-1/2, LLaMA-3, Mistral, Qwen2.5/3) reveals three findings. First, FFN modules and the attention output projection W_o -- the Transmission Class -- fall in a narrow k band: median terminal k in [1.186, 1.204] across 12 entries (cross-family CV = 0.51%), shared across SwiGLU/GeLU activations, Pre-LN/QK-Norm placements, and 70M-14B sizes. Second, the attention input projections W_q, W_k -- the Selection Class -- depart from the Weibull family, with severity shaped by storage: separately-stored Q/K (OLMo-1, OLMo-2) yields k in [0.76, 0.99] (deep); GQA models yield k in [1.10, 1.16] (mild); Pythia's merged W_qkv occupies a transitional zone tracking training budget T/tau monotonically. Third, lambda grows substantially during training and scales with sqrt(eta/lambda_wd) within the Pythia family (Pearson r = 0.94, three Transmission kinds), directionally consistent with Fan et al. (2025). The two parameters carry independent information: k labels the functional class, lambda labels training progress. We release npm-weibull-py v0.4 (Python library) and DATABASE_v9_1 at https://github.com/tiexinding/NPM-Weibull-public .
【5】Multi-Headed Transformer Architectures as Time-dependent Wasserstein Gradient Flows
标题:作为含时Wasserstein梯度流的多头Transformer结构
链接:https://arxiv.org/abs/2605.18870
作者:Alex Massucco,Leonardo Del Grande,Marcello Carioni,Christoff Brune,Carola-Bibiane Schönlieb
摘要:In recent years, transformer architectures have revolutionized the field of language processing, opening the door to previously unforeseen possibilities. However, from a theoretical point of view, the mathematical models proposed in the literature often lack direct contact with the actual architectures and depend on strong simplifying assumptions. In this paper, we reduce this gap by modelling the data flow in multi-headed transformer architectures as time-dependent gradient flows for a suitable interaction energy capturing the design of the attention mechanism. The explicit dependence on time allows us to consider different weights for each head and for each layer, without imposing constraints on the initialization method. Moreover, we prove that, under a suitable integrability assumption on the evolution of the weights, each element of the $ω$-limit set of the gradient flows is a stationary point of the interaction energy at a limiting weight distribution. Finally, we analyse the stability of the gradient flows considering perturbations of both the initial data and the weights. Specifically, on the one hand, we study the robustness of the proposed models with respect to noisy inputs, establishing a continuous dependence of the gradient flows on the initial data and uniqueness of the flows. On the other hand, we prove the $Γ$-convergence of the perturbed interaction energy to the unperturbed one, leading to the convergence of the corresponding gradient flows. We complement these theoretical results with numerical experiments that confirm the predicted energy-dissipation identity and clarify the asymptotic behavior of the dynamics in both the autonomous-like (Ornstein--Uhlenbeck) and the genuinely non-autonomous (oscillating-weights) regimes.
【6】Transformers Linearly Represent Highly Structured World Models
标题:Transformer线性代表高度结构化的世界模型
链接:https://arxiv.org/abs/2605.18847
作者:Roman Kniazev,Nathanaël Fijalkow
摘要
:Do transformers, when trained on sequential reasoning traces, build internal models of the underlying task? And if so, does the structure of those internal representations mirror the structure of the domain? We train an 8-layer transformer on Sudoku solving traces and perform a mechanistic analysis of its internal computation. We establish two results. First, the model builds a substructure world model: it does not represent the board state cell by cell, as a human analyst would expect, but organizes information around the rows, columns, and boxes that Sudoku's constraints act on. Second, we identify a naked-single circuit: a small set of dedicated neurons in the final MLP layer, each individually detecting when exactly one digit remains possible for a specific cell, and reliably promoting that digit. These findings show that the geometry of an emergent world model is shaped by the constraint algebra of the domain, not its surface presentation, and that the resulting decision circuit is sparse, monosemantic, and fully interpretable. More broadly, they demonstrate that mechanistic interpretability tools can recover an end-to-end algorithmic account of how a transformer solves a combinatorial reasoning task.
【7】Precision Tracked Transformer via Kalman Filtering, Kriging and Process Noise
标题:通过卡尔曼过滤、克里金和过程噪音实现精确跟踪Transformer
链接:https://arxiv.org/abs/2605.18832
作者:Bo Long,Deepak Agarwal,Jelena Markovic-Voronov,Yi Wang,Liuqing Li
摘要:The Transformer is the foundational building block of modern AI, yet offers no principled handling of \emph{uncertainty}, which is prevalent in real applications: cold-start tokens with sparse histories in sequential recommendation, heterogeneous signal quality in language models, and attention sinks induced by unconstrained softmax. Every token is treated with uniform confidence. We show this uniformity is a degenerate case of our \emph{Bayesian Filtering Transformer} (BFT): attention becomes precision-weighted kriging, the residual connection becomes a Kalman update with adaptive gain, and the FFN becomes a dynamics model propagating precision via a Jacobian--plus--process-noise rule. Observation precision comes from a parameter-free Restricted Maximum Likelihood (REML) estimator with a conjugate Bayesian prior. BFT replaces any Transformer layer with negligible overhead. On sequential recommendation, BFT applied to three major architectures yields significant gains on six benchmarks, with the largest improvements on cold-start users and rare items where uncertainty is highest. On supervised fine-tuning of large language models with noisy data, BFT improves robustness in two regimes: noisy supervision (token-label corruption in question answering) and noisy context (retrieval-augmented QA with real RAG distractors). A single principled modification -- restoring precision -- unlocks substantial headroom across both classical sequence-modeling and modern LLM regimes.
【8】Simply Stabilizing the Loop via Fully Looped Transformer
标题:通过全回路Transformer简单地稳定回路
链接:https://arxiv.org/abs/2605.18797
作者:Rao Fu,Zixuan Yang,Jiankun Zhang,Jing Ma,Hechang Chen,Yu Li,Yi Chang
摘要:Scaling model performance typically requires increasing model size. Looped Transformer offers a compelling alternative by iteratively reusing the same Transformer blocks, trading additional computation for improved performance without increasing parameter count or context length. Because the number of loop iterations can be adjusted at inference, it also provides a natural mechanism for balancing performance and test-time compute. However, Looped Transformer still suffers from training instability when the number of loop iterations increases. Our analysis reveals that this instability stems from two sources: gradient oscillation and residual explosion. To address these two problems, we propose the Fully Looped Transformer, which introduces two parameter-free modifications: (1) Fully Looped Architecture, which distributes inter-loop signals across all layers to mitigate residual explosion; (2) Attention Injection, which reuses the existing attention block to suppress gradient oscillation. These modifications stabilize training dynamics, enabling the Fully Looped Transformer to be trained stably up to 12 loop iterations, whereas other baseline looped models collapse in this regime. In milder settings where Looped Transformer does not collapse, Fully Looped Transformer still improves average downstream-task performance by up to 13.2\%. Overall, our experiments demonstrate that Fully Looped Transformer improves training stability, enhances downstream performance, and provides preliminary adaptability under different test-time compute budgets by varying loop iterations at inference.
【9】Robust Basis Spline Decoupling for the Compression of Transformer Models
标题:用于压缩Transformer模型的鲁棒基样条脱钩
链接:https://arxiv.org/abs/2605.18794
作者:Joppe De Jonghe,Van Tien Pham,Mariya Ishteva
摘要:Decoupling is a powerful modeling paradigm for representing multivariate functions as compositions of linear transformations and univariate nonlinear functions. A single-layer decoupling can be viewed as a fully connected neural network with a single hidden layer and flexible activation functions, providing a direct link with neural networks. Because of this, the use of decoupling methods has gained increasing attention in neural network domains, particularly compression, since it enables structured approximations with reduced parameter complexity. Existing tensor-based decoupling methods typically rely on polynomial or piecewise-linear parameterizations of the internal nonlinear functions, which can suffer from numerical instability or limited expressiveness. In this work, we introduce a B-spline-based decoupling framework that generalizes these existing approaches. By exploiting the local support and flexible smoothness control of B-splines, the proposed formulation yields a more numerically stable and expressive representation. We derive a constrained coupled matrix-tensor factorization and propose a robust alternating least-squares algorithm, called R-CMTF-BSD, incorporating normalization and Tikhonov regularization. The proposed method is validated through experiments on synthetic data and transformer model compression. Results on the Vision and Swin Transformer architectures demonstrate that B-spline decoupling enables substantial parameter reduction while maintaining competitive accuracy, making the R-CMTF-BSD algorithm a promising tool for structured neural network compression.
【10】Cross-Subject Intracranial EEG Reconstruction from Scalp Recordings Using Multi-Scale Cross-Attention Transformers
标题:使用多尺度交叉注意转换器根据头皮记录重建跨受试者的脑部脑电
链接:https://arxiv.org/abs/2605.18897
作者:Tien-Dat Pham,Xuan-The Tran
摘要
:Intracranial EEG (iEEG) provides high-fidelity neural recordings essential for clinical and brain-computer interface applications, but acquiring these signals requires invasive surgery. While recent studies have attempted to estimate iEEG from non-invasive scalp EEG, most rely on patient-specific models, creating a circular dependency: if surgery is required to collect training data, the non-invasive model offers limited practical benefit. In this study, we address the challenge of cross-subject iEEG reconstruction by predicting intracranial signals for unseen patients using models trained on other individuals. We propose CAST (Cross-Attention Spatial-Temporal Transformer), a machine learning framework that translates scalp EEG into multi-channel iEEG waveforms through a two-stage transfer learning strategy. First, a temporal encoder extracts multi-scale neural representations at three different resolutions. Then, because electrode placements vary substantially across patients, a channel-aware decoder is calibrated using only a few minutes of data from the target subject. We evaluated the proposed method using leave-one-subject-out cross-validation on two public datasets comprising 1,282 iEEG channels. Experimental results demonstrate that CAST reconstructs cortical signals located near the scalp surface substantially better than deep subcortical activity. In highly observable sensorimotor regions, the model achieved peak correlations of up to r=0.864 in the precentral gyrus. Furthermore, with a channel selection strategy, CAST obtained a mean correlation of r=0.545 on viable subjects, outperforming previous within-subject baselines. These findings indicate that cortical iEEG signals can be reconstructed for unseen subjects from scalp EEG without extensive patient-specific training, and that only a brief calibration phase is sufficient to adapt the model to new hardware configurations.
GAN|对抗|攻击|生成相关(9篇)
【1】When Critics Disagree: Adaptive Reward Poisoning Attacks in RIS-Aided Wireless Control System
标题:当批评者不同意时:RIS辅助无线控制系统中的自适应奖励中毒攻击
链接:https://arxiv.org/abs/2605.20037
作者:Deemah H. Tashman,Soumaya Cherkaoui
摘要:Reward-poisoning attacks present a significant risk to learning-based wireless control systems. Given this, we propose a Disagreement-Guided Reward Poisoning (DGRP) adaptive attack on a Soft Actor-Critic (SAC) agent. In a Cognitive Radio Network (CRN) environment assisted by Reconfigurable Intelligent Surfaces (RIS), the SAC agent is tasked with maximizing the long-term secondary users' (SUs) rate by simultaneously optimizing the transmission power of the SU transmitter and the RIS phase shifts. DGRP corrupts rewards, particularly when the SAC dual critics exhibit substantial disagreement-especially in high-leverage, high-uncertainty states-resulting in distorted value estimations and guiding the policy towards suboptimal actions. Our findings demonstrate that DGRP substantially diminishes the performance improvements typically provided by RIS and degrades transmission quality. We further investigate key attack parameters and determine their impact on learning. In comparison to periodic-timing and exploration-triggered baselines, DGRP consistently causes greater damage, highlighting the necessity of considering disagreement-aware threats when evaluating the robustness of Deep Reinforcement Learning (DRL) in RIS-assisted networks.
【2】Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes
标题:基于序贯熵变化检测基于流畅优化的对抗算法
链接:https://arxiv.org/abs/2605.19966
作者:Mohammed Alshaalan,Miguel R. D. Rodrigues
备注:Accepted at ICML 2026; 20 pages, including 9 pages main text, references, and appendix
摘要:Optimization-based adversarial suffixes can jailbreak aligned large language models (LLMs) while remaining fluent, weakening static and windowed perplexity-based detectors. We cast adversarial suffix detection as an online change-point detection problem over the token-level next-token entropy stream. Using the LLM system prompt to estimate a robust baseline, we standardize user-token entropies and apply a one-sided CUSUM statistic. The resulting detector, CPD Online (CPD), is model-agnostic, training-free, runs online, and localizes the adversarial suffix onset. On a benchmark of 1,012 optimization-based suffix attacks (GCG, AutoDAN, AdvPrompter, BEAST, AutoDAN-HGA) and 1,012 perplexity-controlled benign prompts, CPD improves F1 over the strongest windowed-perplexity baseline on all six open-weight chat models (LLaMA-2-7B/13B, Vicuna-7B/13B, Qwen2.5-7B/14B). On LLaMA-2-7B at the canonical CUSUM setting ($k=0$), CPD reaches AUROC $0.88$ and F1 $0.82$. Beyond prompt-level detection, CPD concentrates 79.6% of its triggers inside the adversarial suffix, versus 17-46% for windowed perplexity. Finally, when used as a lightweight gate for LLaMA Guard, CPD reduces guard calls by 17-22% on a high-volume, benign-dominated deployment while preserving guard-level detection quality
【3】PhyWorld: Physics-Faithful World Model for Video Generation
标题:PhyWorld:视频生成的物理忠实世界模型
链接:https://arxiv.org/abs/2605.19242
作者:Pu Zhao,Juyi Lin,Timothy Rupprecht,Arash Akbari,Chence Yang,Rahul Chowdhury,Elaheh Motamedi,Arman Akbari,Yumei He,Chen Wang,Geng Yuan,Weiwei Chen,Yanzhi Wang
摘要:World simulators can provide safe and scalable environments for training Physical AI systems before real-world deployment. Large video generation models are emerging as a promising basis for such simulators because they can generate diverse and realistic visual futures. However, using them as world simulators requires physically faithful video continuations, namely, generated videos that preserve the physical state implied by the conditioning input, and evolve in ways consistent with basic physical principles. We propose PhyWorld, a video generation world model designed to produce temporally coherent and physically faithful scene continuations through two-stage post-training. In the first stage, we improve video-to-video continuation with flow matching fine-tuning, encouraging stable visual attributes and coherent motion dynamics across frames. In the second stage, we align generated dynamics with physical principles using Direct Preference Optimization (DPO) over physics preference pairs, guiding the model toward outputs with higher physical plausibility. To evaluate PhyWorld, we use both standard video-quality benchmarks and a dedicated physical-faithfulness benchmark with per-law scoring. Experiments show that PhyWorld improves video consistency, achieving an average score of 0.769 on VBench compared with 0.756 or below for state-of-the-art baselines. PhyWorld also improves physical plausibility, reaching an average score of 3.09 on our physical-faithfulness benchmark compared with 2.99 for the strongest baseline. These results suggest that post-training large video generation models with continuation and physics-preference signals can make them more effective world simulators for Physical AI.
【4】Generative Pseudo-Force Fields for Molecular Generation
标题:分子生成的生成伪力场
链接
:https://arxiv.org/abs/2605.19050
作者:Stefaan Simon Pierre Hessmann,Khaled Kahouli,Stefan Gugler,Michael Plainer,Frank Noé,Klaus-Robert Müller,Niklas Wolf Andreas Gebauer
摘要:Generating stable molecular conformations typically forces a tradeoff between the physical realism of energy-based relaxation and the sampling efficiency of data-driven generative models. While machine learning force fields (MLFFs) can sample stable conformations by relaxing molecular geometries according to physical forces, they require costly ab-initio training data. Conversely, diffusion models (DMs) learn from equilibrium data alone but are dependent on noise schedules and time-step conditioning. In this work, we propose generative pseudo-force fields (GPFFs) to bridge these paradigms by training an MLFF on a quadratic pseudo-potential energy surface relative to reference equilibrium structures. Because no ab-initio calculations are required for the perturbed geometries, non-equilibrium training data can be generated on the fly by perturbing the equilibria with Gaussian noise. We show that GPFFs constitute a time-step-agnostic variant of variance exploding DMs: the score comes from the predicted pseudo-forces but because force magnitudes implicitly encode the noise level, no time-step conditioning is needed. Our GPFF can hence be used as a drop-in replacement in standard diffusion sampling (ancestral, Heun) but also facilitates more efficient, adaptive variants and an MLFF inspired direct denoising scheme. Our proposed sampling algorithms support arbitrary structural priors and geometric constraints. On QM9, GPFF has 100 % validity at 256 neural function evaluations (NFE) and over 50 % at just 6 NFE, outperforming diffusion baselines across all samplers. Combined with custom priors, we showcase the fast and accurate generation process of our method in a molecular editor for a drug design setting, where a molecule is generated in real time.
【5】Guiding Neuro-Symbolic Scenario Generation with Spatio-Temporal Logic
标题:用时空逻辑指导神经符号场景生成
链接:https://arxiv.org/abs/2605.19038
作者:Lorenzo Bonin,Francesco Giacomarra,Luca Bortolussi,Jyotirmoy V. Deshmukh,Francesca Cairoli
摘要:The rapid advancement of autonomous driving (AD) technologies has outpaced the development of robust safety evaluation methods. Conventional testing relies on exposing AD systems to vast numbers of real-world traffic scenes -- a brute-force approach that is prohibitively expensive and statistically ineffective at capturing the rare, safety-critical edge cases essential for validating real-world robustness. To address this fundamental limitation, we introduce STRELGen, a scalable framework for the targeted generation of safety-critical driving scenarios. STRELGen synergistically combines a multi-agent trajectory-generation diffusion model (DM) with Spatio-Temporal Logic (STREL) specifications that encode complex safety and realism properties through a highly interpretable formalism. Crucially, monitoring satisfaction levels of these specifications is differentiable, enabling gradient-based search. At inference time, we optimize directly over the DM latent space to maximize STREL formula satisfaction. The result is efficient generation of highly plausible yet safety-critical multi-agent scenarios that lie within the learned data distribution. STRELGen thus provides a flexible, interpretable, and powerful tool for stress-testing autonomous driving systems, moving beyond the limitations of brute-force data collection.
【6】MoCo-EA: Exploiting Adversarial Mode Connectivity for Efficient Evolutionary Attacks
标题:MoCo-EA:利用对抗模式连接性进行高效进化攻击
链接:https://arxiv.org/abs/2605.18919
作者:Hyo Seo Kim,Gang Luo,Can Chen,Binghui Wang,Yue Duan,Ren Wang
摘要:Evolutionary algorithms for adversarial attacks leverage population-based search to discover perturbations without gradient information, but suffer from inefficient crossover operations that destroy adversarial properties through discrete interpolation. We introduce Mode Connectivity Evolutionary Attack (MoCo-EA), which replaces traditional crossover with a novel Bézier crossover operator that optimizes perturbations along a continuous Bézier curve between parent perturbations. Our key insight is that adversarial examples lie on connected manifolds where intermediate points maintain and often enhance attack effectiveness. We demonstrate three findings: (1) Successful adversarial perturbations exhibit mode connectivity; (2) Intermediate points along optimized paths achieve higher transferability than endpoints; (3) Bézier crossover dramatically outperforms discrete genetic operations while reducing convergence time and query requirements. By exploiting the geometric structure of adversarial space through path optimization, MoCo-EA provides an efficient and reliable method. Our work challenges the traditional view of adversarial examples as isolated points and opens new directions for both attack generation and defense research.
【7】GenAI-FDIA: Physics-Informed Generative Models for False Data Injection Attacks
标题:GenAI-FDIA:针对错误数据注入攻击的物理知情生成模型
链接:https://arxiv.org/abs/2605.18873
作者:Mohammad A. Razzaque,Muta Tah Hira
备注:Submitted to IEEE Transactions on Smart Grid
摘要:Training and evaluating false data injection attack (FDIA) detectors for power systems is constrained by data scarcity. Operational grid measurements are commercially sensitive, and hand-crafted attacks fail to capture complex distributional structures imposed by network physics. We present \textsc{GenAI-FDIA}, a framework benchmarking a pool of $P{=}20$ architectures for physics-compliant FDIA synthesis, spanning Wasserstein GANs, MMD-VAEs, normalising flows, diffusion models, and cross-family hybrids. These are evaluated across three IEEE testbeds (14-bus DC, 30-bus DC, and 14-bus AC) under a 60/20/20 chronological split using data-driven Bad Data Detection (BDD) threshold calibration. Our empirical results verify that these models generate high-fidelity attacks, with all architectures achieving evasion rates of $ε_{\text{BDD}} \ge 86.6\%$ on the 14-bus network; additionally, limiting an attacker's topological knowledge induces a measurable degradation in stealthiness ($p \le 0.0022$). Crucially, we identify a previously unreported failure mode: applying affine physics projections directly in normalised feature spaces critically displaces the attack vector, collapsing BDD evasion from ${\sim}55\%$ to $
【8】Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models
标题:用于基础模型综合评估的细粒度基准生成
链接:https://arxiv.org/abs/2605.18824
作者:Mohammed Saidul Islam,Negin Baghbanzadeh,Farnaz Kohankhaki,Afshin Cheraghi,Ali Kore,Shayaan Mehdi,Elham Dolatabadi,Arash Afkanpour
摘要:Evaluation of foundation models often rely on aggregate scores from benchmarks that lack comprehensive coverage and metadata for a fine-grained evaluation. We introduce a framework for automated benchmark generation. Our framework generates evaluation problems grounded in reference material, such as textbooks, producing benchmarks with broad coverage, rich metadata, and robustness to contamination. The pipeline employs a multi-agent architecture for problem generation and a solution-graph-driven strategy that significantly improves the reliability of ground truth solutions. Using the framework, we generate three benchmarks in Machine Learning, Corporate Finance, and Personal Finance. Expert review finds a significantly lower ground-truth error rate than previous benchmarks such as MMLU and GSM8K. Evaluation of 12 commercial and open-source models shows that our benchmarks achieve near-uniform competency coverage and surface performance differences across models that existing benchmarks fail to capture. We will open-source the framework and our curated benchmarks soon.
【9】Quantum Adversarial Machine Learning: From Classical Adaptations to Quantum-Native Methods
标题:量子对抗机器学习:从经典适应到量子原生方法
链接:https://arxiv.org/abs/2605.18821
作者:Roozbeh Razavi-Far,Mohammad Meymani,Erfan Mahmoudinia,Dorsa Vazirzade,Peyman Paknezhad,Fateme Ghasemi,Saeed Saravani,Somayeh Nikkhoo,Kimia Haghjooei
摘要:Machine learning has revolutionized numerous industrial domains. Despite recent advances, machine learning models remain vulnerable to adversarial threats. Adversarial machine learning is a field that studies these vulnerabilities to build robust machine learning models. Quantum machine learning is an interdisciplinary field that bridges quantum computing and classical machine learning. While quantum machine learning shows potentials to outperform classical machine learning in complex tasks such as regression, classification, and generative modeling, it remains vulnerable to adversarial attacks. Given the recent advancements in quantum computing and machine learning, the quantum adversarial machine learning field has emerged to study the vulnerabilities of quantum machine learning, possible attacks, and novel quantum-enhanced defense strategies. In this survey, we provide a detailed overview on quantum adversarial machine learning and explore the existing attacks and countermeasures. We also review the theoretical underpinnings of this area, emerging trends, and critical challenges.
半/弱/无/有监督|不确定性|主动学习(8篇)
【1】StruMPL: Multi-task Dense Regression under Disjoint Partial Supervision and MNAR Labels
标题:StruMPL:不相交部分监督和MNAR标签下的多任务密集回归
链接:https://arxiv.org/abs/2605.19931
作者:Reza M. Asiyabi,Juan Alberto Molina-Valero,The SEOSAW Partnership,Steven Hancock,Casey M. Ryan
备注:10 pages with 3 figures and 4 tables, References and Appendix 12 pages with 1 figure and 4 tables
摘要:Estimating forest aboveground biomass (AGB) from Earth observation combines two structurally incompatible label sources: spaceborne lidar provides canopy structure at millions of locations but no biomass estimate, and ground-based plots provide biomass at thousands of biased locations but no metrics of structure. No single training sample carries labels for all target variables, plot labels are missing not at random (MNAR), and biomass is linked to the structural variables by known but biome-specific allometric laws. We formalise this as multi-task dense regression under heterogeneous disjoint partial supervision with MNAR labels and inter-task physical constraints, and propose StruMPL to address it jointly. A shared encoder feeds per-variable regression, imputation, and propensity heads for spatial MNAR correction, and a learnable physics module that evaluates the inter-task constraint on the model's own predictions at every pixel. The supervised loss uses an Augmented IPW (AIPW) pseudo-outcome with stop-gradients on the propensity and on the imputation baseline; we show analytically and empirically that both are necessary for joint optimisation to recover IPW-weighted stationary points while keeping the loss bounded. On two ecologically distinct biomes, StruMPL outperforms ablation variants and the closest published method on AGB RMSE and bias, with a stratified analysis showing AIPW reduces high-AGB bias by ~54%.
【2】Fast and Featureless Node Representation Learning with Partial Pairwise Supervision
标题:采用部分成对监督的快速、无障碍的节点表示学习
链接:https://arxiv.org/abs/2605.19916
作者:Sujan Chakraborty,Saptarshi Bej
摘要:We introduce Contrastive FUSE, a fast and unified framework for scalable node representation learning in graphs with partially available pairwise node labels and no available node features. Unlike existing methods, we directly optimize a spectral contrastive objective that integrates community-aware structural signals with signed pairwise constraints. To support large-scale training, we replace the expensive modularity gradient with a lightweight approximation, which preserves the structure-seeking behavior of modularity while reducing the computational cost significantly. This yields an efficient optimization scheme with a natural gradient decomposition and adaptive learning-rate scaling, enabling fast iterative updates even on million-edge graphs. Extensive experiments on benchmark citation networks, large co-purchase graphs, and OGB datasets show that Contrastive FUSE achieves competitive or superior contrastive classification performance without relying on node features, while offering substantial runtime gains over existing baselines. These results highlight the effectiveness of coupling modularity-inspired structural learning with contrastive supervision for efficient and scalable contrastive node representation learning.
【3】Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation
标题
:连续AI Agent评估的无分布不确定性量化
链接:https://arxiv.org/abs/2605.19779
作者:Yuxuan Gao,Megan Wang,Yi Ling Yu
备注:6 pages, 7 figures, 2 tables. Accepted at the ICML 2026 Workshop on Agentic Uncertainty Quantification (AgenticUQ) - Poster
摘要:We adapt split conformal prediction and adaptive conformal inference (ACI) to continuous AI agent evaluation, providing distribution-free coverage guarantees for forecasted quality scores. Conformal intervals achieve calibration error below 0.02 across all nominal levels at the 24h horizon, while ACI correctly widens intervals by 35% following agent releases then reconverges. We further develop compositional uncertainty bounds for multi-agent pipelines (validated via simulation across inter-stage correlations rho in [-0.5, 0.9]), a conformal abstention rule for pairwise rankings with controlled false-ranking rate, and FDR-corrected abstention for leaderboard-scale multiple testing. Evaluating 50 agents via 18 real-time signals collected hourly, we show that per-agent conditional coverage is well-concentrated around the nominal level (mean 80.4%, 90% of agents within [72%, 90%]), and that cross-source sentiment divergence predicts ranking instability (r=0.64, p<0.01). A circularity-controlled validation confirms the framework captures signal beyond benchmarks (rho_s=0.52, p<0.01, n=35). Code and data are released under CC BY 4.0.
【4】Quantifying the Pre-training Dividend: Generative versus Latent Self-Supervised Learning for Time Series Foundation Models
标题:量化训练前红利:时间序列基础模型的生成性与潜在自我监督学习
链接:https://arxiv.org/abs/2605.19462
作者:Noam Major,Kathy Razmadze,Yoli Shavit
摘要:The success of self-supervised learning (SSL) in vision and NLP has motivated its rapid adoption for time series. However, research has focused primarily on Generative paradigms and forecasting tasks, leaving the broader utility of learned representations unquantified. We establish a controlled framework to evaluate the "pre-training dividend": the value added by SSL across diverse temporal tasks. We systematically compare Generative paradigms against Latent Alignment architectures, introducing adaptations of LeJEPA and DINO for time series. These adaptations utilize Discrete Wavelet Transform (DWT) augmentations to enforce invariance to local fluctuations. Our analysis reveals that the pre-training dividend is highly asymmetric: SSL yields gains of up to 375% for anomaly detection and classification, yet remains marginal for forecasting. We demonstrate that representational utility is non-universal, governed by a precision-invariance trade-off where the specific signal resolution required by the task must align with the objective. Finally, we show that representation quality is largely independent of data origin and saturates at moderate architectural depths, suggesting a path to scaling via massive synthetic generation. Our code is available at: https://github.com/noammajor/Models
【5】GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning
标题:GAE在不完全信息自我游戏强化学习中表现不佳
链接:https://arxiv.org/abs/2605.19235
作者:Zhiyuan Fan,Gabriele Farina
摘要:Competitive multi-agent reinforcement learning in imperfect-information games requires agents to act under partial observability and against adversarial opponents, necessitating stochastic policies. While self-play reinforcement learning with Proximal Policy Optimization (PPO) has achieved strong empirical success, its standard advantage estimator, generalized advantage estimation, suffers from additional variance due to the sampling of stochastic future actions. This variance is amplified in equilibrium self-play because of the stochastic nature of the equilibrium policy and persists even when the critic is exact. We address this bottleneck by introducing $Q$-boosting, a variance-reduced advantage estimator based on a centralized action-value critic, and propose Variance-Reduced Policy Optimization (VRPO), incorporating this new estimator. The algorithm replaces sampled multi-step backups with a multi-step Expected SARSA$(λ)$ trace, computing policy expectations at each step to average out action-sampling noise, while retaining PPO's clipped objective and on-policy actor updates. Empirically, VRPO consistently achieves strong performance from mid-sized to large-scale games including Dou Dizhu and Heads-Up No-Limit Texas Hold'em.
【6】Descriptive versus Regulatory Uncertainty in Bounded Predictive Systems
标题:有界预测系统中的描述性与监管不确定性
链接:https://arxiv.org/abs/2605.18909
作者:Ahmed Gamal Eldin
摘要:Any system that models the world under finite representational capacity must compress; any compression entails a prior; and the prior is the system's bias. What has not been established is whether uncertainty participates in the dynamics governing future behavior, or merely describes the output distribution without consequence. We introduce a structural distinction between descriptive uncertainty, which does not recursively modulate the system's policy, and regulatory uncertainty, which directly enters the optimization landscape and drives persistent adaptive restructuring. We prove formally that current transformer architectures are confined to descriptive uncertainty at inference. We ground this in thermodynamics via Landauer's principle: for uncertainty to be regulatory, epistemic error must cost real energy; in a decoupled system, hallucinations and correct derivations dissipate identical energy. We test this empirically across three locally-deployed language models (3B, 8B, 70B parameters). Token-level Shannon entropy is statistically invariant across tasks spanning pattern retrieval, causal operator application, and out-of-distribution causal generalization in all three models (all pairwise p >= 0.568; within-model ranges 0.011-0.028 nats), while task accuracy varies substantially across the same conditions (0%-100%). Entropy and accuracy are orthogonal. The decoupling is scale-invariant: larger models achieve higher accuracy but identical entropy flatness. This structural incapacity is not resolvable by additional parameters or training data. Genuine epistemic grounding requires physical coupling between thermodynamic substrate state and information processing cost.
【7】Emergence of Frontier Superposition: Möbius attractor and Cascade Supervision
标题
:前沿叠加的出现:莫比乌斯吸引子和级联监督
链接:https://arxiv.org/abs/2605.18820
作者:Hongyu Gu,Jingwen Fu
备注:40 pages, 3 figures
摘要:Superposition allows Transformers to reason in depth, carrying an entire reasoning frontier in parallel through a bounded-depth forward pass instead of unrolling serial chain-of-thought tokens. While Zhu et al. (2025) hand-crafted an equal-weight breadth-first frontier in a single residual stream for graph reachability, it remained open whether gradient descent could ever find this target amidst permutation-symmetric saddles. We close this gap on Reachability-by-Superposition over Erdős-Rényi graphs by isolating architectural and supervisional contributions. Architecturally, we identify a Möbius attractor: under $S_n$-symmetry in the tree regime, layerwise dynamics reduce to a 1D Möbius map whose zero set is a codimension-one manifold of global optima containing the equal-weight superposition state. On the supervision side, we identify Cascade Supervision: a loss class whose backward pass simultaneously delivers (A) selectivity bootstrap, (B) gradient persistence across depth, and (C) per-step discrimination (e.g., \mathcal{L}_{sup} and \mathcal{L}_{node}). End-to-end supervision fails condition (B) and is provably insufficient: internal gradients at layer c decay as (np)^{-(D-c-2)/2} in the graph fan-out and stall before the manifold is reached. Our thesis: Möbius attractor + Cascade Supervision = emergence of superposition reasoning. The parameter-free decay law predicts a final-step cosine of 0.35 vs. 0.71 (end-to-end vs. cascade) at depth D=3; experiments confirm 0.37 vs. 0.69, matching within 0.02 at every step.
【8】Hyrax: An Extensible Framework for Rapid ML Experimentation and Unsupervised Discovery in the Era of Rubin, Roman, and Euclid
标题:Hyrax:鲁宾、罗曼和欧几里得时代快速ML实验和无监督发现的可扩展框架
链接:https://arxiv.org/abs/2605.18959
作者:Aritra Ghosh,Drew Oldag,Michael Tauraso,Andrew J. Connolly,Peter Ferguson,Derek Jones,Gourav Khullar,Argyro Sasli,Samarth Venkatesh,Gracia Wang,Maxine West,Dylan Berry,Neven Caplar,Colin Orion Chandler,Tanawan Chatchadanoraset,Michael W. Coughlin,Melissa DeLucchi,Alexandra Junell,Diego Miura,Felipe Fontinele Nunes,Wilson Beebe,Doug Branton,Sandro Campos,Liam Cunningham,Mi Dai,Jeremy Kubica,Konstantin Malanchev,Rachel Mandelbaum,Sean McGuire,Imad Pasha,Dan S. Taranu,Tianqing Zhang
备注:28 pages, 20 figures, submitted to AJ
摘要:The NSF-DOE Vera C. Rubin Observatory, Roman Space Telescope, Euclid, and other next-generation surveys will deliver imaging, spectroscopic, and time-domain data at scales that increasingly shift the bottleneck in astronomical machine learning (ML) projects from model design to infrastructure. We present Hyrax, an open-source, modular, GPU-enabled Python framework that supports the full ML lifecycle in astronomy: from data acquisition and training to inference and experiment comparison, with capabilities including multimodal dataset support, integrated vector databases for similarity search, and interactive two- and three-dimensional latent-space exploration for unsupervised discovery. We demonstrate Hyrax's versatility through five representative applications on real survey data: (i) unsupervised representation learning on $\sim 4\times10^5$ Rubin Legacy Survey of Space and Time (LSST) Data Preview 1 (DP1) galaxies, surfacing new merger and low-surface-brightness candidates missing from reference Euclid and Dark Energy Survey catalogs, while also isolating imaging artifacts -- all without labeled training data; (ii) hybrid density-based clustering for identifying cluster-scale gravitational lens candidates in DP1 data; (iii) multimodal early-time transient classification in the Zwicky Transient Facility leveraging light curves, spectra, images, and metadata; (iv) supervised false-positive filtering in shift-and-stack searches for distant solar system objects in the Dark Energy Camera Ecliptic Exploration Project survey; and (v) supervised detection of semi-resolved dwarf galaxies in Hyper Suprime-Cam and LSST-like imaging using synthetic source injection. Together, these results demonstrate that Hyrax provides astronomy-specific ML infrastructure that enables systematic discovery and rapid methodological iteration across next-generation astronomical surveys.
迁移|Zero/Few/One-Shot|自适应(16篇)
【1】TrajTok: Adaptive Spatial Tokenization for Trajectory Representation Learning
标题:TrajTok:用于轨迹表示学习的自适应空间令牌化
链接:https://arxiv.org/abs/2605.20134
作者:Zhen Xiong,Shang-Ling Hsu,Cyrus Shahabi
摘要:Learning generalizable trajectory representations from raw GPS traces remains difficult because the data is continuous, noisy, and irregularly sampled. Spatial tokenization is also challenging: fine grids yield sparse cells with weak embeddings, while coarse grids merge heterogeneous movement patterns into the same token. We present TrajTok, a trajectory encoder with a simple pretraining recipe for transferable trajectory embeddings. TrajTok first learns a multi-resolution hexagonal cell partition from the spatial distribution of GPS points, converting noisy GPS sequences into discrete cell tokens. To capture both geometry and kinematics, it uses a factorized transformer encoder with early per-modality self-attention blocks, cross-attention fusion layers, and spatiotemporal rotary position embeddings, ST-RoPE, to encode where and when each token occurs. TrajTok is pretrained with masked-token modeling that recovers both geometric structure and kinematic patterns from partial trajectory observations. On the Porto dataset, a frozen TrajTok encoder with lightweight task adapters achieves strong performance across trajectory similarity search, classification, estimated time of arrival, and full travel-time regression, outperforming multiple task-specific methods. The same frozen encoder supports both geometry-dominated and kinematics-dominated tasks, suggesting that TrajTok learns transferable trajectory structure rather than task-specific shortcuts. These results indicate that learned multi-resolution spatial tokenization combined with masked-token pretraining is a promising direction for general-purpose trajectory foundation models.
【2】Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates
标题:通过失去自适应学习率进行微调而不会忘记
链接:https://arxiv.org/abs/2605.20005
作者:Parjanya Prajakta Prashant,Jiongli Zhu,Aldan Creo,Babak Salimi
备注
:25 pages
摘要:Fine-tuning large language models on new data improves task performance but degrades capabilities learned during pretraining, a phenomenon known as catastrophic forgetting. Existing methods mitigate this by modifying the fine-tuning objective to suppress high-loss tokens or sequences, but these tokens are essential for learning new tasks, especially those with poor pretraining coverage. In such settings, hard tokens should still contribute to learning, so forgetting must be controlled without suppressing them. We identify a simple mechanism for doing so: per-step forgetting is bounded by the product of the learning rate and the square root of the current training loss. This suggests that high-loss batches are especially prone to inducing forgetting. Motivated by this observation, we introduce FINCH, a loss-adaptive learning-rate schedule that reduces the learning rate on high-loss batches and increases it as the model converges, while leaving the fine-tuning objective unchanged. Across knowledge acquisition, science, and low-resource language adaptation benchmarks, FINCH reduces forgetting by 93% on average while matching the task performance of standard fine-tuning. On Qwen3-4B knowledge acquisition, FINCH cuts TruthfulQA degradation by 5x and reverses HaluEval degradation, while better preserving confidence calibration. Overall, our results show that learning-rate schedules are an effective tool to shape model behavior during fine-tuning, beyond just target-task optimization.
【3】Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings
标题:概念引导的噪音阴性抑制Zero-Shot分类和胸部X射线发现的基础
链接:https://arxiv.org/abs/2605.19374
作者:Chenyu Lian,Hong-Yu Zhou,Chun-Ka Wong,Jing Qin
备注:Early accepted by MICCAI 2026
摘要:Vision-language alignment using chest X-rays and radiology reports has emerged as an advanced paradigm for zero-shot classification and grounding of chest X-ray findings. However, standard contrastive learning typically treats radiographs and reports from different patients simply as negative pairs. This assumption introduces noisy negatives, as different patients frequently exhibit similar findings. Such noisy negatives cause semantic ambiguity and degrade performance in zero-shot understanding tasks. To address this challenge, we propose CoNNS, a concept-guided noisy-negative suppression framework. To support the negative suppression mechanism, unlike previous methods that use raw reports or templatized texts, we construct a hierarchical concept ontology using large language models. The ontology structures 41 key clinical concepts by explicitly modeling presence, attributes (location and characteristics), and texts (evidential segment and presence statement). Leveraging this ontology, we implement a cross-patient pair relabeling strategy comprising three steps: (1) Fine-Grained Breakdown to categorize pairs based on finding presence; (2) Noisy Negative Filtering to resolve semantic conflicts by removing false negatives; and (3) Hard Negative Mining to identify subtle attribute discrepancies using a lightweight language model. Finally, we propose a Concept-Aware NCE loss to align visual features with text while suppressing the identified noisy negatives. Extensive experiments across multi-granularity zero-shot grounding tasks and five zero-shot classification datasets validate that CoNNS outperforms existing state-of-the-art models. The code is available at https://github.com/DopamineLcy/conns.
【4】Skinned Motion Retargeting with Spatially Adaptive Interaction Guidance
标题:利用空间自适应交互引导进行皮肤运动重定向
链接:https://arxiv.org/abs/2605.19355
作者:Soojin Choi,Seokhyeon Hong,Chaelin Kim,Junghyun Nam,Junhyuk Jeon,Junyong Noh
备注:SIGGRAPH 2026 / ACM TOG. Project page available at https://suzyn.github.io/space_page/
摘要:Retargeting motion across characters with varying body shapes while preserving interaction semantics, such as self-contact and near-body proximity, remains a challenging problem. While recent geometry-aware approaches address this by maintaining spatial relationships between predefined corresponding regions, their reliance on static correspondences often struggles when the target character exhibits exaggerated body proportions. In this paper, we present a geometry-aware motion retargeting framework that preserves interaction semantics by performing proximity matching over spatially adaptive anchors. Unlike prior methods with static anchor definitions, the proposed method dynamically repositions anchors to reachable regions on the target character. This is achieved via a Transformer-based anchor refinement strategy that predicts anchor displacements and constrains the translated anchors to remain on the target character geometry through differentiable soft projection. By incorporating pose-dependent spatial structures from the source character, the adapted anchors provide structurally coherent guidance for interaction-aware retargeting. Conditioned on these anchors, a graph-based autoencoder predicts target skeletal motion that preserves the spatial configuration of the source. To encourage task-aligned optimization between anchor adaptation and motion retargeting, we adopt an alternating training scheme in which each module is optimized in turn. Through extensive evaluations, we demonstrate that our method outperforms state-of-the-art approaches in preserving interaction fidelity across diverse character geometries.
【5】A Two-Phase Adaptive Balanced Penalty Method for Controllable Pareto Front Learning under Split Feasibility Conditions
标题:分裂可行性条件下可控Pareto前沿学习的两阶段自适应平衡罚方法
链接:https://arxiv.org/abs/2605.19306
作者:Nguyen Viet Hoang,Dung D. Le,Tran Ngoc Thang
备注:36 pages, 18 figures, 12 tables. Submitted to Neural Networks (Elsevier)
摘要
:We address the open problem of training hypernetworks for Controllable Pareto Front Learning (CPFL) under split feasibility conditions with rigorous theoretical guarantees. We reformulate the constrained Pareto problem as a Bi-Level Scalarized Split Problem (BSSP) and propose the Adaptive Balanced Penalty (ABP) algorithm, whose three gradient components -- optimality, set feasibility, and image feasibility -- are blended through an adaptive indicator driven by a computable lower bound. Using a novel convex surrogate technique, we prove full-sequence convergence under standard convexity and Robbins-Monro step-size assumptions. The ABP penalty structure is then translated into a two-phase, feasibility-first training strategy for Hyper-MLP and HyperTrans architectures (ABP-HyperNet). To evaluate constrained CPFL, we introduce the Expected Feasible Hypervolume (EFHV), which jointly captures solution quality and constraint satisfaction. Experiments on five multi-objective benchmarks validate the ABP solver against ground truth, while three multi-task learning datasets demonstrate that ABP-HyperNet achieves up to 2.3x higher EFHV than unconstrained baselines by raising feasibility from 36-49% to 87-100%.
【6】Cross-Paradigm Knowledge Distillation: A Comprehensive Study of Bidirectional Transfer Between Random Forests and Deep Neural Networks for Big Data Applications
标题:跨范式知识提炼:大数据应用随机森林和深度神经网络之间双向转移的综合研究
链接:https://arxiv.org/abs/2605.19299
作者:Mahdi Naser Moghadasi
摘要:The exponential growth of big data has intensified the need for efficient and interpretable machine learning models that can handle diverse data characteristics while maintaining computational efficiency. Knowledge distillation has primarily focused on neural network-to-neural network transfer, leaving cross-paradigm knowledge transfer largely unexplored. This paper presents the first comprehensive study of bidirectional knowledge distillation between Random Forests (RF) and Deep Neural Networks (DNN), addressing critical gaps in ensemble learning and model compression for big data applications. We propose novel methodologies including progressive multi-stage distillation, multi-teacher ensemble distillation from diverse tree models, and uncertainty-aware cross-paradigm transfer mechanisms. Through 144 comprehensive experiments across 6 diverse datasets encompassing classification and regression tasks, we demonstrate that bidirectional RF-DL distillation achieves competitive performance while providing complementary benefits: interpretability from tree models and expressiveness from neural networks. Our results show that multi-teacher ensemble distillation consistently outperforms traditional approaches, with NN-COMPACT achieving 98.13% classification accuracy and NN-WIDE reaching 92.6% R^2 score in regression tasks. The proposed framework enables deployment flexibility in big data environments, allowing optimal model selection based on computational constraints and interpretability requirements. This work establishes a new research direction in cross-paradigm knowledge transfer with significant implications for interpretable AI and scalable model deployment in resource-constrained big data systems.
【7】Domain-Adaptive Communication-Rate Optimization for Sim-to-Real Humanoid-Robot Wireless XR Teleoperation
标题:模拟到真实的人形机器人无线XR遥操作的域自适应通信速率优化
链接:https://arxiv.org/abs/2605.19293
作者:Caolu Xu,Zhiyong Chen,Meixia Tao,Li Song,Feng Yang,Wenjun Zhang
备注:submitted to IEEE journal
摘要:Wireless extended reality (XR) teleoperation provides embodied interaction capability for collecting humanoid robot demonstrations, but the large-scale adoption is restricted by the overhead of high-frequency motion transmission. This paper develops a system framework that integrates sampling, transmission, interpolation, and reconstruction and formulates a communication-rate optimization that aims to minimize the communication energy while maintaining the reconstruction accuracy of robot motion trajectories through dimension-wise sampling-rate control. Since acquiring real-time feedback from physical robots is limited by hardware costs, it is necessary to solve the problem through simulator interaction with offline real-domain data correction. To guide sim-to-real adaptation, we provide a PAC-Bayes generalization characterization that reveals the effects of latent density-ratio estimation, finite-sample deviation, and encoder bias. Building on this analysis, we propose a proximal policy optimization (PPO) method with density-ratio weighting and trust-region regularization. Experiments on public humanoid teleoperation dataset show that the proposed method improves the tradeoff between reconstruction error and communication energy consumption under sim-to-real distribution shift. We further analyze the effectiveness of the proposed algorithm across various wireless channels and dynamic motion trajectories.
【8】MANGO: Meta-Adaptive Network Gradient Optimization for Online Continual Learning
标题:MANGO:在线连续学习的元自适应网络梯度优化
链接:https://arxiv.org/abs/2605.19080
作者:Ankita Awasthi,Marco Apolinario,Kaushik Roy
摘要:In Online Continual Learning (OCL), a neural network sequentially learns from a non-stationary data stream in a single-pass with access only to a limited memory replay buffer. This contrasts sharply with off-line continual learning where training is multiple epoch dependent on large datasets. The main challenge faced by OCL is to overcome catastrophic forgetting of past tasks (stability) while learning new ones efficiently (plasticity). Existing methods counter forgetting via replay-based rehearsal, output level distillation, fixed regularization, or meta-learning on the current data. However, these methods have limitations: rehearsal introduces a stored sample bias; distillation operates on output-distributions without modulating parameter updates; fixed-regularization penalizes parameters irrespective of sensitivity; stream-only meta-learning lacks a feedback controlled parameter update. We propose Meta-Adaptive Network Gradient Optimization (MANGO), an OCL framework that balances stability-plasticity via gradient-gating and meta-learned regularization. Gradient-gating scales parameter updates based on sensitivity, preventing destructive updates. Meta-learned regularization adapts stability coefficients, evaluating the effect of parameter update on replay. In MANGO, replay acts as both a training signal and a forgetting evaluator. We evaluated our method on three standard OCL benchmark datasets. MANGO outperforms strong baselines, achieving state-of-the-art results with consistent performance across replay sizes. In domain incremental learning on CLEAR-10 and class incremental learning on CIFAR-100 and Tiny-ImageNet, it achieves highest accuracy among all baselines and achieves positive Backward Transfer, overcoming forgetting on CLEAR-10.
【9】SAGA: A Sequence-Adaptive Generative Architecture for Multi-Horizon Probabilistic Forecasting with Adaptive Temporal Conformal Prediction
标题:SAGA:具有自适应时间共形预测的多地平线概率预测的序列自适应生成架构
链接:https://arxiv.org/abs/2605.19014
作者:Gustav Olaf Yunus Laitinen-Fredriksson Lundström-Imanov,Hafize Gonca Cömert
备注
:14 pages, 3 figures, 12 tables, 5 appendices, 45 references. Submitted to IEEE TPAMI. Source code at https://github.com/olaflaitinen/saga (archived: doi:10.5281/zenodo.20260366). Synthetic equivalent dataset: doi:10.5281/zenodo.20260287. Empirical work conducted on the Swedish LISA register via SCB MONA (project SCB-MONA-2026-147); ethical approval Swedish Ethical Review Authority 2026-04127-01
摘要:Microsimulation models used by ministries of finance and central banks rely on parametric processes for lifetime earnings that capture only first and second moments of the conditional distribution and miss long-range nonlinear structure. We propose SAGA, a decoder-only transformer for irregular tabular panel sequences, paired with a split conformal calibration wrapper that delivers individual-level prediction intervals with finite-sample marginal coverage guarantees. Trained on the longitudinal Swedish LISA register over 1990 to 2022, comprising 2,143,817 individuals and 61,284,903 person-years, the model forecasts annual labor earnings at horizons of one to thirty years and aggregates them by Monte Carlo into present-discounted lifetime earnings distributions. Against the canonical Guvenen, Karahan, Ozkan, and Song parametric process and tabular and recurrent baselines, SAGA reduces continuous ranked probability score by 31.9 percent at the ten-year horizon and mean absolute error by 37.7 percent at the twenty-year horizon. Conformal intervals achieve nominal coverage to within 0.4 percentage points marginally and within 2.4 percentage points on the worst-case demographic subgroup. The reconstructed lifetime earnings Gini coefficient is 0.327 against the partially observed truth of 0.341 and the GKOS estimate of 0.378. Model weights, calibration tables, and a synthetic equivalent dataset are released for replication outside the protected SCB MONA environment.
【10】Distance-Aware Muon: Adaptive Step Scaling for Normalized Optimization
标题:距离感知μ子:归一化优化的自适应步长缩放
链接:https://arxiv.org/abs/2605.18999
作者:Yury Demidovich,Abhishek Chakraborty,Grigory Malinovsky,Angelia Nedić,Peter Richtárik
摘要:Muon and related normalized optimizers decouple the choice of update direction from the choice of step scale, but their practical performance remains sensitive to the scale of the normalized step. We study adaptive scaling rules for Muon in general norm geometries and develop three complementary algorithms. For smooth non-convex objectives, we introduce Distance-Adaptive Muon, whose trust-region radius is set from the radius explored by the trajectory, and prove a stationarity guarantee under a bounded-trajectory assumption. We then turn to star-convex objectives, a tractable model of the favorable global geometry often used to reason about the empirical loss landscapes of deep neural networks, where objective-gap guarantees are possible. In this setting, we first introduce Scale-Calibrated Muon, which keeps Muon's exponential moving average but sets the step length from a local descent certificate computed from the current gradient and momentum. For this method, we prove a last-iterate O(1/T) objective-gap bound under a bounded initial sublevel-set assumption, where the corresponding radius parameter appears only in the analysis and not in the algorithm. Finally, we develop Distance-Free Muon, a recentered trust-region method that uses a scalar distance certificate and a majorized one-dimensional search to select the trust-region radius without requiring the unknown distance from the initialization to a global minimizer. Experiments on Transformer language modeling (GPT-124M/WikiText-103) and image classification (ViT-Tiny/CIFAR-100) show that the proposed adaptive scaling rules reduce sensitivity to manual scale tuning and match or improve tuned fixed-scale Muon baselines under the tested budgets.
【11】Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints
标题:通过自适应安全约束实现非平稳性下的安全连续强化学习
链接:https://arxiv.org/abs/2605.18842
作者:Timofey Tomashevskiy
备注:Preprint version
摘要:Safe reinforcement learning in nonstationary environments requires safety mechanisms that adapt as environmental conditions change. Standard safe reinforcement learning methods often assume fixed constraints or stable environmental conditions, which can become inadequate under distribution shift. We propose LILAC+, a framework for safe continual reinforcement learning under nonstationarity that combines three adaptive safety mechanisms: context-based safety constraints, adaptation-speed constraints, and budget-to-state safety enforcement. Context-based constraints adjust safety requirements using inferred and predicted environmental context. Adaptation-speed constraints tighten safety requirements when the rate of environmental change exceeds the agent's ability to adapt safely. Budget-to-state enforcement converts cumulative safety requirements into local state-level control constraints that can be enforced at decision time. Together, these mechanisms provide a unified approach for proactive and reactive safety adaptation in continual reinforcement learning. We evaluate the framework in simulated driving environments under stationary, seen nonstationary, and unseen nonstationary conditions. The results show that adaptive safety constraints substantially reduce safety violations under distribution shift while maintaining competitive task performance compared with unconstrained and fixed-constraint baselines. These findings suggest that safe continual reinforcement learning requires adaptive constraint mechanisms that respond not only to current state information but also to predicted environmental context, adaptation demand, and remaining safety budget.
【12】From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning
标题:从累积约束到自适应队列非平稳强化学习的安全控制
链接:https://arxiv.org/abs/2605.18841
作者:Timofey Tomashevskiy
备注:13 pages. Preprint version
摘要
:Safety in reinforcement learning is often specified through cumulative cost constraints, but these trajectory-level guarantees do not directly prevent unsafe individual decisions, especially under nonstationarity. In continual and nonstationary settings, the difficulty is amplified because the risk associated with the same action can vary across contexts, while a fixed state-level threshold may be either too conservative or too weak. We propose Constraint Projection Safety Shield (CPSS), a runtime mechanism that converts a cumulative safety budget into adaptive state-level control constraints during execution. CPSS tracks the remaining safety budget, projects it into a time-varying admissible risk threshold, and filters policy actions whose predicted safety cost exceeds the active threshold. The threshold is adjusted online using contextual signals so that enforcement becomes stricter in more demanding or rapidly changing regimes and less restrictive when the available safety budget is sufficient. We analyze the resulting shielded policy and show that the mechanism guarantees per-state threshold satisfaction for executed actions, induces finite-horizon cumulative cost bounds, and yields a performance degradation bound in terms of intervention frequency and per-step reward distortion. We evaluate CPSS in nonstationary highway merging scenarios using highway-env. Across multiple seeds, CPSS substantially reduces proximity-based safety violations and increases separation margins while intervening selectively rather than dominating the learned policy. These results support adaptive budget-to-threshold projection as a practical way to transform cumulative safety specifications into effective local safety control for continual reinforcement learning systems.
【13】Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training
标题:Hybrid-LoRA:桥接全微调和低秩适应后训练
链接:https://arxiv.org/abs/2605.18822
作者:Chengqian Zhang,Wei Zhu,Kyumin Lee
摘要:Post-training has become essential for adapting large language models (LLMs) to complex downstream behaviors, including instruction following, preference alignment, and multi-step reasoning. Reinforcement learning with verifiable rewards (RLVR) has recently emerged as a particularly effective post-training paradigm for improving reasoning capabilities, with critic-free algorithms such as GRPO and GSPO enabling scalable optimization. However, RLVR post-training with full fine-tuning (FFT) requires substantial GPU memory and incurs high training costs. Although parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), effectively reduce computational costs, they often suffer from a noticeable performance gap compared to full fine-tuning in post-training for complex reasoning tasks. In this paper, we propose Hybrid-LoRA, an efficient hybrid post-training framework that selectively applies full fine-tuning to a small subset of modules less suited to low-rank adaptation, while adapting the remaining components with LoRA. We introduce a novel Hybrid-LoRA Score to rank candidate modules according to their sensitivity to low-rank adaptation under a fixed parameter budget. Experiments show that Hybrid-LoRA closely matches full fine-tuning performance under a 10% full fine-tuning module budget, with the remaining candidate modules adapted by LoRA, consistently outperforming four state-of-the-art PEFT post-training baselines, achieving improvements of up to 5.65% and on average 4.36% over the best baseline.
【14】Adaptive Multi-Scale Goodness Aggregation for Forward-Forward Learning
标题:用于前向学习的自适应多尺度优度聚集
链接:https://arxiv.org/abs/2605.18804
作者:Salar Beigzad,Vansh Verma
备注:6 pages, 5 tables, IEEE format
摘要:We propose Adaptive Multi-Scale Goodness Aggregation (AMSGA), a novel extension of the Forward-Forward (FF) algorithm designed to improve stability, robustness, and generalization in local-learning neural networks. AMSGA addresses several limitations of the original FF framework by introducing multi-scale goodness aggregation across local, intermediate, and global representations; adaptive curriculum-guided hard negative mining; layer-dependent adaptive thresholds; and a warm-up cosine annealing learning-rate schedule for improved optimization stability. Together, these modifications strengthen the FF paradigm while preserving its biologically plausible and memory-efficient properties. Experiments on MNIST and Fashion-MNIST demonstrate consistent performance improvements over the baseline FF algorithm, achieving up to +1.45% improvement on MNIST and +1.50% improvement on Fashion-MNIST without significant computational overhead. Our results suggest that local learning methods can become substantially more competitive when goodness estimation and training dynamics are carefully designed.
【15】HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models
标题:HELoRA:针对混合专家模型的热门专家分层低等级适应
链接:https://arxiv.org/abs/2605.18795
作者:Jia Wei,Zhonghao Zhang,Ping Chen,Qianyang li,Yancheng Pan,Shaoxun Wang,Ziyi Qiu,Longxiang Wang
摘要:Low-Rank Adaptation (LoRA) dominates parameter-efficient fine-tuning of large language models, yet most variants target dense architectures. Mixture-of-Experts (MoE) models scale parameters at near-constant per-token compute, and their sparse activation patterns create untapped opportunities for more efficient adaptation. We propose Hot-Experts Layer-level Low-Rank Adaptation (HELLoRA), which attaches LoRA modules only to the most frequently activated experts at each layer. This simple mechanism reduces trainable parameters and adapter-induced FLOPs while improving downstream performance, an effect we attribute to a form of structured regularization that preserves pretrained expert specialization. To stress-test HELLoRA under extreme parameter budgets, we further compose it with LoRI to form HELLoRI, which freezes the up-projection and sparsifies the down-projection. Across three MoE backbones, namely OlMoE-1B-7B, Mixtral-8x7B, and DeepSeekMoE, and three task families covering mathematical reasoning, code generation, and safety alignment, HELLoRA consistently outperforms strong PEFT baselines. Relative to vanilla LoRA on OlMoE, HELLoRA uses 15.7% of the trainable parameters, reduces adapter FLOPs by 38.7%, achieves 1.9x the training throughput, and improves accuracy by 9.2%. On DeepSeekMoE, HELLoRA outperforms LoRA while using only 23.2% of its trainable parameters. These results demonstrate that activation-aware adapter placement is an effective and practical route to scaling PEFT for MoE language models.
【16】Posterior Contraction of Lévy Adaptive B-spline Regression in Besov Spaces
标题:Besov空间中Lévy自适应B样条回归的后验收缩
链接:https://arxiv.org/abs/2605.19610
作者:Jeunghun Oh,Sewon Park,Jaeyong Lee
摘要
:We investigate the asymptotic properties of the Lévy Adaptive B-spline (LABS) regression model, a Bayesian nonparametric method that incorporates B-spline kernels into the Lévy Adaptive Regression Kernel (LARK) model. LABS applies splines of varying degrees with independently defined knots, yielding a flexible model class capable of adapting to irregular and locally structured features of the true function. Within the nonparametric regression framework with univariate random design and Gaussian errors, we establish that the LABS posterior contracts around the true function in Besov classes at nearly minimax-optimal rates, up to a logarithmic factor, while adapting automatically to unknown smoothness. This study contributes to filling a gap in the literature, where theoretical results on posterior contraction of the LARK model in Besov spaces remain scarce. Simulation experiments on standard test functions in Besov spaces, including Blocks, Bumps, HeaviSine, and Doppler, complement the theoretical results and demonstrate the practical utility of LABS.
强化学习(7篇)
【1】ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders
标题:ARC-RL:受ARC突袭者启发的强化学习游乐场
链接:https://arxiv.org/abs/2605.19503
作者:Carlo Romeo,Andrew D. Bagdanov
摘要:Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim-to-real robotics and routinely take the form of creatures with no real-robot counterpart. We introduce ARC-RL, a suite of four MuJoCo continuous-control environments featuring robotic morphologies inspired by the bestiary of ARC Raiders: the 18-DoF tall hexapod Queen, the 12-DoF armoured hexapod Bastion, the 18-DoF compact hexapod Tick, and the 12-DoF quadruped Leaper. All four robots share a unified observation template, action convention, simulation cadence, and a single closed-form multi-component reward function whose only per-morphology variation lives in a small set of weights and parameters. The reward fuses a velocity-tracking tent, a healthy survive bonus, a phase-locked gait-compliance bonus/cost pair, action regularisers, three safety penalties, and a posture anchor; no motion-capture data enters the reward at any point. We additionally provide hand-crafted Central Pattern Generator demonstrators per morphology, which serve both as fixed expert references and as sources of prior data for offline-to-online training. On this playground, we conduct a controlled empirical study comparing standard online algorithms (SAC, SPEQ, SOPE-EO) and methods augmented with prior data (SACfD, SPEQ-O2O, SOPE), and characterise how each paradigm copes with the playground's morphological diversity and animation-style stylistic constraints.
【2】Sampling-Based Safe Reinforcement Learning
标题:基于采样的安全强化学习
链接:https://arxiv.org/abs/2605.19469
作者:Luca Vignola,Bruce D. Lee,Manish Prajapat,Manuel Wendl,Melanie Zeilinger,Andreas Krause,Yarden As
摘要:Safe exploration remains a fundamental challenge in reinforcement learning (RL), limiting the deployment of RL agents in the real world. We propose Sampling-Based Safe Reinforcement Learning (SBSRL), a model-based RL algorithm that maintains safety throughout the learning process by enforcing constraints jointly across a finite set of dynamics samples. This formulation approximates an intractable worst-case optimization over uncertain dynamics and enables practical safety guarantees in continuous domains. We further introduce an exploration strategy based on constraining epistemic uncertainty, eliminating the need for explicit exploration bonuses. Under regularity conditions, we derive high-probability guarantees of safety throughout learning and a finite-time sample complexity bound for recovering a near-optimal policy. Empirically, SBSRL achieves safe and efficient exploration both in simulation and in real robotic hardware, and readily extends to practical deep-ensemble implementations that scale to high-dimensional continuous control problems.
【3】When the Majority Votes Wrong, the Intervention Timing for Test-Time Reinforcement Learning Hides in the Extinction Window
标题:当大多数人投票错误时,测试时强化学习的干预时间隐藏在灭绝窗口中
链接:https://arxiv.org/abs/2605.19444
作者:Hongxiang Lin,Zhirui Kuai,Erpeng Xue,Lei Wang
摘要:Test-time reinforcement learning (TTRL) reports substantial accuracy gains on mathematical reasoning benchmarks using majority vote as a pseudo-label signal. We argue these gains are systematically misinterpreted: most reflect sharpening of already-solvable problems rather than genuine learning, while problems corrupted from correct to incorrect outnumber truly learned ones, and this damage is irreversible once majority vote locks onto a wrong answer. Per-problem tracking reveals that correct-answer signals in low-ability problems are briefly active before being permanently suppressed, a phenomenon we term the \textit{Correct-Answer Extinction Window}, with Flip Rate (FR) as its leading indicator. We thus propose \textbf{TTRL-Guard}, a lightweight framework with three mechanisms targeting the extinction window: Flip-Rate-Aware Reward Scaling (FRS) down-weights at-risk updates as FR declines, Minority-Preserving Sampling (MPS) retains gradient signal from minority correct answers, and Risk-Conditioned Sparse Updatings (RCSU) suspends updates on polarized problems. Experiments across three models and four benchmarks show that TTRL-Guard achieves the best average pass@1 on Qwen2.5-7B-Instruct and Qwen3-4B, improves relatively over TTRL by +54\% on AIME 2025. \footnote{Our code and implementation details are available at https://github.com/linhxkkkk/TTRL-Guard.
【4】RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning
标题:WLFTSim:通过强化学习微调实现真实且可控的多智能体交通模拟
链接:https://arxiv.org/abs/2605.19033
作者:Ehsan Ahmadi,Hunter Schofield,Behzad Khamidehi,Fazel Arasteh,Jinjun Shan,Lili Mou,Dongfeng Bai,Kasra Rezaee
备注:CVPR 2026 Highlight; Project page at https://ehsan-ami.github.io/rlftsim
摘要
:Supervised open-loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning. The project page is available at https://ehsan-ami.github.io/rlftsim.
【5】Emergence of a Flow-Assisted Casting Strategy for Olfactory Navigation via Memory-Augmented Reinforcement Learning
标题:通过记忆增强强化学习实现嗅觉导航的流辅助投射策略的出现
链接:https://arxiv.org/abs/2605.18881
作者:Changxu Zhao,Dongxiao Zhao,Xin Bian,Gaojin Li
摘要:In dynamic flow fields, various animals exhibit remarkable odor search capabilities despite relying on stochastic detections. Interestingly, there exists an optimal time window for integrating these detections that maximizes search efficiency. To understand the underlying mechanism, we investigate the navigation performance of Reinforcement Learning (RL) agents in unsteady flows under varying memory lengths and flow conditions. Without any predefined models, the agents develop a flow-assisted casting strategy and adaptively adjust both the geometry of their search trajectories and the concentration threshold for initiating casting to maximize the success rate. The agent's average speed toward the odor source exhibits a non-monotonic dependence on memory length, which can be explained by the "sector-search" model.
【6】ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning
标题:ReCrit:用于科学批判性推理的转换感知强化学习
链接:https://arxiv.org/abs/2605.18799
作者:Wanghan Xu,Yuhao Zhou,Hengyuan Zhao,Shuo Li,Dianzhi Yu,Zhenfei Yin,Yaowen Hu,Fengli Xu,Wanli Ouyang,Wenlong Zhang,Lei Bai
摘要:Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This is especially risky in scientific reasoning, where user criticism can turn a valid answer into an incorrect one. We frame critic interaction as an inter-turn correctness-transition problem rather than a final-answer accuracy problem, and identify three challenges: transition awareness, decoupling useful correction from harmful sycophancy, and scalable rollout. We propose ReCrit, a transition-aware reinforcement learning framework that decomposes Initial-to-Critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. To make interaction training practical, ReCrit further uses dynamic asynchronous rollout with tail-adaptive completion to reduce rollout waiting. On three scientific reasoning benchmarks, ChemBench, TRQA, and EarthSE, ReCrit improves average Critic accuracy from 38.15 to 51.49 on Qwen3.5-4B and from 45.40 to 55.59 on Qwen3.5-9B. Ablations show that final-answer rewards provide little interaction-level gain, while transition-aware rewards and quadrant weighting produce more distinguishable training signals and larger net Critic-stage improvement. The code is available at https://github.com/black-yt/ReCrit .
【7】Precision Physical Activity Prescription via Reinforcement Learning for Functional Actions
标题:通过功能动作的强化学习实现精确的身体活动处方
链接:https://arxiv.org/abs/2605.19208
作者:Gefei Lin,Rui Miao,Jennifer Sacheck,Xiaoke Zhang
摘要:Physical activity (PA) plays an important role in maintaining and improving health. Daily steps have been a key PA measure that is easily accessible with common wearable devices. However, methods are lacking to recommend a personalized optimal distribution of daily steps over a period of time for the best of certain health biomarkers. In this paper, we fill this void based on the data from the All of Us Research Program which includes months of step counts as well as repeated measurements of key health biomarkers. We develop a new offline reinforcement learning (RL) algorithm to learn personalized and optimal PA distributions associated with cardiometabolic risk, where the action is a function representing the daily step distribution over a period of time. Simulation studies demonstrate the advantage of the proposed approach over existing continuous-action RL methods. The learned optimal policy from the All of Us data generally suggests people take more daily steps and also follow a more consistent pattern of PA over time while offering tailored recommendations for subgroups in blood glucose level, body mass index, blood pressure, age, and sex.
元学习(1篇)
【1】Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation
标题:超越二元成功:细粒度操纵的诊断元评估框架
链接:https://arxiv.org/abs/2605.19986
作者:He-Yang Xu,Pengyuan Zhang,Zongyuan Ge,Xiaoshuai Hao,Serge Belongie,Xin Geng,Yuxin Peng,Xiu-Shen Wei
备注:Project page: https://metafine.github.io/
摘要
:Fine-grained manipulation marks a regime where global scene context no longer suffices, and success hinges on the tight coupling of local attribute grounding, high-fidelity spatial perception, and constraint-respecting motor execution. However, current embodied AI benchmarks collapse these capacities into binary success rates, systematically inflating reported capabilities by up to 70% and masking the architectural bottlenecks that impede real-world deployment. We introduce MetaFine, a diagnostic meta-evaluation framework that disentangles manipulation competency along three axes: understanding, perception, and controlled behavior. Built on a compositional task graph, MetaFine absorbs heterogeneous external benchmarks and reconstructs them into diagnostic scenarios of varying complexity under a unified protocol. Evaluating state-of-the-art vision-language-action (VLA) models through this lens exposes severe dimension-specific failures invisible to conventional metrics. Through targeted causal intervention, we identify the visual encoder's ability to preserve local spatial structure as a key bottleneck for fine-grained precision: improving it directly unlocks previously inaccessible manipulation capabilities without modifying downstream policies. MetaFine further supports hybrid real-sim validation, using limited paired real-world rollouts to calibrate scalable simulation-based estimates for more stable physical benchmarking. By shifting evaluation from ranking to diagnosis, MetaFine turns benchmarking into an actionable compass for repairing the layered capacities underlying genuine physical dexterity. The MetaFine framework, benchmarks, and supporting resources will be publicly released at our project page: https://metafine.github.io/.
分层学习(2篇)
【1】Hierarchical Contrastive Learning for Multi-Domain Protein-Ligand Binding
标题:多结构域蛋白质-配体结合的分层对比学习
链接:https://arxiv.org/abs/2605.19902
作者:Shuo Zhang,Rongqi Hong,Huifeng Zhang,Jian K. Liu
备注:Accepted by ISBRA2026
摘要:Predicting protein-ligand binding affinity remains intractable for multi-domain proteins, where inter-domain dynamics govern molecular recognition. Existing geometric deep learning methods typically treat proteins as monolithic static graphs, suffering from rigid-body assumptions and aleatoric noise in flexible regions. To address this, we introduced HCLBind, a self-supervised framework that decouples geometric representation learning from affinity regression. HCLBind leverages a general-to-specific pre-training paradigm on the Q-BioLiP database to learn a robust physical grammar of binding. We propose a novel hierarchical decoy strategy: the model learns local physicochemical constraints through protein coordinate perturbation in single-domain proteins and global conformational geometry through inter-domain rotation in multi-domain complexes. Our hybrid architecture integrates a domain-gated graph attention network and cross-modal attention to explicitly prioritize domain interfaces. Furthermore, we employ LoRA on protein and ligand foundation models, ensuring efficient optimization while preserving evolutionary knowledge. Experiments on PDBBind demonstrate that HCLBind effectively learns discriminative interface features and provides robust uncertainty estimation, overcoming the limitations of standard supervised learning. The code is available at https://github.com/jiankliu/HCLBind.
【2】Towards Family-Grouped Hierarchical Federated Learning on Sub-5KB Models: A Feasibility Study of Privacy-Preserving ECG Monitoring for Ultra-Resource-Constrained Wearables
标题:迈向Sub-5 KB模型上的家庭分组分层联邦学习:超资源限制可穿戴设备的隐私保护心电图监测的可行性研究
链接:https://arxiv.org/abs/2605.18862
作者:Hangyu Wu
备注:Supported by Shenzhen Coddie Technology Co., Ltd. This is a preprint and has not been peer-reviewed
摘要:Cardiovascular disease remains the leading cause of death worldwide, and early detection of arrhythmias through continuous ECG monitoring on wearable devices can prevent life-threatening events. Federated Learning (FL) enables privacy-preserving collaborative training by keeping raw ECG data on device, yet standard FL incurs prohibitive communication overhead and standard deep learning models cannot fit on ultra-low-power microcontrollers. We propose Family-Grouped Hierarchical Federated Learning (Family-FL), a three-tier architecture that uses the family as a natural privacy boundary for intra-family aggregation before global synchronization. We further design a hardware-constrained Tiny CNN-LSTM architecture with only 669 parameters, INT8-quantized to occupy merely 4.65KB Flash and 2.95KB RAM, meeting the constraints of STC32G12K128-class microcontrollers. Experiments on the MIT-BIH Arrhythmia Database (mean of 5 independent runs with different seeds) demonstrate that Family-FL reduces communication volume by 76.7% compared to FedAvg while maintaining comparable accuracy. Family-FL-Tiny achieves 91.9 +/- 1.2% accuracy with macro-F1 of 0.483 +/- 0.031, reducing total communication to 0.31% of FedAvg. The model achieves reliable ventricular arrhythmia detection (per-class F1 = 0.80), the most clinically critical abnormality for home-based preliminary screening. These results demonstrate the technical feasibility of privacy-preserving federated learning on ultra-resource-constrained microcontrollers through simulation-based evaluation. We honestly discuss limitations: no hardware deployment, single-dataset validation (MIT-BIH, 47 subjects), reduced rare-class sensitivity, and absence of formal differential privacy guarantees.
医学相关(7篇)
【1】Interpretable Computer Vision for Defect Detection in X-ray Tomography of Aerospace SiC/SiC Composites
标题:可解释计算机视觉用于航空航天Si/Si复合材料X射线断层扫描中的缺陷检测
链接:https://arxiv.org/abs/2605.20159
作者:Antonio Peña Corredor,Julien Lesseur,Romain Nunez,Paul Rivalland,Thomas Philippe
摘要:Non-destructive testing of aerospace SiC/SiC composites via X-ray computed tomography (XCT) relies on expert visual assessment, with current workflows offering limited traceability for accept/reject decisions. Deep convolutional networks can automate defect detection, yet their black-box nature conflicts with the transparency that industrial inspection practice demands. To close this gap, we introduce p-ResNet-50, a convolutional framework extended with a prototype layer that couples high detection accuracy with case-based explanations. Six learned prototypes are explicitly aligned with expert-defined semantic categories-healthy matrix, matrix--air interfaces, pores, line-like defects, and mixed morphologies-so that every classification is traceable to a physically meaningful reference. Two novel regularisation terms, anchor-based and medoid-based, tether prototypes to expert-selected patches and prevent prototype collapse, addressing a known limitation of prototype networks. Latent-space analysis via UMAP delineates semantically coherent sub-domains and maps zones of uncertainty where misclassifications concentrate, giving inspectors an explicit picture of where the model is-and is not-reliable. The framework is validated on an XCT patch dataset of approximately 12,000 patches extracted from four defect-rich SiC/SiC laboratory specimens. Taking a black-box ResNet-50 as a baseline (ROC-AUC = 0.991), the prototype extension achieves comparable performance (accuracy 0.957 vs. 0.959; ROC-AUC 0.994 vs. 0.993) while trading a slight reduction in sensitivity for higher precision and specificity. Each decision is backed by representative evidence patches, and the model explicitly flags its uncertainty regions. Beyond defect mapping, the framework establishes a reusable methodology for embedding domain-expert knowledge into prototype networks, applicable to other XCT inspection scenarios requiring traceable, auditable decisions.
【2】Neuron Incidence Redistribution for Fairness in Medical Image Classification
标题:基于神经元关联重分布的医学图像公平分类
链接:https://arxiv.org/abs/2605.19393
作者:Abin Shoby,Lyle John Palmer,Nikhil Cherian Kurian
备注:4 Pages, 1 Figure
摘要:Deep learning models for medical image classification are susceptible to subgroup performance disparities across demographic attributes such as age, gender, and race. We identify a latent representational mechanism underlying these disparities: in transfer-learned models, the dominant penultimate-layer activation channel under positive predictions is co-activated by both disease-positive samples and privileged demographic groups (male, older patients), producing over-diagnosis; conversely, the dominant channel under negative predictions is co-activated by disadvantaged groups (female, younger patients), producing systematic under-diagnosis. To address this, we propose Neuron Incidence Redistribution (NIR), a lightweight regularization method that penalizes the variance of predicted-probability-weighted mean activations across penultimate-layer neurons, requiring no demographic labels at training time. On HAM10000, TPR disparity drops from 10.81% to 0.93% across age groups and from 12.04% to 0.74% across gender, with a marginal AUC improvement of 0.51 points. On Harvard OCT-RNFL, NIR reduces FPR disparity for race (from 15.68% to 10.66%) and age (from 12.69% to 1.80%), demonstrating that distributing latent disease evidence across the full penultimate layer is a principled and effective strategy for improving demographic fairness in medical AI.
【3】ExECG: An Explainable AI Framework for ECG models
标题:Ex心电图:用于心电图模型的可解释人工智能框架
链接:https://arxiv.org/abs/2605.19258
作者:Jong-Hwan Jang,Yong-yeon Jo
摘要:Deep learning has enabled ECG diagnostic models with strong performance in tasks such as arrhythmia classification and abnormality detection. However, accuracy alone is insufficient for clinical deployment because it does not explain why a specific output was produced, limiting justification, error analysis, and trust. Although ECG XAI has been extensively investigated and steadily improved, practical pipelines and reporting conventions vary across studies, hindering reuse and reproducibility. To address these issues, we present Explainable AI framework for ECG models (ExECG), a Python framework that provides a three-stage pipeline: Wrapper standardizes access across heterogeneous ECG formats and intermediate representations, Explainer unifies diverse XAI methods under a shared execution protocol, and Visualizer supports consistent cross-method comparison within a unified interface. We demonstrate end-to-end usage with concise examples and two case studies, highlighting interoperable and reproducible ECG explainability.
【4】Worst-Group Equalized Odds Regularization for Multi-Attribute Fair Medical Image Classification
标题:多属性公平医学图像分类的最坏组均衡赔率正规化
链接:https://arxiv.org/abs/2605.19214
作者:Nikhil Cherian Kurian,Victor Caquilpan Parra,Abin Shoby,Luke Whitbread,Lauren Oakden-Rayner,Robert Vandersluis,Jessica Schrouff,Lyle J. Palmer,Mark Jenkinson
备注:11 Pages, 2 Figures
摘要:Diagnostic performance in medical AI varies systematically across demographic groups, yet subgroup AUC can mask clinically important disparities. At a fixed inference-time operating point, some groups may exhibit over-diagnostic behaviour, characterized by elevated true and false positive rates, while others show under-diagnostic patterns with reduced true and false positive rates. These opposing tendencies can cancel in aggregate AUCs while producing meaningful inequities in clinical decision-making. Motivated by the need to assess and mitigate such disparities at the operating point and across multiple demographic attributes simultaneously, we propose a worst-group equalized-odds margin regularizer. The proposed regularizer explicitly targets subgroup-level deviations on both the true positive and false positive sides at inference. At each update, the method identifies subgroups defined by explicit demographic attributes (e.g., age, sex, and race) that exhibit the most extreme margin deviations and applies a unified penalty, enabling fairness optimization across multiple demographic axes without requiring explicit intersectional constraints. Across two medical imaging datasets in realistic multi-label settings, our method consistently reduces disparities in Equalized Odds and Equalized Opportunity with minimal impact on AUC, preserving diagnostic performance while improving fairness.
【5】Quantized Machine Learning Models for Medical Imaging in Low-Resource Healthcare Settings
标题:低资源医疗保健环境中医学成像的量化机器学习模型
链接:https://arxiv.org/abs/2605.19207
作者:Sumanth Meenan Kanneti,Aryan Shah
摘要:Deep learning models have shown strong performance in medical image analysis, but deploying them in low-resource clinical environments remains difficult due to computational, memory, and power constraints. This paper presents a multi-strategy compression framework for brain tumor classification from MRI, encompassing quantization-aware training, knowledge distillation from a DenseNet-101 teacher to a compact DenseNet-32 student with low-bit post-training quantization, and Float16 post-training quantization on a lightweight MobileNetV2 backbone. Using a multi-class brain tumor MRI dataset containing glioma, meningioma, pituitary tumors, and healthy controls, we provide full experimental validation of the MobileNetV2-based pipeline, training the classifier through a three-stage transfer learning process and applying Float16 quantization via TensorFlow Lite. The DenseNet-based distillation and quantization-aware training strategies are described as complementary compression approaches within the framework, with their complete empirical evaluation reserved for future work. Experimental results on the MobileNetV2 pipeline show that the quantized model achieves 82.37 percent validation accuracy compared to the 82.20 percent full-precision baseline, reducing model size from 35.34 MB to 5.76 MB, a 6.14x compression ratio with no meaningful accuracy loss. Per-class evaluation confirms that quantization preserves diagnostic performance uniformly across all four tumor categories. These findings demonstrate that lightweight quantized models can deliver clinically viable brain tumor screening in resource-constrained healthcare settings.
【6】Spectral Gradient Surgery for Domain-Generalizable Dataset Distillation
标题:域可概括数据集蒸馏的光谱梯度手术
链接:https://arxiv.org/abs/2605.18836
作者:Minyoung Oh,Najeong Chae,Jae-Young Sim
备注:17pages
摘要:Dataset Distillation (DD) synthesizes a compact synthetic dataset that preserves the training utility of a full dataset. However, its standard formulation assumes that test data follow the same distribution as training data, an assumption that rarely holds in practice. A straightforward extension-applying post-hoc Domain Generalization (DG) techniques to distilled data-is ill-suited because existing DG methods rely on the natural diversity of real datasets, which compact synthetic sets inherently lack, while also incurring substantial augmentation overhead that conflicts with the efficiency objective of dataset distillation. To address this limitation, we introduce Domain Generalizable Dataset Distillation (DGDD), a new problem setting that explicitly targets out-of-distribution (OOD) generalization of distilled datasets. We study this problem through a widely adopted DD baseline of Distribution Matching (DM). We attribute the OOD vulnerability of DM to the entanglement of class-discriminative and domain-specific information within the compressed synthetic set, and propose Spectral Gradient Surgery (SGS) to disentangle the two. The key insight of SGS is that cross-domain agreement among domain-wise gradients in the spectral domain reveals which gradient components are shared across source domains-and are therefore class-discriminative-and which are domain-specific. Based on this observation, SGS augments the standard DM update with two complementary gradients: one that reinforces cross-domain shared components and another that explicitly promotes diversity within the distilled dataset. Extensive experiments on diverse-scale benchmarks demonstrate that SGS substantially improves OOD generalization while remaining plug-and-play compatible with existing DM methods.
【7】Learning Interpretable Point-Based Clinical Risk Scores via Direct Optimization
标题:通过直接优化学习可解释的基于点的临床风险评分
链接:https://arxiv.org/abs/2605.19113
作者:Ying Cui,Albert M Li,Vivek Charu,Yeon-Mi Hwang,Tina Hernandez-Boussard,Lu Tian
备注:23 pages, 4 figures
摘要:Many clinical risk scores are deployed as additive rules with nonnegative integer points assigned to relevant binary predictive features. These integer weights not only make the score easier to use in practice but also promote sparsity in the resulting prediction model. Such risk scores are often derived by first fitting a regression model and then rounding the estimated coefficients to the nearest integer after appropriate scaling. This approach is computationally fast but does not guarantee optimality of the resulting score. Alternatively, one may search over all possible integer weights to directly optimize a value function by posing the problem as an integer programming task. However, the associated computational burden can be substantial, especially when the value function is nonconcave or even discontinuous. In this paper, we develop new machine learning algorithms that employ a flexible greedy optimization strategy to learn such additive scoring directly under explicit and sensible optimality objectives. We apply the proposed method to a large electronic health record (EHR) cohort in Epic Cosmos to construct an integer-weighted comorbidity score for measuring the risk of post-discharge mortality. We also conduct a simulation study to examine the finite-sample operating characteristics.
蒸馏|知识提取(6篇)
【1】Towards Distillation Guarantees under Algorithmic Alignment for Combinatorial Optimization
标题:组合优化的数学对齐下的蒸馏保证
链接:https://arxiv.org/abs/2605.20074
作者:Thien Le,Melanie Weber
备注:22 pages
摘要:Distillation transfers knowledge from a large model trained on broad data to a smaller, more efficient model suitable for deployment. In structured prediction settings, prior knowledge about the task can guide the choice of a target architecture that is algorithmically aligned with the underlying problem. Building on recent learning-theoretic analyses of decision-tree (DT) distillation (Boix-Adsera, 2024), we study when distillation succeeds for combinatorial optimization tasks. We focus on the case where the target model is a graph neural network whose architecture is aligned with a dynamic programming (DP) algorithm for the task. Assuming that the source model is sufficiently rich, formalized through the linear representation hypothesis (LRH) (Elhage et al., 2022; Park et al., 2024), we show that the distillation problem can be solved efficiently in the complexity parameters of the DP transition function, represented as a DT. Our results provide a rigorous sufficient condition for successful distillation in the flavour of algorithmic alignment.
【2】Fast Tensorization of Neural Networks via Slice-wise Feature Distillation
标题:通过逐片特征蒸馏的神经网络快速张量化
链接:https://arxiv.org/abs/2605.19842
作者:Safa Hamreras,Sukhbinder Singh,Román Orús
摘要:We propose a scalable tensorization framework for neural network compression based on slice-wise feature distillation. Unlike conventional tensor decomposition methods that rely on costly global finetuning, our approach decomposes the network into slices consisting of either individual layers or blocks (e.g., convolutional layers or MLPs), or small groups of consecutive layers, and tensorizes each slice independently to reproduce the intermediate representations of the original pretrained model. This modular strategy improves accuracy recovery, reduces data requirements, and enables efficient parallel optimization. Experiments on ResNet-34 show significant gains over conventional global tensorization, achieving near-lossless compression at moderate compression rates with faster optimization. Results on GPT-2 XL further demonstrate the scalability of the method and its applicability to large-scale models, particularly in distributed settings.
【3】CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
标题:CEPO:使用对比证据政策优化的WLVR自蒸馏
链接:https://arxiv.org/abs/2605.19436
作者:Ahmed Heakl,Abdelrahman M. Shaker,Youssef Mohamed,Rania Elbadry,Omar Fetouh,Fahad Shahbaz Khan,Salman Khan
备注:9 pages
摘要:When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model's baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just "does the correct answer favor this token?" but "does the correct answer favor it while the wrong answer disfavors it?" A token satisfying both is a genuine reasoning step; one satisfying neither is filler. The wrong-answer teacher is constructed from rejected rollouts already in the training batch, incurring no additional sampling cost. We prove CEPO inherits all structural safety guarantees of the prior state of the art while strictly sharpening credit at decisive tokens, with the improvement vanishing exactly at filler positions. Empirically, CEPO achieves 43.43% and 60.56% average accuracy across five multimodal mathematical reasoning benchmarks at 2B and 4B scale, respectively, versus 41.17% and 57.43% for GRPO under identical training budgets. Distribution-matching self-distillation methods (OPSD, SDPO) fall below the untrained baseline, empirically confirming the information leakage our theory predicts. Our code is available at https://github.com/ahmedheakl/CEPO.
【4】Distilling Linearized Behavior for Effective Task Arithmetic
标题:提炼线性化行为以实现有效任务算法
链接:https://arxiv.org/abs/2605.18993
作者:Thomas Sommariva,Francesca Morandi,Simone Calderara,Angelo Porrello
备注:Accepted at ICML 2026
摘要:Task vector composition has emerged as a promising paradigm for editing pre-trained models, enabling model merging through addition and unlearning through subtraction. Fine-tuning in the tangent space of a pre-trained model (linear fine-tuning) has proven effective, as it produces task vectors that are naturally disentangled and resistant to interference. However, linearized models suffer from limited expressivity during training and incur higher computational costs at inference time, which restrict their practical applicability. In this work, we bridge the gap between linear and standard non-linear fine-tuning. We show that linearity with respect to weight perturbations, a property defined in parameter space, can be enforced through constraints in activation space during training. Concretely, we distill hidden representations from a curvature-regularized linearized teacher into a non-linear student trained via conventional fine-tuning. We find that the resulting model inherits key properties of linearized models for task arithmetic, enabling effective composition of task vectors and achieving strong performance across vision and language benchmarks without incurring any inference-time overhead.
【5】From Sparsity to Simplicity: Enabling Simpler Sequential Replacements via Sparse Attention Distillation
标题:从稀疏到简单:通过稀疏注意力蒸馏实现更简单的顺序替换
链接:https://arxiv.org/abs/2605.18865
作者:Yuxin Ren,Maxwell D Collins,Miao Hu,Huanrui Yang
摘要:Self-attention serves as the core foundation of large-scale transformer pretraining, but its quadratic token interaction cost makes inference expensive. Replacing attention with simpler sequential modules is appealing, yet naive substitution is often lossy, especially at larger scales. This paper revisits attention replacement through the lens of sparsity. Based on the observation of diverse sparsity patterns across transformer layers, we posit that pretrained transformers decompose the complex token dependency across tokens into various sequence-to-sequence mappings of diverse complexities, where some layer functionalities can be approximated and replaced with much simpler sequential modules without loss. We evaluate this premise using a plug-and-play layer-wise distillation framework to approximate and replace attention functionalities in pretrained vision transformer models. Controlled group-wise replacements under a fixed training budget reveal a clear pattern: substituting layers with sparser attention incurs substantially smaller accuracy drops than replacing denser ones. We further impose explicit attention sparsity on the pretrained ViT via AViT-style token retention and perform sparsity-guided distillation for sequential replacing models, where we see increasing teacher sparsity consistently reduces the student-teacher gap. The proposed method achieves efficient attention replacement for reduced parameter size and latency through the guidance of attention sparsity.
【6】Lossless Anti-Distillation Sampling
标题:无损反蒸馏采样
链接:https://arxiv.org/abs/2605.18829
作者:Zibo Diao,Jingchu Gai,Xinyue Ai,Zhang Zhang,Zhenyu He,Di He
摘要
:Frontier commercial generative models face a growing threat from distillation, whereby a distiller harvests generated responses and trains a competing model of its own at drastically lower cost. Existing defenses either rely on modifying the models outputs, thereby sacrificing response quality for benign users, or on behavioral detection methods, which can be readily circumvented by distributing queries across multiple accounts. In this work, we propose Lossless Anti-Distillation Sampling (LADS), a novel sampling scheme specifically designed to counter multi-account distillation while maintaining a lossless experience for benign users. Concretely, LADS derives the randomness underlying each generation from a private seed determined by the semantic content of the query and the number of times the user has queried the model. By construction, every benign user receives a response independently sampled from the original model at each visit, and thus experiences no distortion. In contrast, for a distiller, different accounts share latent randomness whenever their queries fall in the same semantic bucket. As a result, the harvested data becomes correlated, potentially reducing sample diversity and degrading generalization. Using uniform convergence theory, we show that LADS provably degrades the convergence rate of the distillers generalization gap relative to standard i.i.d. sampling in both unconditional and conditional generation settings. Experiments on image generation, mathematical reasoning, and code generation confirm that LADS substantially degrades the performance of distilled students while preserving exact statistical fidelity for individual users.
聚类(1篇)
【1】A Multi-Dimensional Clustering Approach for Identifying Inborn Errors of Immunity
标题:识别先天免疫缺陷的多维集群方法
链接:https://arxiv.org/abs/2605.18880
作者:Nishad Kulkarni,Alexandra K. Martinson,Nicholas L. Rider,Michael Keller,Syed Muhammad Anwar
备注:Accepted at EMBC 2026
摘要:Rare diseases such as inborn errors of immunity (IEI) require early diagnosis to prevent end organ damage and improve quality of life. Hurdles in accessing and curating large scale electronic health record (EHR) data limit routine data driven analyses to remain on the forefront of IEI and other rare disease trends. Development of machine learning (ML) algorithms in IEI for pattern recognition as well as published methodology examining how to systematically process and integrate complex medical data is limited. Our proposed pipeline, including data curation and ML clustering algorithms, is designed to recognize novel rare disease patterns and extract IEI- associated features from a national data registry. Our methodology for EHR data formatting and processing presents the pipeline that transforms raw immunologic lab data into vectors. This is further combined with hyperparameter tuning for diseases pattern recognition via clustering. This study refines IEI feature awareness, develops data tool kits for rare disease populations analysis, and expands on transforming complex medical records in data structures interpretable by unsupervised ML.
自动驾驶|车辆|车道检测等(1篇)
【1】D$^3$-Subsidy: Online and Sequential Driver Subsidy Decision-Making for Large-Scale Ride-Hailing Market
标题:D $' 3 $-补贴:大规模叫车市场的在线和顺序驱动补贴决策
链接:https://arxiv.org/abs/2605.20036
作者:Taijie Chen,Rui Su,Siyuan Feng,Laoming Zhang,Hongyang Zhang,Haijiao Wang,Zhaofeng Ma,Jintao Ke
备注:14 pages, 14 figures
摘要:Ride-hailing platforms like DiDi Chuxing operate in highly dynamic environments where balancing driver supply and passenger demand is critical. Although driver-side subsidies serve as a primary lever to align these forces and improve key KPIs like completed rides (\texttt{Rides}) and gross merchandise value (\texttt{GMV}), optimizing them in production requires simultaneously meeting three constraints: (i) responsiveness to stochastic shocks, (ii) strict subsidy-rate caps, and (iii) low-latency execution at city scale. These requirements rule out expensive per-order optimization, calling for a forward-looking, constraint-aware city-level controller for online sequential decision making. To meet these requirements, we introduce D$^3$-Subsidy (Dynamic Driver-side Diffusion-based Subsidy), a hierarchical diffusion-based framework for deployable city-wide subsidy control. To bridge the train-inference gap, D$^3$-Subsidy employs a prefix-conditioned diffusion model that samples plausible future trajectories from immutable historical observations, ensuring the training protocol aligns with the fixed-history nature of online deployment. These generated plans are then decoded by a context-conditioned inverse module into low-dimensional city-level control signals. For scalable execution, we bridge the gap between city-level planning and fine-grained dispatch via a Lagrangian-dual-derived mapping, which embeds subsidy-rate caps directly into order-driver incentives without iterative optimization. Additionally, a multi-city pretraining strategy with parameter-efficient fine-tuning enables robust transfer across heterogeneous cities. Extensive offline evaluations demonstrate that D$^3$-Subsidy improves \texttt{Rides} and \texttt{GMV} while enhancing cap compliance, and a real-world A/B test confirms significant uplift while keeping budget-related violation metrics within operational thresholds.
联邦学习|隐私保护|加密(3篇)
【1】General Lower Bounds for Differentially Private Federated Learning with Arbitrary Public-Transcript Interactions
标题:具有任意公开脚本交互的差异私人联邦学习的一般下限
链接:https://arxiv.org/abs/2605.19813
作者:Yicheng Li
摘要:We prove a general lower bound for differentially private federated learning protocols with arbitrary public-transcript interactions. The protocol may use any number of adaptive rounds, and each client's local samples may be reused across rounds. For parameter estimation under squared \(\ell_2\) loss, we establish a federated van Trees lower bound for every estimator satisfying a total clientwise sample-level zero-concentrated differential privacy (zCDP) constraint. The main technical ingredient is a privacy-information contraction inequality for complete public transcripts. We illustrate the bound through applications to mean estimation, linear regression, and nonparametric regression.
【2】FedMental: Evaluating Federated Learning for Mental Health Detection from Social Media Data
标题:FedMental:评估联邦学习以从社交媒体数据中进行心理健康检测
链接:https://arxiv.org/abs/2605.18936
作者:Nuredin Ali Abdelkadir,Anjali Ratnam,Zeerak Talat,Stevie Chancellor
备注:Association for Computational Linguistics (ACL) 2026 Main Conference
摘要
:Social media text data are often used to train Machine Learning (ML) models to identify users exhibiting high-risk mental health behaviors. However, sharing this sensitive data poses privacy risks and limits the growth of benchmark datasets. We comprehensively evaluate whether privacy-preserving ML techniques can enable safer data sharing while preserving performance. Specifically, we apply federated learning (FL) and Differentially Private FL for two widely-studied mental health prediction tasks: depression detection on X (Twitter) and suicide crisis detection on Reddit. We simulate realistic data-sharing scenarios by treating each user as a client in a non-IID setting, evaluating across different client fractions, aggregation strategies, and privacy budgets. While FL achieves comparable performance to centralized training (centralized F1 = 85.63; best FL model F1 = 83.16) on depression identification, we find that Differentially Private FL has a large performance-privacy trade-off (up to F1 = 27.01 drop) even with low levels of noise (epsilon = 50). This is due to the distortion of highly informative yet sparse mental health linguistic markers related to mental health, like health topics and emotion words. This research empirically demonstrates the potential and limitations of current privacy preservation techniques for mental health inference tasks.
【3】Data-Free Client Contribution Estimation via Logit Maximization for Federated Learning
标题:通过联邦学习的Logit最大化进行无数据客户贡献估计
链接:https://arxiv.org/abs/2605.18892
作者:Asim Ukaye,Nurbek Tastan,Mubarak Abdu-Aguye,Karthik Nandakumar
备注:22 pages, 7 figures
摘要:Federated learning (FL) enables collaborative learning of computer vision models, where privacy and regulatory constraints prevent centralizing data across devices or organizations. However, practical FL deployments often exhibit severe class imbalance and label skew, causing standard aggregation protocols to overfit dominant clients and degrade minority-class performance. We propose a data-free, class-wise contribution estimation and aggregation framework based on logit maximization (CELM) that does not require sharing raw data, client metadata, or auxiliary public datasets. The FL server probes client updates to obtain class-wise evidence scores and assembles a cross-client evidence matrix, which quantifies both per-class competence and class coverage. Using this matrix, we compute contribution weights that upweight clients providing strong, discriminative evidence for underrepresented classes. The resulting aggregation is stable due to simplex constraints and momentum smoothing, and it remains compatible with standard FL training pipelines. We evaluate the approach on representative vision benchmarks under controlled non-IID and pathological label splits, demonstrating that CELM-based aggregation improves robustness to imbalance and statistical heterogeneity, while yielding better performance without requiring any additional data exchange.
推理|分析|理解|解释(17篇)
【1】Multi-axis Analysis of Image Manipulation Localization
标题:图像操纵定位的多轴分析
链接:https://arxiv.org/abs/2605.20174
作者:Keanu Nichols,Divya Appapogu,Giscard Biamby,Dina Bashkirova,Anna Rohrbach,Bryan A. Plummer
备注:28 pages, 5 figures, 5 tables
摘要:Advanced image editing software enables easy creation of highly convincing image manipulations, which has been made even more accessible in recent years due to advances in generative AI. Manipulated images, while often harmless, could spread misinformation, create false narratives, and influence people's opinions on important issues. Despite this growing threat, there is limited research on detecting advanced manipulations across different visual domains. Thus, we introduce Analysis Under Domain-shifts, qualIty, Type, and Size (AUDITS), a comprehensive benchmark designed for studying axes of analysis in image manipulation detection. AUDITS comprises over 530K images from two distinct sources (user and news photos). We curate our dataset to support analysis across multiple axes using recent diffusion-based inpaintings, spanning a diverse range of manipulation types and sizes. We conduct experiments under different types of domain shift to evaluate robustness of existing image manipulation detection methods. Our goal is to drive further research in this area by offering new insights that would help develop more reliable and generalizable image manipulation detection methods.
【2】Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing
标题:最佳表示大小:预训练和线性探测的多维分析
链接:https://arxiv.org/abs/2605.20105
作者:Valentina Njaradi,Clémentine Dominé,Rachel Swanson,Marco Mondelli,Andrew Saxe
摘要:Learning to generalise from limited data is a fundamental challenge for both artificial and biological systems. A common strategy is to extract reusable structure from abundant unlabelled data, enabling efficient adaptation to new tasks from limited labelled data. This two-stage paradigm is now standard in modern training pipelines, where pretraining is followed by fine-tuning or linear probing. We provide an analytical model of this process: structure extraction is formalized as principal component analysis on unlabelled data, and downstream learning as linear regression on a separate labelled dataset. In the high-dimensional regime, we derive exact expressions for training and generalisation error showcasing their dependence on representation dimensionality, unlabelled and labelled sample sizes, and task alignment. Our results show that pretrained representations strongly influence downstream generalisation, and we characterize the optimal representation size as a function of task parameters: with abundant pretraining data but scarce downstream data, maximally compressed representations are optimal, whereas with limited pretraining data, higher-dimensional representations generalise better. Furthermore, we establish an exact trade-off between pretraining and supervision, quantifying how much unlabelled data is required to replace a single labelled sample. Beyond our idealised model, we observe similar phenomenology in autoencoders and pretrained LLMs. Altogether, we highlight that optimising representation size is critical, giving conditions for when compression during pretraining improves generalisation.
【3】A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits
标题:推理的度量理论分析:结构概括和逼近极限
链接:https://arxiv.org/abs/2605.19944
作者:Yuyang Zhang,Yifu Zhang,Xuehai Zhou,Xiaoyin Chen
备注:Preprint
摘要:While empirical scaling laws for LLM reasoning are well-documented, the theoretical mechanisms governing out-of-distribution (OOD) generalization remain elusive. We formalize reasoning via optimal transport, projecting discrete trajectories into a continuous metric space to quantify domain shifts using the Wasserstein-1 distance. Invoking Kantorovich duality, we bound OOD generalization via architectural Lipschitz continuity and functional approximation limits. This exposes two primary constraints. First, position-dependent attention (e.g., Absolute Positional Encoding) fails to preserve shift invariance, yielding an $Ω(1)$ Lipschitz constant and expected risk, whereas shift-invariant mechanisms (e.g., Rotary Embeddings) preserve equivariance and bound the error. Second, by mapping sequential backtracking to a Dyck-$k$ language, we establish a strict circuit depth lower bound for $\text{TC}^0$ Transformers. Scaling physical layer depth is necessary to avert representation collapse -- a constraint that scaling representation width cannot bypass due to irreducible approximation bounds in Barron spaces. Evaluations across 54 Transformer configurations on combinatorial search corroborate these bounds, demonstrating that generalization risk degrades monotonically with the Wasserstein domain shift.
【4】B-cos GNNs: Faithful Explanations through Dynamic Linearity
标题:B-cos GNNs:通过动态线性的忠实转换
链接:https://arxiv.org/abs/2605.19778
作者:Joschka Groß,Mohammad Shaique Solanki,Verena Wolf
摘要:We introduce B-cos GNNs, an inherently explainable class of graph neural networks whose predictions decompose exactly into per-node, per-feature contributions via a single input-dependent linear map. B-cos GNNs use linear (sum-based) aggregation and replace non-linear message and update functions with B-cos transforms. This induces meaningful, task-specific weight-input alignment that is directly accessible through the model's dynamic linearity. Instance-level explanations follow from a single forward and backward pass, requiring no auxiliary explainer, modified learning objective, or perturbation procedure. Instantiated as a GIN, our approach trades small losses in predictive accuracy for state-of-the-art explainability across diverse synthetic and real-world benchmarks, producing explanations orders of magnitude faster than post-hoc baselines.
【5】A Family of Divergence Measures for Evaluating the Reconstruction Quality of Explainable Ensemble Trees
标题:评价可解释合并树重建质量的一类分歧性指标
链接:https://arxiv.org/abs/2605.19618
作者:Massimo Aria,Agostino Gnasso,Carmela Iorio
摘要:Validating interpretable surrogate models for ensemble learners requires measuring agreement between the ensemble's internal representation and its surrogate approximation, rather than mere association. Correlation-based approaches are scale-invariant and fail to detect systematic discrepancies in co-occurrence structure. We propose a statistical framework grounded in the agreement-association distinction, centered on the normalized Loss of Interpretability (nLoI). Rooted in the Cressie-Read power divergence family with lambda equal to 2, the nLoI admits a closed-form decomposition into within-node and between-node components, providing a unique diagnostic capability to identify precisely where and why reconstruction fails. The framework incorporates four complementary measures capturing distinct structural facets of approximation quality. A unified permutation testing procedure delivers valid inference for all measures within a single resampling pass. Theoretical properties, including boundedness and symmetry, are established for each metric. Monte Carlo simulations and empirical evaluations confirm exact Type I error control and demonstrate that these measures detect reconstruction fidelity gradients invisible to correlation-based alternatives. The framework is developed and illustrated in the context of Explainable Ensemble Trees (E2Tree), and empirical evaluation on three benchmark datasets illustrates the practical utility of the framework.
【6】Understanding Dynamics of Adam in Zero-Sum Games: An ODE Approach
标题:理解零和游戏中亚当的动力学:ODE方法
链接:https://arxiv.org/abs/2605.19392
作者:Yi Feng,Weiming Ou,Xiao Wang
摘要:The remarkable success of the Adam in training neural networks has naturally led to the widespread use of its descent-ascent counterpart, Adam-DA, for solving zero-sum games. Despite its popularity in practice, a rigorous theoretical understanding of Adam-DA still lags behind. In this paper, we derive ordinary differential equations (ODEs) that serve as continuous-time limits of the Adam-DA. These ODEs closely approximate the discrete-time dynamics of Adam-DA, providing a tractable analytical framework for understanding its behavior in zero-sum games. Using this ODE approach, we investigate two fundamental aspects of Adam-DA: local convergence and implicit gradient regularization. Our analysis reveals that the roles of the first- and second-order momentum parameters in zero-sum games are exactly the opposite of their well-documented effects in minimization problems. We validate these predictions through GAN experiments across multiple architectures and datasets, demonstrating the practical implications of this reversed momentum effect.
【7】Accurate, Efficient, and Explainable Deep Learning Approaches for Environmental Science Problems
标题:环境科学问题的准确、高效且可解释的深度学习方法
链接:https://arxiv.org/abs/2605.19366
作者:Jimeng Shi
备注:161 pages
摘要
:Environmental science plays a pivotal role in safeguarding ecosystems, a domain driven by large-scale, heterogeneous data. In the big data era, artificial intelligence (AI) has emerged as a transformative tool for learning patterns and supporting decision-making. This dissertation develops AI-based approaches tailored to complex environmental science problems to achieve Environmental Intelligence, studying three specific challenges. First, we focus on flood prediction and management in coastal river systems. Conventional physics-based models are computationally intensive, limiting real-time application. To overcome this, we propose a deep learning (DL)-based model, WaLeF, for water level forecasting, and a forecast-informed DL model, FIDLAr, to manage water levels. Evaluated in a flood-prone coastal system in South Florida characterized by extreme rainfall and sea level fluctuations, FIDLAr outperforms baselines in accuracy and efficiency while providing interpretable outputs. Second, we target global weather prediction, which is challenged by massive data scale. Traditional physics methods are deterministic and computationally heavy. We propose CoDiCast, a conditional diffusion model tailored for probabilistic weather forecasting. Adapted from generative AI for predictive tasks, experiments show CoDiCast achieves accurate, efficient forecasts with explicit uncertainty quantification. Lastly, we address scientific question-answering in environmental science. When answering in-domain questions, large language models (LLMs) often suffer from hallucinations due to out-of-date or limited knowledge. While retrieval-augmented generation (RAG) retrieves domain-specific knowledge, existing methods trade off accuracy, efficiency, or explainability. We propose Hypercube-RAG, built on a structured text cube framework, which successfully exhibits all three properties simultaneously.
【8】IMLJD: A Computational Dataset for Indian Matrimonial Litigation Analysis
标题:IMLJD:印度婚姻诉讼分析的计算数据集
链接:https://arxiv.org/abs/2605.19346
作者:Joy Bose
备注:8 pages, 2 figures, 5 tables. Dataset available at huggingface.co/datasets/joyboseroy/imljd and Code at github.com/joyboseroy/imljd
摘要:We present IMLJD, an open dataset of 3,613 Indian court judgments covering matrimonial disputes under IPC Section 498A, the Protection of Women from Domestic Violence Act, and CrPC Section 482. The dataset covers the Supreme Court of India from 2000 to 2024 (1,474 cases) and the Karnataka High Court from 2018 to 2024 (2,139 cases), with structured outcome labels, metadata-derived indicators, and a knowledge graph. We find that 57.6% of quashing petitions succeed at the Supreme Court level compared to 39.7% at the Karnataka High Court level. On a matched 2018 to 2024 period, the SC quash rate is 59.3%, widening the differential to 19.6 percentage points and confirming the finding is robust to temporal adjustment. The dataset, code, and knowledge graph are released openly at https://github.com/joyboseroy/imljd and https://huggingface.co/datasets/joyboseroy/imljd.
【9】Inference-Time Scaling in Diffusion Models through Iterative Partial Refinement
标题:通过迭代部分细化实现扩散模型中的推理时间标度
链接:https://arxiv.org/abs/2605.19317
作者:Taegu Kang,Jaesik Yoon,Sungjin Ahn
备注:Accepted at the ICLR 2026 Workshop on AI with Recursive Self-Improvement
摘要:Inference-time scaling has emerged as a major approach for improving reasoning capabilities, and has been increasingly applied to diffusion models. However, existing inference-time scaling methods for diffusion models typically rely on external verifiers or reward models to rank and select samples, limiting their scalability to settings where such evaluators are available and reliable. Moreover, while recent diffusion models perform sequential inference with region-wise, mixed-noise conditioning, inference-time scaling tailored to this setting remains relatively underexplored. We propose Iterative Partial Refinement (IPR), an inference-time scaling method for sequential diffusion that requires no external verifier. Starting from an already-generated sample, IPR re-noises a subset of regions and regenerates them conditioned on the remaining regions, enabling the model to revise earlier decisions under a richer context than was available during the initial generation. This iterative partial refinement produces more globally consistent samples without external verification. On reasoning tasks requiring global constraint satisfaction, IPR consistently improves performance: on MNIST Sudoku, the valid solution rate increases from 55.8% to 75.0%. These results show that iterative partial refinement alone can serve as an effective inference-time scaling strategy for diffusion models in sequential, mixed-noise settings. Code is available at: https://github.com/ahn-ml/IPR
【10】Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels
标题:私人推理渠道中间接影响的反事实可能性测试
链接:https://arxiv.org/abs/2605.19092
作者:Alexander Boesgaard Lorup
备注:12 pages, 4 figures, 5 tables
摘要:Reasoning systems increasingly separate intermediate computation into private and public channels, creating evaluation cases that look similar in transcripts: independent co-derivation, direct access to private content, and indirect influence through public communication. This paper presents a counterfactual likelihood test for measuring influence between private reasoning channels. The method replaces an upstream private block with a length-matched donor block, holds the public token sequence and downstream target fixed, and measures the downstream target's negative-log-likelihood shift. On a 7B role-channel reasoning model used for validation, textual probes are unreliable: raw n-gram overlap overstates leakage, corrected overlap remains noisy, and canary reproduction reports no discrimination. Counterfactual likelihood separates unmasked and masked conditions, while length matching controls a RoPE positional confound. In the hardened masked validation, reverse B-to-A influence is near zero, while A-to-B influence persists through public-speech hidden states. A multi-checkpoint validation across three checkpoints, five seeds, and 13,734 valid directional contrasts replicates this asymmetry. A graph-separation control that blocks private-to-public carrier edges produces bit-identical natural and counterfactual scores across all 13,734 control evaluations, identifying the tested public-channel pathway as the complete carrier of the measured counterfactual signal under the implemented role-visibility mask. The results show that private-channel evaluation should report direct and indirect influence separately, and that counterfactual likelihood probes provide a practical default for measuring these boundaries.
【11】A Geometric Analysis of Sign-Magnitude Asymmetry in a ReLU + RMSNorm Block under Ternary Quantization
标题:三值量化下ReLU + RMS范数块符号-幅度不对称性的几何分析
链接:https://arxiv.org/abs/2605.18933
作者:Lei Dong
备注:53 pages, 2 figures, 21 tables, 7 appendices
摘要:Pre-norm Transformers with RMSNorm tolerate ternary {-1,0,+1} weight quantization with surprisingly small loss (Ma et al., 2024). We give a geometric explanation via sign-magnitude decomposition of weight perturbations. In a two-layer ReLU + RMSNorm model with i.i.d. Gaussian weights, sign-flips produce $π/(π-2) \approx 2.75$ times more transverse output energy than sign-preserving magnitude perturbations of equal Frobenius norm, as the flip rate $p \to 0$ (Theorem 3). The mechanism: ReLU creates a hidden-space directional asymmetry between the two perturbation types, which RMSNorm's transverse-projection Fréchet derivative selectively exposes. Sign-quantization error is itself a sign-preserving perturbation with angular alignment $\cos^2 \to 2/π$ (Theorem 4); its post-ReLU radial fraction ($0.365$) matches the pre-ReLU value $1-2/π$ within $0.4\%$, so ReLU is approximately transparent to ternary error. Multi-layer compounding of the $2.75\times$ factor is not experimentally supported; the gap to real-model sign sensitivity arises from outlier features violating delocalization. For an input dimension with amplitude $α$, a single sign-flip produces post-ReLU energy amplified by $R \approx nα^2$ relative to a delocalized entry. On TinyLlama-1.1B, at linear response ($p \leq 0.5\%$), count-matched NLL leverage stabilizes at $\sim 10\times \approx n\mathbb{E}[α^2]$, matching the per-entry theory; the all-column NLL ratio of $5.0\times$ falls within $R_{\mathrm{col}} \leq 19$ ($67\times$ PPL gap reflects metric nonlinearity). Measured outlier $α$ at layer 12 (median $0.024$, max $0.26$) confirms heavy-tailed concentration. The Bussgang constant $2/π$, RMSNorm geometry, and ReLU half-space structure together explain sign-magnitude asymmetry in pre-norm models, with $R \propto nα^2$ accounting for real-model deviations.
【12】Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era
标题:推理可移植性:指导WLVR时代MLLM的持续学习
链接:https://arxiv.org/abs/2605.18903
作者:Qiuhe Hong,Yuyang Liu,Shuo Yang,Tiantian Peng,Fei Zhu,Yonghong Tian
摘要:Vision-Language Models in Continual Learning (VLM-CL) aim to continuously adapt to new multimodal tasks while retaining prior knowledge. The emerging paradigm that couples Multimodal Large Language Models (MLLMs) with Reinforcement Learning with Verifiable Rewards (RLVR) calls for a new pattern to guide continual adaptation. Advances in reasoning capability now make it feasible to impose constraints at the reasoning level. We formalize portability, a sample-level measure of how reusable the previous policy's behavior is on a new task, and empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not. We instantiate this as Reasoning Portability (RP) and propose Reasoning-based Dynamic Balance Continual Learning (RDB-CL), which modulates the per-sample Kullback-Leibler regularization in RLVR according to RP: a tight anchor preserves reusable reasoning on high-RP samples, while a relaxed anchor on low-RP samples permits exploration of new reasoning pathways. Experiments show that RDB-CL consistently outperforms baselines, improving Last accuracy by +12.0% over the vanilla RLVR baseline.
【13】Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries
标题:审计推理--用头条件金丝雀去学习后的跟踪验证声明
链接:https://arxiv.org/abs/2605.18891
作者:Yanhang Li,Zhichao Fan,Zexin Zhuang
摘要:Evaluations of unlearning on reasoning models sometimes show a bypass pattern. The answer side looks unlearned, but the model's own thinking trace keeps emitting the forgotten content, and the gap is taken as evidence that the weights still remember. We audit this reading on DeepSeek-R1-Distill-Qwen-7B with LoRA-memorized fictional authors and NPO unlearning, conditioned on a six-token canary head. On one seed, swapping the thinking trace for a short non-canary prefill on the same weights drops the answer rate by as much as the bypass gap itself, whether the prefill mimics the training template or not. On a second seed the bypass gap shrinks rather than vanishing, and the prefill swap reverses direction and brings the answer rate to ceiling. A positive parser-split bypass gap thus does not by itself identify hidden weight-level memorization, and does not rule it out either. On a different distillate the same metric flips sign because the parser cannot find the closing tag. We recommend a decode-time template swap as a cheap sanity check alongside the canonical audit.
【14】SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference
标题:球形GV:角度域注意力和速率失真保留,以实现高效的长上下文推理
链接:https://arxiv.org/abs/2605.18856
作者:Anay Chauhan,Gurucharan Marthi Krishna Kumar,Arion Das,Amit Dhanda,Vinija Jain,Aman Chadha,Amitava Das
摘要:Long-context inference is increasingly constrained by the KV cache: resident memory grows with context length, and decoding becomes limited by repeated High Bandwidth Memory (HBM) streaming rather than arithmetic. Existing methods such as eviction, windowing, quantization, and offloading reduce footprint, but often leave the critical-path bottleneck only partially addressed, especially when compressed states must still be reconstructed into dense vectors during decoding. We present Spherical KV, a long-context inference method that treats KV allocation as a rate-distortion problem grounded in attention geometry for efficient decoding. The method is built on two ideas: (i) represent directional information cheaply in the decode hot loop, and (ii) allocate retention and precision according to estimated future utility. Its first component, Angle-Domain Attention (ADA), stores keys in a spherical parameterization consisting of a scalar radius and compact angle codes, and computes attention logits directly from these codes without reconstructing dense keys. This preserves a paged, block-local, fusion-friendly decode path and directly targets HBM traffic in realistic serving settings. Its second component, Rate-Distortion Retention (RDR), jointly chooses keep/drop decisions and precision tiers per token and head under a fixed budget, producing tier-homogeneous pages with lightweight metadata and coalesced reads. Together, ADA and RDR provide a deployment-oriented mechanism for reducing KV residency while preserving decode efficiency.
【15】INAR-VL: Input-Aware Routing for Edge-Cloud Vision-Language Inference
标题
:INAR-DL:边缘云视觉语言推理的输入感知路由
链接:https://arxiv.org/abs/2605.18853
作者:Ahmed Šabanović,Paul Joe Maliakel,Ivona Brandić
备注:8 pages, 3 figures
摘要:Edge deployment of Vision-Language Models (VLMs) faces a tradeoff between latency and accuracy: cloud execution provides high-quality predictions but incurs communication delay and energy cost, while edge-only execution is faster but less accurate due to limited model capacity. This trade-off is further complicated by heterogeneity in image quality and reasoning complexity, making static placement suboptimal. We present INAR-VL, a lightweight edge-cloud routing system for multimodal inference in a two-tier deployment. INAR-VL maintains complementary VLMs across edge and cloud and uses lightweight image and text complexity signals to guide routing and model selection, executing simple queries locally while offloading complex ones when beneficial. Evaluation on visual question answering shows that INAR-VL executes 36% of requests on the edge, reduces latency by 24%, lowers energy by 26%, and preserves 97% of cloud-level accuracy.
【16】Accurate Evaluation of Quickest Changepoint Detectors via Non-parametric Survival Analysis
标题:通过非参数生存分析准确评估最快变点检测器
链接:https://arxiv.org/abs/2605.18798
作者:Taiki Miyagawa,Akinori F. Ebihara
备注:Accepted to ICML 2026. GitHub: https://github.com/TaikiMiyagawa/Kaplan-Meier-Average-Run-Length
摘要:We propose non-parametric estimators for the average run length (ARL) and average detection delay (ADD) in quickest changepoint detection (QCD) under finite and irregular sequence lengths. Although ARL and ADD are widely used as optimality criteria in theoretical and simulation studies, their application to real-world datasets is hindered by limited and irregular sequence lengths. To address this issue, we propose non-parametric estimators for the ARL and ADD, termed KM-ARL and KM-ADD, by drawing an analogy between QCD and survival analysis to model detection probabilities under sequence truncation. We derive estimation bias bounds and prove that they are asymptotically unbiased unless extrapolation is required. Experiments on simulated and real-world datasets demonstrate their practical utility, enhancing robustness against limited and irregular sequence lengths, improving interpretability, and facilitating empirical, intuitive model selection. Our Python code is provided at https://github.com/TaikiMiyagawa/Kaplan-Meier-Average-Run-Length, offering ready-to-use implementations for practitioners.
【17】Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis
标题:肺超声生物标志物对充血性心力衰竭再入院风险的预后价值:一项初步数据驱动分析
链接:https://arxiv.org/abs/2605.18878
作者:Jana Armouti,Laura Hutchins,Jacob Duplantis,Thomas Deiss,Thales Nogueira Gomes,Keyur H. Patel,Seema Walvekar,Shane Guillory,Thomas H. Fox,Amita Krishnan,Ricardo Rodriguez,Bennett DeBoisblanc,Deva Ramanan,John Galeotti,Gautam Gare
摘要:Hospital readmission within 30 days of discharge is a leading driver of morbidity, mortality, and avoidable healthcare expenditure in congestive heart failure (CHF). Current clinical risk stratification tools rely primarily on non-imaging data and exhibit limited predictive performance. Point-of-care lung ultrasound (LUS) offers a sensitive, noninvasive window into the pulmonary congestion that characterizes CHF decompensation, yet its prognostic utility for readmission prediction remains largely unexplored. We present a pilot feasibility study, the first systematic machine learning study using B-mode LUS acquired during hospitalization to predict 30-day CHF readmission. Quantitative spatiotemporal embeddings are extracted from a pretrained Temporal Shift Module (TSM) ResNet-18 encoder, and interpretable biomarker features are separately evaluated. Through structured ablations over lung view, temporal representation, multi-view fusion, and cross-lung augmentation, we identify the key imaging factors driving readmission risk. Our findings reveal that (1) dependent lower-lung regions (Left-3, Right-3) carry the strongest prognostic signal, consistent with their greater susceptibility to hydrostatic congestion; (2) temporal difference features between sequential examinations substantially outperform single-timepoint representations, highlighting the importance of capturing disease trajectory; and (3) multi-view feature concatenation yields the best overall performance, with our top MLP model achieving an F1 score of 0.80 (95% CI: 0.62-0.96). Biomarker analysis further reveals that pleural-line abnormalities, including breaks and indentations, are as informative as the canonical A-line and B-line markers. These results support POCUS-derived biomarkers as practical, interpretable tools for noninvasive CHF risk stratification.
检测相关(5篇)
【1】SAGE: Scalable Automatic Gating Ensemble for Confident Negative Harvesting in Fraud Detection
标题:SAGE:可扩展的自动门控集合,用于欺诈检测中自信的负面收获
链接:https://arxiv.org/abs/2605.20157
作者:Sudheer Tubati,Amit Goyal
摘要:Music streaming fraud, where bad actors artificially inflate stream counts to manipulate chart rankings and royalty payments, poses a significant threat to streaming services and legitimate content creators. Traditional fraud detection approaches struggle with a critical challenge: many legitimate edge cases, including super-fans and sleep-music sessions, exhibit activity patterns that closely mimic those of coordinated fraud. We present SAGE, a novel counterfactual-aware negative harvesting approach that combines SimHash-based stratified sampling with a modular gating ensemble for confident negative identification from unlabeled data. Our ensemble architecture employs pluggable statistical gates (currently instantiated with Mahalanobis distance and k-NN density) with configurable voting thresholds enabling adaptive precision-recall trade-offs. This addresses the representation bias problem in Positive-Unlabeled learning by ensuring comprehensive coverage of rare behavioral cohorts through floor-constrained sampling. Evaluation demonstrates strong precision and recall on held-out data. The approach generalizes across fraud detection domains, achieving strong performance on both customer-level and artist-level fraud without modification to the core methodology.
【2】Your Neighbors Know: Leveraging Local Neighborhoods for Backdoor Detection in Decentralized Learning
标题:你的邻居知道:在去中心化学习中利用当地社区进行后门检测
链接:https://arxiv.org/abs/2605.19969
作者:Sayan Biswas,Antoine Boutet,Davide Frey,Romaric Gaudel,Rachid Guerraoui,Maxime Jacovella,Anne-Marie Kermarrec,Dimitri Lerévérend,François Taïani,Martijn de Vos
备注:41 pages, 10 figures
摘要:Decentralized learning (DL) is an emerging machine learning paradigm where nodes collaboratively train models without a central server. However, the collaborative nature of DL makes it vulnerable to backdoor attacks, where a model is taught to behave normally on standard inputs while executing hidden, malicious actions when encountering data with specific triggers. Backdoor attacks in DL remain understudied and existing defenses often overlook DL constraints. We introduce Argus, a novel backdoor detection framework native to DL that requires neither a central coordinator nor prior knowledge of the trigger. In Argus, honest nodes locally analyze received model updates to identify potential backdoor triggers. Nodes then collectively share their triggers with their neighbors and use a structural similarity metric to separate true backdoors from false alarms induced by data heterogeneity. A key insight is that false positive triggers exhibit inconsistencies across participants while true positive ones show consistent patterns. Model updates that fail this collaborative test are rejected, and persistently malicious senders are eventually evicted. We provide the first theoretical convergence guarantees for a DL-specific backdoor detection mechanism, showing that filtering out suspicious model updates with high probability preserves a convergence rate comparable to standard DL. We implement and evaluate Argus on three standard datasets and against three state-of-the-art baselines. Across settings, Argus reduces attack success rates by up to 90 points compared to no defense, while preserving model utility within 5 percentage points of an omniscient oracle. Furthermore, the effectiveness of Argus compared to baselines improves as data heterogeneity increases.
【3】Scalable, Energy-Efficient Optical-Neural Architecture for Multiplexed Deepfake Video Detection
标题:用于多路Deepfake视频检测的可扩展、节能的光神经架构
链接:https://arxiv.org/abs/2605.19360
作者:Parnian Ghapandar Kashani,Shiqi Chen,Aydogan Ozcan
备注:30 Pages, 8 Figures
摘要:The rapid proliferation of AI-generated visual media has created an urgent need for efficient, trustworthy deepfake detection systems. However, existing deep learning-based detection methods rely on computationally intensive and energy-demanding inference algorithms, limiting their scalability. Here, we present a hybrid digital-analog deepfake video detection framework that combines a lightweight digital front-end with a spatially multiplexed optical decoding back-end for massively parallel analog inference through a programmable spatial light modulator. By simultaneously processing 15 or more video streams within a single optical propagation pass, the system enables high-throughput and accurate video-level authenticity prediction at reduced computational cost compared with purely digital methods. We validated this hybrid deepfake video processor using different datasets spanning classical face-swapping, real-world deepfake recordings, and fully AI-generated videos. Using a spatially multiplexed experimental set-up operating in the visible spectrum, we achieved average deepfake detection accuracy, sensitivity and specificity of 97.79%, 99.86% and 95.72%, respectively, on the Celeb-DF video dataset with 15 videos tested in parallel in a single optical pass per inference. The multiplexed optical decoder also demonstrates resilience against various types of video degradation, noise, compression, experimental misalignments and black-box adversarial attacks. Our results show that integrating optical computation into AI inference enables simultaneous gains in throughput, energy efficiency, and adversarial robustness - three properties that are difficult to achieve together in purely digital systems.
【4】Quantum Machine Learning for Cyber-Physical Anomaly Detection in Unmanned Aerial Vehicles: A Leakage-Free Evaluation with Proxy-Audited Feature Sets
标题:用于无人飞行器网络物理异常检测的量子机器学习:使用代理审计特征集的无泄漏评估
链接:https://arxiv.org/abs/2605.19233
作者:Carlos A. Durán Paredes,Javier E. León Calderón,Nicolás Sánchez Perea,German Darío Díaz,Camilo Segura Quintero
备注:10 pages, 7 figures, 1 table; open Qiskit 2.x implementation available at https://github.com/Carlosandp/TLM-UAV-Quantum-Anomaly-Detection
摘要:Unmanned aerial vehicles (UAVs) are cyber-physical systems whose attack surface spans networked avionics and on-board sensor fusion: a compromised GPS or battery module can mimic a benign mission segment and evade naive anomaly detectors. We present a leakage-free evaluation of quantum machine learning for UAV anomaly detection on the multi-sensor TLM:UAV benchmark. Three contributions support the study. (i) A group-aware temporal protocol (B2) partitions the dataset into ten contiguous TimeUS blocks and evaluates over ten seeds, eliminating the inflation produced by random stratified splits that mix neighbouring samples. (ii) A three-mode feature audit (full/loose/strict) quantifies how much accuracy stems from instantaneous physical signals versus contextual proxies (cumulative energy, battery state, GPS trajectory). (iii) A hybrid XGBoost + Data Reuploading (DRU) classifier is benchmarked against five paired non-linear controls (raw, PCA, polynomial-2, random-RBF, and an untrained DRU map) under identical budgets. The standalone DRU does not consistently match the strongest classical baseline across seeds; however, the trained-DRU hybrid is the only model whose mean F1 macro shifts upward from full to strict (+0.05), a directional signal that the per-seed standard deviations prevent from being interpreted as a statistically established difference. The trained-DRU hybrid also records the lowest mean false-alarm rate under proxy-free evaluation, subject to the inter-seed variance reported. We frame this as an incremental, reproducible quantum-enhanced hybrid benefit, and provide an open Qiskit 2.x implementation as a benchmark for cybersecurity analytics in NISQ-era aerospace systems.
【5】Fast and Lightweight Backdoor Detection via Head Random Probing
标题:通过头部随机探测快速、轻量级的后门检测
链接:https://arxiv.org/abs/2605.18908
作者:Yinbo Yu,Xueyu Yin,Jing Fang,Chunwei Tian,Qi Zhu,Jiajia Liu,Daoqiang Zhang
摘要:Deep neural networks (DNNs) remain critically vulnerable to backdoor attacks. Existing post-training detectors often require clean or surrogate data, gradients, or iterative trigger reconstruction, leading to high computational costs and limited robustness under practical model-auditing scenarios. In this paper, we propose HTell, a fast and lightweight data-free backdoor detector based on head random probing. Instead of reconstructing diverse trigger patterns, HTell inspects their unified manifestation in the prediction head: backdoored models tend to exhibit abnormal response concentration on the target class under random latent probes. HTell generates architecture-aware random latent probes, feeds them directly into the model head, and detects backdoors by analyzing class-wise response statistics, without accessing real or surrogate data, model gradients, or parameter optimization. We evaluate HTell on a large-scale benchmark containing more than 6,000 backdoored models and over 700 clean models, covering 4 datasets, 14 architectures, and 21 types of backdoor attacks. HTell achieves 99.03% true positive rate and 2.11% false positive rate with only 12.69 ms/model detection latency, reducing the time cost by over 30,000$\times$ compared with representative gradient-based detectors. These results demonstrate that head random probing provides an accurate, robust, and efficient solution for large-scale data-free backdoor model auditing.
分类|识别(6篇)
【1】INSHAPE: Instance-Level Shapelets for Interpretable Time-Series Classification
标题:INSHADE:可解释时间序列分类的实例级形状表
链接:https://arxiv.org/abs/2605.20088
作者:Seongjun Lee,Seokhyun Lee,Changhee Lee
备注:Accepted to IJCAI 2026. 25 pages
摘要:Discovering shapelets -- i.e., discriminative temporal patterns within time series -- has been widely studied to address the inherent complexity of time-series classification (TSC) and to make model decision-making processes more transparent. However, existing methods primarily focus on population-level shapelets optimized across the entire dataset, which leads to two fundamental limitations: (i) population-level patterns often misalign with instance-specific features, resulting in suboptimal performance and potentially misleading interpretations, and (ii) most methods treat shapelets as independent entities, overlooking important temporal dependencies and interactions among multiple patterns. To address these limitations, we propose INSHAPE, an interpretable TSC framework that discovers variable-length, discriminative temporal patterns specific to each time series. INSHAPE identifies these patterns as non-overlapping segments and models their temporal dependencies, thereby providing clear instance-level interpretations while achieving strong predictive performance. Furthermore, INSHAPE bridges local and global interpretability through a bottom-up approach, aggregating instance-level shapelets into prototypical (population-level) shapelets. Extensive experiments on 128 UCR and 30 UEA benchmark datasets show that INSHAPE consistently outperforms state-of-the-art shapelet-based methods while providing more intuitive and interpretable insights.
【2】MSAlign: Aligning Molecule and Mass Spectra Foundation Models for Metabolite Identification
标题:MSAlignn:用于代谢物识别的分子和谱对齐基础模型
链接:https://arxiv.org/abs/2605.19752
作者:Paul Krzakala,Gabriel Melo,Camille Lançon,Charlotte Laclau,Rémi Flamary,Etienne Thévenot,Florence d'Alché-Buc
摘要:Accurately identifying metabolites i.e. small molecules from mass spectrometry data remains a core challenge in metabolomics, with broad applications in drug discovery, environmental analysis, and clinical research. We address the Molecule Retrieval task, which consists in recovering the chemical structure of a metabolite from its MS/MS spectrum given a set of candidate molecules. While the recent release of benchmark datasets such as MassSpecGym and Spectraverse has considerably accelerated the development of novel machine learning approaches, the complexity of data preprocessing pipelines and the lack of unified implementations make methods and results difficult to reproduce and compare. We make three contributions. First, we propose a unified framework encompassing recent approaches based on representation alignment and contrastive learning. Second, we introduce MSAlign, inspired by multimodal alignment in vision-language models, which learns a shared representation space by aligning two frozen foundation models (DreaMS for mass spectra and ChemBERTa for molecules) through lightweight MLP projections trained with a candidate-based contrastive objective. MSAlign is simple to implement, fast to train and consistently outperforms existing approaches across all benchmarks. Third, we investigate a long-standing evaluation problem: data splitting strategies in molecule retrieval implicitly trade off data leakage against domain shift. We formalize this tension by introducing a quantitative measure of distribution shift, and use it to evaluate splitting strategies in existing benchmarks. All datasets, splits, candidate sets, and a unified implementation of MSAlign and baselines are publicly released to support reproducible research.
【3】MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification
标题:MAM-CLIP:乳房X光摄影简化的视觉语言预训练,用于BI-RADS分类
链接:https://arxiv.org/abs/2605.19359
作者:Halil Ibrahim Gulluk,Olivier Gevaert
摘要
:Deep learning methods have demonstrated promising results in predicting BI-RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of mammograms, training classification models solely on image labels often yields limited performance. To address this challenge, we curated 2313 mammogram images and their corresponding captions from two mammography atlases. Our proposed approach employs a multi-modal model that uses a pretrained PubMedBERT as the language component. By training this model on image-text pairs with contrastive learning, we enable the vision encoder to absorb the rich information contained in the captions, thereby improving its understanding of mammography findings. We then fine-tune the vision encoder on two datasets for BI-RADS prediction, achieving superior performance compared with models trained without this pretraining, particularly when labeled samples are scarce. The improvement in the 3-class average F1 score ranges from +1% to +14%: a +1% increase with 40K training samples, and a +14% increase with 1K samples. Furthermore, our experiments reveal that 2K image-text pairs from mammography atlases can be more informative than 2K labeled samples for label prediction, with an average margin of +1.1% when more than 10K training samples are available. Overall, our work provides a vision-language model for mammography and highlights the value of textual information from mammography atlases. In addition, we publicly release preprocessed mammography images of the TEKNOFEST dataset. The training code, pre-trained model weights, data extraction scripts, and the released dataset are publicly available at: https://github.com/igulluk/MAM-CLIP
【4】An Objective Performance Evaluation of the LSTM Networks in Time Series Classification
标题:LSTM网络在时间序列分类中的客观性能评估
链接:https://arxiv.org/abs/2605.19311
作者:Sooraj Sunil,Balakumar Balasingam
备注:Accepted in 2026 29th International Conference on Information Fusion
摘要:The rapid adoption of deep learning has increasingly led to data-driven models replacing classical model-based algorithms, even in domains governed by well-understood physical laws. While data-driven models, such as long short-term memory (LSTM) networks, have become a popular choice for time-series analysis, their performance relative to model-based approaches in structured environments is rarely evaluated objectively. This paper presents a performance evaluation framework comparing an LSTM classifier against a model-based expectation maximization (EM) classifier for binary time-series classification. The evaluation is conducted on two scalar linear Gaussian state space models differing only in their noise statistics, where the Kalman filter likelihood ratio test with true parameters serves as a reference for the best achievable classification performance.Through Monte Carlo simulations, the classifiers are evaluated across three axes: task difficulty, controlled by the separation in process or measurement noise between the two models; sequence length; and training dataset size. The results show that the EM classifier, which exploits the known model structure, performs strongly when the data conform to the assumed model class. The LSTM classifier requires a larger separation in noise statistics to achieve reliable classification, and its performance saturates below the reference classifier when the models differ only in measurement noise, regardless of sequence length or training dataset size.
【5】CLIC: Contextual Language-Informed Cardiac Pathology Classification
标题:CLIC:背景信息心脏病理学分类
链接:https://arxiv.org/abs/2605.19132
作者:Giovani D. Lucafo,Rafael da Costa Silva,João Lucas Luz Lima Sarcinelli,Andre Guarnier De Mitri,Diego Furtado Silva
备注:6 pages, 2 figures, accepted at the ICLR 2026 Workshop on Time Series in the Age of Large Models (TSALM)
摘要:The electrocardiogram (ECG) is the gold standard for non-invasive diagnosis of cardiac pathologies and is a fundamental pillar of cardiovascular medicine. Recent progress in deep learning has led to the development of robust automated classifiers that achieve high performance by processing raw physiological signals. However, in clinical practice, diagnosis is rarely based solely on the signal. Cardiologists commonly support their interpretation with the patient's characteristics and the specific data-acquisition context. Despite this, most current algorithms remain restricted to signal-only analysis, failing to integrate technical metadata and demographic variables. This paper proposes Contextual Language-Informed Cardiac pathology classification (CLIC), a multimodal framework that significantly enhances diagnostic precision by encoding these variables through natural language. We demonstrate that translating patient-level contextual data into descriptive text provides an informative anchor that helps the model disambiguate complex physiological patterns. We further investigate the use of Large Language Models to synthesize richer clinical descriptions and observe that, while these generated texts remain competitive, controlled template-based contextual clinical text leads to consistent improvements in downstream classification performance.
【6】Navigating the Emotion Tree: Hierarchical Hyperbolic RAG for Multimodal Emotion Recognition
标题:导航情感树:用于多模式情感识别的分层双曲RAG
链接:https://arxiv.org/abs/2605.18884
作者:Zeheng Wang,Bo Zhao,Yijie Zhu,Zhishu Liu,Hui Ma,Ruixin Zhang,Shouhong Ding,Qianyu Xie,Zitong Yu
摘要:Multimodal emotion recognition aims to integrate text, audio, and video sources to understand human affective states. Although multimodal large language models excel at multimodal reasoning, they typically treat emotion categories as independent labels, ignoring the rich hierarchical taxonomy of human psychology. Moreover, lacking external contextual knowledge makes them highly susceptible to over-interpreting noisy cues, further complicating fine-grained emotion classification. To address these issues, we propose \textbf{HyperEmo-RAG}, a retrieval-augmented generation framework that leverages a structured emotional knowledge base. Our framework introduces two key innovations. 1) Hierarchical hyperbolic grounding. Recognizing the inherent hierarchical tree structure of emotion taxonomies, we jointly embed hierarchical emotion labels and multimodal samples into a continuous hyperbolic space (Poincaré ball) and design a hierarchical beam-search deliberation process that progressively retrieves samples from coarse to fine-grained levels. 2) Structured evidence injection. Based on the retrieved evidence, we construct an evidence graph and inject the structured knowledge as explicit cognitive context into the LLM through a Tree-Aware Attention mechanism and an EmotionGraphFormer, preserving the integrity of graph-structured information. Experiments on multiple datasets demonstrate that HyperEmo-RAG significantly outperforms existing methods.
表征(6篇)
【1】Atoms of Thought: Universal EEG Representation Learning with Microstates
标题:思想原子:使用微观状态的通用脑电表示学习
链接:https://arxiv.org/abs/2605.20182
作者:Xinyang Tian,Ruitao Liu,Ziyi Ye,Siyang Xue,Xin Wang,Xuesong Chen
备注:Accepted by the 3rd International Workshop on Multimodal and Responsible Affective Computing (MRAC 2025). 8 pages of main text, 23 pages total, 5 figures, 4 tables
摘要:Learning universal representations from electroencephalogram (EEG) signals is a cutting-edge approach in the field of neuroinformatics and brain-computer interfaces (BCIs). Conventionally, EEG is treated as a multivariate temporal signal, where time- or frequency-domain features are extracted for representation learning. This paper investigates a simple yet effective EEG representation, i.e., microstates. Microstates represent the building blocks of brain activity patterns at a microscopic time scale. We build a universal microstate tokenizer from a large medical EEG dataset by clustering continuous EEG signals into sequences of discrete microstates. The microstate tokenizer is then adopted universally across a series of downstream tasks, including sleep staging, emotion recognition, and motor imagery classification. Experimental results show that EEG representation learning with microstates outperforms traditional time-domain and frequency-domain features under different models and across different tasks. Further analysis shows that microstates offer greater interpretability and scalability, thereby opening up applications in both cognitive neuroscience and clinical research.
【2】What Makes a Representation Good for Single-Cell Perturbation Prediction?
标题:是什么让表示适合单细胞微扰预测?
链接:https://arxiv.org/abs/2605.19343
作者:Wenkang Jiang,Yuhang Liu,Yichao Cai,Erdun Gao,Jiayi Dong,Ehsan Abbasnejad,Lina Yao,Javen Qinfeng Shi
备注:Accepted to ICML 2026
摘要:Single-cell perturbation modeling is fundamental for understanding and predicting cellular responses to genetic perturbations. However, existing approaches, from causal representation learning to foundation models, often struggle with an overlooked challenge: gene expression is dominated by perturbation-invariant information, while perturbation-specific signals are intrinsically sparse. As a result, learned representations either entangle invariant and perturbation-specific information, leading to spurious and non-generalizable predictors, or suppress perturbation-specific signals altogether, rendering them ineffective for prediction. To address this, we propose PerturbedVAE, a general framework designed to resolve this signal imbalance. The framework explicitly separates perturbation-specific information from dominant invariant structure and recovers causal representations to effectively utilize such information for prediction. We further provide an identifiability analysis that characterizes the conditions under which sparse perturbation effects can be reliably recovered, thereby clarifying how the framework can be concretely specified under such conditions. Empirically, PerturbedVAE achieves state-of-the-art performance on a widely used benchmark across multiple evaluation settings, yielding significant gains on out-of-distribution combinatorial predictions and uncovering interpretable perturbation-response programs.
【3】Identifiable Multimodal Causal Representation Learning under Partial Latent Sharing
标题:部分潜在共享下的可识别多模式因果表示学习
链接:https://arxiv.org/abs/2605.19135
作者:Manal Benhamza,Marianne Clausel,Myriam Tami
摘要:Causal representation learning (CRL) seeks to uncover meaningful latent variables and their corresponding causal structure from high-dimensional observational data. Although its significance, CRL identifiability remains a crucial property, as it ensures the recovery of the mechanisms behind the data generation process, and hence the interpretability and robustness of the representation. Proving identifiability in CRL is intrinsically difficult, and we address in this work an even more challenging setting: multimodality. We consider multimodal observed data with a latent partially shared structure. Each modality is generated, through non linear mixing functions, from a specific subset of causal latent variables. Under flexible assumptions and without imposing any parametric distribution on the latent variables, we establish component-wise identifiability guarantees for the causal latent representation. Our identifiability results, furthermore, apply to the undercomplete scenario where we have, for each modality, more observed than latent variables. To instantiate our theoretical analysis, we introduce a Wasserstein-based module to recover the partially shared latent structure. Due to its differentiability, the latter can be easily integrated into all types of architecture, only requiring minimal changes. Extensive experiments on synthetic and realistic datasets validate the superiority of our approach over SOTA methods.
【4】Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts
标题:启发式嵌入:系统预算Bayesian优化的动态表示
链接:https://arxiv.org/abs/2605.19093
作者:Zhiyuan Jerry Lin,Benjamin Letham,Samuel Dooley,Maximilian Balandat,Eytan Bakshy
摘要:System prompts are a central control mechanism in modern AI systems, shaping behavior across conversations, tasks, and user populations. Yet they are difficult to tune when feedback is available only as aggregate metrics rather than per-example labels, failures, or critiques. We study this aggregate feedback setting as sample-constrained black-box optimization over discrete, variable-length text. We introduce ReElicit, a Bayesian optimization framework based on \emph{embedding by elicitation}. Given a task description, previously evaluated prompts, and scalar scores, an LLM elicits a compact, interpretable feature space and maps prompts into it. Leveraging a probabilistic Gaussian process surrogate, an acquisition function then selects target feature vectors, which the LLM realizes and refines into deployable system prompts. Re-eliciting the feature space as new evaluations arrive lets the representation adapt to the observed prompt-score history. We evaluate the setting using offline benchmark accuracy as a controlled aggregate proxy: the optimizer observes one scalar score per prompt and no per-example labels, errors, or critiques. Across ten system prompt optimization tasks with a 30 total evaluation budget, ReElicit achieves the strongest aggregate performance profile among representative aggregate-only prompt-optimization baselines. These results suggest that LLMs can serve as adaptive semantic representation builders, not only prompt generators, for Bayesian optimization over natural-language artifacts.
【5】VCR: Learning Valid Contextual Representation for Incomplete Wearable Signals
标题:DVR:学习不完整可穿戴信号的有效上下文表示
链接:https://arxiv.org/abs/2605.18837
作者:Yuxuan Weng,Wenhan Luo,Qijia Shao
摘要:Wearable devices enable continuous health monitoring from multimodal signals, but real-world deployment is hindered by limited labeled data and pervasive sensor incompleteness. While large-scale self-supervised pretraining reduces label dependence, most existing methods assume full modality availability. Current approaches for handling modality missingness often reconstruct entire absent signals, which can encourage hallucinating modality-specific details that are not inferable from the observed sensor signals and degrade robustness. We propose VCR, a self-supervised framework that learns to extract valid representations robust to modality missingness. VCR employs an orthogonal tokenizer to enforce strict orthogonal disentanglement by rectifying latent manifolds and applying a geometric projection, separating each modality into shared semantics and modality-specific residuals. This design preserves complete information integrity while serving as a structural foundation for robust learning under modality missingness. The resulting tokens are processed by a missing-aware mixture-of-experts backbone that adapts to varying patterns of modality availability. By constraining the objective to reconstruct only the shared components of missing modalities, VCR effectively mitigates hallucinations of non-inferable modality-specific details. Across multiple health monitoring tasks, VCR consistently improves performance and robustness under full, single-missing, and multiple-missing modality settings compared with strong supervised and self-supervised baselines.
【6】Cross-View Attention Fusion Net: A Prior-Guided Dual-View Representation Learning for Cardiac Output Estimation from Short-Term PPG Signals
标题:交叉视图注意力融合网络:根据短期PPV信号估计心输出量的先验引导双视图表示学习
链接:https://arxiv.org/abs/2605.19666
作者:Yaowen Zhang,Bo Cui,Libera Fresiello,Peter H. Veltink,Dirk W. Donker,Ying Wang
摘要:Accurate cardiac output (CO) estimation from photoplethysmography (PPG) is promising for unobtrusive hemodynamic monitoring, but remains difficult since CO is jointly determined by cardiac function and vascular tone. Conventional feature-based models use physiologically meaningful PPG descriptors, yet depend on accurate pulse detection and may miss latent temporal relationships. In contrast, fully end-to-end deep learning models learn directly from raw PPG but often underuse established PPG-derived prior information. Here, we introduce the Cross-View Attention Fusion Network (CVAF-Net), a prior-guided dual-view deep learning model for CO estimation from short, fixed-length PPG segments. CVAF-Net processes raw PPG as a temporal view and a feature sequence map (FSM) as a structured prior-guided view, and fuses the two representations through cross-view attention. The model was independently evaluated using 5-, 15-, and 30-s segments from three datasets: simulated pulse waves (3323 subjects), vasoconstriction provocation (79 subjects), and resting/cycling activities (10 subjects), and was compared with multiple machine learning and deep learning benchmarks. CVAF-Net outperformed most benchmark methods and achieved performance comparable to a state-of-the-art Transformer-based model, with a mean absolute error (MAE) of 0.19 L/min (MAPE: 3.95%) on simulated data and high accuracy in real-world settings (minimum MAE: 1.20 L/min). Importantly, CVAF-Net reduced FLOPs by twelvefold compared with the leading Transformer-based model. Plausibility analysis showed physiologically consistent CO estimates, with expected correlations with age ($ρ= -0.274$), heart rate ($ρ= 0.894$), and systemic vascular resistance ($ρ= -0.740$). These findings indicate that CVAF-Net provides an accurate, computationally efficient, and generalizable approach for continuous wearable-based CO monitoring.
3D|3D重建等相关(1篇)
【1】CompoSE: Compositional Synthesis and Editing of 3D Shapes via Part-Aware Control
标题:CompoSE:通过零件感知控制进行3D收件箱的成分合成和编辑
链接:https://arxiv.org/abs/2605.19350
作者:Habib Slim,Shariq Farooq Bhat,Mohamed Elhoseiny,Yifan Wang,Mike Roberts
摘要:Creating and editing high-quality 3D content remains a central challenge in computer graphics. We address this challenge by introducing CompoSE, a novel method for Compositional Synthesis and Editing of 3D shapes via part-aware control. Our method takes as input a set of coarse geometric primitives (e.g., bounding boxes) that represent distinct object parts arranged in a particular spatial configuration, and synthesizes as output part-separated 3D objects that support localized granular (i.e., compositional) editing of individual parts. The key insight that enables our method is our use of a diffusion transformer architecture that alternates between processing each part locally and aggregating contextual information across parts globally, and features a novel conditioning technique that ensures strong adherence to the user's input. Importantly, our method learns to infer part semantics and symmetries directly from the user's coarse layout guidance, and does not require part-level text prompts. We demonstrate that our method enables powerful part-level editing capabilities, including context-aware substitution, addition, deletion, and style-preserving resizing operations. We show through extensive experiments that our method significantly outperforms existing approaches on guided synthesis, as measured by objective metrics and LLM-based evaluations.
编码器(1篇)
【1】Variational Diffusion Channel Decoder
标题:变分扩散通道解码器
链接:https://arxiv.org/abs/2605.18902
作者:Chengwei Zhang,Yifan Du,Siyu Liao
摘要:Neural channel decoder, as a data-driven channel decoding strategy, has shown very promising improvement on error-correcting capability over the classical methods. However, the success of those deep learning-based decoder comes at the cost of drastically increased model storage and computational complexity, hindering their practical adoptions in real-world time-sensitive resource-sensitive communication and storage systems. To address this challenge, we propose an efficient variational diffusion model-based channel decoder, which effectively integrates the domain-specific belief propagation process to the modern diffusion model. By reaping the low-cost benefits of belief propagation and strong learning capability of diffusion model, our proposed neural decoder simultaneously achieves very low cost and high error-correcting performance. Experimental results show that, compared with the state-of-the-art neural channel decoders, our model provides a feasible solution for practical deployment via achieving the best decoding performance with significantly reduced computational cost and model size.
优化|敛散性(15篇)
【1】Take It or Leave It: Intent-Controlled Partial Optimal Transport
标题:接受或放弃:意图控制的部分最优运输
链接:https://arxiv.org/abs/2605.20030
作者:Salil Parth Tripathi,Bertrand Chapron,Fabrice Collard,Nicolas Courty,Ronan Fablet
摘要:While optimal transport (OT) enforces a rigid constraint by requiring two measures to be matched exactly, partial optimal transport relaxes this requirement by allowing mass to remain unmatched through a global budget, scalar rebate, or uniform rejection rule. However, many applications call for more structured, pointwise rejection mechanisms, where the decision to leave mass unmatched depends on side-specific reliability, support geometry, or external information about which components should participate in the comparison. We introduce \emph{intent-controlled partial optimal transport} (IC-POT), a targeted generalization of partial transport that replaces the global rejection paradigm with pointwise rejection costs over both measures. We show that the resulting optimization problem admits a dual interpretation in terms of local acceptance thresholds and can be solved by recasting it as a balanced Kantorovich OT problem on an augmented support. Beyond theoretical analysis, we demonstrate the practical relevance of IC-POT in settings where rejection is driven by side information. In positive-unlabeled learning and open-partial domain adaptation, incorporating pointwise rejection rules that encode statistical structure improves fixed baseline pipelines. Finally, we motivate the use of IC-POT with a geophysical practical case: multi-modal satellite ocean measurements, for which physical and sensors priors naturally inform the rejection mechanism and define the retrieved comparable signal information.
【2】Training Neural Networks with Optimal Double-Bayesian Learning
标题:利用最佳双Bayesian学习训练神经网络
链接:https://arxiv.org/abs/2605.20009
作者:Vy Bui,Hang Yu,Karthik Kantipudi,Ziv Yaniv,Stefan Jaeger
备注:13 pages, 4 figures; see also arXiv:2410.12984 [cs.LG]
摘要:Backpropagation with gradient descent is a common optimization strategy employed by most neural network architectures in machine learning. However, finding optimal hyperparameters to guide training has proven challenging. While it is widely acknowledged that selecting appropriate parameters is crucial for avoiding overfitting and achieving unbiased outcomes, this choice remains largely based on empirical experiments and experience. This paper presents a new probabilistic framework for the learning rate, a key parameter in stochastic gradient descent. The framework develops classic Bayesian statistics into a double-Bayesian decision mechanism involving two antagonistic Bayesian processes. A theoretically optimal learning rate can be derived from these two processes and used for stochastic gradient descent. Experiments across various classification, segmentation, and detection tasks corroborate the practical significance of the theoretically derived learning rate. The paper also discusses the ramifications of the proposed double-Bayesian framework for network training and model performance.
【3】Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs
标题:多项逻辑MDPs的Minimax最优方差感知后悔界
链接:https://arxiv.org/abs/2605.19768
作者:Pierre Boudart,Pierre Gaillard,Alessandro Rudi
摘要:We study reinforcement learning for episodic Markov Decision Processes (MDPs) whose transitions are modelled by a multinomial logistic (MNL) model. Existing algorithms for MNL mixture MDPs yield a regret of $\smash{\tilde{O}(dH^2\sqrt{T})}$ (Li et al., 2024), where $d$ is the feature dimension, $H$ the episode length, and $T$ the number of episodes. Inspired by the logistic bandit literature (Abeille et al., 2021; Faury et al., 2022; Boudart et al., 2026), we introduce a problem-dependent constant $\barσ\_T \leq 1/2$, measuring the normalised average variance of the optimal downstream value function along the learner's trajectory. We propose an algorithm achieving a regret of $\smash{\tilde{O}(dH^2\barσ\_T\sqrt{T})}$, which recovers the existing bound in the worst case and improves upon it for structured MDPs. For instance, for KL-constrained robust MDPs, $\barσ\_T = O(H^{-1})$, reducing the horizon dependence by a factor $H$. We further establish a matching $\smash{Ω(dH^2\barσ\_T\sqrt{T})}$ lower bound, proving minimax optimality (up to logarithmic factors) and fully characterising the regret complexity of MNL mixture MDPs for the first time.
【4】Optimal Reconstruction from Linear Queries
标题:线性插值的最佳重建
链接:https://arxiv.org/abs/2605.19625
作者:Yuval Filmus,Shay Moran,Elizaveta Nesterova
备注:Accepted to COLT 2026. 46 pages, 4 figures
摘要
:We study the problem of reconstructing an unknown point in $\mathbb{R}^d$ from approximate linear queries. This setting arises naturally in applications ranging from low-dimensional remote sensing and signal recovery to high-dimensional data analysis and privacy-sensitive inference. Our main goal is to characterize the optimal reconstruction error as a function of the number of queries $T$, the ambient dimension $d$, and the noise parameter $δ$. We first analyze the limit $T \to \infty$ and show that the optimal reconstruction error converges to the explicit value $\sqrt{2d/(d+1)} δ$, which plays a role analogous to the Bayes optimal error in supervised learning. When the dimension is fixed, we show that the excess error above this limit decays doubly exponentially fast as $T \to \infty$, a rate that is significantly faster than those typically encountered in learning curves. When the dimension grows, we show that a number of queries on the order of $\exp(d)$ is necessary and sufficient to achieve vanishing excess error. Finally, we introduce and analyze an improper variant of the reconstruction problem. From a technical perspective, our main contribution is a generalization of Jung's theorem (1901). The classical theorem bounds the maximum possible radius of a set of diameter 1 and characterizes extremal bodies. Our generalization provides a robust variant that characterizes near-extremal bodies and is proved via geometric and dynamical arguments exploiting symmetry and Lie group actions.
【5】Learning-Accelerated Optimization-based Trajectory Planning for Cooperative Aerial-Ground Handover Missions
标题:基于学习加速优化的空地合作移交任务轨迹规划
链接:https://arxiv.org/abs/2605.19562
作者:Jingshan Chen,Bochen Yu,Henrik Ebel,Peter Eberhard
备注:Preprint of a contribution accepted for publication in the RoManSy 2026 Springer proceedings
摘要:This paper presents a learning-augmented trajectory planning framework for cooperative unmanned aerial vehicle (UAV) and unmanned ground vehicle (UGV) handover missions. While centralized trajectory optimization ensures dynamic feasibility and task optimality, its high computational cost limits real-time applicability. We propose a neural surrogate planner utilizing decoupled encoder-decoder long short-term memory (LSTM) networks to generate coordinated handover trajectory predictions from the task specifications. These predictions serve as informed warm starts for the downstream centralized optimizer, thereby accelerating convergence to dynamically feasible solutions. Benchmark evaluations demonstrate that the learning-augmented planning framework achieves more than a threefold speedup and 100% optimization success rate compared to cold start optimization. The results indicate that combining data-driven inference with model-based refinement enables fast and reliable trajectory generation for heterogeneous multi-robot systems.
【6】MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization
标题:MOCHA:多目标Chebyshev Annealing for Agent技能优化
链接:https://arxiv.org/abs/2605.19330
作者:Md Mehrab Tanjim,Jayakumar Subramanian,Xiang Chen,Branislav Kveton,Subhojyoti Mukherjee,Anlan Zhang,Sungchul Kim,Somdeb Sarkhel,Sunav Choudhury
备注:Preprint. 25 pages, 14 figures, 5 tables
摘要:LLM agents organize behavior through skills - structured natural-language specifications governing how an agent reasons, retrieves, and responds. Unlike monolithic prompts, skills are multi-field artifacts subject to hard platform constraints: description fields are truncated for routing, instruction bodies are compacted via progressive disclosure, and co-resident skills compete for limited context windows. These constraints make skill optimization inherently multi-objective: a skill must simultaneously maximize task performance and satisfy platform limits. Yet existing prompt optimizers either ignore these trade-offs or collapse them into a weighted sum, missing Pareto-optimal variants in non-convex objective regions. We introduce MOCHA (Multi-Objective Chebyshev Annealing), which replaces single-objective selection with Chebyshev scalarization - covering the full Pareto front, including non-convex regions - combined with exponential annealing that transitions from exploration to exploitation. In our experiments across six diverse agent skills - where all methods share the same multi-objective mutation operator and baselines receive identical per-objective textual feedback - existing optimizers fail to improve the seed skill on 4 of 6 tasks: 1000 rollouts yield zero progress. MOCHA breaks through on every task, achieving 7.5% relative improvement in mean correctness over the strongest baseline (up to 14.9% on FEVER and 10.4% on TheoremQA) while discovering twice as many more Pareto-optimal skill variants.
【7】Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models
标题:Flash PD-RSM:内存优化的结构化稀疏状态空间模型
链接:https://arxiv.org/abs/2605.19150
作者:Aleksandar Terzić,Francesco Carzaniga,Nicolas Menet,Yannick Biehl,Michael Hersche,Thomas Hofmann,Abbas Rahimi
摘要:State-space models (SSMs) face a fundamental trade-off between efficiency and expressivity that is mainly dictated by the structure of the model's transition matrix. Unstructured transition matrices enable maximal expressivity, as measured by their ability to model finite-state automaton (FSA) transitions, but come at a prohibitively high compute and memory cost. In contrast, most structured transition matrix forms are highly efficient both in runtime and memory consumption, but suffer from limited expressivity. Building on recent work on structured sparse SSMs, we propose Flash PD-SSM, a novel SSM that achieves comparable throughput to widely-used structured SSMs with significantly better expressivity guarantees. Flash PD-SSM maintains a trainable set of structured sparse matrices, a single one of which is discretely selected at each time-step, enabling FSA expressiveness at the level of unstructured matrices while maintaining the efficiency required for training models at scale. First, we validate Flash PD-SSM against a suite of alternative models on synthetic mechanistic and state-tracking tasks, finding that its theoretical expressivity is achieved in practice. Second, on multivariate time-series tasks involving sequences of length over 17,000, we find that Flash PD-SSM defines a new state-of-the-art (SoTA) accuracy among competing SSM methods. Finally, we demonstrate that Flash PD-SSM is an effective drop-in replacement for hybrid LLMs, yielding improvements both in natural language state-tracking and in common language modeling scenarios. The model exhibits increased throughput and decreased memory consumption compared to SSMs widely used in frontier language models.
【8】EUPHORIA: Efficient Universal Planning via Hybrid Optimization for Robust Industrial Robotic Assembly
标题:EUPHORIA:通过混合优化实现高效通用规划,实现稳健的工业机器人装配
链接:https://arxiv.org/abs/2605.18872
作者:Shih-Yu Lai,Chia-Ching Yen,Yang-Ting Shen,Peter Yichen Chen,Yu-Lun Liu,Bing-Yu Chen
摘要
:Robotic assembly in architectural construction faces a persistent bottleneck: existing planners are either highly specialized, requiring prohibitive retraining for every new geometric design, or operationally inefficient, treating structural sequencing and kinematic motion as disjoint processes. We present EUPHORIA, a unified framework that achieves universal few-shot adaptability and dynamic efficiency through a hybrid optimization strategy. To overcome the retraining bottleneck, we propose a Meta-Geometric Encoder based on Graph Hypernetworks: unlike standard contrastive learning, which performs only feature-level recognition, our hypernetwork dynamically generates policy parameters from a minimal support set, enabling parameter-level adaptation to complex topologies (e.g., domes, arches) without gradient-based retraining. For structural reasoning, we introduce a Physics-Informed Graph Transformer trained via Soft Actor-Critic (SAC), with a Physics-Bias Attention mechanism that modulates attention scores using contact forces from Discrete Element Model (DEM) simulations, guiding the planner toward structurally critical connections. We further ensure operational efficiency through Kinematics-Aware Sequencing, where the SAC objective penalizes high-energy transitions. Finally, we bridge the Sim2Real gap via Residual Stability Correction, a differentiable optimization layer that fine-tunes coarse assembly actions by minimizing a joint energy-stability cost prior to execution. Experiments show that EUPHORIA significantly reduces energy consumption over decoupled baselines and achieves state-of-the-art success rates on unseen, non-standard geometries with minimal few-shot examples, fusing meta-learning, physics-informed attention, and residual optimization into a cohesive, generalized planner.
【9】MO-CAPO: Multi-Objective Cost-Aware Prompt Optimization
标题:MO-CAPO:多目标成本意识提示优化
链接:https://arxiv.org/abs/2605.18869
作者:Jan Büssing,Moritz Schlager,Timo Heiß,Tom Zehle,Matthias Feurer
摘要:Large language models (LLMs) achieve strong performance across a wide range of tasks but are highly sensitive to prompt design, motivating the need for automatic prompt optimization. Existing methods predominantly focus on performance alone, ignoring competing objectives such as inference cost or latency. At the same time, existing work on multi-objective prompt optimization relies on off-the-shelf NSGA-II, ignoring optimization efficiency. As a remedy, we introduce MO-CAPO, a novel multi-objective prompt optimization algorithm that jointly optimizes performance and inference cost while leveraging budget allocation for cost-efficient optimization. We further propose a deployment-oriented cost objective that captures the full computational profile of LLM inference. We evaluate our approach across four tasks and three LLMs and compare it to an NSGA-II-based multi-objective method and state-of-the-art single-objective prompt optimizers. Results show that MO-CAPO consistently identifies strong, robust, and diverse Pareto front approximations while maintaining cost-efficiency. It outperforms the NSGA-II baseline on 8 out of 12 cases in terms of the noisy R2 metric and achieves competitive performances often already at a considerably lower budget. The discovered solution sets span diverse performance-cost trade-offs that are omitted by single-objective optimizers, yet the top-performance candidates remain competitive with single-objective solutions. Additionally, we conduct the first evaluation of multi-objective machine learning experiments that considers generalization and robustness through noisy R2 and approximation gap, enabling a more realistic assessment of solution quality. MO-CAPO enables practitioners to select from an efficiently discovered set of multiple prompts offering different trade-offs between performance and cost.
【10】Efficient Conditioning Why Pseudo Observation Batch Bayesian Optimization Works When It Does not
标题:有效条件反射为什么伪观察批Bayesian优化在不起作用时有效
链接:https://arxiv.org/abs/2605.18819
作者:Kumbha Nagaswetha,Rabi Pathak
摘要:Constant Liar (CL), Kriging Believer (KB), and fantasy models are widely used for batch selection in parallel Bayesian Optimization, yet a unified theory explaining their effectiveness and conditions under which they fail has been lacking. We identify efficient conditioning as the key surrogate property the ability to update predictions in closed form when data is augmented. We prove that Gaussian Processes satisfy this requirement, producing provably distinct batch points with separation of order l, and that this holds for any acquisition function monotonically non decreasing in posterior uncertainty (EI, UCB, PI), with qualitatively similar behavior for Thompson Sampling. We unify CL, KB, and fantasy models as instances of a single conditioning mechanism differing only in the lie value distribution, and draw quantitative connections to Local Penalization (LP) and qualitative connections to Determinantal Point Processes (DPPs). To disentangle model structure from optimizer randomness, we introduce the Structural Diversity Diagnostic (SDD), a reusable methodology for testing surrogate compatibility. Experiments on Hartmann6D, Ackley 8D, Levy10D, and SVM hyperparameter tuning validate all theoretical predictions: CL or KBs implicit penalty matches or outperforms explicit LP greedy conditioning achieves convergence on par with joint qEI efficient conditioning extends to Multiquadric RBF networks; and parametric surrogates produce degenerate batches even when fully retrained (random forests), while neural networks regain diversity only at 15x the wall clock cost of GP conditioning. Robustness is confirmed across multiple initial datasets and under observation noise.
【11】PROWL: Prioritized Regret-Driven Optimization for World Model Learning
标题:PROWR:用于世界模型学习的优先级遗憾驱动优化
链接:https://arxiv.org/abs/2605.18803
作者:Ahmet H. Güzel,Jenny Seidenschwarz,Benjamin Graham,Jonathan Sadeghi,Jeffrey Hawke,Jack Parker-Holder,Ilija Bogunovic
摘要:Modern action-conditioned video world models achieve strong short-horizon visual realism, yet remain unreliable on rare, interaction-critical transitions that dominate downstream planning and policy performance. Because passive demonstration data systematically under-samples these high-impact regimes, improving robustness requires actively eliciting model failures rather than relying on their natural occurrence. We introduce a KL-constrained adversarial curriculum in which a policy is trained to expose high-error trajectories of a diffusion-based world model while remaining close to the behavior distribution. The world model is continuously fine-tuned on these adversarially discovered trajectories, yielding an adversarial training loop that converts rare failures into a stable, near-distribution training signal without drifting into out-of-distribution exploitation. To maintain pressure on unresolved weaknesses as the model improves, we propose a Prioritized Adversarial Trajectory (PAT) buffer that re-ranks trajectories based on prediction error, action fidelity, and learning progress, focusing training on unresolved failure modes rather than repeatedly revisiting solved cases. We implement our approach in the MineRL framework and evaluate it on held-out out-of-distribution trajectories; PROWL improves robustness over models trained on passive data alone, reveals reward-hacking behaviors under weak behavioral constraints, and demonstrates that effective adversarial world-model training critically depends on balancing exploratory failure discovery with explicit behavioral regularization. Our results suggest that scalable world models benefit not only from larger datasets, but also from selectively generating informative training data.
【12】Theory-optimal Quantization Based on Flatness
标题:基于平坦性的理论最优量化
链接:https://arxiv.org/abs/2605.18800
作者:Xiusheng Huang,Zhe Li,Xuanwu Yin,Lu Wang,Yequan Wang,Dong Li,Emad Barsoum,Kang Liu
备注:16 pages, 2 figures
摘要:Post-training quantization has emerged as a widely adopted technique for compressing and accelerating the inference of Large Language Models (LLMs). The primary challenges in LLMs quantization stem from activation outliers, which significantly degrade model performance especially at lower bit precision. While recent approaches attempt to mitigate outliers through linear transformations across feature dimensions, our analysis reveals that the transformed weights and activations still exhibit persistent outlier patterns with concentrated magnitude distributions. In this paper, we first model the mathematical relationship between quantization error and outliers, and then introduce a new metric Flatness to quantify the distribution of outliers. Based on this, we derive the theoretical optimal solution with respect to Flatness. Building on these insights, we propose Bidirectional Diagonal Quantization (BDQ), a novel post-training quantization framework that effectively disperses outlier patterns through optimized matrix transformations. BDQ strategically distributes outlier magnitudes across matrix dimensions via learned diagonal operations. Extensive experiments demonstrate that BDQ establishes a new quantization benchmark. It achieves less than 1\% accuracy drop in W4A4 quantization on the LLaMA-3-8B model. In the more challenging W2A4KV16 experiment, compared to state-of-the-art approaches, BDQ reduces the performance gap by 39.1\% on the DeepSeek-R1-Distill-LLaMA-70B model.
【13】Goal-Oriented Lower-Tail Calibration of Gaussian Processes for Bayesian Optimization
标题:面向目标的高斯过程的Bayesian优化下尾校准
链接:https://arxiv.org/abs/2605.20145
作者:Aurélien Pion,Emmanuel Vazquez
摘要:Bayesian optimization (BO) selects evaluation points for expensive black-box objectives using Gaussian process (GP) predictive distributions. Kernel choice and hyperparameter selection can lead to miscalibrated predictive distributions and an inappropriate exploration-exploitation trade-off. For minimization, sampling criteria such as expected improvement (EI) depend on the predictive distribution below the current best value, so lower-tail miscalibration directly affects the sampling decision. This article studies goal-oriented calibration of GP predictive distributions below a low threshold $t$ in the noiseless setting, for standard GP models with hyperparameters selected by maximum likelihood. A framework for predictive reliability below $t$ is introduced, based on two notions of spatial calibration: occurrence calibration over the design space and thresholded $μ$-calibration on sublevel sets of the form $\{x\in\mathbb{X}, f(x)\le t\}$. Building on this framework, we propose tcGP, a post-hoc method that calibrates GP predictive distributions below~$t$, and we show that the resulting EI-based global optimization algorithm remains dense in the design space. Experiments on standard benchmarks show improved lower-tail calibration and BO performance relative to standard GP models and globally calibrated GP models.
【14】Convergence of Consensus-Based Particle Methods for Nonconvex Bi-Level Optimization
标题:非凸双层优化基于边界的粒子方法的收敛性
链接:https://arxiv.org/abs/2605.19667
作者:Yutong Chao,Xudong Sun,Konstantin Riedl,Majid Khadiv,Jalal Etesami
摘要:In this paper, we study a consensus-based optimization method for nonconvex bi-level optimization, where the objective is to minimize an upper-level function over the set of global minimizers of a lower-level problem. The proposed approach is derivative-free, and constructs its consensus point via smooth quantile selection combined with a Gibbs-type Laplace approximation. We establish convergence guarantees for both the associated \textit{mean-field} dynamics and its \textit{finite-particle} approximation. In particular, under suitable assumptions on smooth quantile localization, error bounds, and stability, we show that the mean-field law reaches any arbitrary prescribed Wasserstein neighborhood of the target bi-level solution with an explicit exponential rate up to the hitting time. Numerical experiments on a two-dimensional constrained problem and neural network training further support the theoretical results.
【15】A Nonlinear Complexity Index for Wearable PPG Cardiovascular Stability: Multiscale Validation, Systematic Evaluation Correction, and Bayesian Parameter Optimization
标题:可穿戴式PPV心血管稳定性的非线性复杂性指数:多尺度验证、系统评估修正和Bayesian参数优化
链接:https://arxiv.org/abs/2605.18802
作者:Timothy Oladunni,Farouk Ganiyu Adewumi
摘要:Cardiovascular stability estimation from wearable photoplethysmography (PPG) requires a principled nonlinear framework, yet major gaps persist in heuristic parameter selection and evaluation protocols that inflate reported performance. We introduce a Stability-Constrained Cardiovascular Stability Index (SCSI) grounded in Cardiac Stability Theory and validate it across 176,742 segments from four heterogeneous PPG datasets at three temporal scales. Cross-dataset analysis demonstrates a large Kruskal-Wallis effect size (eta2 = 0.351, p < 0.001), strong cross-scale consistency (kappa > 0.97), and significant correlation with respiratory rate across 53 ICU records (Spearman r = 0.346, p = 0.011). We identify three evaluation artifacts that inflate heuristic AUC from a true baseline of 0.573 to 0.752: segment-level cross-validation leakage, test-set normalization leakage, and pooled-AUC overweighting that conceals per-patient failure. Correcting these artifacts and applying Bayesian optimization over 15 joint parameters yields SCSI with cross-validation AUC of 0.720. On 18 held-out records, SCSI achieves pooled AUC of 0.757 (95% CI: 0.686-0.828) and negative predictive value of 0.966 for tachypnea screening, while per-record AUC of 0.497 +/- 0.207 is disclosed for transparency. External validation on 42 elective-surgery records yields AUC of 0.621, confirming cross-population generalization. Ablation analysis identifies the nonlinear complexity module as the dominant component. A sparse three-component architecture is proposed as the minimal deployable configuration. The corrected protocol provides a reproducible benchmark for future wearable cardiovascular stability indices.
预测|估计(22篇)
【1】HaorFloodAlert: Deseasonalized ML Ensemble for 72-Hour Flood Prediction in Bangladesh Haor Wetlands
标题:HaorFloodAlert:孟加拉国Haor湿地72小时洪水预测的去季节性ML加入
链接:https://arxiv.org/abs/2605.20167
作者:Salma Hoque Talukdar Koli,Fahima Haque Talukder Jely,Md. Samiul Alim,Md. Zakir Hossen
备注:9 pages, 9 figures. To be submitted to raaicon.org
摘要:Flash floods in Bangladesh's haor wetlands show up with almost no warning. They wreck the annual boro rice harvest. Current setups, built for riverine floods, miss backwater dynamics entirely. These basins are flat. Water does not behave like it does on the Brahmaputra. We built HaorFloodAlert, a deseasonalized machine learning ensemble that forecasts 72-hour flood probability for the Sunamganj Haor (approximately 8,000 km2). Temperature was acting as a seasonal cheat code - it inflated accuracy by 6.9 pp just because floods happen in warm months. We caught that. We also built an upstream Barak River Sentinel-1 SAR proxy from Silchar, Assam, giving about 36 hours of lead time. Otsu-thresholded SAR change detection validates at 84-91 percent spatial match. The operational ensemble (RF 0.5625 + XGBoost 0.4375) hits 89.6 percent LOOCV accuracy, 87.5 percent recall, and 0.943 AUC-ROC on 77 real Sentinel-1 events. A three-tier alert pipeline and a BRRI-calibrated boro rice damage estimator are included.
【2】Toto 2.0: Time Series Forecasting Enters the Scaling Era
标题:Toto 2.0:时间序列预测进入缩放时代
链接:https://arxiv.org/abs/2605.20119
作者:Emaad Khwaja,Chris Lettieri,Gerald Woo,Eden Belouadah,Marc Cenac,Guillaume Jarry,Enguerrand Paquin,Xunyi Zhao,Viktoriya Zhukov,Othmane Abou-Amal,Chenghao Liu,Ameet Talwalkar,David Asker
备注:Code: https://github.com/DataDog/toto Weights: https://huggingface.co/collections/Datadog/toto-20
摘要:We show that time series foundation models scale: a single training recipe produces reliable forecast-quality improvements from 4M to 2.5B parameters. We release Toto 2.0, a family of five open-weights forecasting models trained under this recipe. The Toto 2.0 family sets a new state of the art on three forecasting benchmarks: BOOM, our observability benchmark; GIFT-Eval, the standard general-purpose benchmark; and the recent contamination-resistant TIME benchmark. This report describes our experimental results and details the design decisions behind Toto 2.0: its architecture and training recipe, training data, and the u-muP hyperparameter transfer pipeline. All five base checkpoints are released under Apache 2.0.
【3】Beyond Isotropy in JEPAs: Hamiltonian Geometry and Symplectic Prediction
标题:超越JEPA中的各向同性:汉密尔顿几何和辛预测
链接:https://arxiv.org/abs/2605.20107
作者:Robert Jenkinson Alvarez
摘要:JEPAs often regularize one-view embeddings toward an isotropic Gaussian, implicitly baking Euclidean symmetry into the representation. We show that this is not merely a benign default. For a known structured downstream geometry $H\succ0$, the minimax and maximum-entropy covariance under a Hamiltonian energy budget is $(c/d)H^{-1}$, and Euclidean isotropy incurs a closed-form price of isotropy. More importantly, when the downstream geometry is unknown, no geometry-independent fixed marginal target is canonical: every fixed covariance shape can be maximally misaligned for some structured geometry. We further show that even oracle one-view marginals do not identify the JEPA view-to-view predictive coupling. These results suggest that the structural bias in JEPAs should enter the cross-view coupling rather than a fixed encoder marginal. We instantiate this principle with \textbf{HamJEPA}, which encodes each view as a phase-space state $(q,p)$ and predicts view-to-view transitions with a learned Hamiltonian leapfrog map, while non-isotropic scale and spectral floors prevent collapse. In a deliberately headless token protocol, HamJEPA improves over SIGReg on CIFAR-100 by $+4.89$ kNN@20 and $+3.52$ linear-probe points at 30 epochs, and by $+6.45$ kNN@20 and $+10.64$ linear-probe points at 80 epochs, while a matched MLP predictor ablation shows that the symplectic coupling is the ingredient driving the neighborhood-geometry gain. On ImageNet-100, HamJEPA-$q$ improves by $+4.82$ kNN@20 and $+7.52$ linear-probe points at 45 epochs.
【4】Learning with Foresight: Enhancing Neural Routing Policy via Multi-Node Lookahead Prediction
标题:前瞻性学习:通过多节点前瞻预测增强神经路由策略
链接:https://arxiv.org/abs/2605.19975
作者:Xia Jiang,Yaoxin Wu,Yew-Soon Ong,Yingqian Zhang
备注:Accepted by the 35th International Joint Conference on Artificial Intelligence
摘要:Neural policies have shown promise in solving vehicle routing problems due to their reduced reliance on handcrafted heuristics. However, current training paradigms suffer from a fundamental limitation: they primarily focus on next-node prediction for solution construction, resulting in myopic decision-making that undermines long-horizon planning capacity. To this end, we introduce Multi-node Lookahead Prediction (MnLP), a novel training strategy that extends the supervised learning paradigm to predict multiple future nodes simultaneously. We incorporate causal and discardable MnLP modules that operate exclusively during training, facilitating models to anticipate multi-step decisions while preserving inference-time efficiency. By incorporating multi-depth auxiliary supervision into the loss function, MnLP equips neural policies with the ability of long-range contextual understanding. Experimentally, MnLP outperforms existing training methods, improving the generalization capability of neural policies across various problem sizes, distributions, and real-world benchmarks. Moreover, MnLP can be seamlessly integrated into diverse neural architectures without introducing additional inference overhead.
【5】A Closed-loop, State-centric, Multi-agent Framework for Passenger Load Estimation from Heterogeneous Data Streams
标题:用于从异类数据流中估计乘客负荷的闭环、以状态为中心的多代理框架
链接:https://arxiv.org/abs/2605.19834
作者:Yiyao Xu,Hao Zhou,Yuhang Wang,Jingran Sun
备注:Preprint version of a paper accepted by the 2026 IEEE 29th International Conference on Intelligent Transportation Systems (ITSC). 7 pages, 4 figures
摘要:To support operations and passenger-facing services, transit agencies need reliable passenger load trajectories. Currently, load estimates are typically inferred from imperfect sensing systems rather than fully observed, and the accuracy of modern automatic passenger counting (APC) systems still varies with station layout, flow intensity, and operating conditions. To address the challenges of robust passenger load estimation from heterogeneous data streams, including incremental count errors, evidence conflicts, and context-dependent sensor reliability, we propose a closed-loop, state-centric, multi-agent framework. This method enforces physical feasibility at every step, allocates trust dynamically among evidence sources, and feeds physics-derived violation residuals back into training for robustness improvement. The architecture consists of a unified stop-event backbone, a coupled Perception--Physical--Fusion loop for stop-by-stop inference, and optional trip-level macro-correction and closed-loop calibration modules.
【6】Beyond Extrapolation: Knowledge Utilization Paradigm with Bidirectional Inspiration for Time Series Forecasting
标题:超越外推:时间序列预测的双向知识利用范式
链接:https://arxiv.org/abs/2605.19249
作者:Liu Chong,Yingjie Zhou,Hao Li,Pengyang Wang,Qingsong Wen,Ce Zhu
备注:Accepted to ICML 2026. 18 pages, 6 figures
摘要:Time-series forecasting is critical in various scenarios, such as energy, transportation, and public health. However, most existing forecasters rely primarily on one-way inference, \textit{i.e.}, mapping \textbf{history} to \textbf{target}, and overlook the structural information provided by a revised natural chain (``\textbf{history} (model input) -- \textbf{target} (ground-truth output) -- \textbf{post-target continuation}''). The post-target continuation records how trajectories evolve after the target, which can help stabilize forecasting, but it is not observable at inference time. In this work, we aim to obtain an approximate proxy of the post-target continuation for the current input, providing structural knowledge for bidirectional forecasting. This idea is instantiated as KUP-BI (Knowledge Utilization Paradigm with Bidirectional Inspiration), a new time-series modeling paradigm that distills continuation-style knowledge (as an approximate post-target continuation proxy) from a \emph{train-only} historical library and integrates it into standard forecasting backbones. The input stream and the continuation-proxy stream are fused via a lightweight feature-level gating module. This design does not introduce information beyond what is already contained in the training trajectories; instead, it provides a structured inductive bias that helps backbones exploit typical continuation patterns rather than relying solely on parametric extrapolation. Experimental results on six public datasets show that KUP-BI consistently improves the forecasting performance of state-of-the-art models, with small additional overhead.
【7】DeRegiME: Deep Regime Mixtures for Probabilistic Forecasting under Distribution Shift
标题:DeRegME:分布转变下概率预测的深度制度混合
链接:https://arxiv.org/abs/2605.19231
作者:Kieran Wood,Stefan Zohren,Stephen J. Roberts
摘要:We introduce DeRegiME -- Deep Regime Mixture of Experts -- a direct multi-horizon probabilistic forecaster that separates latent uncertainty regimes from the underlying signal and softly assigns each forecast location to learned recurring regimes using a sparse variational Gaussian process (GP) whose nonstationary regime-mixing kernel and Student-t likelihood combine per-regime sub-kernels and noise processes via a shared gate. This yields a single sparse-GP posterior, not a mixture of GP experts. DeRegiME addresses a key limitation of neural forecasters: point forecasts discard residual uncertainty, and probabilistic heads -- whether single marginals, uninterpreted mixtures, quantile sets, or diffusion samples -- rarely expose the regime structure of the residual. Yet distribution shift in noisy heteroskedastic time series may be abrupt, gradual, or horizon-dependent and often appears in residual uncertainty rather than the conditional mean. DeRegiME yields an interpretable mean-residual-noise decomposition with a direct-sum feature-space representation that anchors regimes as clusters of residual similarity whose transitions surface as implicit changepoints. The effective number of regimes is pruned by the stick-breaking gate. We prove kernel validity and predictive-density propriety, and across ten benchmarks and three encoder grids DeRegiME improves negative log predictive density (NLPD) by 20.3% over the strongest encoder-matched baseline, a DeepAR/GluonTS-style dynamic Student-t head, with parallel gains on CRPS (3.0%) and MSE (4.7%). Improvements are consistent across all datasets, which span abrupt, gradual, and seasonal shifts.
【8】EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction
标题:EgoTraj:用于多模式预测的真实世界以自我为中心的人类轨迹数据集
链接:https://arxiv.org/abs/2605.19004
作者:Ahmad Yehia,Abduallah Mohamed,Tianyi Wang,Jiseop Byeon,Kun Qian,Junfeng Jiao,Christian Claudel
备注:21 pages, 14 figures. Project page: https://github.com/yehiahmad/EgoTraj
摘要
:Accurately forecasting human trajectories from an egocentric perspective plays a central role in applications such as humanoid robotics, wearable sensing systems, and assistive navigation. However, progress in this direction remains limited due to the scarcity of egocentric trajectory datasets collected in real-world environments. Addressing this need, we introduce EgoTraj, an egocentric multimodal open dataset recorded using Meta Quest Pro (MQPro). EgoTraj contains 75 sequences of human navigation collected from multiple MQPro wearers in real-world urban environments. Each recording provides synchronized RGB video along with ground-truth data, including continuous time-synchronized 6-degree-of-freedom head poses, per-frame 3D eye gaze vectors, scene annotations. To the best of our knowledge, EgoTraj differs from typical egocentric trajectory datasets by capturing long-horizon, self-directed navigation across diverse urban routes with broad participant diversity. To demonstrate the potential of the dataset, we benchmark several state-of-the-art methods for egocentric trajectory prediction and conduct ablation studies to analyze the contributions of gaze, scene, and motion cues. The results highlight the utility of EgoTraj for AR-based perception, navigation, and assistive systems. The EgoTraj dataset, code, and EgoViz Dashboard are publicly available at https://github.com/yehiahmad/EgoTraj.
【9】Does Your Wildfire Prediction Model Actually Work, or Just Score Well?
标题:您的野火预测模型实际上有效吗?还是只是得分很好?
链接:https://arxiv.org/abs/2605.18911
作者:Yangshuang Xu,Yuyang Dai,Liling Chang,Qi Wang,Yushun Dong
备注:25 pages
摘要:Wildfire prediction is important for early warning and resource allocation, yet existing Earth foundation models (Earth FMs) are pretrained for general atmospheric and geophysical objectives rather than wildfire forecasting. To address this gap, we introduce WILDFIRE-FM, the first foundation model pretrained specifically for wildfire prediction using weather, active-fire observations, topography, vegetation, and static environmental data. However, introducing a domain-specific backbone alone does not solve the evaluation problem: wildfire events are sparse in space and time, making transfer conclusions highly sensitive to matching rules and evaluation settings. To address this problem, we introduce a fixed-contract evaluation framework with two controlled checks: a fixed-output check for matching-rule effects and a fixed-feature check for head-selection effects. Under matched contracts, we compare WILDFIRE-FM with ten Earth-FM baselines across occupancy, spread, retrieval, and regression tasks. Our results show that wildfire transfer conclusions depend strongly on evaluation design and task formulation. We hope this framework and WILDFIRE-FM provide a foundation for future wildfire-specific Earth-FM research and benchmarking. Our code is available at https://anonymous.4open.science/r/Wildfire-fm-evaluation-contracts-5AE9/.
【10】Prediction Is Not Physics: Learning and Evaluating Conserved Quantities in Neural Simulators
标题:预测不是物理学:在神经模拟器中学习和评估保守的时间表
链接:https://arxiv.org/abs/2605.18883
作者:Andrew Bukowski,Aditya Kothari,Simba Shi,Ishir Rao
备注:10 pages
摘要:A diffusion model trained on Hamiltonian trajectories can achieve rollout MSE near $10^{-3}$, but the standard deviation of its energy over time is between 7500 and 36000 times larger than the ground-truth energy standard deviation, indicating a failure to preserve conservation laws. This gap motivates our central question of whether neural networks can learn or select globally conserved quantities from physical trajectories. We investigate this across three Hamiltonian systems: projectile motion, pendulum, and spring-mass. We use a structured $T(v)+V(q)$ energy model, a black-box Conservation Discovery Network (CDN), a polynomial CDN, and a conditional diffusion baseline. The structured network reaches $R^2 \geq 0.9999$ against analytical energy on clean data, while the black-box CDN reaches $R^2 \geq 0.996$ when trained with temporal consistency plus a small alignment loss to analytical energy at $t=0$ ($λ_{\mathrm{align}}=0.2$). With $λ_{\mathrm{align}}=0$, CDN Pearson $R^2$ collapses on pendulum and spring-mass ($< 10^{-3}$), showing that temporal consistency alone is not enough to reliably identify the true energy. Under $1\%$ additive Gaussian noise, the CDN outperforms the structured model on the projectile and spring-mass systems, suggesting that the CDN may be more robust to noisy inputs in this setting. However, the polynomial CDN is sensitive to training configuration: it achieves $R^2=0.78$ under a short training schedule on the pendulum system, but reaches $R^2=0.9998$ with more training time and data, regardless of whether noise is added.
【11】First-Passage Prediction of Grokking Delay: ACalibrated Law under AdamW with Causal Validation
标题:Grokking延迟的第一次预测:AdamW下的校准定律和因果验证
链接:https://arxiv.org/abs/2605.18845
作者:Truong Xuan Khanh,Truong Quynh Hoa,Luu Duc Trung,Phan Thanh Duc
备注:51 pages, 7 figures, 6 tables. Preprint
摘要:We give the first quantitative prediction of grokking delay under AdamW. Treating the delay as a first-passage time, we derive a closed-form law T_grok - T_mem = (1 / 2 kappa_LL eta lambda) log(V_mem / V_star), where V_t = ||theta_t||^2 is the squared parameter norm, V_star is an architecture-dependent threshold, and kappa_LL absorbs the AdamW correction to the clean-SGD contraction rate 2 eta lambda. Calibrating (kappa_LL, V_star) on a single hyperparameter cell predicts grokking delays on 26 held-out runs with MAPE 17.7% over a 41x delay range; the law generalises to MLPs (MAPE 18.0%, N=34) and degrades to 23.3% on cross-task extension (N=46, 43.5x range), with a structured residual in which V_star / V_mem stays comparatively stable within architecture (CV about 14% on the 1L transformer). First-passage of V_t is necessary but not sufficient. A quantile-margin theorem establishes that positive delay requires both norm separation V_mem > V_post and angular reachability of a threshold alpha_star = arcsin(C / V_T_mem^(1/2)), where C is computable from the empirical NTK feature map and the validation-margin quantile. Calibrating C on modulus p=89 predicts alpha_star = 47.2 degrees at p=97 (observed 47.8 degrees, error 1.3%) as a prior cross-cell prediction. Causal interventions that freeze the norm or remove weight decay at memorisation eliminate grokking (0/6 vs. 3/3 baseline), trapping the angular displacement near 12 degrees. kappa_LL is empirically measured per architecture rather than derived from (beta_1, beta_2, epsilon); within-architecture CV stays at most 15% across four architectures, but values differ by about 2x between architectural variants beyond depth alone. Empirical scope is algorithmic tasks (modular arithmetic, sparse parity) under AdamW; whether the law transfers to natural-language scale models is open.
【12】An Integrated Forecasting Prototype for Emergency Department Boarding Time to Support Proactive Operational Decision Making
标题:急诊室登机时间的集成预测原型以支持主动操作决策
链接:https://arxiv.org/abs/2605.18839
作者:Orhun Vural,Abdulaziz Ahmed,Ferhat Zengul,James Booth,Bunyamin Ozaydin
备注:22 pages, including supplementary materials
摘要:Overcrowding in emergency departments (ED) remains a persistent operational challenge worldwide, causing delays in care delivery and downstream congestion. ED boarding time, defined as the duration admitted patients remain in the ED while awaiting inpatient bed placement, is a key indicator of this congestion. Predicting ED boarding time in advance enables proactive operational decision making before congestion escalates. We developed and evaluated a multi-horizon time series forecasting framework to predict ED boarding time at 6, 8, 10, 12, and 24-hour horizons. Real-world data from a university-affiliated urban hospital in the United States were utilized and integrated with external contextual data sources, including weather, holidays, and major local events. Decomposition-based Linear (DLinear) and Normalization-based Linear (NLinear) time series forecasting deep learning models showed superior performance across multiple horizons. Models were also evaluated under extreme congestion scenarios characterized by elevated boarding times. In addition, a Machine Learning Operations (MLOps) web application prototype was developed to support translation of the forecasting framework into practice through integrated data ingestion, forecast visualization, experimentation, and retraining.
【13】StampFormer: A Physics-Guided Material-Geometry-Coupled Multimodal Model for Rapid Prediction of Physical Fields in Sheet Metal Stamping
标题:StampFormer:一种物理引导的材料几何耦合多峰模型,用于快速预测板材冲压中的物理场
链接:https://arxiv.org/abs/2605.18835
作者:Jiajie Luo,Mohamed Mohamed,Osama Hassan,Haosu Zhou,Yingxue Zhao,Haoran Li,Xinrun Li,Zhutao Shao,Yang Long,Nan Li,Jichun Li
摘要:Traditional sheet metal forming relies on time-consuming and expensive Finite Element Analysis (FEA) for design validation, a process that significantly prolongs design cycles. While surrogate models offer faster iteration, current approaches have limitations: scalar-based methods cannot capture comprehensive field-based FEA results, while existing image-based models often ignore the critical role of material properties by focusing solely on geometry. To address this gap, we develop a physics-guided deep learning framework, namely StampFormer, which simultaneously uses component geometry and material stress-strain responses to predict FEA outcomes. The StampFormer framework uses three core components to process data. A Material-Augmented Geometric Network (MAGN) first fuses geometric and material data. This information is then integrated at various levels by a Hierarchical Material Embedding Injection Unit (HMEIU) before being processed by the primary network backbone, an adapted Swin-UNet. We evaluated our model on the stamping of a crossmember panel with two simulation datasets for steel and aluminium panels, and results demonstrate that StampFormer provides high-fidelity predictions of critical physical fields - including thinning, major strain, minor strain, plastic strain, and displacement - in under a second. Compared with ground truth FEA, our model achieved an average relative error of less than 8.5% on the four 2D fields and a mean squared error of less than 1.2 mm2 for the 3D displacement field. In summary, we introduce a practical and efficient framework that integrates multimodal information, namely geometry and material properties, to provide fast and accurate predictions, enabling designers to perform real-time manufacturability assessments.
【14】Multi-Token Residual Prediction
标题:多代币剩余预测
链接:https://arxiv.org/abs/2605.18817
作者:Yufeng Xu,Zishuo Bao,Qian Wang,Zeshen Zhang,Haoqi Zhang,Bowen Peng,Ang Li,Rahul Chalamala,Yucheng Lu
摘要:Diffusion Language Models (DLMs) generate text by iteratively denoising masked token sequences, offering a tradeoff between parallelism and quality compared to autoregressive models. In current practice, the number of tokens decoded per step is controlled by a confidence threshold, and quality degrades monotonically as more tokens are denoised per step. We introduce Multi-token Residual Prediction (MRP), a lightweight module that enables dependency-aware multi-token denoising within a single backbone forward pass. MRP exploits a key property of the denoising process: the logit distributions at adjacent denoising steps are remarkably similar. Rather than running the backbone a second time to obtain the next-step logits, MRP predicts the residual between steps from the backbone's hidden states, effectively denoising more tokens per backbone forward at a fraction of the cost. We deploy MRP in two inference modes: direct decoding, which uses the corrected logits without verification for a tunable quality--speed tradeoff; and speculative decoding, which verifies MRP's proposals against the backbone for lossless acceleration. Experiments on SDAR models at the 1.7B, 4B, and 8B scales across reasoning and code generation benchmarks demonstrate up to $1.42\times$ lossless speedup in SGLang.
【15】Dimensional Balance Improves Large Scale Spatiotemporal Prediction Performance
标题:维度平衡提高大规模时空预测性能
链接:https://arxiv.org/abs/2605.18793
作者:Jing Chen,Shixiang Pan,Yujie Fan,Haocheng Ye,Haitao Xu,Wenqiang Xu
摘要:Accurate spatiotemporal pattern analysis is critical in fields such as urban traffic, meteorology, and public health monitoring. However, existing methods face performance bottlenecks, typically yielding only incremental gains and often exhibiting limited cross-domain transferability. We analyze this bottleneck through spatial and temporal entropy measures, which are used as diagnostic indicators of spatiotemporal complexity mismatch rather than as guarantees that entropy alignment alone yields better forecasting. Empirically, larger mismatch is often accompanied by higher prediction uncertainty, especially under a fixed model-capacity budget. Guided by this diagnostic, we propose a scalable, adaptive framework that harmonizes spatial and temporal feature representations. Spatial dimensionality is compressed via low-rank matrix embedding to preserve essential structure, while an extended temporal horizon captures long-range dependencies and mitigates cumulative errors arising from temporal heterogeneity. Extensive experiments on urban traffic, meteorological, and epidemic datasets demonstrate substantial accuracy gains and broad applicability across the evaluated domains, suggesting that the framework is promising for a wide range of spatiotemporal tasks beyond the current study. The code is available on GitHub at https://github.com/ST-Balance/ST-Balance.
【16】Beyond Prediction Accuracy: Target-Space Recovery Profiles for Evaluating Model-Brain Alignment
标题
:超越预测准确性:用于评估模型-大脑对齐的目标空间恢复轮廓
链接:https://arxiv.org/abs/2605.20127
作者:Ken Nakamura,Tomoya Nakai,Ryuto Yashiro,Ayumu Yamashita,Kaoru Amano
备注:34 pages, 12 figures, 5 tables
摘要:Artificial vision models are often evaluated against the human visual cortex by measuring how accurately their internal representations predict brain responses. However, prediction accuracy alone does not indicate which dimensions of the target brain's response space are recovered. Here, we introduce a unified framework for evaluating both model-brain and brain-brain alignment by identifying the response dimensions recovered by prediction. Using repeated fMRI measurements, we first identify target-brain response dimensions that can be reproducibly predicted across independent trial splits. We then predict target-brain responses from either another subject's brain responses or a vision model's internal representations, and quantify how strongly each of these reproducible response dimensions is recovered. Applying this framework to a subset of the Natural Scenes Dataset, in which eight subjects viewed the same natural images during fMRI, we find that the early-to-intermediate visual-cortex responses contain a low-dimensional set of reproducible dimensions. Brain-to-brain comparisons identify which of these dimensions are consistently recoverable from other subjects' brains, providing a diagnostic human reference rather than only a scalar benchmark. In some cases, pretrained and randomly initialized models achieve similar prediction accuracy while showing distinct recovery profiles across these response dimensions. These results show that prediction accuracy alone can mask model-brain mismatches. By making explicit which reproducible brain response dimensions are recovered by prediction, our framework provides a more diagnostic evaluation of alignment between artificial vision models and the human visual cortex.
【17】Optimizing Computational-Statistical Runtime for Wasserstein Distance Estimation
标题:优化Wasserstein距离估计的计算统计时间表
链接:https://arxiv.org/abs/2605.20122
作者:Peter Matthew Jacobs,Jeff M. Phillips
摘要:Squared Wasserstein distance is a frequently used tool to measure discrepancy between probability distributions. This distance is typically computed between empirical measures of size $n$ from two underlying random samples. Unfortunately, even in lower dimensional Euclidean space problems $\left( d \in \{2,3\} \right)$, algorithms for Wasserstein distance computation with approximate or exact precision guarantees scale poorly in the runtime as a function of $n$ and the desired precision. In response, we consider the computational-statistical runtime, where the goal is to estimate from samples the Wasserstein distance between potentially smooth measures up to $ε$-additive error in expectation with respect to the sampling; we allow $O(1)$ computational cost for collecting a sample. Towards this, we develop a Sample-Sketch-Solve paradigm where we introduce a regular cartesian grid sketch of the samples. We show that (especially under $α$-Hölder smooth distributions) this can compress the data without increasing asymptotic error, and also regularizes the structure which enables faster exact algorithms. Ultimately, we approximate $W_2^2(P,Q)$ within $ε$ error in $ε^{-\max(2,\frac{d+1+o(1)}{1+α})}$ time for $0 < α< 1$ Hölder smooth distributions $P,Q$ on $(0,1)^{d}$; an optimal $Θ(ε^{-2})$ for $α> 1/2$ when $d=2$ and nearly optimal as $α\to 1$ when $d = 3$.
【18】Variance-Reduced Manifold Sampling via Polynomial-Maximization Density Estimation
标题:通过多项最大化密度估计的方差缩减的多管齐采样
链接:https://arxiv.org/abs/2605.19938
作者:Serhii Zabolotnii
备注:15 pages, 5 figures, 3 tables. Code supplement: https://github.com/SZabolotnii/Ku-PMM-MASEM-code-supplement
摘要:Uniform sampling on implicitly defined manifolds is a core primitive in motion planning, constrained simulation, and probabilistic machine learning. MASEM addresses this problem by entropy-maximizing resampling, but its resampling weights depend on a local k-nearest-neighbour density estimate whose errors can be amplified by aggressive resampling temperatures. We ask whether a polynomial-maximization moment estimator can replace the plug-in density rule without changing the surrounding MASEM architecture. The proposed PMM-MASEM module computes shell spacings from nested k-nearest-neighbour radii, estimates their standardized cumulants, and uses a gated PMM2/PMM3 estimator only when the spacing distribution departs from the flat Exp(1) regime; otherwise it falls back to the plug-in/MLE rule. This fallback is essential: on a flat homogeneous manifold the plug-in estimator is already the MLE, so PMM should not outperform it. A local Known-DGP Monte Carlo experiment confirms this gate: the selector returns MLE on flat Exp(1) spacings and reduces density MSE by 22--36% on asymmetric gamma and boundary-spacing regimes. The evidence is not uniformly positive: PMM3 worsens a platykurtic uniform spacing law, and a lightweight resampling-proxy experiment improves seven-lobes coverage but degrades the sine and swiss-roll proxies. The current evidence therefore supports an applicability-boundary result rather than a general MASEM improvement claim.
【19】Probabilistic Multivariate Time Series Forecasting with Diffusion Copulas
标题:基于扩散Copula的概率多元时间序列预测
链接:https://arxiv.org/abs/2605.19685
作者:David Huk,Dongshan Wang,Miha Bresar
备注:ICLR 2026 Workshop Advances in Financial AI
摘要
:Accurately assessing financial risk requires capturing both individual asset volatility and the complex, asymmetric dependence structures that emerge during extreme market events. While modern diffusion-based models have advanced multivariate forecasting, they often suffer from a "normality bias" when trained end-to-end, sacrificing marginal calibration for joint coherence and consistently underestimating tail risk. To address this, we propose a Diffusion-Copula framework that explicitly decouples the learning of marginal distributions from their dependence structure. We employ deep Mixture Density Networks to capture heavy-tailed asset dynamics, followed by a Classification-Diffusion Copula to model the joint dependence. Applied to cryptocurrency markets, our approach demonstrates superior performance over state-of-the-art baselines in forecasting systemic extremes of both marginal and joint events. Crucially, we demonstrate that while baseline models classify simultaneous market crashes as statistically impossible "Black Swans" (high surprise), our framework identifies them as "Expected Crashes" (low surprise), successfully preserving the correlation structure necessary for robust risk management during contagion events.
【20】Information Processing Capacity of Stationary Physical Systems: Theory, Data-efficient Estimation Methods, and Photonic Demonstration
标题:静止物理系统的信息处理能力:理论、数据高效估计方法和光演示
链接:https://arxiv.org/abs/2605.19152
作者:Rahul Uma Ramachandran,Serge Massar
【21】Conformal Prediction via Transported Beta Laws
标题:通过传输Beta定律进行保形预测
链接:https://arxiv.org/abs/2605.19024
作者:Thiago R. Ramos,Helton Graziadei,Luben M. C. Cabezas
【22】From Division to Decision: Leveraging Temporal Cell-Stage Segmentation for Embryo Transferability Prediction
标题:从分裂到决策:利用时间细胞阶段分割进行胚胎可移植性预测
链接:https://arxiv.org/abs/2605.18923
作者:Yasmine Hachani,Patrick Bouthemy,Elisa Fromont,Véronique Duranthon,Ludivine Laffont,Alline de Paula Reis
其他神经网络|深度学习|模型|建模(45篇)
【1】When Does Model Collapse Occur in Structured Interactive Learning?
标题:结构化互动学习中何时会发生模型崩溃?
链接:https://arxiv.org/abs/2605.20151
作者:Yuchen Wu,Kangjie Zhou,Weijie Su
备注:57 pages, 12 figures
【2】Normative Networks for Source Separation via Local Plasticity and Dendritic Computation
标题:通过局部可塑性和树枝状计算进行源分离的规范网络
链接:https://arxiv.org/abs/2605.19965
作者:Bariscan Bozkurt,Efe Ali Gorguner,Francesco Innocenti,Rafal Bogacz
【3】Learning Orthonormal Bases for Function Spaces
标题:学习函数空间的正交基
链接:https://arxiv.org/abs/2605.19959
作者:Hamidreza Kamkari,Mohammad Sina Nabizadeh,Justin Solomon
【4】Exploiting Non-Negativity in DAG Structure Learning
标题:利用DAB结构学习中的非负性
链接:https://arxiv.org/abs/2605.19947
作者:Samuel Rey,Madeline navarro,Gonzalo Mateos
【5】Set-Valued Policy Learning
标题:集值政策学习
链接:https://arxiv.org/abs/2605.19830
作者:Laura Fuentes-Vicente,Mathieu Even,Gaëlle Dormion,Antoine Chambaz,Uri Shalit,Julie Josse
【6】Stitched Value Model for Diffusion Alignment
标题:扩散对齐的缝合值模型
链接:https://arxiv.org/abs/2605.19804
作者:Hyojun Go,Hyungjin Chung,Prune Truong,Goutam Bhat,Li Mi,Zhaochong An,Zixiang Zhao,Dominik Narnhofer,Serge Belongie,Federico Tombari,Konrad Schindler
备注:Project page: https://gohyojun15.github.io/StitchVM/
【7】Awakening the Hydra: Stabilizing Multi-Concept Backdoor Injection in Text-to-Image Diffusion Models
标题:唤醒Hydra:稳定文本到图像扩散模型中的多概念后门注入
链接:https://arxiv.org/abs/2605.19698
作者:Kai Wang,Jiale Zhang,Chengcheng Zhu,Chuang Ma,Songze Li
备注:Preprint. 18 pages
【8】MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
标题:MiMuon:具有改进的大型模型通用性的混合Muon优化器
链接:https://arxiv.org/abs/2605.19619
作者:Feihu Huang,Yuning Luo,Songcan Chen
备注:25 pages
【9】Provable Fairness Repair for Deep Neural Networks
标题:深度神经网络的可证明公平性修复
链接:https://arxiv.org/abs/2605.19549
作者:Jianan Ma,Jingyi Wang,Qi Xuan,Zhen Wang
备注:15 pages, 6 figures, 7 tables. full version of the paper accepted by ASE 2025
【10】Boosting Text-to-Image Diffusion Models via Core Token Attention-Based Seed Selection
标题:通过基于核心令牌注意力的种子选择来增强文本到图像的扩散模型
链接:https://arxiv.org/abs/2605.19532
作者:Yunzhe Zhang,Hongfu Liu,Pengyu Hong
备注:Preprint
【11】Base Models Look Human To AI Detectors
标题:基本模型在人工智能检测器中看起来像人类
链接:https://arxiv.org/abs/2605.19516
作者:Yixuan Even Xu,Ziqian Zhong,Aditi Raghunathan,Fei Fang,J. Zico Kolter
备注:39 pages, 9 figures
【12】Implicit Bias of Mirror Flow in Homogeneous Neural Networks: Sparse and Dense Feature Learning
标题:同质神经网络中镜像流的隐式偏差:稀疏和密集特征学习
链接:https://arxiv.org/abs/2605.19458
作者:Tom Jacobs,Guido Montufar
备注:36 pages, 14 figures
【13】Unlocking the Potential of Continual Model Merging: An ODE Perspective
标题:释放连续模型合并的潜力:ODE的视角
链接:https://arxiv.org/abs/2605.19409
作者:Lihong Lin,Haidong Kang
备注:21 pages, 8 figures
【14】Conflict-Free Replicated Data Types for Neural Network Model Merging: A Two-Layer Architecture Enabling CRDT-Compliant Model Merging Across 26 Strategies
标题:用于神经网络模型合并的免预算复制数据类型:一种两层架构,支持跨26种策略合并符合CRDT的模型
链接:https://arxiv.org/abs/2605.19373
【15】HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models
标题:HalluWorld:通过参考世界模型来控制幻觉的基准
链接:https://arxiv.org/abs/2605.19341
作者:Emmy Liu,Varun Gangal,Michael Yu,Zhuofu Tao,Karan Singh,Sachin Kumar,Steven Y. Feng
备注:HalluWorld benchmark (code and data) at github.com/DegenAI-Labs/HalluWorld
【16】From Simple to Complex: Curriculum-Guided Physics-Informed Neural Networks via Gaussian Mixture Models
标题:从简单到复杂:通过高斯混合模型的课程引导物理信息神经网络
链接:https://arxiv.org/abs/2605.19263
作者:Jianan Yang,Yiran Wang,Shuai Li,Fujun Cao,Xuefei Yan,Junmin Liu
备注:23 pages, 15 figures
【17】On-Device Continual Learning with Dual-Stage Buffer and Dynamic Loss for Point-of-Care Pneumonia Diagnosis
标题:具有双阶段缓冲和动态丢失的设备上连续学习用于护理点肺炎诊断
链接:https://arxiv.org/abs/2605.19201
作者:Danu Kim
备注:Presented at 32nd Samsung Humantech Paper Awards
【18】Bridge: Retrieval-Augmented Spatiotemporal Modeling for Urban Delivery Demand
标题:Bridge:城市交付需求的检索增强时空建模
链接:https://arxiv.org/abs/2605.19172
作者:Yihong Tang,Tong Nie,Junlin He,Qianjun Huang,Dingyi Zhuang,Lijun Sun
【19】COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones
标题:COBALT:通过智能手机基于云的远程操作进行众包机器人学习
链接:https://arxiv.org/abs/2605.19138
作者:Ayush Agarwal,Ansh Gandhi,Jeremy A. Collins,Omar Rayyan,Aryan Sarswat,Ranjani Koushik,Masoud Moghani,Ajay Mandlekar,Animesh Garg
【20】EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data
标题:EgoBabyVLM:从自然主义自我中心视频数据中进行跨模式学习的基准
链接:https://arxiv.org/abs/2605.19130
作者:Dongyan Lin,Phillip Rust,Angel Villar Corrales,Alvin W. M. Tan,Mahi Luthra,Charles-Éric Saint-James,Rashel Moritz,Sheila Krogh-Jespersen,Vanessa Stark,Surya Parimi,Jiayi Shen,Youssef Benchekroun,Yosuke Higuchi,Martin Gleize,Tom Fizycki,Nicolas Hamilakis,Manel Khentout,Sho Tsuji,Balázs Kégl,Juan Pino,Michael C. Frank,Emmanuel Dupoux
【21】Chessformer: A Unified Architecture for Chess Modeling
标题:Chessformer:国际象棋建模的统一架构
链接:https://arxiv.org/abs/2605.19091
作者:Daniel Monroe,George Eilender,Philip Chalmers,Zhenwei Tang,Ashton Anderson
备注:International Conference in Learning Representations (2026)
【22】Riemannian Networks over Full-Rank Correlation Matrices
标题:满阶相关矩阵上的Riemann网络
链接:https://arxiv.org/abs/2605.19073
作者:Ziheng Chen,Xiaojun Wu,Bernhard Schölkopf,Nicu Sebe
备注:Accepted to ICML 2026
【23】Learning When to Adapt
标题:学习何时适应
链接:https://arxiv.org/abs/2605.19028
作者:Ali Zindari,Xiaowen Jiang,Rotem Mulayoff,Sebastian U. Stich
备注:Preprint
【24】Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency
标题:线上学习训练控制治理:稳定性和效率压力下的有限自主训练
链接:https://arxiv.org/abs/2605.19008
【25】TabQL: In-Context Q-Learning with Tabular Foundation Models
标题:TabQL:使用表格基础模型的上下文Q学习
链接:https://arxiv.org/abs/2605.18979
作者:Qisai Liu,Zhanhong Jiang,Timilehin Ayanlade,Ashutosh Kumar Nirala,Yang Li,Aditya Balu,Soumik Sarkar
【26】Shaping the Prior: How Synthetic Task Distributions Determine Tabular Foundation Model Quality
标题:塑造先验:综合任务分布如何确定表格基础模型质量
链接:https://arxiv.org/abs/2605.18971
作者:Mohamed Bouadi,Nassim Bouarour,Varun Kulkarni,Shivam Dubey,Aditya Tanna,Vinay Kumar Sankarapu
【27】Stability and Discretization Error of State Space Model Neural Operators
标题:状态空间模型神经运算符的稳定性和离散化误差
链接:https://arxiv.org/abs/2605.18905
作者:Abderrahim Bendahi,Adrien Fradin,Johan Peralez,Julie Digne,Madiha Nadri
【28】Dynamic Model Merging Made Slim
标题:动态模型合并变瘦
链接:https://arxiv.org/abs/2605.18904
【29】Soft Learning
标题:软学习
链接:https://arxiv.org/abs/2605.18889
作者:Mohammed Aledhari,Ali Aledhari,Fatimah Aledhari,Mohamed Rahouti
【30】EVA-0: Test-Time Model Evolution with Only Two Forward Passes per Sample
标题:伊娃-0:测试时模型演变,每个样本只有两次向前传递
链接:https://arxiv.org/abs/2605.18867
作者:Guohao Chen,Shuaicheng Niu,Geng Li,Yunbei Zhang,Shilin Shan,Chunyan Miao,Jianfei Yang
【31】When Individually Calibrated Models Become Collectively Miscalibrated
标题:当单独校准的模型变得集体错误校准时
链接:https://arxiv.org/abs/2605.18858
作者:Zhaohui Wang
备注:42 pages, 1 main figure, multiple tables. Accepted at ProbML 2026
【32】The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next
标题:前沿模型的日益增长的痛苦:排行榜何时停止分离以及下一步衡量什么
链接:https://arxiv.org/abs/2605.18840
作者:Adil Amin
备注:13 pages, 5 figures, 4 tables. Companion paper: "Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling." Code: https://github.com/adilamin89/cape-scaling. Dashboard: https://zehenlabs.com/cape/
【33】In-Context Learning Operates as Concept Subspace Learning
标题:上下文学习作为概念子空间学习运行
链接:https://arxiv.org/abs/2605.18830
作者:Wei Tang,Xinyan Jiang,Fakhri Karray,Lijie Hu
【34】Composition of Memory Experts for Diffusion World Models
标题:扩散世界模型记忆专家的构成
链接:https://arxiv.org/abs/2605.18813
作者:Sebastian Stapf,Pablo Acuaviva Huertos,Aram Davtyan,Paolo Favaro
【35】Metric-Gradient Projection for Stable Multi-Agent Policy Learning
标题:用于稳定多主体政策学习的度量梯度投影
链接:https://arxiv.org/abs/2605.18809
作者:Zuyuan Zhang,Sizhe Tang,Mahdi Imani,Tian Lan
【36】Density-Ratio Losses for Post-Hoc Learning to Defer
标题:事后学习推迟的密度比损失
链接:https://arxiv.org/abs/2605.19557
作者:Alexander Soen,Ragnar Thobaben,Joakim Jaldén,Richard Nock
备注:Preprint
【37】Tweedie's Formulae and Diffusion Generative Models Beyond Gaussian
标题:Tweedie的公式和超越高斯的扩散生成模型
链接:https://arxiv.org/abs/2605.19391
作者:Wenpin Tang,Nizar Touzi,Zikun Zhang,Xun Yu Zhou
备注:27 pages, 18 figures
【38】A Cloud-Based Tool for Meteorite Recovery Using Drones and Machine Learning
标题:使用无人机和机器学习进行陨石恢复的基于云的工具
链接:https://arxiv.org/abs/2605.19179
作者:Seamus L. Anderson,Hadrien A. R. Devillepoix,Lewis Lakerink,Sawitchaya Tippaya,Dale P. Giancono,Martin C. Towner,Iona Clemente,Martin Cupák,Ashley F. Rogers,John H. Fairweather,Mia Walker,Daniel Burgin,Michael A. Frazer,Sophie E. Deam,Veronika Pazderová,Eleanor K. Sansom,Benjamin A. D. Hartig,Hely C. Branco,Thomas Stevenson,Isabella Hatty,Anna Zappatini,Anthony Lagain,Tom Lovelock,Auriane Egal,Lucy Forman,David Belton,Simon Windsor,Shibli Saleheen,Asher Leslie,Gregory B. Poole,Andrew Langendam,Rachel S. Kirby,Andrew G. Tomkins
备注:23 pages, 3 figures
【39】Activation Functions, Statistics and Learning of Higher-Order Interactions in Restricted Boltzmann Machines
标题:限制Boltzmann机中高级相互作用的激活函数、统计和学习
链接:https://arxiv.org/abs/2605.19178
作者:Giovanni di Sarra,Yasser Roudi
备注:38 pages, 27 figures
【40】Reducing Diffusion Model Memorization with Higher Order Langevin Dynamics
标题:具有高级Langevin动力学的简化扩散模型再同步化
链接:https://arxiv.org/abs/2605.19170
作者:Benjamin Sterling,Mónica F. Bugallo,Tom Tirer
【41】Atomistic Modeling of Chemical Disorder in Materials: Bridging Classical Methods and AI-Assisted Approaches
标题:材料中化学无序的原子模型:桥接经典方法和人工智能辅助方法
链接:https://arxiv.org/abs/2605.19124
作者:Jiayu Peng,Peichen Zhong
【42】Dual-Channel Tensor Neural Networks: Finite-Sample Theory and Conformal Structure Selection
标题:双通道张量神经网络:样本理论和保形结构选择
链接:https://arxiv.org/abs/2605.19122
作者:Elynn Chen,Jiayu Li,Zheshi Zheng,Jian Pei
【43】Markov Chain Decoders Overcome the Heavy-Tail Limitations of Lipschitz Generative Models
标题:马尔科夫链解码器克服Lipschitz生成模型的重尾限制
链接:https://arxiv.org/abs/2605.18931
作者:Abdelhakim Ziani,Andras Horvath,Paolo Ballarini
【44】A Logistic Regression Model to Predict Malaria Severity in Children
标题:预测儿童疟疾严重程度的逻辑回归模型
链接:https://arxiv.org/abs/2605.18900
作者:Mary Opokua Ansong,Asare Yaw Obeng,Samuel King Opoku
【45】Noise scheduling and linear dynamics in diffusion models on Lie groups
标题:李群扩散模型中的噪音调度和线性动力学
链接:https://arxiv.org/abs/2605.17326
作者:Javad Komijani
备注:5 pages
其他(68篇)
【1】k-Inductive Neural Barrier Certificates for Unknown Nonlinear Dynamics
标题:未知非线性动力学的k-诱导神经屏障证明
链接:https://arxiv.org/abs/2605.20108
作者:Ben Wooding,Hongchao Zhang,Taylor T. Johnson,Abolfazl Lavaei
备注:18 pages, 5 figures, 3rd International Conference on Neuro-Symbolic Systems (NeuS)
【2】Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding
标题:草稿少,格式多:用于推测解码的混合树构建
链接:https://arxiv.org/abs/2605.20104
作者:Yuhao Shen,Tianyu Liu,Xinyi Hu,Quan Kong,Baolin Zhang,Jun Dai,Jun Zhang,Shuang Ge,Lei Chen,Yue Li,Mingcheng Wan,Cong Wang
【3】What Do Evolutionary Coding Agents Evolve?
标题:进化编码代理进化了什么?
链接:https://arxiv.org/abs/2605.20086
作者:Nico Pelleriti,Sree Harsha Nelaturu,Zhanke Zhou,Zongze Li,Max Zimmer,Bo Han,Sebastian Pokutta
备注:28 pages, 12 figures, 12 tables
【4】Probability-Conserving Flow Guidance
标题:概率保护的流程引导
链接:https://arxiv.org/abs/2605.20079
作者:Parsa Esmati,Junha Hyung,Amirhossein Dadashzadeh,Jaegul Choo,Majid Mirmehdi
【5】Smooth Partial Lotteries for Stable Randomized Selection
标题:平滑部分彩票,实现随机稳定选择
链接:https://arxiv.org/abs/2605.20069
作者:Alexander Goldberg,Giulia Fanti,Nihar B. Shah
【6】Active Context Selection Improves Simple Regret in Contextual Bandits
标题:主动上下文选择改善背景盗贼的简单遗憾
链接:https://arxiv.org/abs/2605.20040
作者:Mohammad Shahverdikondori,Jalal Etesami,Negar Kiyavash
【7】Training-Free Bayesian Filtering with Generative Emulators
标题:使用生成式模拟器的免训练Bayesian过滤
链接:https://arxiv.org/abs/2605.20028
作者:Thomas Savary,François Rozet,Gilles Louppe
备注
:Accepted as a spotlight paper at the International Conference on Machine Learning 2026
【8】Minimalist Visual Inertial Odometry
标题:极简视觉惯性里程计
链接:https://arxiv.org/abs/2605.19990
作者:Francesco Pasti,Jeremy Klotz,Nicola Bellotto,Shree K. Nayar
备注:This work has been submitted to the IEEE for possible publication
【9】Block-Sphere Vector Quantization
标题:块球向量量化
链接:https://arxiv.org/abs/2605.19972
作者:Heesang Ann,Joongkyu Lee,Min-hwan Oh
【10】Real-Time Parallel Counterfactual Regret Minimization
标题:实时并行反事实遗憾最小化
链接:https://arxiv.org/abs/2605.19928
作者:Boning Li,Longbo Huang
备注:13 pages, 3 figures
【11】JAXenstein: Accelerated Benchmarking for First-Person Environments
标题:JAXenstein:第一人称环境的加速基准测试
链接:https://arxiv.org/abs/2605.19926
作者:Ruo Yu Tao,George Konidaris
备注:Main paper: 5 pages, supplementary material: 3 pages
【12】StableGrad: Backward Scale Control without Batch Normalization
标题:StableGrad:无需批量规范化的向后规模控制
链接:https://arxiv.org/abs/2605.19856
作者:Jose I. Mestre,Alberto Fernández-Hernández,Cristian Pérez-Corral,Manuel F. Dolz,Enrique S. Quintana-Ortí
【13】Auditing Privacy in Multi-Tenant RAG under Account Collusion
标题:账户串通下多租户RAG中的隐私审计
链接:https://arxiv.org/abs/2605.19847
作者:Florian A. D. Burnat,Brittany I. Davidson
【14】Smooth Piecewise Cutting for Neural Operator to Handle Discontinuities and Sharp Transitions
标题:处理不连续性和尖锐转变的神经运算符的平滑分段切割
链接:https://arxiv.org/abs/2605.19823
作者:Ha Dang,Sebastian Schmidt,Juergen Hesser
【15】ST-TGExplainer: Disentangling Stability and Transition Patterns for Temporal GNN Interpretability
标题:ST-TG解释者:理清时态GNN可解释性的稳定性和转换模式
链接:https://arxiv.org/abs/2605.19822
作者:Hongjiang Chen,Xin Zheng,Pengfei Jiao,Huan Liu,Zhidong Zhao,Huaming Wu,Feng Xia,Shirui Pan
【16】FLUXtrapolation: A benchmark on extrapolating ecosystem fluxes
标题:FLOXtrapolation:外推生态系统通量的基准
链接:https://arxiv.org/abs/2605.19812
作者:Anya Fries,Jacob A Nelson,Martin Jung,Markus Reichstein,Jonas Peters
【17】LionMuon: Alternating Spectral and Sign Descent for Efficient Training
标题:LionMuon:交替光谱和星座下降以实现高效训练
链接:https://arxiv.org/abs/2605.19811
作者:Arman Bolatov,Artem Riabinin,Nikita Kornilov,Andrey Veprikov,Samuel Horváth,Martin Takáč,Aleksandr Beznosikov
备注:38 pages, 13 figures, 4 tables
【18】Latent Laplace Diffusion for Irregular Multivariate Time Series
标题:不规则多元时间序列的潜在拉普拉斯扩散
链接:https://arxiv.org/abs/2605.19805
作者:Zinuo You,Jin Zheng,John Cartlidge
备注:Camera-ready Spotlight paper at ICML 2026. 27 pages, 5 figures. Code: https://github.com/pixelhero98/LLapDiffusion
【19】AR1-ZO: Topology-Aware Rank-1 Zeroth-Order Queries for High-Rank LoRA Fine-Tuning
标题:AR 1-ZR:具有布局感知的Rank-1零阶搜索器,用于高级别LoRA微调
链接:https://arxiv.org/abs/2605.19767
作者:Ziye Chen,Hongbin Lin,Chenyu Zhang,Xiangda Yan,Yongjie Yang,Yao Shu
【20】Operationalising Artificial Intelligence Bills of Materials (AIBOMs) for Verifiable AI Provenance and Lifecycle Assurance
标题:运营人工智能物料清单(AIBOM)以实现可验证的人工智能出处和有效期保证
链接:https://arxiv.org/abs/2605.19755
作者:Petar Radanliev,Omar Santos,Carsten Maple,Kay Atefi
【21】Agentic Discovery of Cryomicroneedle Formulations
标题:冷冻微针配方的重大发现
链接:https://arxiv.org/abs/2605.19677
作者:Hao Li,Lifu Du,Nurul Hameed,Shemonti Saha Authai,Zlata Stefanovic,Chenjie Xu
【22】optimize_anything: A Universal API for Optimizing any Text Parameter
标题:optimate_anything:用于优化任何文本参数的通用API
链接:https://arxiv.org/abs/2605.19633
作者:Lakshya A Agrawal,Donghyun Lee,Shangyin Tan,Wenjie Ma,Karim Elmaaroufi,Rohit Sandadi,Sanjit A. Seshia,Koushik Sen,Dan Klein,Ion Stoica,Joseph E. Gonzalez,Omar Khattab,Alexandros G. Dimakis,Matei Zaharia
备注:16 pages, 11 figures; Blog: https://gepa-ai.github.io/gepa/blog/2026/02/18/introducing-optimize-anything/
【23】Spectral Integrated Gradients for Coarse-to-Fine Feature Attribution
标题:从粗到细特征属性的光谱综合要素
链接:https://arxiv.org/abs/2605.19607
作者:Soyeon Kim,Seongwoo Lim,Kyowoon Lee,Jaesik Choi
备注:21 pages, 13 figures, 9 tables. Accepted to ACM KDD 2026; includes appendix
【24】Online Market Making and the Value of Observing the Order Book
标题:在线做市和遵守订单的价值
链接:https://arxiv.org/abs/2605.19584
作者:Davide Maran,Marcello Restelli
备注:Accepted at COLT2026
【25】TORQ: Two-Level Orthogonal Rotation for MXFP4 Quantization
标题:TORQ:用于MXFP 4量化的两级垂直旋转
链接:https://arxiv.org/abs/2605.19561
作者:Zukang Xu,Xing Hu,Dawei Yang
备注:17 pages, 4 figures, 13 tables
【26】Adynamical systems view of training generativemodels and the memorization phenomenon
标题:训练生成模型的敌对系统观和记忆现象
链接:https://arxiv.org/abs/2605.19483
作者:Siva Athreya,Chiranjib Bhattacharya,Vivek S. Borkar
备注:12 pages
【27】When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
标题:何时停止重复使用:动态梯度门控以实现样本高效的WLVR
链接:https://arxiv.org/abs/2605.19425
作者:Yuchun Miao,Sen Zhang,Yuqi Zhang,Yaorui Shi,Qi Gu,Xunliang Cai,Lefei Zhang
备注:23 pages, 10 figures
【28】A Bitter Lesson for Data Filtering
标题:数据过滤的惨痛教训
链接:https://arxiv.org/abs/2605.19407
作者:Christopher Mohri,John Duchi,Tatsunori Hashimoto
【29】TIDE: Asymmetric Neural Circuits for Stabilized Temporal Inhibitory-Excitatory Dynamics
标题:TIDE:稳定时间抑制-兴奋动力学的不对称神经回路
链接:https://arxiv.org/abs/2605.19403
作者:Alexander Kyuroson,Denis Kleyko,Marcus Liwicki
【30】An Exterior Method for Nonnegative Matrix Factorization
标题:非负矩阵因式分解的外部方法
链接:https://arxiv.org/abs/2605.19325
作者:Qiujing Lu,Tonmoy Monsoor,Ehsan Ebrahimzadeh,Kartik Sharma,Vwani Roychowdhury
备注:Accepted to ICML 2026
【31】BrainDyn: A Sheaf Neural ODE for Generative Brain Dynamics
标题:BrainDyn:生成大脑动力学的Sheaf神经ODE
链接:https://arxiv.org/abs/2605.19324
作者:Siddharth Viswanath,Panayiotis Ketonis,Chen Liu,Michael Perlmutter,Dhananjay Bhaskar,Smita Krishnaswamy
【32】Matérn Noise for Triangulation-Agnostic Flow Matching on Meshes
标题:网格上三角测量的Matérn噪音-不可知流匹配
链接:https://arxiv.org/abs/2605.19305
作者:Tianshu Kuai,Arman Maesumi,Daniel Ritchie,Noam Aigerman
备注:In ACM Transactions on Graphics (SIGGRAPH 2026). Project page: https://matern-fm.github.io/
【33】EviTrack: Selection over Sampling for Delayed Disambiguation
标题:EviTrack:延迟歧义消除的选择而不是采样
链接:https://arxiv.org/abs/2605.19283
作者:Omer Haq
备注:https://github.com/Haq94/EviTrack
【34】Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
标题:预训练之外的重新思考μ子:VLA和WLVR的光谱故障和高通补救措施
链接:https://arxiv.org/abs/2605.19282
作者:Chongyu Fan,Gaowen Liu,Mingyi Hong,Ramana Rao Kompella,Sijia Liu
【35】Euclidean Embedding of Data Using Local Distances
标题:使用局部距离进行数据的欧几里得嵌入
链接:https://arxiv.org/abs/2605.19243
【36】Robust Mitigation of Age-Dependent Confounding Effects via Sample-Difficulty Decorrelation
标题:通过样本难度去相关性稳健缓解时间依赖的混杂效应
链接:https://arxiv.org/abs/2605.19230
作者:Nikhil Cherian Kurian,Victor Caquilpan Parra,Abin Shoby,Luke Whitbread,Lyle J. Palmer
备注:10 Pages, 3 Figures
【37】A Heuristic Approach for Performance Tuning in RL-based Quadrotor Control via Reward Design and Termination Conditions
标题:通过奖励设计和终止条件进行基于RL的四螺旋桨控制性能调整的启发式方法
链接:https://arxiv.org/abs/2605.19166
作者:Fausto Mauricio Lagos Suarez,Akshit Saradagi,Vidya Sumathy,George Nikolakopoulos
备注:Accepted in the 34th Mediterranean Conference on Control and Automation
【38】How Far Are We From True Auto-Research?
标题:我们距离真正的自动研究还有多远?
链接:https://arxiv.org/abs/2605.19156
作者:Zhengxin Zhang,Ning Wang,Sainyam Galhotra,Claire Cardie
【39】PMF-CL: Pareto-Minimal-Forgetting Continual Learner for Conflicting Tasks
标题:PMF-CL:Pareto最小化-忘记预算任务的持续学习者
链接:https://arxiv.org/abs/2605.19145
作者:Srijith Nair,Atilla Eryilmaz,Jia,Liu
备注:25 pages, 4 figures, 4 algorithms
【40】The impact of observation density on Bayesian inversion of latent dynamics in shock-dominated flows
标题:观测密度对冲击主导流中潜在动力学的Bayesian倒置的影响
链接:https://arxiv.org/abs/2605.19076
作者:Bipin Tiwari,Muhammad Abid,Omer San
【41】Mapping Uncharted Symmetries: Machine Discovery in Combinatorics
标题:映射未知对称:组合学中的机器发现
链接:https://arxiv.org/abs/2605.19063
作者:Eugenio Cainelli,Lorenzo Luccioli,Alessandro Iraci,Michele D'Adderio,Giovanni Paolini
备注:20 pages
【42】KVBuffer: IO-aware Serving for Linear Attention
标题:KVBuffer:IO感知,为线性注意力提供服务
链接:https://arxiv.org/abs/2605.19049
【43】Deep Neural Sheaf Diffusion
标题:深层神经片扩散
链接:https://arxiv.org/abs/2605.19021
作者:Remi Bourgerie,Sarunas Girdzijauskas,Viktoria Fodor
备注:Under review at GFM@ICML2026
【44】LoRA vs. Full Fine-Tuning: A Theoretical Perspective
标题:LoRA与完全微调:理论视角
链接:https://arxiv.org/abs/2605.19018
作者:Ali Zindari,Rotem Mulayoff,Sebastian U. Stich
备注:Preprint
【45】FLUIDSPLAT: Reconstructing Physical Fields from Sparse Sensors via Gaussian Primitives
标题:FLOIDSPLAT:通过高斯基元从稀疏传感器重建物理场
链接:https://arxiv.org/abs/2605.18866
作者:Huaxi Huang,Meng Li,Zhengqing Gao,Xi Zhou,Xiaoshui Huang,Xiao Sun
备注:23 pages, 4 figures,preprint
【46】The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection
标题:99%成功悖论:当近乎完美的检索等于随机选择时
链接:https://arxiv.org/abs/2605.18857
作者:Vyzantinos Repantis,Harshvardhan Singh,Tony Joseph,Cien Zhang,Akash Vishwakarma,Svetlana Karslioglu,Michael Wyatt Thot,Ameya Gawde
备注:12 pages, 2 figures, 7 tables. Accepted at ICLR 2026 Blog Track, https://iclr-blogposts.github.io/2026/blog/2026/bits-over-random/
【47】Delta Attention Residuals
标题:德尔塔注意力残留
链接:https://arxiv.org/abs/2605.18855
作者:Cheng Luo,Zefan Cai,Junjie Hu
【48】Evaluating Memory Condensation Strategies for Coding Agents in Data-Driven Scientific Discovery
标题:评估数据驱动科学发现中编码代理的内存浓缩策略
链接:https://arxiv.org/abs/2605.18854
作者:Renuka Chintalapati,Sid Raskar,Anurag Acharya,Jared Willard,Patrick Emami,Sameera Horawalavithana
【49】INSIGHTS: Demonstration-Based Summaries of Time Series Predictors
标题:INSITION:基于演示的时间序列预测因子摘要
链接:https://arxiv.org/abs/2605.18849
作者:Bar Eini Porat,Rom Gutman,Uri Shalit,Ofra Amir
【50】Exact Linear Attention
标题:精确线性注意力
链接:https://arxiv.org/abs/2605.18848
作者:Weinuo Ou
备注:8 pages, 16 figures, journal
【51】Lost and Found in Translation: Variational Diagnostics for Neural Codebook Channels
标题:翻译中的失落与发现:神经代码簿通道的变分诊断
链接:https://arxiv.org/abs/2605.18846
作者:Yusuke Hayashi
备注:9 pages, 2 figures
【52】The Routing and Filtering Structure of Attention
标题:注意力的路由和过滤结构
链接:https://arxiv.org/abs/2605.18826
作者:Shafayeth Jamil,Rehan Kapadia
备注:13 pages, 7 figures
【53】Multi-Pedestrian Safety Warning at Urban Intersections Use Case of Digital Twin
标题:城市交叉口多行人安全警告数字双胞胎用例
链接:https://arxiv.org/abs/2605.18823
作者:Yongjie Fu,Qi Gao,Mahshid Ghasemi Dehkordi,Gil Zussman,Xuan Di
【54】Symmetry in the Wild: The Role of Equivariance in Neural Fluid Surrogates
标题:自然界中的对称性:等变性在神经流体替代物中的作用
链接:https://arxiv.org/abs/2605.18816
作者:Patryk Rygiel,Julian Suk,Kak Khee Yeung,Christoph Brune,Jelmer M. Wolterink
【55】How Faithful Is Trajectory-Based Data Attribution? Error Sources, Remedies, and Practical Guidelines
标题:基于轨迹的数据归因有多可信?错误来源、补救措施和实用指南
链接:https://arxiv.org/abs/2605.18814
作者:Junwei Deng,Pingbang Hu,Suliang Jin,Hao Lu,Jiachen T. Wang,Shichang Zhang,Jiaqi W. Ma
【56】D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting
标题:D-PACE:并行推测绘图的动态位置感知交叉熵
链接:https://arxiv.org/abs/2605.18810
作者:Tianyu Wu,Yu Yao,Zhenting Qi,Han Zheng,Zhuohan Wang,Haoran Ma,Lawrence Liao,Himabindu Lakkaraju,Ju Li,Yilun Du
【57】Block-Based Double Decoders
标题:基于块的双解码器
链接:https://arxiv.org/abs/2605.18807
作者:Asher Labovich,Benjamin Bradley,Vanessa Alexander,Chaitanya Harsha
备注:8 pages main, 13 pages total
【58】Decentralized autonomous organization and blockchain-based incentivization framework for community-based facilities management
标题:社区设施管理的去中心化自治组织和基于区块链的激励框架
链接:https://arxiv.org/abs/2605.18773
作者:Reachsak Ly,Alireza Shojaei,Xinghua Gao,Philip Agee,Abiola Akanmu
备注:29 pages, 17 figures, 3 tables
【59】FiLark: a streaming-first software framework for end-to-end exploration, annotation, and algorithm integration in distributed acoustic sensing
标题:FiLark:流优先软件框架,用于分布式声学传感中的端到端探索、注释和算法集成
链接:https://arxiv.org/abs/2605.20132
作者:Jintao Li,Weichang Li,Kai Tong,Xaingyu Guo
【60】Tail Annealing for Heavy-Tailed Flow Matching
标题:用于重尾流匹配的尾退
链接:https://arxiv.org/abs/2605.20068
作者:Jean Pachebat
备注:18 pages
【61】BCI-sift: An automated feature selection toolbox for Brain Computer Interface applications
标题:BCI-sift:用于脑机接口应用程序的自动特征选择工具箱
链接:https://arxiv.org/abs/2605.19646
作者:Elena C Offenberg,Dirk Keller,Mariska J Vansteensel,Zachary V Freudenburg,Nick F Ramsey,Julia Berezutskaya
备注:19 pages, 12 figures
【62】Increasing Missingness to Reduce Bias: Richardson-SGD with Missing Data
标题:增加缺失以减少偏见:Richardson-Singapore数据缺失
链接:https://arxiv.org/abs/2605.19641
作者:Ferdinand Genans,Erwan Scornet
【63】Gaussian Approximation and Multiplier Bootstrap for Federated Linear Stochastic Approximation
标题:联邦线性随机逼近的高斯逼近和乘数引导
链接:https://arxiv.org/abs/2605.19629
作者:Ilya Levin,Maksim Shuklin,Eric Moulines,Paul Mangold,Sergey Samsonov
【64】HiLiftAeroML: High-Fidelity Computational Fluid Dynamics Dataset for High-Lift Aircraft Aerodynamics
标题:HiLiftAeroML:用于高升力飞机空气动力学的高保真计算流体动力学数据集
链接:https://arxiv.org/abs/2605.19565
作者:Neil Ashton,Adam Clark,Liam Heidt,Christopher Ivey,Sanjeeb Bose,Rahul Agrawal,Konrad Goc,Rishi Ranade,Corey Adams,Peter Sharpe,Sheel Nidhan,Semit Akkurt,Daniel Leibovici,Jean Kossaifi
【65】Factor Augmented High-Dimensional SGD
标题:因子增广高维SGD
链接:https://arxiv.org/abs/2605.19291
作者:Shubo Li,Yuefeng Han,Xiufan Yu
【66】Provably Data-driven Lagrangian Relaxation for Mixed Integer Linear Programming
标题:混合线性规划的可证数据驱动Lagrange松弛算法
链接:https://arxiv.org/abs/2605.19052
作者:Tung Quoc Le,Anh Tuan Nguyen,Viet Anh Nguyen
备注:Accepted to ICML 2026
【67】Towards Discovery of Polymers for Insulin Delivery via Physics-Grounded Agentic Workflows
标题:通过以物理为基础的统计工作流程发现用于胰岛素输送的聚合物
链接:https://arxiv.org/abs/2605.18831
【68】SpecX: A Large-Scale Benchmark for Multi-Modal Spectroscopy and Cross-Paradigm Evaluation
标题:SpecX:多模式光谱学和跨范式评估的大规模基准
链接:https://arxiv.org/abs/2605.18791
作者:Chengrui Xiang,Tengfei Ma,Yujie Chen,Tong Wang,Haowen Chen,Xiangxiang Zeng
备注:9 pages,1 figures
机器翻译由腾讯交互翻译提供,仅供参考
点击“阅读原文”获取带摘要的学术速递