机器学习学术速递[6.12]

2026-06-12 | CS.LG机器学习 | 共 122 篇

[机构]信息由AI分析生成，可能存在错误，仅供参考，以论文实际显示为准

快速导航

1. 深度学习架构与训练方法 23 篇

2. 表示学习、自监督与对比学习 6 篇

3. 强化学习与序列决策 9 篇

4. 生成模型与概率建模 10 篇

5. 优化、泛化与理论分析 11 篇

6. 高效学习、压缩与部署 8 篇

7. 联邦学习、隐私与安全 2 篇

8. 鲁棒性、不确定性与可信学习 8 篇

9. 图学习与结构化数据 6 篇

10. 迁移、元学习与持续学习 2 篇

11. 数据集、基准与评测 18 篇

12. 机器学习应用 17 篇

13. 其他/综合机器学习 2 篇

1. 深度学习架构与训练方法 | 23 篇

1. Boltzmann Attention: Learnable Ising Couplings for Cooperative Attention

玻尔兹曼注意力：用于协同注意力的可学习伊辛耦合

AI 总结：提出玻尔兹曼注意力，通过可学习的伊辛耦合增强注意力机制中的位置间交互，在字符级语言建模和括号匹配任务中优于标准softmax注意力，并展示了量子退火训练的有效性。

链接：https://arxiv.org/abs/2606.12478

机构：Yonsei University（延世大学）

作者：Gilhan Kim, Daniel K. Park

英文摘要：Attention mechanisms are central to modern sequence models, yet standard attention computes relevance primarily through individual query--key similarities. Although softmax normalization introduces competition among positions, a standard attention layer does not explicitly parameterize learnable interactions between attention decisions. This limits its ability to directly model cooperative or antagonistic co-attention structure within the attention mechanism itself. We propose Boltzmann attention, an energy-based generalization in which attention patterns are governed by an interacting Ising model. The method augments the usual data-dependent local fields with learnable pairwise couplings, allowing the model to represent inter-position correlations beyond those captured by softmax or sigmoid attention. Experiments on character-level language modeling and synthetic bracket matching show that Boltzmann attention consistently improves over standard softmax attention within a standard Transformer architecture, with the advantage becoming more pronounced as sequence length increases. A four-way ablation confirms that the improvement arises from the learnable pairwise couplings. These results suggest that explicit inter-position interactions provide a principled enhancement for attention-based sequence modeling. Moreover, the Ising formulation opens a natural path toward quantum-computing-based sampling strategies: we demonstrate that diabatic quantum annealing provides a practical training method while maintaining competitive performance with exact Boltzmann computation.

2. $μ$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models

$μ$VLA：部分可观测操作中VLA模型的循环记忆研究

AI 总结：针对VLA模型在部分可观测场景中的记忆缺失问题，提出仅通过可学习记忆令牌和截断反向传播时间实现最小化循环记忆增强，在MIKASA-Robo上将训练任务成功率从0.42提升至0.84，并在LIBERO上保持全可观测性能。

链接：https://arxiv.org/abs/2606.12497

机构：CogAI Lab, Moscow, Russia（CogAI实验室，莫斯科，俄罗斯）； MIRAI, Moscow, Russia（MIRAI，莫斯科，俄罗斯）

作者：Egor Cherepanov, Nikita Kachaev, Daniil Zelezetsky, Aydar Bulatov, Artem Pshenitsyn, Yuri Kuratov, Alexey Skrynnik, Aleksandr I. Panov, Alexey K. Kovalev

英文摘要：Vision-language-action (VLA) models predict chunks of future actions from the current observation, an assumption that fails under partial observability, where decisions depend on information no longer visible. Existing memory-augmented VLAs simultaneously introduce recurrence, retrieval, compression modules, auxiliary objectives, hierarchical memory, or task-specific architectural changes, so the contribution of recurrence itself remains entangled with surrounding machinery. We present a controlled isolation study of recurrence in a strong pretrained VLA backbone. Our formulation augments the transformer with a small set of learnable memory tokens carried across timesteps and updated through self-attention, trained end to end with truncated backpropagation through time, with no auxiliary losses and no architectural changes. We instantiate this as $μ$VLA, a family of OpenVLA-OFT variants parameterized by memory width m, TBPTT length K, and the memory update rule (cross-step gradients or a detached EMA), so that recurrence is the only varying factor. On MIKASA-Robo, $μ$VLA improves average success rate on five training tasks from 0.42 to 0.84 at the strongest setting and reaches 0.23 on held-out tasks with the same memory structure versus 0.07 for the memoryless baseline. On tasks requiring different memory structure, performance remains near baseline. On LIBERO, the strongest recurrent variant achieves 96.2% average success, indicating no regression under full observability. We interpret these results as a calibration of the capability envelope of minimal in-backbone recurrence, identifying the regime in which it is sufficient and the regime where additional memory structure is required. Demos and videos can be found in https://avanturist322.github.io/mu-vla/.

3. Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers

基于评分标准的自蒸馏：无需评分标准验证器的后训练

AI 总结： 提出RGSD方法，通过将评分标准作为条件蒸馏到学生模型，无需验证器即可实现密集逐令牌学习，在医学和科学领域达到与基于评判的GRPO相当的评分标准满足率。

链接：https://arxiv.org/abs/2606.12507

机构：Scale AI

作者：MohammadHossein Rezaei, Anas Mahmoud, Zihao Wang, Utkarsh Tyagi, Advait Gosai, Razvan-Gabriel Dumitru, Aakash Sabharwal, Bing Liu, Yunzhong He

英文摘要：Rubrics have emerged as an alternative to RLVR in open-ended domains where a single ground-truth final answer is not available. Existing rubric-based training methods rely on an LLM verifier that scores each rollout against rubrics. This introduces substantial training-time overhead, exposes optimization to verifier-specific biases, and reduces rubric feedback to a sparse end-of-trajectory signal. We propose Rubric-Guided Self-Distillation (RGSD), a verifier-free training method in which the base policy, conditioned on the rubric, serves as the teacher for the unconditioned student. RGSD distills the rubric-conditioned teacher distribution into the student token-by-token, replacing sparse trajectory-level rewards with dense per-token learning signals and removing the LLM judge from the training loop entirely. Across Qwen-2.5 (3B, 7B) and Qwen3-Thinking (4B, 8B) models on medical and science domains, RGSD achieves rubric satisfaction comparable to judge-based GRPO while using one on-policy rollout per prompt and no training-time verifier calls. Ablations show that raw rubrics provide a stronger teacher enrichment signal than self-generated reference responses, while a stronger GRPO judge can outperform RGSD in some settings, positioning RGSD as a complementary verifier-free alternative when verifier cost or reliability is the bottleneck.

4. Deep Unfolded Latent Optimally Partitioned-l2/l1 Networks for Data-driven Block-Sparse Recovery

深度展开潜在最优分区l2/l1网络用于数据驱动的块稀疏恢复

AI 总结：针对凸LOP-l2/l1方法依赖手动调参且近端算子不可微的问题，提出基于隐式微分和深度权重分解的两种深度展开架构，实现自动参数学习，在块稀疏恢复中表现优异且抗脉冲噪声。

链接：https://arxiv.org/abs/2606.12740

机构：Nagoya Institute of Technology（名古屋工业大学）； RIKEN Center for Advanced Intelligence Project（理化学研究所革新智能研究中心）

作者：Takanobu Furuhashi, Hidekata Hontani, Qibin Zhao, Tatsuya Yokota

英文摘要：The convex Latent Optimal Partition (LOP)-l2/l1 approach enables block-sparse signal recovery with unknown partitions but relies on manual hyperparameter tuning. Additionally, numerical instability in differentiating its proximal operator prevents its automatic parameter tuning via Deep Unfolding (DU). To address these limitations, we propose two architectures: a stable framework utilizing implicit differentiation and a flexible variant leveraging Deep Weight Factorization (DWF). The DWF-based approach also supports nonconvex smooth data fidelity terms. Numerical experiments demonstrate that DU-LOP-l2/l1 yields competitive performance and high resilience against impulsive noise.

5. CLARITree: Cholesky and Lookahead Accelerations for Regression with Interpretable Piecewise Linear Trees

CLARITree: 基于Cholesky和前瞻加速的可解释分段线性树回归

AI 总结：提出一种结合前瞻搜索和秩一Cholesky更新的算法，用于构建近最优稀疏分段线性回归树，在计算效率、预测精度和稀疏性之间取得良好平衡。

链接：https://arxiv.org/abs/2606.12840

作者：Yixiao Wang, Hayden McTavish, Varun Babbar, Margo Seltzer, Cynthia Rudin

英文摘要：Regression trees are among the most interpretable yet expressive model classes in machine learning. Historically, greedy induction has been the dominant approach for constructing well-performing regression trees. While optimal methods based on dynamic programming and branch-and-bound exist, they are computationally prohibitive for general linear regression trees, despite often achieving substantially better performance than greedy approaches. Recent work has shown that specialized lookahead strategies can dramatically improve runtime while maintaining near-optimal performance, primarily in classification settings. In this work, we develop a novel algorithm for near-optimal, sparse, piecewise linear regression trees that combines a lookahead-style search strategy with efficient rank-one Cholesky updates of the Gram matrix. We demonstrate, both theoretically and empirically, that our method achieves a favorable trade-off between computational efficiency, predictive accuracy, and sparsity, and scales significantly better than the current state of the art.

6. TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models

TimeROME-DLM：掩码扩散语言模型的时间因果追踪与低秩推理时知识编辑

AI 总结：提出TimeROME-DLM，首个无需训练和梯度的推理时知识编辑框架，通过时间因果追踪定位关键坐标并应用低秩残差编辑，在保持模型性能的同时高效删除事实。

链接：https://arxiv.org/abs/2606.12841

机构：Shanghai Jiao Tong University（上海交通大学）； Nanyang Technological University（南洋理工大学）； National University of Singapore（新加坡国立大学）； University of Science and Technology of China（中国科学技术大学）

作者：Zhengtao Yao, Liuyang Song, Hongbo Zhang, Chenhao Wei, Haoyan Xu, Guang Yang, Siheng Wang

英文摘要： Masked diffusion language models (MDLMs) such as LLaDA now rival autoregressive (AR) LLMs, but every existing knowledge-editing and unlearning method (ROME, MEMIT, etc.) targets AR transformers and either makes assumptions that fail under iterative denoising, or requires gradient updates whose backward-pass activations cost tens of GB of extra VRAM and which collapse MDLMs at standard learning rates. We introduce TimeROME-DLM, the first training-free, gradient-free, inference-time knowledge-editing framework for MDLMs. It couples two components: a Temporal Indirect Effect (TIE) causal-tracing protocol that identifies, for each fact, the coordinate whose intervention most strongly drives the object prediction at later denoising steps; and a closed-form, low-rank residual edit memory that aggregates subject keys and target deltas across all forget facts and applies a single ridge-regularised update at that coordinate at every diffusion forward, with sparsification to limit utility spillover. Backbone weights stay frozen; only three hyperparameters (alpha, lambda, q) are tuned on a small validation split. On TOFU forget01 with TOFU-finetuned LLaDA-8B-Base, TimeROME-DLM cuts forget-set log-probability by roughly 83 nats. The same configuration transfers to LLaDA-8B-Instruct, Dream-7B, MMaDA-8B, DiffuLLaMA-7B, and LLaDA-MoE-1.4B. It keeps retain-set log-probability nearly flat (within ~1 nat at the utility-safe operating point) across 50 sequentially inserted facts, delivers a four- to fourteen-fold wall-clock speedup with zero additional VRAM over the strongest converged training-time baseline, and scales sub-linearly to 400 facts. TimeROME-DLM closes the locate-then-edit gap between AR LLMs and MDLMs at a fraction of the computational cost.

7. LongSpike: Fractional Order Spiking State Space Models for Efficient Long Sequence Learning

LongSpike：用于高效长序列学习的分数阶脉冲状态空间模型

AI 总结：提出LongSpike框架，将分数阶状态空间模型(f-SSM)引入脉冲神经网络，通过长记忆核实现高效长序列学习，在多个基准上超越现有SNN。

链接：https://arxiv.org/abs/2606.12895

作者：Xinrui He, Qiyu Kang, Xuhao Li, Zheng-Jun Zha

英文摘要：Spiking Neural Networks (SNNs) are well-regarded for their biological plausibility and energy efficiency in processing sequential data. However, dominant SNN architectures typically rely on first-order Ordinary Differential Equations (ODEs) to govern neuronal state transitions. This first-order assumption imposes a "memoryless" bottleneck, limiting the model's capacity to capture the complex, long-range dependencies inherent in long-sequence tasks. In this work, we propose LongSpike, a novel SNN framework that integrates fractional-order State-Space Modeling, or f-SSM, from control theory into the spiking domain. By extending traditional integer-order SSMs to the fractional-calculus regime, LongSpike enables the hierarchical integration of neuronal dynamics with long-memory kernels. To mitigate the computational overhead and parallelization challenges typically associated with fractional operators, we leverage a state-space formulation that supports efficient, parallel training. Empirical evaluations on challenging benchmarks, including Long Range Arena (LRA), large-scale WikiText-103, and Speech Commands, demonstrate that LongSpike outperforms state-of-the-art SNNs in accuracy while preserving sparse synaptic computation. The code is available at https://github.com/xinruihe389-commits/LongSpike.

8. Where Computation Lives Inside TabPFN: Causal Localisation of Attention Head Function

计算在 TabPFN 中的位置：注意力头功能的因果定位

AI 总结：通过激活修补、消融和注意力熵分析，发现 TabPFN 2.5 中一个注意力头在峰值层的因果必要性比其他头高2-5倍，且其主导层随任务复杂度变化，其余头呈现对称的后期层轮廓。

链接：https://arxiv.org/abs/2606.12917

作者：Atharva Gupta, Dhruv Kumar, Murari Mandal, Saurabh Deshpande

英文摘要：We present the first causal mechanistic analysis of a tabular foundation model, investigating how TabPFN 2.5's feature wise attention heads distribute computation across layers. Using activation patching, ablation, and attention entropy across two synthetic regression datasets, we find clear temporal specialisation: one head's causal necessity dominates that of the others by 2 to 5 times at peak layer, with its dominant layer shifting across tasks of different complexity, while the remaining heads exhibit symmetric late layer profiles. Attention entropy and patching provide convergent evidence for the computationally active layers of the dominant head. We additionally investigate inference time steerability via contrastive activation steering, which fails to transfer across samples. We attribute this result to TabPFN's in context learning mechanism, which encodes task structure through context dependent attention rather than the stable parametric directions that make steering tractable in language models.

9. Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking Transformers

电路同步先于泛化：来自Grokking Transformer中傅里叶结构的因果证据

AI 总结：提出频率同步度（FSD）指标，发现其在模算术任务中比grokking早500-3000步同步，且通过权重衰减控制验证了间隔期的正则化本质，提供因果证据。

链接：https://arxiv.org/abs/2606.12966

机构：New York University（纽约大学）

作者：Achyuthan Sivasankar

英文摘要：Grokking -- where a transformer on modular arithmetic suddenly transitions from near-chance to near-perfect validation accuracy -- is attributed to a Fourier circuit, but its timing, causal structure, and controllability remain poorly understood. We introduce the Frequency Synchronization Degree (FSD), a normalised, permutation-tested metric for Fourier circuit synchronisation requiring no prior circuit knowledge. Across nine modular addition configurations (primes p in {53, 71, 97, 113, 131}, three seeds), FSD synchronises 500-3,000 steps before grokking (mean lead +1,722 steps; all nine positive, sign-test p~0.004), and precedes a restricted-logit loss baseline (Nanda et al.'s excluded loss) in all nine cases, making it the earliest available predictor. We provide direct causal evidence that the inter-phase gap is a regularisation phenomenon: forking training at the FSD-ceiling step and varying weight decay lambda produces strictly monotone earlier grokking, with Delta_t proportional to 1/lambda. This law replicates across three primes (p in {53,97,131}; R^2=1.00 and R^2=0.99 for two clean cases), captured as Delta_t ~ C/lambda, consistent with (1/lambda)*log(||W_mem||/tau). Architecture ablations show an attention-only model groks with a strong FSD precursor; an MLP-only model never groks; a single-layer model's FSD lags, confirming the precursor is a multi-block circuit property.

10. EPM-JEPA: Operator-Side Experience Modulation in JEPA-Family World Models

EPM-JEPA：JEPA系列世界模型中的算子侧经验调制

AI 总结：提出EPM-JEPA，通过LoRA在权重层面调制预测器，以应对测试时动态偏移；实验表明其优于无记忆基线，但效果弱于预期，并揭示了三种独立动力学过程。

链接：https://arxiv.org/abs/2606.12979

机构：School of Artificial Intelligence and Data Engineering (SAIDE), Indian Institute of Technology Jodhpur（印度理工学院焦特布尔分校人工智能与数据工程学院）

作者：Vedant Pandya

英文摘要：JEPA-family world models use a static predictor whose weights do not adapt when test-time dynamics diverge from training. We compare two mechanisms for incorporating accumulated experience into a JEPA predictor under distribution shift: operand-side injection, where a compressed experience representation is added as a residual to the predictor's hidden state (EI-JEPA), and operator-side modulation, where the same representation generates low-rank weight deltas via LoRA applied to the predictor's weights (EPM-JEPA). On a pre-registered comparison (Moving MNIST, gravity shift), EPM-JEPA (D_shift^{n=50} = 0.7848 +/- 0.0078, three seeds) differs from EI-JEPA (0.8238) by delta = 4.74% - Outcome C: a null result - by our stated criterion, a valid outcome. As a secondary, non-pre-registered observation, EPM-JEPA improves 1.90% over a no-memory baseline (0.8000), consistently across seeds, while EI-JEPA underperforms the baseline, indicating the benefit is specific to weight-level modulation. Our primary contribution is a mechanism analysis: the D_shift^{n=50} trajectory reflects three independent dynamical processes - buffer cycling, EMA target drift, and an intrinsic LoRA settling transient of +0.021 - rather than convergence to equilibrium. These findings motivate PEM-JEPA, a physics-grounded successor addressing this dynamical-peak limitation.

11. Emotional regulation improves deep learning-based image classification

情绪调节改善基于深度学习的图像分类

AI 总结：提出情绪调节框架，通过人工主观体验在深度学习中建模情绪，在图像分类任务中预训练ResNet和ViT，在CIFAR-10/100上超越现有方法，成为情绪增强深度学习的新标杆。

链接：https://arxiv.org/abs/2606.13081

机构：Mare Group（Mare集团）； NOVA LINCS（NOVA LINCS实验室）； Institute of Engineering (ISE), University of Algarve（阿尔加维大学工程学院）； Department of Energy Technologies and Renewable Sources, ENEA Casaccia Research Center（ENEA卡萨恰研究中心能源技术与可再生能源部）

作者：Riccardo Emanuele Landi, João M. F. Rodrigues, Marta Chinnici

英文摘要：Emotion significantly influences cognition, enhancing memory and learning under certain conditions. Drawing on this principle, emotion-augmented deep learning investigates how affective states can improve neural network architectures and learning paradigms, achieving better generalization than non-emotional models. However, existing methods often rely solely on objective neurophysiological factors, neglecting the role of subjectivity in emotion. To bridge this gap, the present study introduces Emotional Regulation, a novel framework for modeling emotion in deep learning through artificial subjective experience. The method employs pre-training based on affective stimuli, balancing non-emotional and emotionally-influenced responses in downstream task optimization. Extensive experimentation was conducted in image classification, pre-training ResNet and ViT architectures on four emotional datasets, using CIFAR-10 and -100 as target benchmarks. Results reveal improvements over the aforementioned backbones, providing evidence of Emotional Regulation as a promising method for defining emotion-augmented deep learning through artificial subjective experience. Furthermore, the proposed approach overcomes the related work in image classification based on CIFAR, revealing Emotional Regulation as the new state-of-the-art in emotion-augmented deep learning for large-scale vision datasets. The study also enforces evidence of the impact of affective states in improving machine learning tasks' optimization, encouraging further investigation on emotion-inspired architectures.

12. Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

揭秘隐状态循环：基于在线强化学习的可切换潜在推理

AI 总结：提出SWITCH框架，通过离散边界令牌使隐状态循环推理兼容在线强化学习，并支持因果机制分析，实验表明其优于现有方法。

链接：https://arxiv.org/abs/2606.13106

机构：HKUST(GZ)（香港科技大学（广州））； University of Cambridge（剑桥大学）； NTU（南洋理工大学）； JoinQuant（聚宽）； HKUST（香港科技大学）

作者：Jiayu Yang, Chao Chen, Shengen Wu, Yinhong Liu, Yuxuan Fan, Lujundong Li, Songning Lai, Chengwei Qin, Zhijiang Guo

英文摘要：Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits to enter latent mode and to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.

13. MP3: Multi-Period Pattern Pre-training forSpatio-Temporal Forecasting

MP3：面向时空预测的多周期模式预训练

AI 总结：针对时空数据中短窗口输入导致的时间幻象问题，提出多周期模式预训练插件MP3，通过多周期时间建模、空间建模和跨周期因果交互，提升现有STGNN的预测性能。

链接：https://arxiv.org/abs/2606.13119

机构：School of Computing and Artificial Intelligence, Southwest Jiaotong University（西南交通大学计算机与人工智能学院）； Eindhoven University of Technology（埃因霍温理工大学）

作者：Lilan Peng, Yandi Liu, Qingren Yao, Chongshou Li, Tianrui Li

英文摘要：Spatio-Temporal forecasting is crucial in diverse fields, such as transportation, climate, and energy. Urban spatio-temporal data exhibits temporal mirage: similar short-window inputs have divergent future trends, and vice versa. Existing spatio-temporal graph neural networks (STGNNs) cannot effectively identify such mirages. We argue that the core reason lies in the short-window inputs that have incomplete period observation, heterogeneous global spatial correlation, and cross-period superposition causality. To bridge this gap, we develop a novel Multi- Period Pattern Pre-training (MP3), a plug-and-play pre-training plugin for distinguishing temporal mirages. MP3 presents two core innovations: (1) The multi-period pattern learning is designed to learn multi-period patterns from long time series. Specifically, multi-period temporal modeling leverages edge convolution to identify different multi-period patterns. Multi-period spatial modeling uses a bottleneck project and a global memory bank to capture heterogeneous global spatial relations efficiently. Cross-period pattern interaction employs a causality-enhanced Transformer to capture dependencies across different period patterns. (2) This plugin can seamlessly integrate into existing STGNN backbones to strengthen their forecasting performance. The experiment on five STGNN baselines across five real-world datasets (including a large-scale dataset CA) verify the effectiveness, superior scalability and strong adaptability of MP3, which brings consistent and robust performance improvements across all evaluated baselines. On average, MP3 reduces the MAE 4.7% and the RMSE 5.0%. The code can be available at https://github.com/YAN-outlook/MP3.

14. Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

选择与改进：理解推理后训练的机制

AI 总结：通过控制实验揭示强化学习后训练通过策略选择和策略改进两种机制提升推理能力，并指出SFT数据和RL数据的不同作用。

链接：https://arxiv.org/abs/2606.13125

机构：Microsoft Research NYC（微软研究院纽约）； UIUC（伊利诺伊大学厄巴纳-香槟分校）

作者：Akshay Krishnamurthy, Audrey Huang, Nived Rajaraman

英文摘要：Reinforcement learning has rapidly emerged as a key component in the training of reasoning and coding models, yet it remains poorly understood from a mechanistic perspective. We study how and through what underlying processes capabilities are acquired or enhanced via reinforcement learning post-training. Our analysis, based on controlled math reasoning experiments with Qwen-2.5-1.5B, reveals two core mechanisms: strategy selection and strategy improvement. Our results highlight the role of SFT data and reinforcement learning data in activating these mechanisms, in particular showing how supervising the model on diverse reasoning strategies can enable strategy selection and how increasing difficulty in reinforcement learning data can enable strategy improvement. Taken together, our results provide mechanistic insight into RL training and suggest practical interventions to continue scaling reasoning capabilities.

15. When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals

路由何时变得可解释？对块注意力残差的因果探针

AI 总结：研究块注意力残差中路由的可解释性，发现仅当路由参与训练时才出现结构化深度路由，且路由权重与因果重要性存在分离，需用因果干预验证。

链接：https://arxiv.org/abs/2606.13168

机构：ETH Zurich（苏黎世联邦理工学院）

作者：Aydin Javadov

英文摘要：Block Attention Residuals (Block AttnRes) by replace fixed additive residuals with a learned softmax over earlier depth-source representations, surfacing cross-layer routing as an inspectable tensor in the forward pass. This is a tempting interpretability target: information flow normally inferred indirectly is now directly observable. We ask whether such exposure suffices for mechanistic interpretation. We probe two same-scale ($0.6$B) Block AttnRes checkpoints under identical routing-ablation interventions: a vanilla Qwen3 inference-wrapped through a deterministic recency-bias schedule that the codebase admits as a routing-equivalent loading path, and a Block AttnRes Qwen3 trained from scratch with routing as part of optimisation. The wrapped baseline's routing weights are content-independent and reproduce the schedule's analytic prediction. The trained AttnRes checkpoint instead exhibits three localised routing motifs: an embedding-source pathway through early-layer MLP, a current-state pathway through early-layer attention and MLP, and an older-history pathway through late-layer attention. Beyond this stratification, we find a sharp dissociation between average routing mass and causal importance: in both sublayers, the largest mass slice is not the largest causal contribution, and one source family carries appreciable mass with no detectable causal role under intervention. Architectural exposure of routing is therefore necessary but not sufficient for mechanistic interpretation: structured depth routing emerges only when routing has been part of training, and even then, descriptive routing summaries should be treated as candidate hypotheses to be tested by causal interventions, not as evidence of mechanism in their own right.

16. Distributional Loss for Robust Classification

分布损失用于鲁棒分类

AI 总结：提出一种基于双峰高斯分布的分布损失概念，通过软化目标隐式捕捉类别模糊性，缓解过拟合，提升决策边界鲁棒性，尤其在低数据场景下效果显著。

链接：https://arxiv.org/abs/2606.13223

作者：Kathleen Anderson, Thomas Martinetz

英文摘要： This paper proposes a novel loss concept for supervised classification tasks. Rather than enforcing a direct mapping from each input sample to a single assigned label, we define an optimization objective over all classifier outputs as a bimodal Gaussian distribution. This softer target formulation implicitly captures class ambiguity, mitigates overfitting, and encourages the learning of more robust decision boundaries, all without requiring additional label information. Experimental results demonstrate consistent improvements in robustness, with particularly pronounced gains in low-data regimes, while requiring only minimal modifications to standard training pipelines.

17. Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization

不同层，不同流形：Transformer优化中的模块级权重空间几何

AI 总结：研究Transformer不同模块偏好不同流形几何，提出为注意力层和MLP层分别分配Stiefel和DGram约束，在GPT-2预训练中取得最佳性能。

链接：https://arxiv.org/abs/2606.13276

机构：School of Engineering Science, The University of Osaka（大阪大学工程科学学院）

作者：Kirato Yoshihara

英文摘要：Weight-space geometry plays a central role in neural network optimization, yet manifold constraints are often applied uniformly across all weight matrices. In this work, we ask whether different transformer modules prefer different manifold geometries. We study Manifold Muon for GPT-2 pretraining and compare layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks. Our results show a clear asymmetry: constraining attention layers with Stiefel geometry while assigning DGram geometry to MLP layers gives the best performance among the tested configurations, whereas the inverted assignment and all-DGram configuration become unstable under the shared hyperparameter setting. We trace this failure to singular value growth in DGram-constrained attention weights, which can amplify attention logits and induce softmax saturation. These findings suggest that symmetry-aware and geometry-aware optimization for transformers should be module-specific rather than uniform.

18. How Much Memory Do We Need? Adaptive Memory Gate for Neural Operators

我们需要多少记忆？神经算子的自适应记忆门

AI 总结：针对现有神经算子固定记忆权重适应性不足的问题，提出AMGFNO，通过可学习门动态调节记忆权重，在低分辨率下nRMSE降低55-79%。

链接：https://arxiv.org/abs/2606.13443

作者：Jihyeon Hur, Yongseok Kwon, Min-Gi Jo, Jeongwhan Choi, Noseong Park

英文摘要：Neural operators have emerged as a powerful data-driven approach for solving time-dependent PDEs. Among recent advances, memory-augmented neural operators explicitly incorporate past states and have achieved remarkable performance under low-resolution observation settings. However, existing approaches apply a fixed memory weight regardless of observation conditions, such as resolution or physical parameters, limiting their adaptability. Our preliminary experiments reveal that optimal memory weight varies with resolution and viscosity, implying that a fixed memory weight cannot simultaneously optimize performance across diverse settings. We propose AMGFNO, which dynamically modulates memory weight through a learnable gate. On the Kuramoto-Sivashinsky and Burgers' equations, AMGFNO achieves 55-79% nRMSE reduction over at low resolution, with the learned gate value automatically decreasing from $\bar{g} \approx 0.7$ to near-zero as resolution increases.

19. Adjusted Cup-Product Neural Layer

调整杯积神经层

AI 总结：提出调整杯积神经层，通过硬连线杯积与高规范理论调整项，实现规范不变读出，并证明调整系数是唯一信号源。

链接：https://arxiv.org/abs/2606.13568

作者：Snigdha Chandan Khilar

英文摘要：Many important observables in physics and geometry are cup products of cochains. The adjusted cup product neural layer has been introduced in this paper. It is a neural primitive that hard wires the cup product with an adjustment term from higher gauge theory. This creates a readout that is gauge invariant by design. Their main theoretical result shows that on a closed cycle the output relies entirely on the adjustment coefficient. Setting this coefficient to zero removes the output completely regardless of other parameters. Thus the adjustment is the only source of gauge invariant signal. They prove this observable is a nonzero quadratic form and is exactly invariant under one and two gauge transformations.

20. Existence Precedes Value: Joint Modeling of Observational Existence and Evolving States in Time Series Forecasting

存在先于价值：时间序列预测中观测存在性与状态演变的联合建模

AI 总结：提出Timeflies框架，联合建模未来观测是否发生（存在性）与数值估计，通过观测流和数值流耦合模块提升缺失值时间序列预测性能。

链接：https://arxiv.org/abs/2606.13571

机构：Ant International（蚂蚁国际）

作者：Yifan Hu, Hongzhou Chen, Peiyuan Liu, Yiding Liu, Zewei Dong, Jiang-Ming Yang

英文摘要： Real-world time series are often highly incomplete and irregular due to sensor dormancy, transmission delays, and event-driven sampling, making reliable forecasting fundamentally challenging. Existing methods have evolved from impute-then-forecast pipelines to continuous-time models such as Neural ODEs and continuous-time graph networks. While these approaches improve the modeling of historical irregularity, they still rely on an implicit oracle assumption at inference time: the timestamps of future valid observations are presumed to be known in advance. This assumption limits practical relevance, since in many real systems the more fundamental question is not only what the future value will be, but also whether a valid observation will occur at all. In this paper, we propose Timeflies, a unified framework that reformulates forecasting as a joint problem of future observability inference and value estimation. To explicitly model the interaction between observation dynamics and state evolution, Timeflies adopts an observation stream and a value stream, coupled through three dedicated modules for reliability-aware embedding, observation-guided dependency modeling, and joint prediction. We further construct Shadow, a benchmark that combines natural missingness from public datasets with real-world industrial data, and introduce the Observation-Value Joint Entropy (OVJE) metric to comprehensively evaluate this coupled predictability. Extensive experiments show that Timeflies consistently outperforms existing methods, highlighting the importance of explicitly modeling future observability in time series forecasting with missing values. Code and dataset are available in https://github.com/ant-intl/Timeflies.

21. Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

单纯形约束的稀疏装袋：集成学习中从均匀先验到稀疏后验的转变

AI 总结：提出SCSB框架，通过最小化袋外损失在概率单纯形上联合优化集成剪枝与校准，引入凹二次惩罚解决L1单纯形悖论，实现高达96%的压缩并提升校准性能。

链接：https://arxiv.org/abs/2606.13589

机构：Georgia Institute of Technology（佐治亚理工学院）

作者：Meher Sai Preetam, Meher Bhaskar

英文摘要：We present Simplex-Constrained Sparse Bagging (SCSB), a mathematically rigorous framework for post-training compression and probability calibration of bootstrap-based bagging ensembles. Standard bagging ensembles (such as Random Forests, Bagged SVMs, and Bagged Neural Networks) assign uniform voting power to all constituent estimators. However, this naive uniform prior ignores the varying local competence of base estimators and contributes to model overconfidence. We formulate ensemble pruning and calibration as a joint optimization problem over the probability simplex by minimizing the Out-Of-Bag (OOB) loss. To induce sparsity, we address the theoretical "L1-simplex paradox" -- the mathematical reality that the L1 norm is constant on the simplex and fails to prune -- by introducing a concave quadratic penalty. SCSB is model-agnostic and achieves up to 96% ensemble compression, yielding linear inference speedups and superior probability calibration (lowered Expected Calibration Error) while preserving or enhancing generalization accuracy.

22. Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models

超越承诺边界：探究大型推理模型中的附带思维链

AI 总结：通过早期退出估计思维链步骤的因果重要性，发现推理中存在从瞬态猜测到稳定答案的“承诺边界”，后续步骤为附带现象，可提前退出以缩短推理长度达55%而不影响性能。

链接：https://arxiv.org/abs/2606.13603

作者：Daniel Scalena, Sara Candussio, Luca Bortolussi, Elisabetta Fersini, Malvina Nissim, Gabriele Sarti

英文摘要：Chain-of-thought (CoT) reasoning is the dominant paradigm for inference-time scaling in language models, yet the causal influence of individual steps on the final answer poorly understood. We estimate each step's causal importance via early exit and use this measure to study how answers form across the reasoning traces of several model families. Across diverse tasks, we find that reasoning typically crosses a \emph{commitment boundary} -- a sharp transition from transient intermediate guesses to a stable, high-confidence answer. This transition often happens in a single step, well before the model's reasoning block ends, and is followed by \emph{epiphenomenal} CoT steps that leave the final answer probability unaltered. Using attention probes, we show that answer-formation stages can be linearly decoded from intermediate reasoning steps with high accuracy and generalize robustly to unseen reasoning tasks. We exploit this signal to early-exit reasoning blocks at the commitment boundary, reducing the length of CoTs up to 55\% on average with negligible impact on model performance.

23. Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

密集监督，稀疏更新：论策略蒸馏的稀疏性与几何结构

AI 总结：本文分析策略蒸馏（OPD）中参数更新的稀疏性和几何特性，发现更新稀疏且集中于小权重坐标，并验证了稀疏子网络的有效性。

链接：https://arxiv.org/abs/2606.13657

机构：School of Artificial Intelligence, Nanjing University（南京大学人工智能学院）； National Key Laboratory for Novel Software Technology, Nanjing University（南京大学计算机软件新技术国家重点实验室）； Amap, Alibaba Group（阿里巴巴集团高德地图）

作者：Guo Yu, Wenlin Liu, Yulan Hu, Hao-Xuan Ma, Jun-Peng Jiang, Han-Jia Ye

英文摘要：On-policy distillation (\textsc{OPD}) has recently become a prominent post-training recipe as it combines two desirable ingredients: on-policy student trajectories and dense teacher supervision, yet how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and use cases, our analysis yields two main findings. On sparsity, \textsc{OPD}-style updates are small and coordinate-sparse. They are distributed across layers and are usually FFN-heavy. This sparse structure is operationally useful: training only the discovered subnetwork recovers nearly the same performance as full \textsc{OPD}. However, the sparsity-inducing SGD optimizer underperforms AdamW in our optimizer ablation, likely because dense teacher supervision preserves heterogeneous coordinate-wise gradient scales where AdamW's adaptive scaling remains useful. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn \textsc{OPD} into ordinary dense parameter rewriting; instead, \textsc{OPD} retains important geometric signatures of on-policy post-training.

2. 表示学习、自监督与对比学习 | 6 篇

24. Representing Time Series as Structured Programs for LLM Reasoning

将时间序列表示为结构化程序以进行LLM推理

AI 总结：提出T2SP方法，将时间序列分解为趋势、周期和显著事件并表示为结构化符号程序，使LLM无需微调即可高效推理，在编辑、描述和问答任务上优于原始序列表示。

链接：https://arxiv.org/abs/2606.12481

机构：Korea University（高丽大学）； Mila, University of Montreal（蒙特利尔大学米拉研究所）

作者：Jaeho Kim, Changhun Oh, Seokhyun Lee, Irina Rish, Changhee Lee

英文摘要：Large language models (LLMs) have demonstrated strong reasoning and instruction-following capabilities, making them potentially powerful tools for time-series analysis. However, time series lie outside their native textual modality, raising a fundamental question: how should time series be represented so that LLMs can reason about them effectively? Existing work typically serializes raw numerical sequences or fine-tunes pre-trained LLMs on time-series data. These approaches place the burden of extracting temporal structure directly on the LLM, creating a modality mismatch that often degrades performance on long sequences and introduces substantial computational overhead. In this work, we introduce Time-Series-to-Structured-Program representation (T2SP), a deterministic, training-free method that represents a time series as a structured symbolic program. T2SP decomposes time series into trends, periods, and salient events, expressing them in a program-friendly format aligned with the textual and code-like modalities on which LLMs are natively trained. By shifting temporal-structure extraction from the model to the representation itself, T2SP enables off-the-shelf LLMs to leverage their existing reasoning capabilities for time-series understanding. We evaluate T2SP on three reasoning tasks -- editing, captioning, and question answering -- where it consistently improves performance, reduces reasoning time, and lowers failure rates compared with raw-string representations. Our results demonstrate that T2SP provides an effective interface between time series and LLMs.

25. A Stationary (and Therefore Compatible) Representation is All You Need

静态（因此兼容）表示即所需

AI 总结：本文证明d-Simplex固定分类器学习的静态表示满足兼容性定义，并通过交叉熵与对比损失的凸组合捕获高阶依赖，实现模型更新时无需重处理的检索服务。

链接：https://arxiv.org/abs/2606.12488

机构：Media Integration and Communication Center (MICC), Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Firenze（佛罗伦萨大学信息工程系媒体集成与通信中心（MICC））

作者：Niccolò Biondi, Federico Pernici, Simone Ricci, Alberto Del Bimbo

英文摘要：Learning compatible representations aims to learn feature representations that can be used interchangeably over time whenever a model undergoes updates. In this paper, we demonstrate that stationary representations learned by d-Simplex fixed classifiers imply compatibility as in its formal definition. This result establishes a foundation for future works and can be directly exploited in practical learning scenarios. We address the challenge of learning compatibility using $d$-Simplex fixed classifiers when the model is sequentially fine-tuned. Learning according to a d-Simplex fixed classifier with the cross-entropy loss aligns feature distributions at the first-order statistics. Consequently, it may not fully capture higher-order dependencies in the representation between model updates. To address this issue, we demonstrate that training the model using a $d$-Simplex fixed classifier through a convex combination of the cross-entropy loss and a contrastive loss not only captures higher-order dependencies, but is also equivalent to learning with the cross-entropy under the compatibility constraints. We confirm our findings with extensive experiments also considering a new scenario where a pre-trained model is sequentially fine-tuned and occasionally replaced with an improved model. We show that stationary representations enable uninterrupted retrieval services (without reprocessing gallery images) while improving performance during model updates and replacements, achieving state-of-the-art. Code at https://github.com/miccunifi/iamcl2r.

26. Dolph2Vec: Self-Supervised Representations of Dolphin Vocalizations

Dolph2Vec: 海豚发声的自监督表示

AI 总结：提出Dolph2Vec，首个基于五年纵向海豚录音数据训练的自监督模型，在签名哨声分类和检测任务上显著优于通用基线，并发现可解释的声学单元。

链接：https://arxiv.org/abs/2606.12503

机构：École Normale Supérieure, Paris, France（巴黎高等师范学院）； Not Diamond, San Francisco, USA（Not Diamond公司）； Institut du Cerveau, Paris, France（巴黎脑研究所）； Champalimaud Foundation, Lisbon, Portugal（尚帕利莫基金会）

作者：Chiara Semenzin, Faadil Mustun, Roberto Dessi, Pierre Orhan, Alexis Emanuelli, Yair Lakretz, Gonzalo de Polavieja, German Sumbre

英文摘要：Self-supervised learning (SSL) has opened new opportunities in bioacoustics by enabling scalable modeling of animal vocalizations without the need for expensive manual annotation. However, current SSL models in this domain prioritize broad generalization across species and are not optimized for uncovering the fine-grained structure of individual communication systems. In this work, we collect and release a novel dataset of over five years of longitudinal recordings, from five known dolphins in a semi-naturalistic marine environment, an unprecedented resource for studying dolphin communication. We adapt the Wav2Vec2.0 Baevski et al. (2020) architecture to this domain and introduce Dolph2Vec, the first large-scale, species-specific SSL model trained exclusively on this data. We benchmark our model on two biologically relevant tasks: signature whistle classification and whistle detection. Dolph2Vec significantly outperforms general-purpose baselines in both tasks. Beyond performance, we show that learned embeddings and codebook structure capture interpretable acoustic units aligned with dolphin whistle categories and possibly sub-whistle structure, enabling fine-grained analysis of communication patterns. Our findings demonstrate how SSL can serve as both a model and a scientific tool to explore hypotheses in animal communication research.

27. Viral Proteins Reveal Geometry of Protein Language Models

病毒蛋白质揭示蛋白质语言模型的几何结构

AI 总结：研究蛋白质语言模型在不平衡数据下对病毒蛋白的表示，发现嵌入空间中存在主导的“天然性”轴，该轴按模型困惑度排序序列，且缩放效果因病毒家族而异，但嵌入仍保留病毒特异性信号。

链接：https://arxiv.org/abs/2606.12609

作者： Arthur Bigot, Harmon Bhasin, Core Francisco Park, Eugene Shakhnovich, Dianzhuo Wang

英文摘要：Protein language models are trained on highly imbalanced datasets, raising the question of how they represent underrepresented biological sequences. Using viral proteins as a case study across ESM model families, we identify a dominant nativeness axis in embedding space, aligned with masked reconstruction perplexity, that orders sequences from well-modeled cellular proteins through viral proteins to shuffled and random sequences. Scaling contracts this axis unevenly across viral families. Despite this, protein language model embeddings retain viral-specific signal: viral proteins remain linearly separable beyond zero-shot perplexity and shallow sequence features. Together, these results suggest that pLM representations are structured by a general notion of nativeness while preserving information specific to distinct biological groups.

28. Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Bag of Dims：通过维度级符号模式实现无需训练的机制可解释性

AI 总结：本文提出Bag of Dims框架，证明Transformer隐藏状态的标准基即可作为无需训练的特征基，通过维度符号模式编码语义，并在三个模型上验证了其有效性。

链接：https://arxiv.org/abs/2606.12629

机构：Amazon Web Services（亚马逊云服务）

作者：Varun Reddy Nalagatla

英文摘要：We show that the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs and confidence via their magnitudes, functioning as independent binary registers. We validate this Bag of Dims framework across three model families (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B) through four progressive experiments. Sign patterns alone carry predictive content: replacing all magnitudes with unity achieves 72-93% top-5 next-token accuracy through the LM head, and pure Hamming scoring without any decoder reaches 80-90% top-4096. These sign patterns organize into semantic features: using a single-token type cache (one forward pass per vocabulary token, no context), we discover 175 categories via per-dimension sign consistency (mean AUC 0.80) from 50 anchors with zero training. A trained probe adds only +0.018 AUC and converges to axis-aligned weights, confirming negligible cross-dimension structure. This structure extends to attention: all 175 categories remain discoverable in K and V projections. On the write side, static FFN weight inspection links 20% of features to individual writer neurons (>0.70 agreement; random controls: 0%), with top-200 neuron coalitions achieving >0.70 agreement on 99.9% of prototypes via majority vote. Fully unsupervised discovery (random seeds, no labels) scales to 1500 features at 100% yield and 99% sparsity across all three models, with pairwise MI of 0.0014 bits confirming low inter-dimension coupling. These results establish that the standard basis already suffices for feature reading throughout the transformer compute pathway, requiring no training, no optimization, and no GPU-days beyond a single forward pass per vocabulary token.

29. Extracting Governing Equations from Latent Dynamics via Multi-View Contrastive Learning

通过多视图对比学习从潜在动力学中提取控制方程

AI 总结：提出DYSCO算法，利用多视图时间对比学习从噪声高维观测中联合恢复潜在轨迹和动力学方程，并通过结构化基函数实现符号恢复，理论保证强可识别性。

链接：https://arxiv.org/abs/2606.13260

机构：EPFL（瑞士联邦理工学院洛桑）

作者：Paolo Muratore, Mackenzie Weygandt Mathis

英文摘要：Identifying latent dynamical systems from noisy, high-dimensional measurements is a central problem at the intersection of representation learning, system identification, and scientific discovery. We present DYSCO, a multi-view temporal contrastive learning algorithm that jointly recovers latent trajectories and the governing dynamics from such observations, by leveraging multiple independent noisy views of the same underlying process to disentangle signal from noise. By parameterizing the dynamics in a structured functional basis, our framework further enables symbolic recovery of the governing equations within an affine gauge. We offer theoretical guarantees for strong identification up to an affine indeterminacy, extending prior identifiability results to the realistic setting of noisy nonlinear observations. Empirically, we demonstrate accurate recovery of both latent trajectories and flow fields across a diverse set of dynamical regimes (e.g., chaotic, oscillatory, and metastable) under both Gaussian and Poisson observation noise, the latter being particularly relevant for neural recordings.

3. 强化学习与序列决策 | 9 篇

30. ReCal: Reward Calibration for RL-based LLM Routing

ReCal: 基于强化学习的LLM路由的奖励校准

AI 总结：提出ReCal框架，通过分层奖励分解和分布感知优化校准奖励信号，解决多目标冲突和异质性任务优化偏差，提升LLM路由性能与稳定性。

链接：https://arxiv.org/abs/2606.12479

机构：Zhejiang University（浙江大学）； Ant Group（蚂蚁集团）； Shanghai AI Laboratory（上海人工智能实验室）

作者：Qihang Yu, Hanwen Tong, Zhengqi Zhang, Bo Zheng, Feng Wei, Shengyu Zhang, Zemin Liu, Fei Wu

英文摘要： Large language model (LLM) routing has emerged as an effective paradigm for leveraging the complementary strengths of multiple LLMs through dynamic model and reasoning-strategy selection. Recent reinforcement learning (RL)-based routing methods further improve routing quality by optimizing routing policies from interaction feedback. However, they still struggle to provide informative and comparable learning signals under heterogeneous tasks with varying difficulty. In practice, multiple objectives (e.g., correctness, format behavior) are aggregated into a single scalar reward, leading to ambiguous credit assignment and conflicting optimization signals. Moreover, reward signals exhibit significant variability across instances, where some instances produce higher or more variable rewards, introducing optimization bias that favors trivial samples over informative ones. To address these issues, we propose \textbf{ReCal}, a \textbf{\underline{Re}}ward \textbf{\underline{Cal}}ibration framework for RL-based LLM routing. We first introduce a hierarchical reward decomposition mechanism with component-wise advantage estimation. We further propose a distribution-aware optimization strategy that calibrates optimization variability through variance-aware reweighting and per-dataset normalization. Experiments on seven datasets demonstrate that ReCal consistently improves routing performance, and training stability over baselines. Code is available at https://anonymous.4open.science/r/ReCal.

31. Speculative Rollback Correction for Quality-Diverse Web Agent Imitation

面向质量多样性的Web智能体模仿的推测性回滚修正

AI 总结：提出推测性回滚修正（SRC）框架，通过固定视野分支审查和回滚机制，在减少教师查询的同时保持轨迹多样性，在WebArena-Infinity上收集了977条通过验证的轨迹和9183个下一步动作示例。

链接：https://arxiv.org/abs/2606.12485

机构：Beihang University（北京航空航天大学）； Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所）； The Hong Kong University of Science and Technology（香港科技大学）； Northwestern Polytechnical University（西北工业大学）； Tsinghua University（清华大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Peking University（北京大学）

作者：Longkun Hao, Hongyu Lin, Hao Li, Zhichao Yang, Haojie Hao, Dongshuo Huang, Haitao Yang, Hongyu Ge, Ming jie Xie, Yanjun Wu, Zi Hao Yin, Yan Bai, Yihang Lou

英文摘要：Training interactive web agents through imitation learning from expert trajectories has emerged as a highly effective approach. However, determining the optimal timing for expert intervention presents a critical challenge in this context. Delayed intervention often leads to the accumulation of early-stage errors, pushing the page state into an irrecoverable regime. Conversely, premature or excessive intervention causes the agent to become overly reliant on expert policies, trapping the model in local optima characterized by a single, rigid trajectory. We propose Speculative Rollback Correction (SRC), a branch-level imitation framework for resettable agent environments. Instead of requesting teacher labels at every visited state or correcting only after a completed trajectory, SRC uses fixed-horizon branch review: the student executes a short speculative segment before teacher review, and the teacher localizes the first harmful deviation only when local progress breaks. Rollback preserves useful prefixes, while successful rollouts are filtered by a hard verifier and retained in a lightweight quality-diversity archive. The resulting data supports next-action supervised fine-tuning on both localized corrections and verifier-passing trajectories. On WebArena-Infinity, SRC collects 977 verifier-passing trajectories and 9,183 next-action examples; fixed-horizon review improves the recovery-versus-query tradeoff over step-level review while retaining verifier-passing solution variants. Code is available at https://github.com/LongkunHao/SRC_gui_agent.

32. Boosting Direct Preference Optimization with Penalization

通过惩罚增强直接偏好优化

AI 总结：提出DPOP，在DPO损失上增加对参考模型贪婪响应的门控惩罚，仅当当前策略对偏好响应概率低于拒绝响应时激活，在AlpacaEval 2.0上显著提升胜率。

链接：https://arxiv.org/abs/2606.12505

作者：Pengwei Sun

英文摘要：Offline preference optimization has become a practical substitute for reinforcement learning from human feedback, but pairwise objectives such as Direct Preference Optimization (DPO) and its variants use only the chosen and rejected responses stored in a static dataset. This leaves a useful signal unused: the response that the reference model itself would generate for the same prompt. We propose Direct Preference Optimization with Penalization (DPOP), a simple extension of DPO that augments the base preference loss with a gated penalty on reference-greedy responses. DPOP activates this penalty only when the current policy still assigns a lower likelihood to the preferred response than to the rejected response. On AlpacaEval 2.0, DPOP improves length-controlled win rate over DPO, SimPO, and AlphaDPO on both Llama-3-8b-it and Gemma-2-9b-it, achieving relative gains of 5.3\% and 4.4\% over baselines on the two models, respectively. Ablations further show that a SimNPO-style length-normalized penalty is stronger than NPO and token-level unlikelihood in this setting.

33. Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

保持策略梯度主导：面向长程工具使用智能体的兄弟引导信用蒸馏

AI 总结：针对长程工具使用强化学习中轨迹级优势信号稀疏的问题，提出兄弟引导信用蒸馏（SGCD），通过动态采样成功与失败轨迹、外部LLM对比生成逐步信用参考，实现密集信用分配，在AppWorld和τ³-airline任务上显著提升性能。

链接：https://arxiv.org/abs/2606.12634

机构：Amazon Web Services（亚马逊云服务）

作者：Tianyu Ding, Jianhong Xin, Juan Pablo De la Cruz Weinstein

英文摘要：Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by reusing a policy's own rollouts or a privileged teacher. We show, however, that direct token-level self-distillation can silently destroy tool use: it rehearses teacher behavior without knowing which actions the verifier rewards, so useful skills and harmful shortcuts are amplified together. We introduce Sibling-Guided Credit Distillation (SGCD), which uses distillation for credit assignment rather than as a competing actor loss. Dynamic sampling produces mixed successful and failed sibling rollouts; an external LLM summarizes their contrast into a training-only stepwise credit reference; dense teacher/student divergence drives credit reassignment; and bounded detached credit weights reshape GRPO token advantages. The deployed student sees no external LLM, sibling evidence, or oracle. Across AppWorld and $τ^3$-airline, SGCD improves over matched GRPO comparators: AppWorld TGC $42.9 \to 45.6$ on test_normal and $24.7 \to 27.0$ on test_challenge, and $τ^3$-airline pass@1 $0.583 \to 0.602$.

34. Individual Control Barrier Functions-Guided Diffusion Model for Safe Offline Multi-Agent Reinforcement Learning

个体控制障碍函数引导的扩散模型用于安全离线多智能体强化学习

AI 总结：提出一种将神经个体控制障碍函数嵌入扩散模型的离线多智能体强化学习算法，通过逆动力学恢复控制策略，在保证奖励的同时显著提升轨迹生成的安全性。

链接：https://arxiv.org/abs/2606.12640

机构：Department of Electrical Engineering and Automation, Aalto University（阿尔托大学电气工程与自动化系）； School of Computing and Data Science, Xiamen University Malaysia（厦门大学马来西亚分校计算与数据科学学院）； Department of Computer Science, University of Toronto（多伦多大学计算机科学系）

作者：Qingyun Guo, Junyi Shi, Jianuo Huang, Tianyu Shi

英文摘要：Offline reinforcement learning allows control policies to be learned directly from data without online interaction, making it suitable for safety-critical tasks. Recent studies have applied diffusion models to offline reinforcement learning to leverage their strong capacity for modeling complex data distributions. However, existing approaches primarily focus on single-agent settings, leaving the safety challenges in multi-agent environments largely unexplored. In this work, we propose a safe offline multi-agent reinforcement learning algorithm that embeds neural individual control barrier functions into the diffusion model to enhance safety during trajectory generation, with control policies recovered through inverse dynamics. We evaluate our algorithm across diverse benchmarks, demonstrating substantial safety improvements while maintaining competitive rewards.

35. ProPlay: Procedural World Models for Self-Evolving LLM Agents

ProPlay: 用于自我进化LLM智能体的程序化世界模型

AI 总结：提出ProPlay程序化世界模型，通过程序级预演和因果过程图，使LLM智能体在部分可观测环境中自我进化，无需外部监督。

链接：https://arxiv.org/abs/2606.12780

机构：University of Notre Dame（圣母大学）； University of Connecticut（康涅狄格大学）

作者：Yijun Ma, Zehong Wang, Yiyang Li, Ziming Li, Xiaoguang Guo, Weixiang Sun, Chuxu Zhang, Yanfang Ye

英文摘要：Self-evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and decide when to trust prior experience. Existing LLM-agent methods often rely on memory or planning modules, yet they rarely close the loop between them to continually refine an internal understanding of environment dynamics. We introduce ProPlay, a procedural world model that supports procedure-level preplay, where agents can rehearse future procedural paths using the learned world knowledge. Rather than representing experience as isolated rules or low-level action constraints, ProPlay abstracts successful trajectories into procedures and organizes them in a procedure graph that captures causal transitions among task stages. Each transition is associated with a reliability record embedding to estimate its task-specific contribution from past outcomes. Before each episode, ProPlay simulates future procedural trajectories over known graph structures as structured soft guidance; after execution, it refines the graph using environment feedback. Experiments on public benchmarks show that ProPlay consistently improves environment understanding and self-evolution capability over strong baselines. Our code has been released in https://github.com/antman9914/proplay.

36. SymQNet: Amortized Acquisition for Low-Latency Adaptive Hamiltonian Learning

SymQNet: 低延迟自适应哈密顿量学习的摊销获取

AI 总结：提出SymQNet，一种摊销强化学习方法，通过离线学习后验条件获取策略，在线快速前向传播，显著降低自适应哈密顿量学习的获取延迟。

链接：https://arxiv.org/abs/2606.12808

作者：Yash Vardhan Tomar, Dheeraj Peddireddy, Vaneet Aggarwal

英文摘要：Adaptive Hamiltonian learning is central to calibrating and characterizing quantum devices. In an adaptive controller, choosing the next experiment is itself a computation. Bayesian design rules are recomputed after every posterior update, and that step can take seconds. Across hundreds of shots, those seconds become a significant wall-clock cost for adaptivity. We introduce SymQNet, an amortized reinforcement-learning approach for low-latency adaptive Hamiltonian learning. SymQNet learns a posterior-conditioned acquisition policy offline, then uses a fast policy forward pass online while retaining Bayesian posterior feedback. On transverse-field Ising benchmarks, SymQNet substantially reduces acquisition latency relative to bounded Fisher-information search and bounded two-step Bayesian active learning by disagreement (BALD). At five qubits, it reduces acquisition-only decision latency by $47.1\times$ and $72.6\times$ relative to these online baselines; at twelve qubits, full simulated steps take $1.02$ s for SymQNet versus $13.27$ s for bounded two-step BALD. Overall, we show that learned acquisition can make adaptive Hamiltonian learning practical for repeated low-latency workloads.

37. Reinforcement Learning for Neural Model Editing

神经模型编辑的强化学习

AI 总结：提出将神经模型编辑形式化为强化学习问题，通过奖励反馈学习编辑策略，在偏见缓解和机器遗忘任务上取得良好效果。

链接：https://arxiv.org/abs/2606.13461

作者：Shaivi Malik

英文摘要： Editing pretrained neural networks requires specialized algorithms tailored to specific objectives. Designing such algorithms is often time-consuming and demands significant effort. We present an exploratory framework that formulates neural model editing as a reinforcement learning problem, where agents modify models using reward feedback. We introduce two environments: MaskWorld, where agents scale weights multiplicatively, and ShiftWorld, where agents apply additive weight updates. The reward function combines a utility-preservation objective with a task-specific editing objective, enabling agents to learn targeted modifications while maintaining overall model performance. We evaluate the framework on bias mitigation in text classification and machine unlearning in image classification, both of which traditionally rely on specialized algorithms. Our results show that the learned policies reduce forget set accuracy to nearly 0% while preserving over 90% retain set accuracy on the unlearning task. In the bias mitigation setting, the learned policies improve bias-related performance by more than 5% while maintaining general classification utility. Our findings show that neural model editing can be cast as a reinforcement learning problem, allowing editing policies to be learned from reward feedback rather than manually engineered for each task.

38. MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

MaxProof: 通过生成-验证器强化学习与群体级测试时扩展实现数学证明规模化

AI 总结：提出MaxProof框架，结合生成-验证器强化学习与群体级测试时扩展，在MiniMax-M3系列上实现竞赛级数学证明，在IMO 2025和USAMO 2026上超越人类金牌阈值。

链接：https://arxiv.org/abs/2606.13473

机构：MiniMax； The Chinese University of Hong Kong（香港中文大学）； Fudan University（复旦大学）； Peking University（北京大学）； Tsinghua University（清华大学）

作者：Jiacheng Chen, Xinyu Zhang, Shunkai Zhang, Yanmohan Wang, Lin Li, Tiancheng Qin, Qin Wang, Zhengmao Zhu, Tianle Li, Jingyang Li, Zehan Li, Binyang Jiang, Jin Zhu, Han Ding, Fei Yu, Chenyu Du, Zijian Song, Jiayuan Song, Zhi Zhang, Yunan Huang, Weiyu Cheng, Pengyu Zhao, Yu Cheng

英文摘要：We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof verification, and critique-conditioned proof repair -- using a defense-in-depth generative verifier engineered for low false-positive rate. These capabilities are merged into a single released M3 model. At test time, MaxProof treats the model as a generator, verifier, refiner, and ranker, searches over a population of candidate proofs, and returns one final proof through tournament selection. With MaxProof test-time scaling, the M3 model reaches 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold-medal threshold on both.

4. 生成模型与概率建模 | 10 篇

39. Net-Ev$^2$: A Generative Simulator for Network Event Evolution

Net-Ev$^2$：网络事件演化的生成式模拟器

AI 总结：提出Net-Ev$^2$，一种结合事件线索与网络拓扑的生成式模拟器，通过结构引导掩码预训练和拓扑感知扩散过程模拟网络事件演化，在多个道路网络数据集上达到最优性能。

链接：https://arxiv.org/abs/2606.12494

机构：NYU Shanghai（上海纽约大学）

作者：Guangyu Wang, Zhaonan Wang

英文摘要：Reducing real-world trial and error has long been a central goal of decision making, and generative simulators advance this goal by modeling the evolution of future states. An even more challenging yet meaningful task is simulating how disturbance events (e.g., accidents) propagate their impacts across real-world networks. The existing approaches fall short of modeling both structured attributes and unstructured semantics of events, and capturing topological structures in simulating network event evolution. Therefore, we are motivated to propose Net-Ev$^2$ ($\underline{\textbf{Net}}$work $\underline{\textbf{Ev}}$ent $\underline{\textbf{Ev}}$olution), a novel generative simulator that jointly leverages event cues while preserving network topology in simulations. Specifically, the framework consists of two stages, namely structure-guided masked pre-training and topology-aware diffusion process, which is achieved by U-Net-like graph downsampling and upsampling during denoising. At inference time, Net-Ev$^2$ can generate simulations using natural-language event input only, with greater flexibility for practical usage. Furthermore, we introduce Net-Ev$^2$-6.5M, a multimodal benchmark of aligned event and network traffic data across four large-scale road networks, as well as a new topology-aware metric, namely JL-MMD, to evaluate topological fidelity in generated network dynamics. Extensive experiments demonstrate the state-of-the-art performance and strong generalization ability of Net-Ev$^2$. Code is made available at https://github.com/Guangyu4/Net-Ev-2.

40. A Stabilized Path-Space Approach to Diffusion-Based Posterior Sampling

一种稳定的路径空间方法用于基于扩散的后验采样

AI 总结：提出一种稳定的路径空间框架，通过随机最优控制与信任域优化，实现非线性逆问题中准确且鲁棒的后验采样。

链接：https://arxiv.org/abs/2606.12710

机构：Oden Institute for Computational Engineering and Sciences, The University of Texas at Austin（德克萨斯大学奥斯汀分校奥登计算工程与科学研究所）； Mitsubishi Electric Research Laboratories (MERL)（三菱电机研究实验室）； Department of Biomedical Engineering, The University of Texas at Austin（德克萨斯大学奥斯汀分校生物医学工程系）； Mitsubishi Electric Research Laboratories（三菱电机研究实验室）

作者：Evan Scope Crafts, Umberto Villa, Saviz Mowlavi, Yanting Ma, Hassan Mansour, Wael H. Ali

英文摘要： Diffusion models provide expressive data-driven priors for Bayesian inverse problems, but many diffusion posterior samplers rely on heuristic guidance approximations that can fail for nonlinear operators and multimodal posteriors. In this work, we develop a stabilized path-space framework for diffusion-based posterior sampling. Starting from a base diffusion process whose terminal marginal represents the prior, we define a likelihood-weighted target measure on trajectories and cast posterior sampling as learning a controlled stochastic process whose path measure matches this target. This formulation connects diffusion posterior sampling to stochastic optimal control while preserving the Bayesian structure needed for uncertainty quantification. We introduce a time reparameterization that makes the path-space control problem well posed by removing the bias induced by the unknown initial value function, without auxiliary training. We then learn the control via a trust-region path-space optimization method with log-variance objectives. The path-space perspective also unifies our learned control approach with existing guidance-based samplers, quantifies the sampling error induced by approximate controls, and yields importance sampling corrections for asymptotically exact posterior expectations. We evaluate the proposed framework on a suite of benchmark inverse problems with analytically characterized or high-quality reference posteriors, enabling principled assessment of sampling accuracy and uncertainty quantification. These experiments provide insight into the behavior of diffusion-based posterior samplers and demonstrate improved accuracy and robustness over leading approaches.

41. The Geometry of Phase Transitions in Generative Dynamics via Projection Caustics

生成动力学中相变的几何：投影焦散视角

AI 总结：本文通过投影焦散几何解释生成动力学中的相变行为，提出临界边界检测器（CBD）诊断分数方向不稳定性，定位模式承诺并支持敏感区域控制。

链接：https://arxiv.org/abs/2606.13191

机构：Institute for the Advanced Study of Human Biology, Institute for Advanced Study, Kyoto University（京都大学高等研究院人类生物学高等研究所）； Graduate School of Engineering, The University of Tokyo（东京大学大学院工学系研究科）

作者：Ryosuke Sakamoto, Kotaro Sakamoto

英文摘要：Continuous-state generative samplers, including diffusion and flow-matching models, evolve through continuous reverse-time dynamics, yet their samples often undergo abrupt qualitative changes: trajectories commit to modes, semantic alternatives collapse, and small perturbations in narrow time windows can produce large downstream effects. This paper develops a geometric account of such phase-transition-like behaviour. We view denoising as gradient descent on a free energy landscape and show that sharp transitions arise near projection caustics, where the nearest-point projection onto the data support ceases to be unique. Motivated by this perspective, we introduce the Critical Boundary Detector (CBD), as practical diagnostics for score-direction instability. Across toy models, standard diffusion models, and latent text-to-image diffusion models, CBD localises mode commitment, predicts intervention-sensitive windows, and supports targeted control in geometrically sensitive regions. Our results connect geometry of data and dynamics of diffusion generation.

42. Towards More General Control of Diffusion Models Using Jeffrey Guidance

使用 Jeffrey 引导实现扩散模型的更通用控制

AI 总结：提出 Jeffrey 引导框架，通过 Jeffrey 条件规则更新边缘分布，扩展扩散模型控制到标准引导无法表达的应用，在 CIFAR-10 和 FFHQ 上显著降低 FID，并在 CelebA-HQ 上实现公平性控制。

链接：https://arxiv.org/abs/2606.13240

机构：Inria, CNRS, I3S, Maasai Université Côte d’Azur（法国国家信息与自动化研究所、法国国家科学研究中心、信息与系统科学实验室、马赛·蔚蓝海岸大学）； Technical University of Denmark（丹麦技术大学）； Inria, CNRS, LJAD, Maasai Université Côte d’Azur（法国国家信息与自动化研究所、法国国家科学研究中心、雅克-路易·利翁实验室、马赛·蔚蓝海岸大学）

作者：Raphaël Razafindralambo, Rémy Sun, Frédéric Precioso, Jes Frellsen, Pierre-Alexandre Mattei

英文摘要：A key strength of diffusion models lies in their flexibility, since their outputs can be controlled at sampling time through guidance. However, beyond simple cases such as conditional sampling, the target distribution is often left implicit, defined only through a sampling rule or a heuristic energy function. To address this, we propose Jeffrey guidance, a principled framework that extends diffusion-model control to applications beyond what standard guidance can express. It leverages Jeffrey's rule of conditioning to update marginal distributions towards a prescribed target, preserving the conditional structure and minimally perturbing the joint distribution. We first demonstrate Jeffrey guidance by targeting a prescribed embedding distribution. With Inception embeddings as the target, this leads to substantial reductions in FID on both CIFAR-10 and FFHQ. We further apply Jeffrey guidance to fairness on CelebA-HQ, updating an unconditional diffusion model to enforce independence between attributes.

43. Enhanced Low-Density Region Exploration in Classifier-Guided Diffusion Models Through Modified Reverse Diffusion Sampling

改进反向扩散采样在分类器引导扩散模型中的低密度区域探索

AI 总结：提出一种无需额外训练的采样时间密度感知方法，通过修改分类器梯度引导轨迹朝向低置信区域并引导采样朝向预测真实图像，以增强扩散模型对低密度区域的探索。

链接：https://arxiv.org/abs/2606.13347

作者：Jagriti Singh, Shekhar Verma, Muneendra Ojha

英文摘要：Diffusion models have emerged as state-of-the-art generative models for high-fidelity image synthesis, particularly in their classifier-free guided and classifier-guided forms. However, standard classifier guidance concentrates probability mass around high-density class mean, leading to poor coverage of rare samples in the tails of the class-conditional distributions. Recent work on diffusion-based tail sampling mitigates this by training an additional low-density-seeking classifier with a synthetic-vs-real discriminator, at the cost of additional networks and training. In parallel, a number of samplers and distillation techniques accelerate or refine diffusion sampling, but do not explicitly address long-tail coverage. We propose a purely sampling-time, density-aware extension of classifier-guided conditional diffusion model that targets low-density regions without any additional training. We have applied guidance at noisy images not on predicted noise like most diffusion models. Starting from a pretrained conditional diffusion model and classifier on ImageNet, we modify the guided reverse dynamics by steering trajectories toward low-confidence regions via the modified classifier gradient, and at each time step, we also guide the sampling process toward the predicted real image. 1st guidance helps explore low-probability samples, and 2nd guidance helps to generate samples to be close to the real data manifold. The proposed sampler consistently improves ADM model recall at 64x64 resolution while maintaining a comparable FID, and with a 256x256 ADM model, we showed the results visually with different combinations of both guidance. We also showed that standard ADM classifier guidance, combined with predicted real image guidance, helps generate high perceptual quality samples with a 256x256 ADM model on ImageNet.

44. VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

VideoMDM: 从2D监督走向3D人体运动生成

AI 总结：提出VideoMDM框架，利用单目视频的2D姿态通过扩散模型学习3D运动先验，使用深度加权的2D重投影损失近似3D监督，在HumanML3D上接近全3D监督性能。

链接：https://arxiv.org/abs/2606.13364

机构：Technion（以色列理工学院）； NVIDIA（英伟达）

作者：Amir Mann, Gal Michael Harari, Merav Keidar, Or Litany

英文摘要：We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.

45. Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs

Hölder++：改进多模态VAE中的质量-一致性权衡

AI 总结：针对多模态VAE生成质量与语义一致性之间的权衡问题，提出Hölder++，通过精确Hölder池化、扩展架构和层次推理，在提升一致性的同时保持生成质量。

链接：https://arxiv.org/abs/2606.13381

作者：Huyen Vo, María Martínez-García, Isabel Valera

英文摘要：Existing approaches for multimodal variational autoencoders (VAEs) face a trade-off between generative quality and coherence-i.e., they struggle to generate realistic and diverse samples that, at the same time, are semantically consistent across modalities. A recent work shows that using a simple approximation to Hölder pooling as an aggregation method improves coherence over the SOTA MMVAE+, despite assuming a single shared representation across all modalities. Yet, it slightly compromises sample diversity. Inspired by this insight, we propose Hölder++, a novel multimodal VAE that improves the generative quality-coherence trade-off through: (i) the first implementation of Hölder pooling without any approximation for multimodal VAEs; (ii) an extended architecture that models distinct shared and private (i.e., modality-specific) representations (Hölder+); and (iii) hierarchical inference that further enhances the disentanglement between the shared and private representations (Hölder++). Our experiments corroborate that Hölder++ consistently improves the generative quality-coherence trade-off, yields more structured latent spaces, and learns shared representations that are informative for downstream tasks.

46. PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update

PolyFlow: 安全高效的多面体约束流匹配，具有约束嵌入和无投影更新

AI 总结：提出PolyFlow，一种将约束直接嵌入模型和流动力学的多面体约束流匹配框架，通过离散时间流公式和无投影架构消除离散化误差并严格满足任意多面体约束，在规划与控制任务中实现零约束违反并降低推理延迟。

链接：https://arxiv.org/abs/2606.13400

作者：Jianming Ma, Qiyue Yang, Yang Zhang, Liyun Yan, Zhanxiang Cao, Yazhou Zhang, Yue Gao

英文摘要：While flow-based generative models have demonstrated strong performance across a wide range of domains, deploying them in safety-critical physical systems remains challenging due to strict constraint requirements. Existing approaches typically enforce safety through post-hoc corrections, which incur substantial computational overhead and may distort the learned distribution. We propose PolyFlow, a polytope-constrained flow matching framework that embeds constraints directly into the model and flow dynamics. PolyFlow introduces a discrete-time flow formulation and a projection-free architecture, which eliminate the discretization error and guarantee strict satisfaction of arbitrary polyhedral constraints, without the need for expensive iterative solvers. Experimental results show that PolyFlow achieves zero constraint violation while maintaining high distributional fidelity across a range of planning and control tasks. Compared to state-of-the-art constrained generation baselines, PolyFlow significantly reduces inference latency and demonstrates a favorable trade-off between safety, efficiency, and generative quality. Code is available on https://github.com/MJianM/PolyFlow.

47. Accelerating Speculative Diffusions via Block Verification

通过块验证加速推测性扩散

AI 总结：提出一种针对扩散模型的推测性采样方案，通过块验证提高草稿接受率，无需训练的Free Drafter实现高达6.3%的加速。

链接：https://arxiv.org/abs/2606.13426

作者：Alexander Soen, Hisham Husain, Valentin De Bortoli, Arnaud Doucet

英文摘要： Speculative decoding speeds up LLM inference by using a draft model to generate tokens, with an acceptance-rejection scheme that ensures that the output matches the target distribution. Adapting this to continuous diffusions is difficult because speculative sampling requires drawing from a residual distribution. While straightforward in discrete spaces, efficiently sampling this residual in continuous space is non-trivial. Consequently, existing diffusion adaptations either use computationally inefficient sampling techniques or rely on an alternative scheme. In this work, we introduce a novel scheme that efficiently implements the original speculative sampling mechanism for diffusion models. Our approach offers a critical advantage over current methods: it enables us to adapt block verification from LLMs to diffusions -- which provably improves the acceptance rate of drafts. Furthermore, we formalize and analyze the Free Drafter, a heuristic self-speculative drafter for diffusions that requires no training. By enabling block verification, our Free Drafter yields up to a 6.3% speedup over existing speculative methods with no additional training and negligible overhead beyond the existing parallel verification pass.

48. A2D2: Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding

A2D2: 任意长度离散扩散模型的自适应解码微调

AI 总结：提出A2D2框架，通过联合优化插入和去掩蔽策略及基于质量的推理调度，实现任意长度离散扩散模型的奖励引导微调，理论上保证收敛到奖励倾斜分布，实验提升奖励优化与生成灵活性和准确性。

链接：https://arxiv.org/abs/2606.13565

作者：Sophia Tang, Yuchen Zhu, Molei Tao, Pranam Chatterjee

英文摘要：Discrete diffusion models offer a simple and stable likelihood-based framework for sequence generation, recently extended to any-length settings via token insertion. Principled reward-guided fine-tuning for any-length discrete diffusion, however, remains largely unexplored. We introduce Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding (A2D2), a unified framework for reward-guided fine-tuning of any-length discrete diffusion models via joint optimization of the insertion and unmasking policies together with a quality-based inference schedule. We derive the Radon-Nikodym derivative for the joint insertion-unmasking path measures, enabling theoretically guaranteed convergence to the intractable reward-tilted sequence distribution without requiring target samples. Building on this, we establish unmasking and insertion quality as tractable approaches for minimizing decoding error and introduce the Adaptive Joint Decoding (AJD) loss, which provably yields the optimal path measure that generates the reward-tilted distribution. Empirically, A2D2 improves reward optimization while enhancing generation flexibility and accuracy over prior fixed-length fine-tuning and inference-time guidance methods.

5. 优化、泛化与理论分析 | 11 篇

49. Two-Layer Linear Auto-Regressive Models Estimate Latent States

两层线性自回归模型估计潜在状态

AI 总结：本文证明两层线性自回归模型通过经验风险最小化训练时，能近似卡尔曼滤波，恢复潜在状态估计，并提供有限样本保证。

链接：https://arxiv.org/abs/2606.12691

作者：Yahya Sattar, Sunmook Choi, Leo Maynard-Zhang, Yassir Jedra, Maryam Fazel, Sarah Dean

英文摘要：Auto-regressive models have emerged as powerful tools for sequential data, from language to video. Understanding how and why these models learn latent representations remains an open theoretical question. In this work, we demonstrate that when trained by empirical risk minimization on data from partially observed linear dynamical systems, two-layer linear auto-regressive models naturally learn to approximate Kalman filtering. In particular, we show that the learned hidden representation coincides, up to a similarity transformation, with the state estimates produced by the optimal (Kalman) filter, even though the model has no explicit knowledge of the underlying dynamics or state. The result follows from three main insights. First, we establish that the Kalman filter is well approximated by an auto-regressive model with bounded truncation error. Second, we show that despite non-convexity, the two-layer optimization landscape is benign, i.e., all stationary points are either strict saddles or global minima. Finally, as our main contributions, we provide finite-sample guarantees on prediction error, parameter estimation error, and latent state recovery. Numerical simulations support the theoretical results and demonstrate that the latent representations of auto-regressive models recover state estimates.

50. Adaptive Weighted Averaging

自适应加权平均

AI 总结：提出一种从单次无偏估计中选取最大未知值的方法，具有可容许性且不劣于基线，应用于随机优化获得在线到批次的转换界限。

链接：https://arxiv.org/abs/2606.12763

机构：University of Utah（犹他大学）； Boston University（波士顿大学）； Google（谷歌）

作者：Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, Manish Purohit

英文摘要：We study the problem of selecting the largest among $n$ unknown values $x_1,\dots,x_n$ given only a single unbiased estimate $y_i$ for each $x_i$. We design strategies that are simultaneously admissible (not uniformly dominated by any other strategy) and also never worse than a given baseline such as uniform random selection. We provide an application to stochastic optimization, where we obtain online-to-batch conversion bounds with a desirable "no-compromise" guarantee: they are never worse than standard random iterate selection, and yet can be significantly better in benign settings.

51. LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold

LoRA-Muon：低秩流形上的谱最速下降

AI 总结：提出LoRA-Muon优化器，将Muon的谱最速下降规则应用于低秩微调，解决LoRA对初始化敏感、最优学习率跨秩迁移差等问题，在TinyShakespeare上以秩32达到比稠密基线更低的验证损失。

链接：https://arxiv.org/abs/2606.12921

机构：Ateneo de Manila University（雅典耀马尼拉大学）； EleutherAI； NaXys, UNamur（纳慕尔大学NaXys研究所）

作者：Franz Louis Cesista, Katherine Crowson, Cédric Simal, Stella Biderman

英文摘要： Low-Rank Adaptation (LoRA) significantly reduces compute and memory costs for finetuning Deep Learning models but is often harder to tune than dense training: when using factor-wise optimizers such as AdamW, it is sensitive to initialization choices, its optimal learning rates transfer poorly across ranks, and it often fails to beat dense baselines. We derive LoRA-Muon by applying the Muon optimizer's spectral steepest-descent rule to the low-rank setting. Along with our split weight-decay rule, our main claim is that LoRA-Muon is a good low-rank proxy for full-rank Muon and Shampoo-family optimizers. Its optimal learning rates transfer across rank, width, depth, and factor-rescaling. In our compute-matched TinyShakespeare study, a rank-$2$ proxy recovers the dense best tested learning rate, and a rank-$32$ LoRA-Muon run attains lower mean validation loss than the dense baseline in the seed-averaged sweep. We further show that the Spectron optimizer depends on arbitrary factor scaling, so it would likely be a poor fit when finetuning starts from badly imbalanced factors, and that LoRA-RITE's simplified QR-coordinate core implements the same spectral update. LoRA-Muon computes that update without QR-decomposition and avoids storing second moments, making it more accelerator-friendly and memory-efficient.

52. Is Spurious Correlation Removal Always Learnable?

虚假相关性去除是否总是可学习的？

AI 总结：研究不变学习在统计可识别时的计算障碍，证明存在一维不变子空间的可采样多环境实例，多项式时间算法无法达到常数精度，并量化环境多样性对可识别性和风险的影响。

链接：https://arxiv.org/abs/2606.12930

作者：Yibo Zhou, Bo Li, Hai-Miao Hu, Hanzi Wang, Xiaokang Zhang, Ruifan Zhang

英文摘要：Invariant learning can fail even when the invariant structure is statistically identifiable. We show a conditional computational barrier: under a black-box samplable supervised sparse recovery primitive motivated by average-case sparse-recovery reductions, there exist \emph{samplable} multi-environment instances with a one-dimensional predictive invariant subspace ($k=1$) that are learnable with polynomial samples by exhaustive search, while any polynomial-time constant-accuracy recovery algorithm would contradict the primitive. We further quantify environment diversity by a separation parameter $γ$, which controls identifiability and the curvature of invariance objectives. Under sufficient diversity and local Gaussian regularity, the minimax risk is $\mathbb{E}[\dist(\hat{V},V_{\mathrm{inv}})^2]=Θ(k(d-k)/(n|\mathcal{E}|))$, and under label-induced shifts a phase transition occurs at $n^*\propto k(d-k)/(|\mathcal{E}|γ^2)$ with refined estimation error scaling proportional to $1/γ^2$. Synthetic and real datasets illustrate the predicted gaps and transitions and motivate simple diversity diagnostics.

53. Exposure Bias as Epistemic Underidentification in Recursive Forecasting

递归预测中的曝光偏差作为认知欠识别问题

AI 总结：本文证明递归多步预测中的曝光偏差不仅是分布偏移，更是部分可观测性下的认知欠识别问题，并提出基于来源变量的误差分解与校正方法。

链接：https://arxiv.org/abs/2606.12990

机构：University of Bristol（布里斯托大学）

作者：Riku Green, Zahraa S. Abdallah, Telmo M Silva Filho

英文摘要：Recursive multi-step forecasting is usually framed as distribution shift: models are trained on observed histories but deployed on their own predictions. We show this framing is incomplete by proving that, under partial observability or state truncation, recursive rollout is also an epistemic underidentification problem. Even with deterministic latent dynamics, one-step Bayes supervision identifies behavior only on observed contexts and need not identify the deployed recursive predictor once rollout queries self-generated induced states whose correct local targets are not determined by numeric state alone. We formalize this with induced states $Z$ and provenance variables $P$, and derive a decomposition of induced-state error into teacher-forcing/rollout mismatch, representation--class approximation, and provenance information gaps. Empirically, we show that rollout enters a distinct induced-state regime, that fixed induced states define a distinct local corrective task, and that closed-loop gains arise not only from local adaptation but also from changing the induced states visited during rollout. Using a simple binary provenance encoding, provenance-aware correction can further improve performance, though gains are conditional rather than uniform. These results recast exposure bias as reasoning under self-induced epistemic uncertainty.

54. Limits of spectral learning under noise

噪声下谱学习的极限

AI 总结：研究监督回归中加性标签噪声对谱方法的影响，推导出噪声导致系数漂移的闭合表达式，揭示了由单一内在噪声尺度控制的通用退化曲线。

链接：https://arxiv.org/abs/2606.13067

机构：Joz̆ef Stefan Institute（约瑟夫·斯特凡研究所）； Faculty of Mathematics and Physics, University of Ljubljana（卢布尔雅那大学数学与物理学院）； Department of Chemical Engineering, Universitat Rovira i Virgili（罗维拉-威尔吉利大学化学工程系）； Center for Computational Science and Applied Mathematics (ComSCIAM), Universitat Rovira i Virgili（罗维拉-威尔吉利大学计算科学与应用数学中心）； ICREA（加泰罗尼亚研究与高等研究院）

作者：Sabin Roman, Ljupco Todorovski, Saso Dzeroski, Marta Sales-Pardo, Roger Guimera

英文摘要：Learning functional relationships from noisy data is a central problem in scientific inference. Spectral methods approximate unknown functions by expanding them in a basis and estimating the corresponding coefficients from data, but the stability of these coefficients under noise remains poorly understood. Here we study supervised regression with additive label noise using sparse spectral representations across multiple bases and dimensions. We show that noise induces a predictable drift in the learned coefficient vector whose magnitude depends on the effective number of active spectral modes. After whitening the empirical feature geometry, we derive a closed-form expression for the overlap between noisy and noiseless coefficient vectors, revealing a universal degradation curve governed by a single intrinsic noise scale. Numerical experiments across Fourier, Legendre, Bessel, and Haar bases confirm the theoretical prediction. The results demonstrate that spectral learning exhibits a fundamental noise threshold beyond which coefficient estimates become unstable, placing intrinsic limits on recovering functional structure from noisy data.

55. Scale Buys Interpolation, Structure Buys a Horizon: Certified Predictability for Equivariant World Models

规模买插值，结构买地平线：等变世界模型的认证可预测性

AI 总结：针对等变潜在世界模型，提出可计算的多步可预测地平线认证，证明T步滚动误差在对称轨道上恒定，并由李雅普诺夫谱分层界定，且该认证为等变模型独有。

链接：https://arxiv.org/abs/2606.13092

作者：Hongbo Wang

英文摘要：Scale buys interpolation; structure buys a certified horizon. A world model's average error says nothing about whether a particular prediction can be trusted, or for how long. For equivariant latent world models we give a computable, multi-step certificate of the predictable horizon: $T$-step rollout error is provably constant over each symmetry orbit (Theorem A) and stratified channel-by-channel by the predictor's Lyapunov spectrum, $T_j(ε)\sim\log(1/ε)/λ_j$. The horizon is two-sided -- a matching lower bound makes approximate equivariance provably horizon-limited -- and the certificate is exclusive to structure: orbit-constant error characterizes equivariance, so no non-equivariant model has it at any scale. Empirically, on 40-D Lorenz-96 only a $\mathbb{Z}_N$-equivariant network recovers the full Lyapunov spectrum ($R^2{=}0.98$); dense and recurrent baselines fail. Because the spectrum is faithful, the certificate acts, a priori: under a fixed sensing budget a $c\times$-inflated certificate provably needs $c\times$ the budget, and the equivariant certificate meets a budget its inflated dense counterpart cannot -- with zero calibration data. The same read-out, unchanged, audits public pretrained world models training-free: TD-MPC2 checkpoints land on the certificate's own scope taxonomy -- calibrated where strongly expansive (ratio 0.94-1.02), optimistic where weakly expansive, correctly abstaining where contracting -- a map a deployed monitor replicates cell-by-cell, out-of-sample. Across the official 1M-317M multitask ladder, calibration does not improve with parameters. On V-JEPA 2-AC (1B, real robot data) the measured cross-check correctly overrides an over-promising tangent spectrum -- the cross-validated audit, not the raw number, is the deployable object. Scale buys interpolation, not a calibrated horizon.

56. Loss-Shift Transfer via Bayes Quotients

通过贝叶斯商进行损失转移迁移学习

AI 总结：本文研究数据分布固定但损失函数变化时的损失转移问题，利用贝叶斯商形式化损失的精炼顺序，证明粗损失的最小表示对严格更细的损失不足，并在有限输出对数损失下给出精确量化关系。

链接：https://arxiv.org/abs/2606.13178

机构：Athena Research Center（雅典娜研究中心）； Democritus University of Thrace（德谟克利特大学）； International Hellenic University（国际希腊大学）

作者：Vasileios Sevetlidis

英文摘要：Transfer learning is usually studied as a consequence of distribution shift. This paper identifies an orthogonal failure mode in which the data distribution is fixed and the loss changes. This setting is called \emph{loss shift}. A loss determines which information in $X$ is Bayes-relevant, and two losses may therefore require different representations even under the same joint law $P(X,Y)$. The idea is formalized using Bayes quotients, which allow losses to be ordered by refinement. In the Bayes-quotient formulation, strict refinement gives an immediate qualitative obstruction. A source-minimal representation for a coarser loss is insufficient for a strictly finer target loss. For finite-output log loss, this obstruction becomes an exact quantitative identity. The excess risk is the conditional information about $Y$ discarded by the representation. Experiments in controlled, learned, synthetic-image, and real-image settings show the predicted effect, i.e., classification-equivalent representations can have different optimal log-loss performance under a fixed data distribution.

57. Clipping Makes Distributed and Federated Asynchronous SGD Robust to Stragglers

裁剪使分布式和联邦异步SGD对掉队者具有鲁棒性

AI 总结：本文理论证明梯度裁剪能消除异步SGD中最大延迟对复杂度的影响，基于次Weibull梯度噪声模型，首次实现异步优化的高概率收敛。

链接：https://arxiv.org/abs/2606.13287

机构：KTH Royal Institute of Technology（瑞典皇家理工学院）

作者：Samuel Erickson, Mikael Johansson

英文摘要：In modern machine learning, parallelization of training is an important strategy for increasing scale. Asynchronous stochastic gradient descent (ASGD), which maximizes the utilization of available hardware by avoiding waiting for slow workers. However, with constant step sizes, the convergence of ASGD is nonetheless affected negatively by slow workers due to large delays in updates. At the same time, it has been empirically observed in asynchronous training of deep learning models that gradient clipping "stabilizes" training. In this work, we provide a theoretical justification for this behavior, as we show that clipping removes the dependence of the maximum delay in the oracle complexity. We employ a sub-Weibull model of gradient noise which generalizes sub-Gaussian and sub-exponential distributions to more heavy-tailed distributions, motivated by empirical observations in deep learning. We show convergence in expectation, and the first time in asynchronous optimization, convergence with high probability.

58. Learning with Simulators: No Regret in a Computationally Bounded World

与模拟器学习：计算受限世界中的无悔学习

AI 总结：提出可模拟过程框架，利用模拟器近似任意复杂依赖的数据分布，恢复VC维误差界，并展示条件采样的统计与计算优势。

链接：https://arxiv.org/abs/2606.13576

机构：MIT（麻省理工学院）； Microsoft Research（微软研究院）

作者：Sasha Voitovych, Abhishek Shetty, Noah Golowich, Alexander Rakhlin

英文摘要： Understanding the minimal assumptions necessary for generalization is the fundamental question in learning theory. Unfortunately, most results rely heavily on independence (or some proxy thereof) of the data-generating process, while results for strongly dependent data are far more limited. Towards addressing this gap, we introduce the framework of simulatable processes, where the learner has access to a simulator that approximates the distribution generating the data (which may be an arbitrarily complex and dependent process). Surprisingly, given access to such a simulator, we show that we can recover the same learning guarantees as in the classical setting with independent data, namely, error bounds that depend on the VC dimension. Further, we use this framework to study the power of conditional sampling and show strict statistical and computational advantages in this setting. As a highlight of our framework, we exhibit a single algorithm that simultaneously learns any given VC class under all processes samplable in bounded polynomial time, with regret controlled by the time-bounded Kolmogorov complexity of the process. This provides a significant conceptual broadening of the classical PAC model.

59. Limitations of Learning Tanh Neural Networks with Finite Precision

有限精度下学习Tanh神经网络的局限性

AI 总结：基于有限精度计算和L^p精度保证，通过构造尖锐局部化bump函数，证明自适应随机算法在L^p范数下收敛速度不超过蒙特卡洛率O(m^{-1/p})，除非采样预算随网络参数和架构指数增长。

链接：https://arxiv.org/abs/2606.11104

作者：Philipp Grohs, Matěj Trödler

英文摘要：We investigate limitations of learning $\tanh$ neural networks from point evaluations under finite-precision computations and $L^p$ accuracy guarantees, building on Berner, Grohs, and Voigtländer (2023). Our approach is based on a novel construction of sharply localized bump functions via iterated $\tanh$ activations. Using this mechanism, we show that, in a finite-precision setting, no adaptive randomized algorithm based on $m$ samples can achieve a convergence rate higher than the Monte Carlo rate $O(m^{-1/p})$ in the $L^p$ norm, unless the sampling budget grows exponentially with the size of the network parameters and architecture. The results reveal fundamental limitations imposed by finite precision on the learnability of classes containing localized bump functions, extending previous results for ReLU networks to the $\tanh$ setting.

6. 高效学习、压缩与部署 | 8 篇

60. DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics

DynamicPTQ: 通过残差流动态缓解激活量化崩溃

AI 总结：提出DynamicPTQ，通过分析残差流中激活的相位式动态变化，识别量化敏感层并分配8位精度，在W4A4KV4量化下提升LLaMA-2/3的困惑度和零样本QA性能，吞吐量提升1.05-1.07倍。

链接：https://arxiv.org/abs/2606.12487

机构：City University of Hong Kong（香港城市大学）； Zhejiang University of Technology（浙江工业大学）

作者：Zimo Zhao, Maolin Wang, Bowen Yu, Bowen Liu, Xiao Han, Xiangyu Zhao

英文摘要：Post-training quantization (PTQ) is essential for efficient large language model inference, but reliably quantizing activations remains challenging when weights, activations, and KV caches are all quantized to 4-bit precision. A key difficulty lies in massive activations, whose extreme values dominate the activation range and amplify quantization errors. State-of-the-art methods mainly mitigate massive activations through transformation-based smoothing, such as orthogonal rotations and affine scaling, but overlook the cross-layer dynamics of the residual stream. In this paper, we show that massive activations emerge and disappear in a phase-wise pattern across network depth, triggering large residual changes. These changes cause newly injected layer-wise updates to dominate the 4-bit quantization scale and weaken historical residual information. To characterize this behavior, we introduce Jump Ratio and Historical Feature SNR. This suggests that static transformation-based smoothing cannot fully resolve dynamic quantization instability caused by cross-layer residual changes. Based on this analysis, we propose DynamicPTQ, a Dynamic Post-Training Quantization policy for phase-aware mixed-precision activation quantization. DynamicPTQ identifies quantization-sensitive layers from residual-stream dynamics and assigns 8-bit activation precision only to these layers, while keeping weights, KV caches, and other activations in 4-bit precision. It can be directly integrated with strong PTQ baselines such as QuaRot, SpinQuant, and FlatQuant. Experiments on LLaMA-2 and LLaMA-3 show that DynamicPTQ consistently improves perplexity and zero-shot QA performance under W4A4KV4 quantization, while achieving 1.05 to 1.07 times throughput improvement with modest memory overhead. These results demonstrate a practical path toward robust low-bit LLM inference.

61. M*: A Modular, Extensible, Serving System for Multimodal Models

M*: 一个模块化、可扩展的多模态模型服务系统

AI 总结：提出M*系统，通过将模型表示为数据流图并引入Walk Graph抽象，支持多模态复合模型的高效服务，在多个任务上降低延迟并提升吞吐量。

链接：https://arxiv.org/abs/2606.12688

机构：Stanford University（斯坦福大学）； University of Washington（华盛顿大学）； Carnegie Mellon University（卡内基梅隆大学）

作者：Atindra Jha, Naomi Sagan, Keisuke Kamahori, Irmak Sivgin, Rohan Sanda, Steven Gao, Mark Horowitz, Luke Zettlemoyer, Olivia Hsu, Jure Leskovec, Baris Kasikci, Stephanie Wang

英文摘要： We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. Such architectures underpin a broad class of multimodal models, including unified multimodal models, omni models, speech-language models, vision-language-action policies, and world models. However, existing model serving frameworks were built on narrow assumptions about model structure, making them ill-suited to accommodate this new architectural diversity. Here we present M*, a universal serving system for efficient serving of composite AI models. M* represents models as dataflow graphs, processing requests spanning diverse modalities and tasks as traversals over these graphs. The core insight is a modular abstraction that supports arbitrary composition of model components, flexible placement onto a physical cluster, and model-agnostic optimizations within a distributed runtime. We call this abstraction the Walk Graph and show how it can concisely capture composite models from a broad range of families. We instantiate M* on representative models and find that it achieves, on average, 20% lower end-to-end latency than vLLM-Omni for text-to-image workloads on BAGEL, while delivering up to 2.9x lower real-time factor and 2.7x higher throughput for text-to-speech workloads on Qwen3-Omni. M* also outperforms the V-JEPA 2-AC rollout baseline for robotic planning by up to 12.5x. Thus, our work paves the road towards more efficient serving of complex models with minimal developer effort.

62. Multi-Bitwidth Quantization for LLMs Using Additive Codebooks

使用加性码本的大语言模型多比特宽度量化

AI 总结：提出Drop-by-Drop框架，基于信息论和逐次细化理论，利用加性码本和Matryoshka监督实现单个模型在推理时支持多精度权重控制，降低存储开销并保持性能。

链接：https://arxiv.org/abs/2606.12876

机构：University of Toronto（多伦多大学）

作者：Liza Babaoglu, Shuangyi Chen, Ashish Khisti

英文摘要：As large language models (LLMs) are increasingly deployed across heterogeneous hardware with varying resource constraints, the ability to adaptively manage the trade-off between performance and efficiency without retraining is critical. We propose Drop-by-Drop, a novel multi-bitwidth post-training quantization framework that enables inference-time precision control over LLM weights from a single trained model. Our method is theoretically grounded in information theory and successive refinement. We establish that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated, under a weighted mean squared error distortion motivated by LLM loss functions. To realize this in practice, Drop-by-Drop incorporates Matryoshka-style supervision into the loss function, exploiting the structure of additive codebooks. Drop-by-Drop produces a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level. This approach significantly reduces storage and memory overhead by allowing a single checkpoint to serve multiple bitwidths, while maintaining competitive perplexity and accuracy across major architectures, such as Qwen, LLaMA, Gemma, and Mistral.

63. TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

TWLA：通过训练后量化实现大语言模型的三值权重和低位激活

AI 总结：提出TWLA框架，通过后训练量化实现1.58位权重和4位激活，解决激活分布长尾问题，加速推理。

链接：https://arxiv.org/abs/2606.13054

作者：Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Xing Hu, Zhe Jiang, Dawei Yang

英文摘要：Large language models (LLMs) exhibit exceptional general language processing capabilities, but their memory and compute costs hinder deployment. Ternarization has emerged as a promising compression technique, offering significant reductions in model size and inference complexity. However, existing methods struggle with heavy-tailed activation distributions and therefore keep activations in high precision, fundamentally limiting end-to-end inference acceleration. To overcome this limitation, we propose TWLA, a post-training quantization (PTQ) framework that achieves 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy. TWLA comprises three components: (1) Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) minimizes layer-output error under weight ternarization via a two-stage optimization from Euclidean initialization to manifold relocation; (2) Kronecker Orthogonal Tri-Modal Shaping (KOTMS) applies a Kronecker-structured orthogonal rotation to reshape weights into ternary-friendly tri-modal distributions, while the shared rotation statistically suppresses activation outliers; and (3) Inter-Layer Aware Activation Mixed Precision (ILA-AMP) explicitly introduces adjacent-layer second-order interaction costs in bit allocation and jointly optimizes for the layer-wise disparity of activation quantization gains induced by the shared orthogonal transform, preventing cascades triggered by a few weak layers. Extensive experiments demonstrate that TWLA maintains high accuracy under W1.58A4, while delivering significant inference acceleration. The code is available at .

64. MiniPIC: Flexible Position-Independent Caching in <100LOC

MiniPIC: 少于100行代码的灵活位置无关缓存

AI 总结：提出MiniPIC，通过无位置编码KV缓存和用户控制缓存重用原语，在vLLM中实现多种位置无关缓存方法，显著提升预填充吞吐量并降低首个令牌延迟。

链接：https://arxiv.org/abs/2606.13126

机构：IBM Research（IBM研究院）

作者：Nathan Ordonez, Thomas Parnell

英文摘要：Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV entries unless they share identical prefixes with another request, while Position-Independent Caching (PIC) implementations within production-grade inference servers typically either require substantial server code changes or keep KV state outside the server, incurring host-to-device transfer overhead. We present Minimalistic PIC (MiniPIC): a minimal, flexible and fast vLLM design built from two ingredients: positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC stores unrotated K vectors in the KV cache, applies RoPE to K tiles inside attention using per-request logical positions, and exposes three user-facing and token-level primitives: block-aligned padding, span separator (SSep), and prompt depend (PDep), that modify hashing behavior and effective block-level causal attention structure. With fewer than 100 lines of core-engine changes plus a custom attention backend, these primitives are sufficient to realize multiple PIC methods, including Block-Attention, EPIC, and Prompt Cache, within the same running vLLM instance, while natively integrating with KV cache CPU offload implementations. On 2WikiMultihopQA, MiniPIC with interleaved scheduling improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, preserves the linear prefill scaling of uncached spans, and incurs only 5.7% worst-case overhead.

65. ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

ReSET: 通过步骤感知温度缩放实现精确的延迟关键型NVFP4推理

AI 总结： 针对大型推理模型在NVFP4低精度推理中精度下降和延迟问题，提出基于推理步骤熵的温度缩放方法ReSET，并设计CUDA小M核，在多个基准上提升精度约2点，解码速度提升2倍。

链接：https://arxiv.org/abs/2606.13233

机构：Hanyang University（汉阳大学）； Xenoscube Korean Inc.（Xenoscube韩国公司）

作者：Sihwa Lee, Janghwan Lee, Donghoon Yoo, Jae Gon Kim, Hanyul Ryu, Soojung Ryu, Jungwook Choi

英文摘要：Large reasoning models (LRMs) improve complex problem-solving by generating long intermediate reasoning traces, but this substantially increases inference costs. NVFP4 inference offers a promising approach to reduce both computational and memory costs through hardware-supported low-precision execution. However, directly applying NVFP4 to LRMs introduces two practical limitations: reasoning accuracy degrades under quantization, and existing NVFP4 kernels do not fully realize latency benefits in small-batch autoregressive decoding. In this work, we analyze the effect of NVFP4 quantization on token-level uncertainty during reasoning. We show that quantization increases incorrect sampling at low-entropy symbolic tokens, while causing over-concentration on a small set of tokens in high-uncertainty reasoning steps. Based on this observation, we propose \textbf{ReSET}, a reasoning-step entropy-based temperature-scaling method that estimates step-level uncertainty online and adapts the decoding temperature using both token-level and step-level entropy signals. To address the latency gap, we further design a CUDA-core small-$M$ NVFP4 kernel for latency-critical autoregressive decoding. Across reasoning benchmarks and model scales, ReSET improves NVFP4 reasoning accuracy by up to $\sim\!$2 points over the NVFP4 baseline. Our CUDA-core small-$M$ kernel further improves latency-critical decoding, delivering up to $2.5\!\times$ kernel-level speedup over NVFP4 vLLM and approximately $2\!\times$ end-to-end decoding speedup over BF16. Code is available at https://github.com/aiha-lab/ReSET.

66. Quantizing Time-Series Models As Dynamical Systems: Trajectory-Based Quantization Sensitivity Score

将时间序列模型量化为动力系统：基于轨迹的量化敏感度评分

AI 总结：提出基于轨迹的量化敏感度评分（TQS），从动力系统稳定性角度分析量化误差传播，实现无需校准数据的混合精度量化。

链接：https://arxiv.org/abs/2606.13300

作者：Mariya Pavlova, Harrison Bo Hua Zhu, Elizsveta Semenova, Yingzhen Li

英文摘要：We introduce the Trajectory-based Quantization Sensitivity Score (TQS), a metric that reframes post-training quantization (PTQ) through the lens of dynamical-systems stability. By modeling the network's rollout as a discrete-time dynamical system, TQS characterizes how quantization-induced errors propagate and amplify over the rollout horizon. Unlike conventional PTQ methods, where sensitivity analysis is often coupled to the quantization procedure, TQS enables a priori sensitivity estimation decoupled from quantizer selection and bit-width assignment. This separation allows for quantization budget planning even for black-box or compiled networks with fused operators. Building on this, we present TQS-PTQ, a flexible mixed-precision framework that requires no calibration data or costly second-order approximations. Our experiments show that a dynamical-systems perspective provides a robust, high-performing pathway for low-precision deployment in resource-constrained settings.

67. Positional Encoding in the Context of Memristor-Based Analog Computation for Automatic Speech Recognition

基于忆阻器的模拟计算在自动语音识别中的位置编码

AI 总结：针对忆阻器模拟计算中位置编码导致模数转换精度下降的问题，通过调整ADC权重和精度位比例或移除编码相关线性变换，分别降低约50%和30%的性能损失。

链接：https://arxiv.org/abs/2606.13379

机构：Machine Learning and Human Language Technology Group, Faculty of Computer Science, RWTH Aachen University（亚琛工业大学计算机科学学院机器学习和人类语言技术组）； Apptek GmbH（Apptek 有限公司）

作者：Benedikt Hilmes, Nick Rossenbach, Ralf Schlüter

英文摘要：Memristors provide a new chance for resource-efficient computation of neural models for natural language processing by enabling analog execution of vector-matrix-multiplication. Yet, computations on these devices are currently subject to larger distortion, both in weight programming and execution. In this work, we identify large output values of transformed positional encodings to cause major degradation within analog-to-digital conversion (ADC) as part of memristor-based computation. By adjusting the proportion of weight and precision bits of the ADC of specific memristor layers, we reduce the degradation of the execution by ~50% relative, while keeping the estimated energy consumption stable. Additionally, we investigate scenarios where the ADC cannot be modified. In that case the degradation can be reduced by ~30% relative after removing encoding-related linear transformations.

7. 联邦学习、隐私与安全 | 2 篇

68. Fed-FBD: Federated Functional Block Diversification for Isolation, Privacy, and Surgical Unlearning

Fed-FBD：用于隔离、隐私和精准遗忘的联邦功能块多样化

AI 总结：提出Fed-FBD模块化联邦架构，将ResNet分解为六个功能块并维护颜色变体仓库，实现块级隔离、隐私设计和亚秒级精准遗忘，在多个数据集上以微小精度代价换取安全保障。

链接：https://arxiv.org/abs/2606.12679

机构：University of Wisconsin–Madison（威斯康星大学麦迪逊分校）

作者：Weijie Chen, Alan B. McMillan

英文摘要： Federated learning (FL) enables collaborative model training without sharing raw patient data, but standard approaches such as FedAvg treat each client as a black box and provide no mechanism for isolating an adversarial contributor, auditing per-client influence, or honoring a departed participant's right to be forgotten. We present Fed-FBD (Federated Functional Block Diversification), a modular federated architecture that decomposes a ResNet backbone into six functional blocks (the stem, four residual groups, and the classification head) and maintains a warehouse of N color variants, each assembled from independently tracked and contributor-stamped blocks. Fed-FBD provides three capabilities absent in FedAvg: (i) architecturally guaranteed block-level isolation, so that an adversarial or mislabelled client cannot contaminate the clean colous; (ii) privacy-by-design, where membership inference advantage is already indistinguishable from chance before any privacy mechanism is applied; and (iii) surgical machine unlearning of a departed participant's contribution at sub-second cost and without retraining. Experiments on six MedMNIST-2D datasets, PathMNIST at 224x224, and CIFAR-10 show that Fed-FBD trades a modest 0.3%-3.1% IID accuracy gap on the adequately sized datasets for these guarantees, remains within 0.8%-4.0% of FedAvg at Dirichlet alpha=1.0 on three of four datasets, and confines all six adversarial attacks we study to the poisoned client's own blocks with at most +/-0.01 AUC drift on the clean colors.

69. Let's Ask Gauss: Improved One-Run Privacy Auditing

让我们问高斯：改进的单次运行隐私审计

AI 总结：提出一种基于高斯渐近分布的差分隐私审计框架，利用白盒DP-SGD中金丝雀对齐信号的归一化和，从单次训练运行中获取更紧的隐私下界。

链接：https://arxiv.org/abs/2606.12733

机构：Georgia Institute of Technology（佐治亚理工学院）； Rensselaer Polytechnic Institute（伦斯勒理工学院）； Purdue University（普渡大学）

作者：Adya Agrawal, Yu Wei, Jaspal Singh, Malik Magdon-Ismail, Vassilis Zikas

英文摘要：Privacy auditing provides an important safeguard by estimating the actual information leaked by a model, thus ensuring that theoretical privacy guarantees hold in practice. We study empirical privacy auditing for differentially private (DP) machine learning, focusing on efficient one-run methods for mechanisms such as DP-SGD. Prior one-run approaches threshold training examples or "canaries" into binary membership guesses, which discards useful information. We show that, in the white-box DP-SGD setting, canary-aligned signals naturally form a sequence of random variables whose normalized sum is asymptotically Gaussian. Leveraging this distributional perspective, we develop a DP-auditing framework that leads to tighter privacy lower bounds from a single training run.

8. 鲁棒性、不确定性与可信学习 | 8 篇

70. Robustness Verification of Recurrent Neural Networks with Abstraction Refinement

基于抽象精化的循环神经网络鲁棒性验证

AI 总结：提出抽象精化框架，通过分割预激活区间消除非线性松弛误差，并利用SHAP引导的时间步选择策略降低组合成本，显著提升RNN鲁棒性验证成功率。

链接：https://arxiv.org/abs/2606.12490

机构：National Science and Technology Council (NSTC), Taiwan（台湾国家科学与技术委员会）

作者：Li-Jen Lin, Chih-Duo Hong

英文摘要：Certified local robustness verification for recurrent neural networks (RNNs) is challenging because approximation errors introduced by nonlinear relaxations can propagate through recurrent connections and accumulate over time. As a result, scalable linear bound propagation methods often become overly conservative and fail to certify inputs that are in fact robust, especially when many pre-activation intervals cross zero. We propose an abstraction-refinement framework for RNN verification that partitions such intervals to remove the dominant relaxation error: on each refined branch, ReLU becomes exact, and smooth activations such as tanh and sigmoid admit substantially tighter linear envelopes. To control the combinatorial cost of splitting in long sequences, we introduce a SHAP-guided timestep selection strategy that ranks hidden states by their contribution to the verification objective and refines only the most critical timesteps in temporal order. Experiments on CIFAR10 and MNIST stroke benchmarks demonstrate consistent improvements in verification success and robustness-margin tightness over abstraction-only baselines, while exposing clear runtime trade-offs between ReLU and tanh models.

71. Policy-driven Conformal Prediction for Trustworthy QoT Estimation

策略驱动的可信QoT估计的保形预测

AI 总结：提出Conformal QoT框架，结合统计保证的QoT估计与操作决策策略，实现域偏移下可靠的光路可行性预测，在开放数据集上将准确率从92%提升至99.6%。

链接：https://arxiv.org/abs/2606.12501

机构：Chalmers University of Technology（查尔姆斯理工大学）； University of Applied Sciences and Arts of Southern Switzerland（瑞士南方应用科学与艺术大学）

作者：Kiarash Rezaei, Omran Ayoub, Paolo Monti, Carlos Natalino

英文摘要：We propose Conformal QoT, a policy-driven framework that combines statistically guaranteed QoT estimation with operational decision policies, enabling reliable lightpath-feasibility predictions under domain shift and improving accuracy from 92\% to 99.6\% on open datasets.

72. Towards Provably Fair Machine Learning: Bayesian Approaches For Consistent and Transparent Predictions

迈向可证明公平的机器学习：用于一致和透明预测的贝叶斯方法

AI 总结：提出公平贝叶斯分类器，通过强制确定性和统计一致性，在多个数据集上实现零一致性错误，同时保持准确性和多校准，解决少数群体因正则化导致的预测不一致问题。

链接：https://arxiv.org/abs/2606.12615

机构：University College Dublin（都柏林大学学院）

作者：Owen O'Neill, Fintan Costello

英文摘要： ML classifiers deployed in high-stakes domains produce predictions whose quality varies systematically across subgroups. For granular subgroups defined by intersections of multiple features, predictions are often inconsistent with the observed data: the model's outputs contradict the evidence available for that subgroup. This problem is exacerbated by regularisation, which improves aggregate performance by collapsing small subgroups into larger groups, disproportionately affecting demographic minorities. We define two requirements for consistent prediction: determinism (identical individuals receive identical predictions) and statistical consistency (we cannot reject, at significance level alpha, the hypothesis that the predictions for a subgroup were drawn from the Bayesian optimal target distribution inferred for that subgroup). From these requirements we derive the Fair Bayesian classifier, which enforces both across every group and subgroup simultaneously and abstains whenever no consistent deterministic prediction is possible. On three benchmark datasets (Adult, COMPAS, and Bank Marketing), standard classifiers produce statistically inconsistent predictions for a substantial proportion of subgroups. Our classifier achieves zero consistency error by construction while exceeding baseline accuracy and multicalibration on every dataset tested. Statistical consistency provides a principled foundation for prediction quality with direct implications for algorithmic fairness. Minority demographics are disproportionately concentrated in small subgroups, precisely where frequentist inference is least reliable; addressing this inference problem is therefore a necessary step toward fair ML. By enforcing Bayesian consistency at the finest resolution the data supports, the our classifier demonstrates that exhaustive subgroup fairness with principled abstention is achievable in practice.

73. Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLMs

规范性鲁棒性作为LLM中不可验证推理的前沿

AI 总结：提出道德推理作为不可验证推理的典型子域，定义道德鲁棒性并引入可扩展的多轮对抗评估框架，发现模型会向用户偏好偏移推理（平均6.5%），且受顺序和轮次影响。

链接：https://arxiv.org/abs/2606.12731

机构：DeepMind； Institute of Philosophy, School of Advanced Study, University of London（伦敦大学高等研究院哲学研究所）； Technische Universität Berlin（柏林工业大学）

作者：Elizaveta Tennant, Benjamin Henke, Anita Keshmirian, Murray Shanahan, Verena Rieser, Kristian Lum, Sydney Levine, Julia Haas

英文摘要：As LLMs increasingly serve in advisory and deliberative roles, users rely on them for non-verifiable reasoning in domains lacking objective ground truths. However, traditional evaluations of LLM reasoning focus almost exclusively on fact-based domains, such as mathematics and science, leaving uncertainty over whether and to what degree models can handle ambiguous, subjective, or value-laden problems over time. To address this concern, we propose moral reasoning as a paradigmatic subdomain of non-verifiable reasoning. We define moral robustness as a model's capacity to exhibit sound moral reasoning across time and contexts, and we introduce a scalable, adversarial, multi-turn evaluation framework to empirically measure this capability. We simulate 48,000 user-agent moral deliberations across four frontier LLMs, varying premise relevance, premise order, conversation duration, and the user's stated moral view. We find that models successfully ignore morally-irrelevant distractors, but shift their reasoning by up to 6.5%, on average, towards the user's stated preferred moral view, and varying their reasoning depending on factors such as order (altering moral judgments by order in 13-22% of the cases) and duration (altering moral judgments between single-turn and multi-turn in 10-24% of the cases). Our analysis indicates that models tailor not just their final verdicts but their underlying justifications to align with a user's moral viewpoint - a failure mode we characterize as moral deliberative sycophancy.

74. PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent

PolicyGuard：面向强化学习智能体的测试时和步级对抗防御

AI 总结：提出PolicyGuard，一种基于高斯过程后验方差的测试时步级后门防御方法，通过自适应伪轨迹计算单步不确定性，在七种RL游戏中达到平均AUROC 0.856和0.859。

链接：https://arxiv.org/abs/2606.12896

作者：Junfeng Guo Heng Huang

英文摘要：While real-world applications of reinforcement learning (RL) are becoming increasingly popular, the security of RL systems deserve more attention and exploration. In particular, recent work has revealed that RL agents are vulnerable to backdoor attacks, where a victim agent behaves normally under standard conditions but executes malicious actions when a specific trigger is activated. Existing backdoor defenses for RL either require access to the agent's internal parameters, operate only at the model or trajectory level, or are limited to specific attack types. To ensure the security of RL agents, we propose \texttt{PolicyGuard}, a \textit{test-time step-level} backdoor defense which leverages Gaussian Process (GP) posterior variance and adapts pseudo trajectories to enable uncertainty computation for individual time step. Besides, we also provide theoretical foundations to explain the efficacy of GP posterior variance. Extensive experiments across seven RL games demonstrate that PolicyGuard achieves state-of-the-art detection performance in most cases, with average AUROC of 0.856 for perturbation-based attacks and 0.859 for adversary-agent attacks.

75. Detecting Explanatory Insufficiency in Learned Representations: A Framework for Representational Vigilance

检测学习表示中的解释不充分性：表示警觉性框架

AI 总结：提出VER框架，通过识别持久残差结构来监测学习表示的充分性，补充传统评估方法。

链接：https://arxiv.org/abs/2606.13172

机构：Laboratory of Bioengineering and Nanosciences (LBN), University of Montpellier（蒙彼利埃大学生物工程与纳米科学实验室）； EuroMov Digital Health in Motion, University of Montpellier, IMT Mines Alès（蒙彼利埃大学EuroMov数字健康运动实验室，IMT阿莱斯矿业学院）； Certified Sophrologist, Sensorimotor Practice（认证心理放松治疗师，感觉运动实践）； Emeritus Professor, University of Montpellier（蒙彼利埃大学名誉教授）

作者：Jacques Raynal, Pierre Slangen, Elsa Raynal, Jacques Margerit

英文摘要： Learned representations are central to modern machine learning and are commonly evaluated through predictive performance, robustness, uncertainty estimation, or generalization. However, a learned representation may remain operationally successful while progressively failing to organize persistent residual structures that are not fully captured by conventional evaluation metrics. This article introduces VER, the Vigilant Evaluator of Representations, a conceptual framework for monitoring representational adequacy in learned representations. VER does not propose a new learning algorithm, loss function, or model architecture. Instead, it formalizes a diagnostic process through which persistent residual structures may be identified, analyzed, and interpreted as potential indicators of explanatory insufficiency. The framework distinguishes representational inadequacy from ordinary prediction error, uncertainty, noise, and distribution shift. It introduces a monitoring sequence based on representation identification, explanatory-domain delimitation, residual-structure detection, explanatory-resistance evaluation, and vigilance signaling. VER is intended as a contribution to representation diagnostics in machine learning. Its objective is not to replace existing evaluation methods but to complement them by treating representational adequacy as an explicit object of inquiry. A path toward empirical evaluation through representational-vigilance benchmarks is also outlined.

76. Understanding helpfulness and harmless tension in reward models

理解奖励模型中的有用性与无害性张力

AI 总结：通过激活分析和消融实验，发现奖励模型中有用性和无害性目标存在干扰，共享神经元对模型行为影响不成比例，导致对齐张力。

链接：https://arxiv.org/abs/2606.13209

机构：University of Copenhagen（哥本哈根大学）

作者：Eshaan Tanwar, Pepa Atanasova

英文摘要：Reward models are a key component of reinforcement learning from human feedback (RLHF), aligning language models toward both helpful and harmless behaviour. However, the internal mechanisms underlying these objectives and their conflicts remain poorly understood. We study alignment tension in reward models trained under helpfulness-only, harmlessness-only, and mixed-objective settings. We find that mixed-objective models often underperform single-objective models, indicating interference between objectives. Using activation-based methods, we identify neurons associated with each objective and study their functional roles via targeted ablations. We find that these neurons causally support their corresponding objectives while often negatively affecting the opposing one. We find that a substantial proportion of neurons are shared between helpfulness and harmlessness, and that these shared neurons exert a disproportionate influence on model behaviour, contributing to alignment tension. Additionally, our results provide insights and mechanistic interpretation into how alignment objectives are represented in reward models and why multi-objective alignment remains challenging, motivating future work on disentangled and controllable alignment methods.

77. Uncertainty Estimation for Molecular Diffusion Models

分子扩散模型的不确定性估计

AI 总结：提出一种事后方法，利用去噪网络的拉普拉斯近似估计预训练分子扩散模型中每个样本的不确定性，该分数与样本质量负相关，可用于过滤生成样本。

链接：https://arxiv.org/abs/2606.13451

作者：Paul Seij, Christian A. Naesseth, Stephan Mandt, Metod Jazbec

英文摘要：Diffusion models have seen wide adoption for 3D molecular generation, yet they offer no principled signal of when a generated molecule is likely to be of low quality. We propose a post-hoc method for estimating per-sample uncertainty in pretrained molecular diffusion models. Building on a Laplace approximation of the denoising network, we measure the variability of the noise prediction across the generation trajectory. Empirically, we show that the resulting uncertainty score is informative of sample quality, exhibiting a negative correlation with established sample-level quality metrics. We further study how the proposed uncertainty score can be used to filter generated samples, improving model performance via test-time scaling.

9. 图学习与结构化数据 | 6 篇

78. Physics-Aware Auxiliary Losses Improve Out-of-Distribution Generalization of a GNN Synthesizability Filter

物理感知辅助损失提升图神经网络可合成性滤波器的分布外泛化能力

AI 总结：通过在GNN上添加基于Bertz指数的拓扑复杂度回归和MMFF94力场应变能软惩罚作为辅助损失，在分布外数据上小幅但显著提升了可合成性滤波器的AUC（最高+0.0066）。

链接：https://arxiv.org/abs/2606.12651

作者：Riya Bisht, Dhruv Agarwal

英文摘要：Machine-learning drug-discovery pipelines increasingly rely on generative models that propose molecules far from the data used to train downstream synthesizability filters. Existing filters (SAScore, SCScore, RAscore, DeepSA) are purely statistical and degrade in exactly this out-of-distribution (OOD) regime. We ask whether cheap, closed-form physical priors, used as auxiliary supervision on a graph neural network (GNN), improve OOD generalization. We add two auxiliary losses to a GINE backbone: a topological complexity regression supervised by the Bertz index, and a strain-energy soft penalty supervised by MMFF94 force-field energy. On a 65,177-molecule corpus (HIV, Tox21, COCONUT) labeled by SAScore thresholds we reproduce a strong in-distribution baseline, then evaluate a 4-way ablation (baseline / +complexity / +strain / +both) on a single-source OOD split (train on drug-like HIV+Tox21, test on COCONUT natural products), repeated over 5 seeds with paired bootstrap confidence intervals. All three physics-aware variants give a small but statistically significant OOD improvement over the baseline (mean OOD AUC 0.9774): +complexity Delta = +0.0060 (95% CI [+0.0023, +0.0102]), +strain Delta = +0.0032 ([+0.0008, +0.0052]), +both Delta = +0.0066 ([+0.0038, +0.0093]); every interval excludes zero, and the combination is best. The variants are indistinguishable in-distribution, so the effect is visible only under OOD evaluation. We are explicit that the effects are modest, and we report a cautionary methodological finding: a single-seed version of this experiment produced a qualitatively different (non-monotone) story that did not survive multi-seed evaluation.

79. A Zero-shot Generalized Graph Anomaly Detection Framework via Node Reconstruction

基于节点重构的零样本广义图异常检测框架

AI 总结：提出AlignGAD框架，通过全局统一模块对齐异构特征、聚类模块捕获组级异常模式及节点差异评分模块聚合多视图异常证据，实现零样本跨域图异常检测。

链接：https://arxiv.org/abs/2606.12673

机构：School of Computing, KAIST（韩国科学技术院计算机学院）

作者：Phan Nguyen, Dat Cao, Hien Chu, Khue Hoang

英文摘要：Cross-domain graph anomaly detection (GAD) aims to identify abnormal nodes in unseen target graphs, showing strong potential in real-world applications with heterogeneous graph data. However, existing methods often depend on dataset-specific feature semantics and structural patterns, which limits their ability to generalize across different domains. To address this challenge, we propose AlignGAD, a zero-shot generalized graph anomaly detection framework. Our framework is built upon three key components: a Global Unification Module that aligns heterogeneous node features and normalizes graph signals in the spectral domain; a Clustering Module that constructs cluster-aware graph views to capture group-level abnormal patterns; and a Node Discrepancy Scoring Module that measures reconstruction discrepancy and aggregates anomaly evidence from different graph views. Experiments on multiple real-world datasets demonstrate the effectiveness of AlignGAD under the zero-shot GAD setting.

80. Multimodal Graph Negative Learning

多模态图负学习

AI 总结：提出GraphMNL框架，通过负学习解决多模态属性图中节点级分支语义不平衡问题，避免主导分支偏差传播，在Grocery和Reddit M数据集上取得最优性能。

链接：https://arxiv.org/abs/2606.12863

作者：Zhengyu Wu, Xu Wang, Hongchao Qin, Xunkai Li, Guang Zeng, Rong-Hua Li, Guoren Wang

英文摘要：Multimodal attributed graphs (MAGs) integrate graph topology with heterogeneous modality attributes, such as text and images, thereby enabling richer modeling of complex relational systems. However, such expressiveness also makes learning on MAGs depend on multiple semantic sources, including structural topology, textual and visual attributes, each of which can be regarded as a branch for node representation. Node-level branch semantic imbalance arises when these branches differ across nodes in semantic informativeness and reliability: a branch that provides discriminative semantics for one node may mislead another due to bias in modality quality or structural context. Existing methods often mitigate such heterogeneity through cross-branch agreement or alignment, implicitly treating the dominant prediction as reliable supervision. When the dominant branch is biased, forced imitation may propagate its bias to other branches and suppress original semantics that are useful for classification. We propose GraphMNL, a graph-aware multimodal negative learning framework that addresses this issue by using Negative Learning as cross-branch guidance. Instead of forcing inferior branches to imitate a teacher prediction, the model teaches them which classes a node is unlikely to belong to. GraphMNL builds a branch library, identifies dominant and inferior branches via graph-aware reliability arbitration, gates unstable transfer, and applies target-preserving negative learning over non-target classes. This design decouples target supervision from branch guidance so that supervised losses learn the correct class, while Negative Learning suppresses unlikely alternatives when branch agreement is unreliable. Through the comprehensive experimental evaluation, GraphMNL achieves the best performance on Grocery datasets with 72.47% accuracy and 76.60 F1 score on Reddit M datasets.

81. SMGFM: Spectral Multimodal Graph Pretraining for Multimodal-Attributed Graphs

SMGFM: 面向多模态属性图的谱多模态图预训练

AI 总结：提出SMGFM框架，利用图频谱分解区分结构诱导语义与模态特有语义，通过频带路由实现跨模态融合，在图级和模态级任务上取得最优性能。

链接：https://arxiv.org/abs/2606.12867

作者：Zhengyu Wu, Xu Wang, Hongchao Qin, Xunkai Li, Guang Zeng, Rong-Hua Li, Guoren Wang

英文摘要：Multimodal-attributed graphs (MAGs) couple graph topology with node semantics from text, images, and other modalities. Traditional graph learning contextualizes node semantics by coupling topology with node features. However, this coupling design becomes troublesome in MAGs, where structure-induced and modality-intrinsic semantics may contribute differently to downstream tasks. Structure-induced semantics promote relational consistency through smooth topological variation, whereas modality-intrinsic semantics often encode local, fine-grained distinctions that should not be uniformly smoothed or aligned. Therefore, the key challenge is to identify semantic roles before cross-modal fusion. To this end, we leverage graph-frequency variation as a prior, where low-frequency components capture topology-consistent semantics and high-frequency components preserve modality-specific semantics. Based on this intuition, we propose SMGFM, a spectral multimodal graph pretraining framework that decomposes each modality-specific node signal into graph-frequency bands and assigns band-level semantic roles before cross-modal interaction. Concretely, SMGFM constructs frequency-resolved modality tokens with scalable Chebyshev filters, estimates their coupling reliability through topology-conditioned routing, and performs band-modality interaction before fusion. Its frequency-routed objectives align smooth consensus routes while preserving modality-specific routes, mitigating spatial-domain entanglement and uniform cross-modal alignment. Extensive experiments conducted on the MAG datasets demonstrate that SMGFM achieves state-of-the-art performance across graph-level and modality-level tasks.

82. Clustering Node Attributed Networks with Graph Neural Networks and Self Learning

使用图神经网络和自学习的节点属性网络聚类

AI 总结：提出一种基于图神经网络和自学习的无监督图聚类框架，通过多轮自学习交替优化节点表示和聚类，利用上下文图提升性能，在合成和真实数据上表现优异。

链接：https://arxiv.org/abs/2606.13444

机构： Systems Engineering and Computer Science (PESC), Federal University of Rio de Janeiro (UFRJ)（里约热内卢联邦大学系统工程与计算机科学系）

作者：Rodrigo de Sapienza Luna, Daniel Ratton Figueiredo

英文摘要：Graph clustering - partitioning the node set of a graph into disjoint subsets that reflect some latent information - is a fundamental problem as it finds applications in a myriad of different scenarios. While this classic problem has been tackled for decades by different communities, a recent variation of the problem driven by real data considers the scenario where nodes have attributes that are also informative. This has triggered novel methods that simultaneously leverage network information (edges) and node information (attributed) in the design of novel clustering algorithms. This work proposes a novel framework that builds on prior works that have applied graph neural networks (GNN) to graph clustering. The proposed framework operates in rounds of self learning in a fully unsupervised setting. In each round, a GNN generates representations for nodes that are used to cluster the nodes. This clustering influences the graph used to generate the node representation in the next round. Moreover, a context graph built in each round using the original graph is used to generate the node representations. Empirical results show that the proposed methodology extracts information from both network edges and node attributes in synthetic data, outperforming algorithms focused solely on the network or attributes when neither are very informative. Multiple rounds of learning also improve the performance and always outperforms a long single round of training (i.e., classic GNN graph clustering). When considering real datasets, empirical results indicate that the proposed methodology is competitive to state-of-the-art methods when cluster sizes are balanced.

83. Understanding Truncated Positional Encodings for Graph Neural Networks

理解图神经网络的截断位置编码

AI 总结：研究截断位置编码（如前k个特征空间或邻接矩阵幂）对图神经网络表达能力的影响，理论证明截断后多种位置编码的表达能力存在本质差异，且截断谱位置编码不再强于1-WL测试，实验表明混合截断编码优于单一类型。

链接：https://arxiv.org/abs/2606.13671

作者：James Flora, Mitchell Black, Weng-Keen Wong, Amir Nayyeri

英文摘要：Positional encodings (PEs) enhance the power of graph neural networks (GNNs), both theoretically and empirically. Two of the most popular families of PEs - spectral (e.g., Laplacian eigenspaces, effective resistance) and walk-based (polynomials of the adjacency matrix) - are theoretically equivalent in expressive power, with expressivity between the 1-WL and 3-WL tests. However, this equivalence assumes the GNN uses the "complete" version of these PEs, which requires $O(n^3)$ time and space complexity. Instead, practitioners commonly use truncated variants of these encodings, such as the first $k$ eigenspaces or powers of the adjacency matrix. However, the theoretical properties of these truncated PEs are unknown. In this work, we initiate the study of these truncated PEs. Theoretically, we show that, under truncation, several families of PEs are fundamentally different in expressive power. As a corollary, we show that truncated spectral PEs are no longer stronger than the 1-WL test. We also study a family of spectral PEs, the $k$-harmonic distances, to highlight the differences in expressive power of even closely related truncated PEs. Finally, we experimentally show that a mix of truncated PEs is preferable to any single family on real-world datasets.

10. 迁移、元学习与持续学习 | 2 篇

84. How Useful is Causal Invariance for Domain Adaptation in Finite-Sample Settings?

因果不变性在有限样本设置中对领域适应有多大用处？

AI 总结：研究线性回归中因果不变性如何提升监督领域适应，通过候选预测器的目标风险边界和有限样本估计误差推导匹配上下界，证明当边界足够大时自适应聚合可避免负迁移。

链接：https://arxiv.org/abs/2606.12680

机构：Department of Computer Science, ETH Zurich（苏黎世联邦理工学院计算机科学系）； Causal Artificial Intelligence Lab, Columbia University（哥伦比亚大学因果人工智能实验室）； Department of Statistics, Columbia University（哥伦比亚大学统计系）

作者：Julia Kostin, Kasra Jalaldoust, Elias Bareinboim, Samory Kpotufe, Fanny Yang

英文摘要：Machine learning models often degrade when they are deployed on a target distribution that differs from the source distributions they were trained on. Recent work in causality-based domain generalization has shown how shared causal structure between domains can induce invariant predictors, e.g., models on a subset of features which have stable risk across structured domain shifts. However, the extent to which such population-level causal invariances can lead to gains in finite-sample settings remains underexplored. In particular, in practice we often have access to a few labeled target samples, a setting called supervised domain adaptation (sDA). In this paper, we explore when (full or partial) causal knowledge can provably improve supervised domain adaptation. As a first step, we study linear regression, where full or partial causal knowledge specifies a collection of invariant or possibly invariant feature subsets, each yielding a source-trained candidate predictor. We derive matching upper and lower bounds showing that finite-sample gains are governed by the target-risk margins separating the candidates, together with the finite-source estimation error. When these margins are sufficiently large relative to $n_Q$, an adaptive aggregation procedure can match the best candidate predictor while avoiding negative transfer relative to target-only learning. On the other hand, when the margins are too small, no algorithm can reliably exploit the candidate collection to obtain faster finite-sample rates. We further connect these margins to structural shift magnitude in linear SCMs and validate the theory on real-world causal benchmarks.

85. The Stable Recovery Manifold: Geometric Principles Governing Recoverability in Continual Learning

稳定恢复流形：持续学习中可恢复性的几何原理

AI 总结：通过分析Split CIFAR-100上ResNet-18的顺序学习，发现遗忘知识在表示重组后仍可紧凑解码，提出稳定恢复流形假说，表明灾难性遗忘主要是可访问性和流形对齐问题。

链接：https://arxiv.org/abs/2606.13637

作者：Ayushman Trivedi, Bhavika Melwani

英文摘要： Catastrophic forgetting is often viewed as the destruction of previously learned knowledge during sequential learning. Building on the Accessibility Collapse framework, we investigate the geometric structure of recoverability in continual learning. Using Split CIFAR-100 and a sequentially trained ResNet-18, we analyze recoverability, representational drift, and recovery complexity across ten tasks. We introduce Recovery Subspace Dimensionality (k_t), a measure of the minimum number of singular directions required to preserve 90 percent of full probe performance. Contrary to our Recoverability Diffusion hypothesis, recovery dimensionality remains stable throughout training (mean k_t = 8.0) despite substantial representational drift. Principal-angle drift strongly predicts recoverability (r = -0.862), and a simple geometric model explains 82.2 percent of recoverability variance. These findings support the Stable Recovery Manifold hypothesis, suggesting that forgotten knowledge remains compactly decodable despite representational reorganization. The results indicate that catastrophic forgetting is primarily an accessibility and manifold-alignment problem rather than information destruction.

11. 数据集、基准与评测 | 18 篇

86. Scalable anomaly detection via a univariate Christoffel function

通过单变量Christoffel函数实现可扩展的异常检测

AI 总结：针对Christoffel函数方法因矩阵大小随维度指数增长而难以应用于高维数据的问题，提出基于查询点与支撑点间平方距离的单变量Christoffel函数（UCF），在ADBench基准上平均精度优于14种基线方法。

链接：https://arxiv.org/abs/2606.12483

作者：Florian Grivet, Didier Henrion, Jean-Bernard Lasserre, Louise Travé-Massuyès

英文摘要：Anomaly detection plays a critical role in identifying unusual patterns across domains such as fraud detection, network intrusion, and system fault diagnosis. Recently, Christoffel function-based methods, rooted in polynomial optimization, have emerged as promising alternatives to deep learning due to their strong mathematical foundations and computational frugality. However, their practical applicability is hindered by the need to invert a matrix whose size grows exponentially with the data dimension, rendering the method intractable even for moderate-dimensional datasets. This paper addresses the dimensionality limitations of Christoffel function-based anomaly detection while preserving its key theoretical properties, i.e., the on-off support dichotomy behavior and the accurate support shape capture. We introduce UCF, a univariate Christoffel function which is based on the squared distance between the query point and the support points. Extensive experiments on the ADBench benchmark demonstrate that UCF consistently outperforms 14 state-of-the-art baselines in terms of Average Precision. By resolving the scalability bottleneck of the Christoffel Function, this work expands the toolkit of anomaly detection methods with a robust, theoretically grounded, and universally applicable approach.

87. Crossing the Validation Crisis: Cross-Validation Reduces Benchmarking Variance Surprisingly Well

跨越验证危机：交叉验证出人意料地有效降低基准测试方差

AI 总结：本文提出交叉验证通过样本增益概念量化虚拟数据增强，显著提升算法性能评估的置信度与稳定性，并引入动态早停机制减少计算开销。

链接：https://arxiv.org/abs/2606.12552

机构：MIND Team, Université Paris-Saclay, Inria, CEA, Palaiseau, France（MIND团队，巴黎-萨克雷大学，法国国家信息与自动化研究所，法国原子能委员会，帕莱索，法国）； SODA Team, Inria, Palaiseau, France（SODA团队，法国国家信息与自动化研究所，帕莱索，法国）； Probabl

作者：Célestin Eve, Gaël Varoquaux, Thomas Moreau

英文摘要：Modern machine learning progresses through empirical work, benchmarking new methods to evaluate relative performance. However, the statistical variability inherent to evaluation - exacerbated by the stochastic nature of many algorithms - often makes performance estimation unreliable due to the limited test samples available, leading to a validation crisis in which genuine advances are difficult to discern. In this work, we show that cross-validation improves markedly confidence when evaluating and comparing learning algorithm performances. We introduce the concept of sample gain, which quantifies the virtual data augmentation achieved by using multiple cross-validation splits to reduce benchmarking variance. Experiments on both synthetic and real-world datasets (histopathologic scans and NLP fine-tuning) demonstrate that multiple splits can substantially improve the reliability and stability of performance estimates, with diminishing returns often setting in later than expected. We also introduce a procedure to dynamically early-stop cross-validation by estimating from the first few folds if subsequent folds will bring large sample gains. Our findings highlight the value of pushing cross-validation on available samples to achieve robust and reliable benchmarking.

88. Emerging Flexible Designs for Geospatial Multimodal Foundation Models

地理空间多模态基础模型的新兴灵活设计

AI 总结：本文系统比较了不同架构的地理空间基础模型，在统一设置下评估其灵活性与性能，为多模态推理提供设计指导。

链接：https://arxiv.org/abs/2606.12595

机构：Oak Ridge National Laboratory（橡树岭国家实验室）

作者：Philipe Dias, Waqwoya Abebe, Abhishek Potnis, Aristeidis Tsaris, Dan Lu, Xiao Wang, Dalton Lunga

英文摘要： Foundation models are rapidly transforming Earth observation by enabling scalable pretraining across diverse unlabeled geospatial modalities. However, their architectural diversity ranging from encoder-only to encoder-decoder and masked autoencoding paradigms makes it challenging to assess performance trade offs in a consistent manner. In this work, we present an apples-to-apples comparison of leading FM architectures designed for geospatial multimodal reasoning, with a particular focus on flexibility across varied spectral band configurations. We standardize pretraining using identical self supervised learning objectives and training datasets, and evaluate all models under consistent parameterization on the GEOBench benchmark across classification and segmentation tasks. Our results offer new insights into the design trade-offs between model flexibility, modality alignment, and downstream task performance. By highlighting architectural strengths and limitations under controlled conditions, this study provides practical guidance for building next generation geospatial foundation models capable of robust multimodal reasoning.

89. Evaluation of AutoML Frameworks for IDS under Imbalanced Data Conditions of the NSL-KDD Dataset

NSL-KDD数据集不平衡数据条件下IDS的AutoML框架评估

AI 总结：研究NSL-KDD数据集上严重类别不平衡对多分类入侵检测中AutoML框架性能的影响，发现集成学习和不平衡感知优化可提升少数类检测能力，PyCaret表现最佳（macro-F1 66%）。

链接：https://arxiv.org/abs/2606.12611

机构：Cybersecurity and Artificial Intelligence Laboratory (CS&I Lab), National Institute of Telecommunications (Inatel)（网络安全与人工智能实验室（CS&I Lab），国家电信研究所（Inatel））； Wireless and Artificial Intelligence Laboratory (WAI Lab), National Institute of Telecommunications (Inatel)（无线与人工智能实验室（WAI Lab），国家电信研究所（Inatel））

作者：Wiliane Carolina Silva, Evandro César Vilas Boas, Felipe A. P. de Figueiredo

英文摘要：This work investigates the impact of severe class imbalance on the performance of automated machine learning (AutoML) frameworks for multiclass network intrusion detection using the NSL-KDD dataset. Unlike previous studies that simplify the problem through binary classification or minority-class removal, we preserve the original five-class distribution, including highly underrepresented attacks such as R2L and U2R, enabling a realistic evaluation of imbalance-sensitive learning behavior. Nine open-source AutoML frameworks were analyzed under a unified and reproducible experimental protocol, considering differences in architectural design, ensemble strategies, validation procedures, hyperparameter optimization, and imbalance-handling mechanisms. The results demonstrate that frameworks incorporating ensemble learning and imbalance-aware optimization achieve better minority-class discrimination. PyCaret obtained the best overall performance, reaching 66\% macro-F1, followed by AutoGluon with 55\%, whereas frameworks lacking native balancing support exhibited significant degradation in minority-class detection capability. The analysis further shows that accuracy-oriented optimization alone is insufficient for highly imbalanced IDS scenarios, since high-weighted metrics may coexist with poor generalization on rare attack categories. As a contribution, this work establishes a standardized benchmark for AutoML-based intrusion detection under severe multiclass imbalance, highlighting current architectural limitations and the need for native integration of imbalance-aware optimization, resampling, and stratified evaluation strategies into automated learning pipelines. The source code is publicly available.

90. The Metric Picks the Winner: Evaluation Choice Flips Model Rankings for Drug-Response Prediction in Unseen Chemistry

度量选择胜者：评估选择翻转未见化学空间中药物反应预测的模型排名

AI 总结：本研究通过VCPI竞赛数据，发现药物反应预测模型排名随评估指标反转：简单基线在代理指标下胜出，但真实指标下深度模型显著优于线性指纹基线，首次在真实药物化学数据上验证了度量校准效应。

链接：https://arxiv.org/abs/2606.12639

作者：Dhruv Agarwal, Riya Bisht

英文摘要：Predicting how a cell's transcriptome responds to a drug it has never seen is a core, hard problem in computational cell biology: recent benchmarks show complex models often fail to beat trivial baselines once test compounds are held out by chemistry. We study one cell line and assay, THP-1 cells profiled by DRUG-seq, scored by the active-compound weighted MSE(wMSE) of the VCPI prediction contest. We propose a staged approach: dumb baselines (untreated control and mean training-compound response) that the field keeps failing to beat; non-parametric retrieval (a Tanimoto-weighted average of a held-out compound's nearest training compounds); and a fusion stage combining a frozen chemistry embedding with retrieval-support features to predict the residual over the mean, with an uncertainty head and gene programs. On the released VCPI THP-1 drug-seq data (14,026 training compounds), under a Bemis-Murcko scaffold split, the model ranking inverts depending on the metric. Under an inverse-variance per-gene proxy, a regularized linear regression on Morgan fingerprints appears to win over the deep models, retrieval, and ChemBERTa -- the textbook "simple baselines win" result. But under the contest's true active-set metric (per-(gene, compound) Mejia weights, validated against the official scorer; mean baseline 0.535 vs the organizers' 0.507 reference), that reverses: the deep models win, our fusion decoder significantly beats the linear fingerprint baseline (-0.012 wMSE, paired bootstrap p < 10^-4), and the proxy's winner becomes the worst chemistry-aware predictor. Picking the metric picks the winner -- to our knowledge the first demonstration on real held-out drug chemistry of the metric-calibration effect established largely on genetic perturbation. We release a reproducible pipeline wired to the official scorer that emits a valid submission over the real 1064 x 12,995 grid.

91. TEDD: Robust Detection of Unstable Temporal Features

TEDD：不稳定时间特征的鲁棒检测

AI 总结：提出TEDD方法，利用回归模型检测导致时间分布变化的特征，无需参数调优，可扩展，能检测数值和类别特征的单变量及多变量漂移。

链接：https://arxiv.org/abs/2606.12643

机构：Feedzai

作者：Ricardo Ribeiro Pereira, Bruno Casal Laraña, Nádia Soares, Miguel Araújo

英文摘要： When working with real-world temporal data, it is common to encounter features whose distribution is changing over time. The naive employment of Machine Learning models on this unstable data might lead to rapidly degrading performance, especially if the new distribution is much different from what was previously seen during training. In order to cope with this problem, it is critical to automatically identify features that are changing over time. With these features detected, data scientists and other practitioners will be able to mitigate the issue (for instance, by applying data transformations), deploying more robust models that retain high performance for longer periods of time. In this paper, we describe which temporal changes a feature should not suffer from, and propose TEDD, a technique to a) identify when a dataset might lead to an unstable Machine Learning model and b) automatically detect which features cause such lack of robustness. In order to achieve it, we leverage a regression model to highlight which features contribute to a good prediction of an instance's timestamp. We compare our approach to other methods in real and synthetic data, testing their detection capability on all simple change patterns. We show that our method: detects all types of basic changes, both for numerical and categorical features; can detect multivariate drifts; returns a comparable value measuring the amount of change of each feature; requires no parameter tuning; and is scalable both on number of features and instances of the dataset.

92. Out-of-Distribution (OOD) Detectors for Open-Set RF Fingerprinting

面向开放集射频指纹识别的分布外检测器

AI 总结：针对开放集射频指纹识别中未知发射机与时间漂移引起的分布偏移问题，引入基于信息论的OOD检测统一框架，并采用无需OOD调优数据的方法，在POWDER数据集上验证其性能接近有真实OOD数据的基线。

链接：https://arxiv.org/abs/2606.12718

作者：Sudeepta Mondal, Ganesh Sundaramoorthi

英文摘要：Radio-frequency (RF) fingerprinting systems must operate in open-world environments where signals from unknown transmitters and temporal drift introduce distribution shift at test time. Out-of-distribution (OOD) detection provides a natural framework for this problem, yet its application to RF fingerprinting (RFF) remains limited. A key barrier to their adoption is that most OOD detectors require auxiliary OOD data for parameter tuning, an assumption that is difficult to satisfy in RF environments where representative OOD data is impractical to collect. In this work, we introduce a promising set of OOD detection methods from the machine learning literature to open-set RFF domain. We present these methods within a unified mathematical framework based on information theory, which is a natural framework for communication systems. Our framework allows for the systematic analysis of methods and development of new methods. We further demonstrate the applicability of recent work on tuning OOD detectors without given OOD tuning data for open-set RFF. We evaluate on the POWDER RF fingerprinting dataset, showing that detectors tuned without any given OOD data achieve performance comparable to baselines with access to true OOD tuning data and greatly out-perform baseline approaches without access to true OOD tuning data, showcasing the practical viability for the RFF problem.

93. Detecting Functional Memorization in Code Language Models

检测代码语言模型中的功能记忆

AI 总结：研究代码语言模型的功能记忆现象，通过反事实设置对比暴露目标代码的模型与未暴露的参考模型，使用文本和功能相似性度量，发现功能记忆超出文本重叠的检测范围。

链接：https://arxiv.org/abs/2606.12764

机构：Meta； Imperial College London（伦敦帝国学院）

作者：Matthieu Meeus, Anil Ramakrishna, Matthew Grange, Zheng Xu, Luca Melis

英文摘要：Large language models (LLMs) are increasingly used to generate code at scale. Meanwhile, prior work has investigated whether training data may be recoverable from model outputs, by auditing the textual overlap between training examples and model generations. Code, however, can be functionally equivalent while textually dissimilar. In this work, we study functional memorization: extraction of functional logic beyond what verbatim metrics detect. We construct a counterfactual setup for Olmo-3-32B, comparing a midtrained model (exposed to target code) against a pretrained reference (not exposed). We prompt both models with Python function signatures and measure both textual and functional similarity (i.e., LLM-as-a-judge, execution-based). Our results show clear evidence of functional memorization, highlighting the need for auditing metrics that go beyond textual overlap.

94. Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration

图上的样本选择：用于无损训练加速的统一数据集剪枝框架

AI 总结：提出基于图的统一数据集剪枝框架，将数据集建模为加权图，通过最大权重团问题选择样本，并设计贪心算法，在多种剪枝比例下优于现有方法，实现ImageNet-1k上40%以上训练加速且不损失精度。

链接：https://arxiv.org/abs/2606.12913

作者：Dongyue Wu, Zilin Guo, Xiaoyu Li, Jiajia Liu, Jingdong Chen, Nong Sang, Changxin Gao

英文摘要：The rapid growth of modern training datasets has significantly increased computational cost, motivating dataset pruning~(DP) methods which retain only a subset of informative samples to reduce training cost. Existing pruning criteria typically rely on either intrinsic signals that assess samples independently or extrinsic signals that promote diversity via pairwise relations. While effective in their own specific regimes, each captures only one aspect of sample utility and lacks robustness across different pruning ratios or data distribution. In this work, we present a unified graph-based DP framework. By modeling the dataset as a weighted graph, where node weights encode intrinsic value and edge weights encode extrinsic value, DP can be cast as a Maximum Weight Clique Problem (MWCP). Although MWCP is NP-hard, its structure admits a principled greedy solution based on sample-wise marginal gains. Under a few mild conditions, we further prove that this unified objective enjoys a formal approximation guarantee, which applies to a broad family of importance metrics and provides practical design guidelines. Extensive experiments show that our method outperforms existing DP methods while substantially reducing training cost, reducing training time by over 40\% without sacrificing accuracy on ImageNet-1k with ResNet-50.

95. DeepJEB++: Foundation Model-Driven Large-Scale 3D Engineering Dataset via 2D Latent Space Augmentation

DeepJEB++: 基于基础模型驱动的二维潜空间增强的大规模三维工程数据集

AI 总结：提出DeepJEB++框架，通过二维潜空间增强和基础模型，将少量喷气发动机支架种子设计扩展为大规模带仿真标签的三维数据集，实现40倍扩展。

链接：https://arxiv.org/abs/2606.12994

机构： Cho Chun Shik Graduate School of Mobility, Korea Advanced Institute of Science and Technology（韩国科学技术院赵春植移动研究生院）； Department of Mechanical Engineering, Hanyang University（汉阳大学机械工程系）； Narnia Labs（纳尼亚实验室）

作者：Soyoung Yoo, Leekyo Jeong, Jinsu Ra, Dongeon Lee, Sunwoong Yang, Hyogu Jeong, Namwoo Kang

英文摘要：Data-driven engineering design is constrained by the lack of large-scale 3D datasets that pair geometry with physics-based performance labels. In particular, existing 3D data augmentation techniques have limitations in preserving subtle and diverse geometric variations, and it remains difficult to automate the subsequent simulation-labeling process, where boundary conditions vary depending on the generated geometry. We present DeepJEB++, a foundation-model-driven data-augmentation framework that expands a small seed set of jet engine brackets into a large, simulation-labeled 3D dataset under constrained resources. Our key idea is to augment in the data-rich 2D latent space, then transfer to 3D. In Stage 1, we fine-tune a pretrained 2D latent diffusion model on multi-view renders and synthesize novel views by latent interpolation, retaining manufacturable designs through a vision-language-model (VLM) quality filter. In Stage 2, the validated images are lifted to 3D meshes by a domain-adapted generative foundation model. In Stage 3, an automated pipeline recognizes the load and bolt interfaces on each mesh and assigns finite-element labels -- mass, stress, and displacement -- without manual intervention. We assess augmentation quality along three intrinsic axes: manufacturability, label fidelity against the SimJEB ground truth, and distributional consistency. Starting from fewer than 400 seed designs, DeepJEB++ yields 15,360 simulation-labeled 3D brackets -- a 40x expansion -- using a single GPU per stage. The dataset will be made publicly available to support reproducible engineering-AI research.

96. Reliability of Probabilistic Emulation of Physical Systems

物理系统概率仿真的可靠性

AI 总结：比较生成模型与CRPS训练集成在物理系统概率仿真中的可靠性，发现CRPS集成在覆盖率和推理速度上更优。

链接：https://arxiv.org/abs/2606.12997

机构：The Alan Turing Institute（艾伦·图灵研究所）； Autodesk Research（欧特克研究院）； PhysicsX； Orbital； University of Sheffield（谢菲尔德大学）； University College London（伦敦大学学院）

作者：Sam F. Greenbury, Radka Jersakova, Paolo Conti, Marjan Famili, Christopher Iliffe Sprague, Edwin Brown, Jason D. McEwen

英文摘要：Two dominant approaches have emerged for generating probabilistic forecasts of physical systems: generative models, such as diffusion or flow matching; and ensembles of deterministic models with stochasticity injected, trained using the continuous ranked probability score (CRPS) loss. While both approaches have demonstrated strong predictive accuracy, the reliability of their uncertainties has not been systematically assessed. We address this gap by developing a framework to evaluate both approaches across diverse 2D spatiotemporal physical systems, under matched model size and computational budget. We assess the reliability of probabilistic emulation by inspecting the empirical coverage of predictive intervals, while also considering accuracy and computational efficiency metrics. CRPS-trained ensembles typically achieve more reliable uncertainties on both single-step prediction and autoregressive rollouts, demonstrating better coverage than the standard alternative of training generative models in a latent space. Moreover, the CRPS approach offers significantly faster inference. When generative models are trained in ambient rather than a compressed latent space, which is often infeasible for high-dimensional problems, they exhibit comparable coverage to CRPS-trained ensembles, though with substantially larger inference latency. In contrast, when CRPS-trained ensembles are trained in latent space they do not show a marked degradation in coverage with respect to ambient space. Both generative models and CRPS-trained ensembles demonstrate good predictive accuracy. To facilitate future research and application, we release AutoCast, a modular framework implementing both generative models and CRPS-trained ensembles, alongside AutoSim, a flexible dataset generation package for rapid prototyping.

97. Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

权威、真实性与引文偏差：研究大语言模型认知易感性的大规模多领域基准

AI 总结：提出AuthorityBench基准，通过2x2因子设计隔离引文权威信号对LLM认知行为的影响，发现引文存在（无论真假）均提高幻觉率，真声明搭配假引文时幻觉率上升3-22个百分点。

链接：https://arxiv.org/abs/2606.13104

作者：Aryan Khurana, Aravind Ramana RN, Dhruv Kumar

英文摘要：Large language models are increasingly deployed in citation-augmented settings, yet the effect of citation presence on model behavior independent of factual content remains poorly understood. We introduce AuthorityBench, a 220,564-prompt multi-domain benchmark that isolates how citation-based authority signals influence epistemic behavior in LLMs. The benchmark uses a fully balanced 2x2 factorial design crossing claim veracity with citation veracity, the first to do so, across four domains (general knowledge, science, law, and medicine), with controlled variation over 40 prompt templates, four venue prestige tiers, and a country-coded author name dataset. Evaluating seven models on 12 structured research questions, we find that citation presence, whether real or fabricated, consistently increases hallucination rates relative to a no-citation baseline. The effect is strongest when fabricated citations accompany true claims, raising hallucination rates by 3 to 22 percentage points and reaching 35 to 77% in the general knowledge domain, while legal claims are comparatively robust and venue prestige and author demographics show negligible impact. All datasets and evaluation code are available at: https://github.com/floating-reeds/AuthorityBench

98. Disparate Impact in Synthetic Data Generation

合成数据生成中的差异性影响

AI 总结：本文重新审视合成数据生成中的差异性影响公平性概念，指出非差异性影响要求合成分布与真实分布一致，并分析SDG失败的原因（表达能力、抽样误差、差分隐私估计误差），提出分组学习策略以提升整体效用和公平性。

链接：https://arxiv.org/abs/2606.13105

机构：Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 - CRIStAL（里尔大学、法国国家信息与自动化研究所、法国国家科学研究中心、中央里尔高等电力工程学院、计算机科学、信号与自动化研究实验室）

作者： Paul Andrey, Michaël Perrot, Batiste Le Bars, Marc Tommasi

英文摘要：We revisit the fairness notion of disparate impact for synthetic data generation (SDG), that assesses whether the utility of generated records is the same across sensitive groups. Our approach departs from existing work on fair SDG, that address the problem of correcting for undue biases in the observed distribution, hence redefining SDG as learning a distribution that is not that of the real data. By contrast, non-disparate impact is notably achieved when the synthetic and real distributions are the same. We expose reasons why SDG may fail to reach that solution and discuss why approximation and estimation errors occur and can be disparate across groups. We notably look into the expressive power of SDG methods relative to distribution complexity, sampling errors due to group proportions, and estimation errors induced by differential privacy mechanisms. We illustrate cases of disparate impact on both artificial and real-world data, focusing on SDG methods that rely on probabilistic graphical models. We also introduce a strategy of learning group-wise SDG models and illustrate how it can improve both the overall utility and its parity in many settings.

99. WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity Recognition

WHAR Arena: 基准测试高效可穿戴人体活动识别的最新进展

AI 总结：为解决可穿戴人体活动识别中的可比性危机，构建了包含30个数据集的大规模基准，评估17种架构，发现预测性能趋于饱和，而紧凑模型和随机森林在部署效率上构成帕累托前沿。

链接：https://arxiv.org/abs/2606.13194

机构：Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； IPAI Foundation gGmbH（IPAI基金会有限责任公司）

作者：Maximilian Burzer, Tobias King, Till Riedel, Michael Beigl, Tobias Röddiger

英文摘要：Deep learning has become the dominant paradigm in Wearable Human Activity Recognition (WHAR), yet progress is obscured by a comparability crisis. Results are often reported using inconsistent datasets, custom data processing, and varying evaluation protocols, making state-of-the-art claims fragile. We address this with a large-scale, open-source benchmark that integrates 30 diverse datasets under standardized processing, unified model interfaces, and a shared cross-subject evaluation protocol. Evaluating 17 representative architectures across 4760 training runs, we jointly measure predictive performance alongside on-device latency, peak memory, and model size on an Android reference device. Our results reveal that the WHAR state of the art is distributed rather than dominated by a single architecture. While CNN-HAR achieves the highest mean macro-F1, top-performing models cluster tightly, indicating contemporary architectures have converged near a predictive performance ceiling. When accounting for deployment efficiency, compact neural models, such as TinierHAR, and classical Random Forests define the practically relevant Pareto frontier, whereas larger recurrent and hybrid models incur high hardware costs without corresponding performance gains. Consequently, while predictive performance has plateaued, substantial potential for future progress remains in optimizing deployment efficiency and improving adaptation to domain shifts. We release our full framework to support transparent reuse and extension.

100. From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation

从不确定判断到校准排名：用于LLM评估的共形Elo估计

AI 总结：提出一种两层次校准方法，通过局部不确定性传播和全局共形预测，将LLM-as-a-judge的Elo评分误差降至17.9 MAE，并提供无分布假设的置信区间。

链接：https://arxiv.org/abs/2606.13221

机构：ELLIS Institute Tübingen（ELLIS 蒂宾根研究所）； OpenEuroLLM

作者：Bora Kargi, David Salinas

英文摘要：Evaluating new large language models typically requires costly human annotation campaigns at scale. LLM-as-a-judge offers a cheaper alternative, but judge scores carry systematic errors - such as position bias, self-preference, or intransitivity - that can strongly miscalibrate the resulting rankings. We quantify the resulting judge-human disagreement at two complementary levels. At the local level, we estimate per-battle uncertainty from the judge's own score differences by propagating calibrated win probabilities rather than hard labels into the Bradley-Terry procedure. This alone provides a drastic improvement to Elo estimation accuracy, bringing LLM-derived ratings within 17.9 Elo MAE of human-derived ones when averaged over 55 held-out models on LMArena. At the global level, we apply split conformal prediction to the residual gap between LLM-derived and human-derived Elo ratings across held-out models, producing prediction intervals with distribution-free marginal coverage guarantees that account for irreducible LLM-human disagreement. Together, these two layers yield a low-cost evaluation tool that provides developers with calibrated Elo estimates and honest uncertainty bounds, without access to large-scale human annotations.To facilitate reproducibility, we release our code at https://github.com/kargibora/SoftElo .

101. Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios

导航安全-保真度权衡：通过概率场景进行电力系统的大规模多变量时间序列预测

AI 总结：针对现有基准无法评估大规模多变量概率预测的安全性与保真度权衡问题，提出包含多达36,964个通道的电力系统基准PowerPhase和场景式分位数预测器PowerForge，在多个网格上取得最佳平均排名。

链接：https://arxiv.org/abs/2606.13338

机构：ZJU-UIUC Institute, Zhejiang University（浙江大学伊利诺伊大学厄巴纳香槟校区联合学院）

作者：Kaijie Xu, Anqi Wang, Xilin Dai

英文摘要： Probabilistic forecasting models are increasingly deployed on multivariate systems with distinct channel physics and operational constraints, but existing benchmarks evaluate neither property at scale. Public canonical multivariate benchmarks cap out at 2,000 channels, while power-system benchmarks either lack temporal structure or probabilistic evaluation. We introduce PowerPhase, a probabilistic forecasting benchmark built on six transmission grids ranging from 2,000 to 36,964 jointly forecasted channels, more than an order of magnitude beyond popular canonical multivariate benchmarks. Each target trajectory is the output of an AC power-flow solve, and PowerPhase ships with constraint-aware metrics, including Safety_mBrier, NECV, and CVaR-alpha, that complement CRPS and Distortion. Across eight baselines and three seeds, distributional accuracy and constraint satisfaction rank models differently, a trade-off we term safety-fidelity. We further propose PowerForge, a scenario-based quantile forecaster with type-specific decoding heads and a causal bridge between variable groups, which achieves the best average rank on every grid.

102. SupraBench: A Benchmark for Supramolecular Chemistry

SupraBench: 超分子化学基准

AI 总结：为评估大语言模型在超分子化学推理中的能力，与领域专家合作发布了首个超分子基准SupraBench，包含四个基本任务和一个辅助视觉任务，并提供了16M令牌的语料库SupraPMC。

链接：https://arxiv.org/abs/2606.13477

机构：University of Notre Dame（圣母大学）； University of Connecticut（康涅狄格大学）

作者：Tianyi Ma, Yijun Ma, Zehong Wang, Weixiang Sun, Ziming Li, Connor R. Schmidt, Chuxu Zhang, Matthew J. Webber, Yanfang Ye

英文摘要：Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, designing host-guest systems remains time-consuming, requiring days of dry-lab verification per candidate pair. Although LLMs have emerged as a fast alternative with strong performance on molecular binding tasks, no benchmark currently systematically evaluates LLMs for host-guest reasoning across fundamental supramolecular chemistry tasks, e.g., binding affinity prediction. To this end, we collaborate with domain experts to release the first Supramolecular Benchmark, called SupraBench, to evaluate LLMs in chemistry reasoning. Specifically, we design four fundamental tasks, i.e., binding affinity prediction, top-binder selection, solvent identification, and host-guest description, plus an auxiliary vision-based task for molecular identification. We also release SupraPMC, a curated 16M-token corpus of Supramolecular chemistry articles distilled from Europe PMC, to support the adaptation to the supramolecular domain. We benchmark a broad range of open and proprietary LLMs and find that LLMs leave substantial headroom across all tasks. Domain adaptation pretraining over SupraPMC transfers cleanly to in-distribution regression but trades off against strict letter-format output. Moreover, the difficulty profile differs sharply across task families, revealing distinct failure modes that indicate specific gaps in current supramolecular chemistry reasoning. Our source codes and benchmark datasets are available at https://github.com/Tianyi-Billy-Ma/SupraBench.

103. CRAFTIIF: Cross-Resolution Analytic Four-Type Interpretable Isolation Forest for Multivariate Time Series Anomaly Detection

CRAFTIIF：用于多元时间序列异常检测的跨分辨率分析四类型可解释孤立森林

AI 总结：提出CRAFTIIF无监督框架，通过四种小波特征和五个孤立森林同时检测点、分布、时间和集体四类异常，在mTSBench基准上达到平均F1=0.228，VUS-PR比先前最佳提升40.7%。

链接：https://arxiv.org/abs/2606.13486

机构：Avathon

作者：William Smits

英文摘要：Anomaly detection in multivariate time series is challenged by four structurally distinct anomaly types -- point (isolated spikes), distributional (level shifts), temporal (rhythm changes), and collective (inter-sensor correlation breakdowns) -- each requiring different feature representations. Most unsupervised methods target only one or two types and provide limited interpretability. We present CRAFTIIF (Cross-Resolution Analytic Four-Type Interpretable Isolation Forest), a fully unsupervised framework targeting all four types without dataset-specific tuning. CRAFTIIF generates K=500 random analytic wavelet feature draws across four families (Morlet, DOG, Haar, Coiflet), each targeting a specific anomaly type, feeding five structured Isolation Forests -- one per type plus a meta-IF for compound anomalies. An adaptive Otsu/MAD threshold calibrates detection automatically across anomaly rates from 0.1% to 69.2%. Because each IF is trained exclusively on type-specific features, branch firing provides direct anomaly-type attribution by construction, without post-hoc explanation. Evaluated on all 19 datasets of the mTSBench benchmark (Zhou et al., TMLR 2026), CRAFTIIF achieves mean F1=0.228 (all 19 datasets) and F1=0.322 (13 detectable datasets), ranking first among all 25 evaluated methods on VUS-PR (0.463 vs. previous best 0.329, +40.7%). A diagnostic framework -- oracle F1, detectability limits, and branch separation ratios -- identifies 6 of 19 datasets as fundamentally undetectable by any unsupervised method. Ablation over 11 conditions confirms adaptive thresholding (+38% F1), four-branch structure (+20%), and meta-IF (+23%) are each essential. Code: https://github.com/smitswil/craftiif

12. 机器学习应用 | 17 篇

104. Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics

幻觉起始的快速检测：延迟界与学习型CUSUM统计量

AI 总结：将幻觉起始检测建模为快速变化检测问题，基于RAGTruth验证的一阶马尔可夫模型，利用学习型CUSUM算法在匹配虚警率下实现11-13个token的检测延迟，优于线性基线，并揭示了分类指标掩盖的延迟结构。

链接：https://arxiv.org/abs/2606.12476

机构：Independent Researcher（独立研究员）

作者：Igor Itkin

英文摘要：Token-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction time: the number of tokens that pass between the onset of a hallucination and the alarm. We formulate hallucination onset detection as a quickest change detection problem. A first-order Markov model of the latent faithful/hallucinated state, validated on RAGTruth, places the task inside classical change-point theory and yields Lorden's lower bound on detection delay: about 1.3 tokens at a false-alarm rate of 0.01. We then show that a causal recurrent labeler acts as a CUSUM with a learned increment; at a matched false-alarm rate it detects in 11-13 tokens, against 31 for a linear per-token baseline, and a controlled decomposition attributes most of this advantage to a better per-token score rather than to temporal accumulation. An information-rate optimality theorem of Donsker-Varadhan type explains the remaining order-of-magnitude gap: the learned score realizes only 1/4.5 of the divergence the features carry, a deficit that recalibration cannot remove, with the remainder a finite-horizon effect. Classification metrics conceal this delay structure; sequential analysis makes it measurable

105. An Empirical Study on Predictive Maintenance for Component X in Heavy-Duty Scania Trucks

重型斯堪尼亚卡车中组件X的预测性维护实证研究

AI 总结：针对卡车车队，提出一种基于状态监测的预测性维护方法，将磨损状态建模为单调非递减时间序列，通过选取最近观测并转换为表格数据，利用AutoML简化建模，在Scania组件X数据集上降低了成本。

链接：https://arxiv.org/abs/2606.12486

机构：SnT, University of Luxembourg（卢森堡大学SnT）； Scania CV AB（斯堪尼亚商用车公司）

作者：Valeriu Dimidov, Sasan Jafarnejad, Raphaël Frank

英文摘要：Condition-based Predictive Maintenance (PdM) for truck fleets has gained momentum in recent years. This maintenance strategy aims to minimize unplanned downtimes and reduce costs by monitoring the health status of vehicles and taking proactive action based on their condition. However, the implementation of condition-based PdM systems is challenging due to the large volume of data generated by the trucks, the inherent complexity of detecting failures through sensor data and the difficulties in finding cost-effective trade-offs in the solution's implementation. In this paper, we define and validate a condition-based PdM methodology built on the assumption that the wear-and-tear state of the monitored component can be represented as a monotonically non-decreasing time series. It involves selecting only the most recent observations from the time series and transforming them into a tabular format for classification using machine learning (ML) models designed for tabular data. Our results indicate that the proposed methodology reduces costs on the Scania Component X dataset compared to current state-of-the-art (SOTA) approaches, while also simplifying the modeling process through AutoML.

106. Improving Crash Frequency Prediction from Simulated Traffic Conflicts Using Machine Learning Based Microsimulation

基于机器学习的微观仿真从模拟交通冲突改进碰撞频率预测

AI 总结：本文利用机器学习行为模型替代传统规则模型进行交通微观仿真，通过极端值理论分析模拟冲突预测碰撞频率，在英国利兹五个信号交叉口验证了ML模型无需地点校准即可提升预测准确性。

链接：https://arxiv.org/abs/2606.12500

作者：Xian Liu, Carlo G. Prato, Gustav Markkula

英文摘要：Traffic microsimulation combined with surrogate safety measures has increasingly been used as a proactive alternative to historical crash data for predicting crash frequency for current or planned road infrastructure designs. However, existing microsimulation-based safety studies have adopted simplified rule-based behaviour models, which reproduce traffic flow reasonably well but often fail to generate realistic conflict dynamics, limiting crash prediction accuracy. Recent advances in machine learning (ML)-based behaviour models offer a promising opportunity to potentially improve microsimulation realism and crash frequency predictions by learning human driving behaviour directly from large-scale trajectory datasets. To investigate this possibility, traffic microsimulation was conducted for five real-world signalised intersections in Leeds, UK, using both a standard rule-based model and a state-of-the-art ML model. Simulated vehicle trajectories were analysed using a two-dimensional Time-to-Collision metric to identify simulated conflicts, which were then modelled using Extreme Value Theory to predict crash frequency. Results show that conflicts from the ML model yielded crash predictions in line with the real-world crash data, whereas the rule-based model did not permit meaningful predictions, presumably due to a lack of model calibration to the specific simulated intersections. Directly using ML-generated simulated crashes to predict real-world crash frequency also yielded poor results, suggesting that while current ML models can realistically reproduce conflicts, they are not yet able to generate realistic crashes. Overall, the findings demonstrate that ML-based behaviour models are promising for improving crash prediction from simulated conflicts, without a need for location-specific model calibration, and suggest clear future directions for ML-based traffic microsimulation.

107. Physics-Informed Neural Networks for Chemotherapy Pharmacokinetics: Benchmarking the Clinical Estimator and Exposing Parameter Identifiability

基于物理信息的神经网络用于化疗药代动力学：基准测试临床估计器并揭示参数可辨识性

AI 总结：本研究将物理信息神经网络（PINN）应用于化疗药代动力学，在双室线性模型上匹配临床标准方法，在Michaelis-Menten扩展模型中揭示参数不可辨识性，并通过稀疏组织观测部分恢复可辨识性。

链接：https://arxiv.org/abs/2606.12658

作者：Riya Bisht, Dhruv Agarwal

英文摘要：Physics-Informed Neural Networks (PINNs) are an attractive tool for partial-observation problems in biology, where the governing dynamics are known but some compartments cannot be measured. Chemotherapy pharmacokinetics (PK) is a clean instance: drug concentration in plasma is routinely measured, but concentration in tissue -- which determines tumour kill and off-target toxicity -- is not. We benchmark a PINN against the standard clinical baseline (nonlinear least-squares on the analytical biexponential plasma solution, hereafter NLS) and a physics-agnostic neural baseline (a data-only MLP) on two PK problems. On the linear two-compartment problem, NLS is near-optimal; the PINN matches it to within a small constant factor while also producing the tissue curve in a single training pass, whereas the data-only MLP fails on tissue by roughly 10x. On a Michaelis-Menten extension (saturable elimination), the biexponential closed form no longer exists, so NLS is mis-specified and silently returns meaningless rate constants. The PINN instead exposes a deeper fact: the Michaelis-Menten two-compartment model is non-identifiable from plasma alone, and the PINN reports this honestly by converging to a basin with k12 -> 0. Adding two sparse tissue observations largely resolves identifiability: across five seeds the PINN recovers k21 to within 1% of truth and Vmax, Km to within one standard-deviation bar, while k12 moves in the correct direction (0.02 -> 0.82) but remains ~2 sigma below truth -- a recovery the closed-form NLS estimator cannot attempt at all, because its biexponential ansatz describes only plasma. Our claim is not that PINNs beat NLS. It is that PINNs offer a uniform recipe that ties the textbook estimator on the textbook problem, exposes structural identifiability that the textbook estimator hides, and absorbs heterogeneous measurements within a single loss.

108. Forecasting Is Not Attribution: Localizing Decoder Bypass in Graph-Based Neural Marketing Mix Models

预测不等于归因：在基于图的神经营销组合模型中定位解码器旁路

AI 总结：针对基于图的神经营销组合模型中预测精度高但归因失败的问题，提出DICE-MMM框架，通过限制解码器通信路径来诊断和定位归因旁路，实验表明低预测误差不能保证归因正确性。

链接：https://arxiv.org/abs/2606.12687

机构：University of California, Irvine（加州大学尔湾分校）； AdsGency AI

作者：Yunbo Wang, Bolbi Liu

英文摘要：Marketing mix models are used to forecast business outcomes and to attribute those outcomes to marketing channels, but these goals are not equivalent. We study a failure mode in graph-based neural MMM called attribution bypass: a high-capacity decoder can obtain low forecasting error through target autoregression, dense communication, co-movement, context, or latent memory while failing to route counterfactual sensitivity through the graph used as the attribution object. We introduce DICE-MMM as a bounded diagnostic and training framework. We do not claim that observational neural MMM identifies causal effects. Instead, DICE separates three questions often conflated in graph-based MMM: graph recovery, forecasting accuracy, and whether the trained decoder's perturbation-induced influence is graph aligned. Stage 1 trains a graph encoder with a restricted graph-mediated decoder. Stage 2 freezes the selected encoder and trains a graph-safe latent decoder whose cross-node communication must pass through the supplied graph. Decoder use is evaluated with CIG, AR-CIG, and graph-swap tests. Across controlled R/d/T swaps and an external multi-graph rawlog stress test, DICE improves stable graph recovery over CausalMMM. The experiments show that forecasting accuracy is not an attribution certificate: in a sparse-target benchmark, no-graph and full-graph decoders achieve MSE@7 around 0.004 while AR-CIG nAUPRC remains near or below zero, whereas an oracle graph reaches 0.807 +/- 0.129 at comparable MSE. Frozen graph-swap localizes the bottleneck: the same DICE-hard-trained decoder moves from nAUPRC -0.044 +/- 0.006 under learned graph inputs to 0.894 +/- 0.027 with the oracle graph. The contribution is a stress test and failure-localization framework showing that low MSE can hide attribution bypass and that the unresolved bottleneck is graph-support selection, not forecasting or decoder capacity.

109. LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data

基于可穿戴传感器数据的2型糖尿病个性化血糖评估：LLM驱动方法

AI 总结：提出GlyLLM框架，利用大语言模型整合可穿戴传感器数据和结构化元数据，实现个性化血糖动态建模，在血糖预测和糖尿病分类任务上分别比传统ML方法提升13.66%和13.08%。

链接：https://arxiv.org/abs/2606.12699

机构：Department of Information Systems and Cybersecurity, The University of Texas at San Antonio（德克萨斯大学圣安东尼奥分校信息系统与网络安全系）； School of Engineering Medicine, Texas A&M University（德克萨斯农工大学工程医学院）； Department of Family and Community Medicine, The University of Texas at San Antonio（德克萨斯大学圣安东尼奥分校家庭与社区医学系）

作者：Yifan Gao, Yanmin Gong, Yun Shi, Yuanxiong Guo

英文摘要：Type 2 Diabetes (T2D) poses an increasing global health threat, demanding effective glycemic assessment to support personalized and improved diabetes care. Wearable sensors such as continuous glucose monitors (CGM) and fitness trackers offer many valuable insights for glycemic assessment. However, effectively analyzing these data requires integration with essential individual-level context. Existing methods are often based on traditional machine learning (ML) and rely primarily on historical blood glucose measurements and overlook personalized information, which limits their performance across diverse diabetes populations. Recent advances in large language models (LLMs) have demonstrated their ability to integrate diverse data modalities while modeling sequential dependencies, motivating the exploration of their potential for personalized glycemic assessment. In this paper, we propose GlyLLM, an LLM-powered framework for modeling CGM-based glycemic dynamics through the integration of wearable sensor data and structured metadata. GlyLLM can leverage the extensive prior knowledge of pre-trained LLMs and achieve sensor-text semantic abstraction at decision time. Experiments on two related tasks on the AI-READI dataset demonstrate that our model outperforms traditional ML methods by an average of 13.66\% in Root Mean Squared Error (RMSE) for glucose forecasting and 13.08\% in Area Under the Receiver Operating Characteristic (AUROC) for diabetes categorization. Additionally, our ablation study shows that diabetes surveys and biometric tests are more critical than other health information for glycemic assessment. Our work presents a promising step toward harnessing the power of LLMs to advance personalized glycemic assessment in T2D care.

110. Physics-Informed Neural Networks and Radial Basis Functions for PDEs with Dirac Delta Sources

物理信息神经网络与径向基函数求解含狄拉克δ源的偏微分方程

AI 总结：针对含狄拉克δ项的偏微分方程，通过将物理信息神经网络解释为残差最小二乘法，利用弱形式直接处理δ项，并对比径向基函数展开方法，发现径向基函数-残差最小二乘法在输运问题中更稳定。

链接：https://arxiv.org/abs/2606.12735

机构：Department of Civil and Environmental Engineering, University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校土木与环境工程系）

作者：Manuel Reyna, Alexandre Tartakovsky

英文摘要：Physics-Informed Neural Networks (PINNs) are a machine learning method for solving forward and inverse Partial Differential Equations (PDEs). When applied to PDEs with Dirac delta functions in the forcing terms, boundary conditions, or initial conditions, PINNs require approximating them with smooth surrogate functions, a practice that can introduce significant modeling errors. In this work, we exploit the interpretation of PINNs as Residual Least Squares (RLS) methods and show that this perspective enables direct treatment of Dirac delta terms by integrating the weak-form equation. Among RLS formulations other than PINN, we focus on the Radial Basis Function (RBF) expansion (also known as a single-layer RBF Network). We show that while integrating out the Dirac delta in PINNs causes residuals to fail to converge to zero, RBF-RLS consistently provides good forward and inverse solutions to transport problems. We explain this finding using the Neural Tangent Kernel (NTK) theory. We test both approaches on linear PDEs that represent groundwater flow and transport in porous media and rivers. We solve inverse problems to fit synthetic data, noisy synthetic data, and real-world measurements.

111. Interpretable Factor Decomposition for Decision Intelligence in Large-Scale Financial Markets: Evidence from China's A-Share Market

可解释因子分解用于大规模金融市场决策智能：来自中国A股市场的证据

AI 总结：提出可解释机器学习流程，将截面股票收益预测分解为可审计因子贡献，使用XGBoost和TreeSHAP在中国A股市场验证，发现行为信号贡献58.2%预测归因。

链接：https://arxiv.org/abs/2606.12843

作者：Xiao Han, Yao Xiao, Zhen Zhang, Moxuan Zheng

英文摘要：We present an interpretable machine learning pipeline to decompose Cross-Sectional Equity Return Predictability into auditable factor contribution. We apply an XGBoost model with TreeSHAP attribution and conduct stress testing on 3632 Chinese A-share stocks from 2009 until 2019. Using 60-month, rolling windows over 55 months of out-of-sample data, XGBoost obtains a mean AUC of 0.547 and +2.38%/month (Newey-West t = 5.94; Annualized Sharpe 2.23) long-short spread for the top vs bottom quintiles. This alpha is persistent after adjusting for the Carhart four-factor model (+2.31%/month; t = 7.48). SHAP Decomposition indicates that behavioral signals (turnover and momentum) account for 58.2% of predictive attribution compared to 10.7% for valuation ratios, on average, across 55 industry groups. Ablation analysis serves to cross-validate this ranking and provides evidence that SHAP and ablation diverge in a manner that highlights feature substitutability structure that is largely invisible to either method used in isolation.

112. Predicting Cognitive Load from Speech and Interaction Dynamics in Dyadic Conversations

从二元对话中的语音和交互动态预测认知负荷

AI 总结：研究在自然协作对话中，通过语音和交互动态特征预测感知认知负荷，发现对话交互（如话轮转换）能有效预测时间压力、脑力工作等认知负荷维度。

链接：https://arxiv.org/abs/2606.12971

机构：Department of Computer Science, Colby College（科尔比学院计算机科学系）

作者：Tahiya Chowdhury

英文摘要：Estimating cognitive load from speech has largely been studied in controlled laboratory settings, with limited understanding of its reliability in natural collaborative conversations. We investigate whether speech and interaction dynamics predict perceived cognitive load during dyadic conversations. We analyze audio from 53 dyads performing nine collaborative tasks and extract static acoustic, dynamic, and interaction features to train a two-head Gated Recurrent Unit encoder to predict cognitive load scores. Results show conversational interaction provides useful signals for predicting cognitive load related to time pressure, mental work, effort, and task performance. Temporal demand is associated with turn-taking dynamics such as overlap and speaker switch, while mental demand is linked to imbalanced participation between speakers. These findings highlight the importance of task structure and conversational interaction for modeling cognitive load in natural collaborative settings.

113. scLLM-DSC: LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering for Single-Cell RNA Sequencing

scLLM-DSC：基于LLM知识增强的跨模态深度结构聚类用于单细胞RNA测序

AI 总结：提出scLLM-DSC框架，通过知识驱动语义视图与结构感知拓扑视图的跨模态对比对齐，利用LLM增强单细胞RNA测序数据的聚类性能，显著优于现有方法。

链接：https://arxiv.org/abs/2606.13007

机构：Computer Network Information Center, Chinese Academy of Sciences（中国科学院计算机网络信息中心）； University of Chinese Academy of Sciences（中国科学院大学）； Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences（中国科学院大学杭州高等研究院）； School of Computing and Information Technology, Great Bay University（大湾区大学计算机科学与技术学院）； School of Engineering, Westlake University（西湖大学工学院）

作者：Ping Xu, Pengjiang Li, Tian Du, Zaitian Wang, Jiawei Gu, Ziyue Qiao, Pengfei Wang, Yuanchun Zhou

英文摘要：Clustering is fundamental to scRNA-seq analysis, serving as a cornerstone for identifying cell populations and resolving tissue heterogeneity. However, existing methods focus on mining numerical statistical patterns, suffering from semantic agnosticism by neglecting the intrinsic biological functions encoded by genes. While Large Language Models (LLMs) offer promising semantic capabilities, their direct adaptation to cell clustering is hindered by the structural mismatch between generative pre-training objectives and discriminative downstream tasks. To bridge this gap, we propose scLLM-DSC, a novel LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering framework. Diverging from data-driven paradigms, scLLM-DSC establishes a semantically-grounded representation by synergizing two views: a Knowledge-Driven Semantic View derived from NCBI gene priors and contextualized Cell2Sentence embeddings, and a Structure-Aware Topological View extracted via a graph-guided encoder. Crucially, we introduce a cross-modal contrastive alignment mechanism to enforce consistency between biological semantics and transcriptomic features within a unified latent space. Extensive benchmarks demonstrate that scLLM-DSC significantly outperforms eleven state-of-the-art baselines in clustering accuracy.

114. CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts

CausalMoE：基于模式路由异构专家的十亿规模多模态基础模型用于格兰杰因果发现

AI 总结：提出CausalMoE，一种十亿规模多模态格兰杰因果基础模型，通过模式路由混合异构专家解耦动态机制，结合因果自注意力与LLM/VLM先验，实现稀疏因果图恢复，在监督和少样本场景中达到最优。

链接：https://arxiv.org/abs/2606.13024

机构： State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University（北京大学智能科学与技术学院通用人工智能国家重点实验室）； National Institute of Health Data Science, and Institute for Artificial Intelligence, Peking University（北京大学健康医疗大数据国家研究院、人工智能研究院）

作者：Bo Liu, Di Dai, Jingwei Liu, Jiarui Jin, Xiaocheng Fang, Guangkun Nie, Hongyan Li, Shenda Hong

英文摘要：Granger Causal Discovery (GCD) is fundamental for analyzing temporal dependencies in complex systems. However, existing neural GCD methods predominantly rely on a "one-size-fits-all" paradigm, struggling to capture distribution shifts and dynamic regime changes inherent in real-world time series. This often leads to entangled representations and spurious causal graphs. In this paper, we propose CausalMoE, a billion-scale multimodal Granger causal foundation model that explicitly models patch-level heterogeneity. CausalMoE introduces a Pattern-Routed Mixture of Heterogeneous Experts, which dynamically identifies latent temporal patterns and routes patches to specialized domain experts, effectively decoupling regime-specific mechanisms from shared dynamics. To ensure interpretable graph recovery, we design a Causality-Aware Self-Attention mechanism operating across variables, yielding sparse Granger causal graphs via proximal optimization. Furthermore, CausalMoE is the first to integrate LLMs and VLMs to align numerical signals with textual and visual priors, regularizing causal estimation in complex scenarios. Extensive experiments demonstrate that CausalMoE establishes a new state-of-the-art on fully supervised benchmarks, while effectively generalizing to few-shot settings where traditional methods fail.

115. A green solvent screening tool for emerging materials via uncertainty aware, transformer enhanced transfer learning

一种面向新兴材料的绿色溶剂筛选工具：基于不确定性感知、Transformer增强的迁移学习

AI 总结：提出一种结合预训练Transformer模型和不确定性量化的迁移学习方法，在极少数据下高精度预测溶解度参数，并开发了可定制的绿色溶剂筛选工具。

链接：https://arxiv.org/abs/2606.13060

机构：Technical University of Munich（慕尼黑工业大学）； Institute of Structure of Matter – National Research Council Rome (ISM-CNR)（罗马国家研究委员会物质结构研究所）； University of Rome "Tor Vergata"（罗马第二大学）

作者：Ioannis Kouroudis, Simon Ternes, Zhaosu Gu, Gohar Ali Siddiqui, Marina Ustinova, Angelo Lembo, Alessio Gagliardi, Aldo Di Carlo

英文摘要：Accurate prediction of solubility remains a central challenge across materials science and sustainable chemistry. In particular due to emerging technologies like organic and hybrid photovoltaics, batteries, and catalysis, solvent usage is expected to increase significantly within the coming years. Therefore, substituting solvents with greener alternatives is vital. This is where machine learning can have substantial impact. However, the limited data on critical parameters of solubility significantly constraints machine learning efficacy. In this work, we transfer a pre-trained foundational model on QM9 targets to our application with minimal data requirements. Additionally, the pipeline integrates uncertainty quantification, allowing the user to gauge the confidence of the predictions. As baseline, we succeed in predicting the Hansen solubility parameters and Dielectric Constant for which extensive databases exist. Importantly, we achieve high model performance on additional targets, such as Gutmann Donor and Acceptor numbers, where the available data is extremely limited. Overall, we augment data on solubility descriptors by orders of magnitude with high quality predictions. For effective dissemination, we deploy easy-to-use, easily integrateable with high throughput labs, customizable tool for ranking and screening possible solvent substitutes. Finally, we rediscovered known green solvent alternatives and proposed new candidates proving its relevance for finding eco-friendly solvents.

116. Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

与你合作得更好：将用户修正编译为编码代理的运行时强制

AI 总结：提出TRACE方法，通过将用户修正编译为原子规则并在运行时强制执行，显著减少编码代理在后续任务中的偏好违反，优于纯记忆方法。

链接：https://arxiv.org/abs/2606.13174

机构：University of Notre Dame（圣母大学）； IBM Research（IBM研究院）； Tencent AI Lab（腾讯AI实验室）

作者：Yujun Zhou, Kehan Guo, Haomin Zhuang, Xiangqi Wang, Yue Huang, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Nuno Moniz, Nitesh V. Chawla, Xiangliang Zhang

英文摘要：Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized real-user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test-time Rule Acquisition and Compiled Enforcement (TRACE), a drop-in skill-layer pipeline for coding-agent runtimes that mines user corrections, rewrites them as atomic rules, and compiles them into runtime checks that must pass before an agent completes future tasks. Unlike runtime checks written ahead of time by developers, TRACE skills come from the user's own chat corrections. We evaluate TRACE with simulated user-in-the-loop experiments on ClawArena coding-agent tasks and MemoryArena-derived memory-intensive tasks. On ClawArena, TRACE reduces held-out preference violation from 100.0% to 37.6% on in-distribution tasks and from 100.0% to 2.0% on out-of-distribution tasks. On MemoryArena-derived tasks, TRACE reduces in-distribution violation from 100.0% to 60.5% while matching or exceeding the strongest memory baseline on task pass. These results suggest that compiling corrections into runtime enforcement can address a repeated-friction failure mode that memory alone does not reliably solve, reducing the need for users to restate the same correction across future sessions. Experiment code is available at https://github.com/YujunZhou/TRACE_exp, and the deployable skill is available at https://github.com/YujunZhou/tellonce.

117. Decoding Insect Song: A Multitask Semisupervised Orthoptera Bioacoustic Classifier

解码昆虫之歌：一种多任务半监督直翅目生物声学分类器

AI 总结：提出PULSE半监督多任务框架，结合弱监督分类、自监督学习和知识蒸馏，在直翅目生物声学分类中优于通用模型，并通过主动学习进一步提升性能。

链接：https://arxiv.org/abs/2606.13236

机构： University of Oxford（牛津大学）

作者：Olga Isupova, Danil Kuzin, Ella Browning, Tom Mills, Steven Reece

英文摘要：Passive acoustic monitoring holds great promise for ecological inference, yet existing automated tools are typically narrowly trained and non-transferable. We address these limitations with PULSE, a semi-supervised, multi-task framework for Orthoptera bioacoustics, combining weakly-supervised species classification, self-supervised learning on unlabelled field audio, and knowledge distillation from a general-purpose bioacoustic model. Our domain-adapted specialist model outperforms a state-of-the-art general model across all metrics (macro F1: 0.21 vs. 0.07; AUC: 0.74 vs. 0.45; AP: 0.32 vs. 0.19), with active learning further raising F1 to 0.34 and AUC to 0.84. Beyond classification, the learned embeddings encode ecologically meaningful structure, exposed through an interactive visualisation tool for ecological discovery.

118. To GAN or Not To GAN: Segmentation Analysis on Mars DEM

生成对抗还是非生成对抗：火星DEM上的分割分析

AI 总结：使用监督语义分割和生成对抗方法自动检测火星上的土丘，并比较两种方法，发现添加人工生成数据并未改善结果。

链接：https://arxiv.org/abs/2606.13252

机构：University of Passau（帕绍大学）

作者：Douglas Dziedzorm Agbeve, Aditya V. Handrale, Salim Fares, Seif E. Idani

英文摘要：To better understand Martian Surface, which is needed to enable Rovers navigate Mars with ease, it is necessary to be able to determine the location of mounds. Detecting and studying these morphologies can also help us find evidence of extraterrestrial life, in this case, more specifically, water or signs of life conducive environments. Detection of mounds was done by manually mapping morphological parameters onto Digital Elevation Models. This paper solves the problem by automatically detecting and or predicting mounds on Mars using Neural Network based Semantic Segmentation methodologies. This is done by using supervised semantic segmentation model and generative adversarial approach. A comparison of the approaches shows that adding extra artificially generated data did not improve the result.

119. Once-for-All: Scalable Simultaneous Forecasting via Equilibrium State Estimation

Once-for-All: 基于均衡状态估计的可扩展同步预测

AI 总结：提出均衡状态估计（ESE）范式，通过一次前向传播估计多系统均衡状态并基于状态差异生成预测，在保持精度的同时实现10-70倍加速，且具有线性时间复杂度和鲁棒性。

链接：https://arxiv.org/abs/2606.13285

机构：RMIT University（皇家墨尔本理工大学）； Monash University（莫纳什大学）； University of Adelaide（阿德莱德大学）

作者：Beinan Xu, Andy Song, Jiti Gao, Feng Liu

英文摘要：We introduce Equilibrium State Estimation (ESE), a novel paradigm for simultaneous prediction, where multiple interacting systems require separate yet coordinated forecasts. Such scenarios often arise in real-world settings such as economics and healthcare modeling. Unlike existing approaches that predict one system at a time, ESE forecasts all systems in a single pass. It first estimates the equilibrium state across systems, then generates holistic forecasts based on the difference between the current state and the estimated equilibrium. Extensive experiments on synthetic and real-world datasets, including currency exchange and COVID-19 spread modeling, demonstrate that ESE is at least as accurate as state-of-the-art (SOTA) methods while being significantly faster. In addition, ESE integrates seamlessly with conventional predictors, combining their accuracy with its exceptional efficiency and delivering a 10-70x speedup. With linear-time complexity, ESE scales far better than SOTA methods as the number of systems increases. Moreover, it remains accurate under diverse perturbations, establishing ESE as a fast, generalizable, robust, and scalable multi-prediction method.

120. Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection

基于离线模仿学习的海事异常检测中的稀有门控上下文调节

AI 总结：提出RGFiLM模块，通过稀有度门控调节上下文调制强度，解决上下文异常检测中稀有上下文导致的高误报问题，在海事轨迹异常检测中取得最佳F1-FPR权衡。

链接：https://arxiv.org/abs/2606.13311

机构：Department of Industrial Engineering, Ulsan National Institute of Science and Technology (UNIST)（蔚山科学技术院工业工程系）

作者：Yongmin Kim, ByeongHoon Jeon, Sungil Kim

英文摘要：Contextual anomaly detection aims to identify abnormal behavior conditional on context variables, but practical deployments often face highly imbalanced context distributions where rare regimes can be critical information. Under such frequency bias, context-conditioned models can produce unstable decisions and excessive false alarms in rare contexts. We propose Rarity-Gated Feature-wise Linear Modulation (RGFiLM), a rarity-aware conditioning module that combines feature-wise modulation (i.e., context-conditioned scaling and shifting of hidden features) with a gate controlled by a data-driven rarity score. The rarity score is estimated from the empirical distribution of context variables and regulates how strongly context modulates intermediate representations: the gate becomes more decisive under rare contexts while remaining conservative under frequent contexts. We evaluate RGFiLM on maritime trajectory anomaly detection using AIS motion sequences with ERA5 environmental context in an environment-sensitive detour scenario. When instantiated in a sequential anomaly scoring pipeline, RGFiLM achieves the best mean F1--False Positive Rate (FPR) trade-off among the compared context-agnostic and context-conditioned methods. These results suggest that explicitly accounting for context rarity is an effective approach for reducing false alarms in context-sensitive anomaly detection.

13. 其他/综合机器学习 | 2 篇

121. The Mathematics of AI Winters: The mathematical Taxonomy of Paradigm Fragility in AI Winter

AI寒冬的数学：AI中范式脆弱性的数学分类

AI 总结：本文提出AI寒冬的数学解释，通过感知机不可能性、神经网络训练复杂度、高维非参数估计率、梯度消失和统计学习理论等数学瓶颈，分析早期AI范式失败的原因，并关联后续突破。

链接：https://arxiv.org/abs/2606.12610

机构：AIFI； Staq.io

作者：Miquel Noguer i Alonso, David Pacheco Aznar

英文摘要：Two major periods of reduced funding and confidence in artificial intelligence research, commonly called the first and second AI winters, are usually explained through engineering failure, commercial disappointment, and inflated expectations. This article develops a complementary thesis: that the dominant paradigms of those periods also met genuine formal barriers, including limitations of representation, optimisation, computational complexity, statistical learnability, and high-dimensional approximation. The contribution is synthetic rather than archival. We do not claim that particular theorems mechanically caused the winters; rather, we show that several central disappointments of early AI were aligned with mathematically precise bottlenecks. We analyse these bottlenecks through the perceptron impossibility results of Minsky and Papert, the complexity-theoretic hardness of exact neural-network training established by Blum and Rivest, minimax rates for nonparametric estimation in high dimension due to Stone, vanishing-gradient analyses by Hochreiter and by Bengio and collaborators, and classical statistical learning theory in the tradition of Vapnik and Chervonenkis, Valiant, and Blumer and collaborators. We then relate these barriers to the later breakthroughs that mitigated, rather than eliminated, them.

122. Order Is Not Control

秩序并非控制

AI 总结：本文论证秩序不等于控制，提出接收器门控响应定律，并在生物、大语言模型、适配器和随机算子面板中验证，表明控制是局部的、可测量的。

链接：https://arxiv.org/abs/2606.12923

机构：Australian Broadcasting Corporation（澳大利亚广播公司）

作者：Gareth Seneque, Lap-Hang Ho, Nafise Erfanian Saeedi, Jeffrey Molendijk, Tim Elson

英文摘要：AI alignment, interpretability, steering, and neural perturbation studies identify order-inducing objects. We argue that order is not control. Control requires a receiver-gated response law: a denominator-indexed operator mapping material state, action/drive, bath, and receiver state to response displacement, sinks, effort, and basin projection. We identify it across biological, LLM, adapter, and stochastic-operator panels. The laws are local: an intervention can be admitted, saturated, sign-changing, leaky, or overdriven depending on medium, bath, receiver state, action port, and comparator. Control is assigned when finite effort moves a target or outcome-readout class under the same denominator while damage, null/evasive, invalid format, overdrive, and unnecessary effort stay bounded. Mouse ALM, C. elegans, and zebrafish panels provide physical response-operator evidence while excluding coordinate identity and controller conclusions. LLM panels show generated-output response laws: across four material conditions, response vectors are predictable at 72.8-73.7% component-sign accuracy, rising to 84.3-84.8% on nonzero components; held-out observers predict system-effect and target/oracle families at 93.6% and 91.7% accuracy. Constitution-conditioned adapters reshape susceptibility as prepared media, and stochastic-operator panels separate measured opportunity from deployable action policies. This gives a driven-dissipative response-system account at the mesoscopic control level: drives act through prepared media, baths, and receivers, producing admitted movement, impedance, sinks, or overdrive. The evidence supports local admitted control and measurable stochastic response operators, while leaving deployable pre-generation control, hidden/logit causal sufficiency, biological-to-LLM coordinate identity, and literal thermodynamic quantities outside scope.