机器学习学术速递[6.17]

2026-06-17 | CS.LG机器学习 | 共 95 篇

[机构]信息由AI分析生成，可能存在错误，仅供参考，以论文实际显示为准

网站信息进行了更新，将Commnts中有会议、期刊名字的进行了高亮显示，同时增加了project的链接，扫码体验

快速导航

1. 深度学习架构与训练方法 13 篇

2. 表示学习、自监督与对比学习 3 篇

3. 强化学习与序列决策 11 篇

4. 生成模型与概率建模 8 篇

5. 优化、泛化与理论分析 6 篇

6. 高效学习、压缩与部署 8 篇

7. 联邦学习、隐私与安全 1 篇

8. 鲁棒性、不确定性与可信学习 5 篇

9. 图学习与结构化数据 6 篇

10. 迁移、元学习与持续学习 6 篇

11. 数据集、基准与评测 6 篇

12. 机器学习应用 20 篇

13. 其他/综合机器学习 2 篇

1. 深度学习架构与训练方法 | 13 篇

1. Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

模型在预填充阶段记笔记：KV缓存可编辑且可组合

AI 总结：研究发现KV缓存像笔记一样存储结论，支持编辑和组合：编辑单个字段可修正决策（8B模型准确率1.00，仅需~1%计算），组合预编译技能可无缝插入任意上下文（logit余弦相似度0.90-0.999），延迟降低至O(L)。

链接：https://arxiv.org/abs/2606.17107

机构：Pine AI

作者：Bojie Li

英文摘要：Prefix caching reuses prefill only across an exactly shared prefix, so one changed field invalidates the entire downstream cache. Yet overwriting the field's own key/value vectors and reusing the rest leaves the model acting on the old value. The reason, established causally across four model families: at prefill the model has already written the field-conditioned conclusion onto downstream notes; the field's own key/value drives under 1% of the decision. Read as a notebook of memoized conclusions, two capabilities follow. (1) It is editable. A salient erratum amends the notes; and with chain-of-thought, editing the field alone recovers the decision (1.00 at 8B, ~1% compute), while without CoT it is ignored. (2) It is composable. The notes are position-portable, so a precompiled skill can be RoPE-repositioned and spliced into any context, indistinguishable from full recompute (logit cosine 0.90-0.999, twelve models) at O(L) rather than O(L^2) time-to-first-token. A unified edit+compose agent stays decision-identical to recompute at up to 14.9x lower latency. The approach applies to any per-token attention KV cache, validated across scale, quantization, Mixture-of-Experts, and multimodal caches, and extends to several attention variants through small adapters. Because the erratum is append-only, it composes with production prefix caching: in an online vLLM benchmark it keeps the prefix cache-aligned (98.5% hit-rate), cutting p90 time-to-first-token by 53-398x.

2. PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

PowerOPD：利用有界幂变换稳定在线策略蒸馏

AI 总结：针对在线策略蒸馏中log-ratio奖励无界导致训练不稳定问题，提出基于Box-Cox幂变换的有界、符号一致奖励族PowerOPD，在数学推理任务上平均提升Avg@8/Pass@8达+6.37/+5.71，并降低59.2%时间与23.1%显存。

链接：https://arxiv.org/abs/2606.17199

机构：Eastern Institute of Technology, Ningbo（宁波东方理工大学）； The Hong Kong Polytechnic University（香港理工大学）； Shanghai Jiao Tong University（上海交通大学）； University of Waterloo（滑铁卢大学）

作者：Anhao Zhao, Junlong Tong, Yingqi Fan, Ping Nie, Wenjie Li, Xiaoyu Shen

英文摘要： Standard on-policy distillation (OPD) for large language models estimates the reverse-KL objective using student-sampled tokens, yielding an unbiased single-sample Monte Carlo estimator that avoids vocabulary-wide computation. However, we show that this estimator suffers from severe training pathologies in practice: sample inefficiency, unstable generation dynamics, and a substantial performance gap compared to exact full-vocabulary OPD. Reward-level diagnosis traces these pathologies to the log-ratio reward, which is unbounded by construction, producing extremely high-variance gradients concentrated at early positions and persisting throughout training; standard post-hoc scaling fail as they operate only after this distortion occurs. To solve this problem, we propose PowerOPD: a family of natively bounded, sign-consistent rewards from the Box-Cox power transformation, parameterized by alpha > 0, of which the log-ratio is the degenerate alpha -> 0 limit. Across six mathematical reasoning benchmarks and four Qwen3 teacher-student pairs, PowerOPD achieves benchmark-averaged Avg@8/Pass@8 gains of up to +6.37/+5.71 over vanilla OPD, +3.01/+3.54 over post-hoc stabilization, and +2.59/+8.90 over full-vocabulary OPD, while reducing wall-clock time by 59.2% and peak GPU memory by 23.1%. Larger alpha generally improves accuracy, consistently shortens responses, and keeps gradient norms more than 3,000x smaller than vanilla OPD.

3. The Discrete-Log Clock: How a Transformer Learns Modular Multiplication

离散对数时钟：Transformer如何学习模乘法

AI 总结：通过乘法特征变换分析，发现Transformer在模乘法任务中学习到稀疏的傅里叶谱，其嵌入和MLP神经元主要编码少数乘法频率，表明模型实现了离散对数空间中的加法运算，即“离散对数时钟”算法。

链接：https://arxiv.org/abs/2606.17399

机构：Stanford University（斯坦福大学）

作者：Huu Danh Nguyen (Stanford University)

英文摘要：When small transformers grok modular multiplication, prior work reports that the learned embedding has a "dense" Fourier spectrum requiring all frequencies. This contrasts with modular addition, where only a sparse set of key frequencies suffices. We show this density is an artifact of analyzing in the wrong basis. The natural Fourier transform for multiplication is not the standard additive DFT but the multiplicative character transform, which decomposes functions on the multiplicative group $(\mathbb{Z}/p\mathbb{Z})^*$ into its irreducible representations. Applying this transform to a grokked transformer trained on $a \cdot b \bmod 113$, we find the embedding spectrum becomes highly sparse (Gini coefficient 0.58 vs. 0.07 in the additive basis) with only 4 key frequencies carrying significant energy. Furthermore, 96.9% of MLP neurons are cleanly tuned to a single multiplicative frequency, and neuron activation heatmaps reveal 2D-periodic structure when reordered by the discrete logarithm. These results demonstrate the transformer reduces multiplication to addition in discrete-log space, implementing a "Discrete-Log Clock" algorithm analogous to Nanda et al.'s Clock algorithm for addition. The methodology generalizes: matching the analysis basis to the algebraic structure of the task reveals interpretable structure where standard tools see noise.

4. Reducing Learner Redundancy in Boosting via Residual Orthogonalization

通过残差正交化减少Boosting中的学习器冗余

AI 总结：针对Boosting中残差拟合导致的学习器冗余问题，提出SCBoost框架，通过谱残差投影和协方差正则加权两种机制减少冗余，理论证明其几何性质，实验表明在精度和F1分数上表现优异。

链接：https://arxiv.org/abs/2606.17567

机构：Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（中国科学院深圳先进技术研究院）； College of Information Science and Technology, Beijing University of Chemical Technology（北京化工大学信息科学与技术学院）； Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学高瓴人工智能学院）； School of Computer Science, Central China Normal University（华中师范大学计算机学院）； the School of Computing, Engineering and Mathematical Sciences, La Trobe University（拉筹伯大学计算、工程与数学科学学院）

作者：Ye Su, Jipeng Guo, Yong Liu, Xin Xu, Gangchun Zhang, Jinxin Chen, Di Wu, Longlong Zhao

英文摘要：While sequential residual fitting is the bedrock of standard boosting frameworks, it inherently breeds learner redundancy by repeatedly revisiting correlated error components. To address this bottleneck, we propose a shift from residual fitting to \textit{residual orthogonalization} and introduce SCBoost. Our framework tackles redundancy through two complementary mechanisms: Spectral Residual Projection (SRP) and Covariance-Regularized Weighting (CRW). During training, SRP projects each residual target onto the orthogonal complement of the historical prediction subspace, forcing successive learners to capture only novel empirical innovations. During aggregation, CRW optimizes ensemble weights on a validation set with an explicit covariance penalty to mitigate remaining correlations. Theoretically, we provide a finite-sample geometric characterization proving that SRP yields an exact additive residual-energy decomposition. Furthermore, under an isotropic-noise assumption, we rigorously establish the conditions under which this projection improves the effective Signal-to-Noise Ratio. Extensive experiments across ten benchmark datasets demonstrate that SCBoost delivers strong out-of-the-box performance, particularly in accuracy and F1 score. This work reinterprets boosting through a geometric lens, suggesting that explicit redundancy control is a principled and necessary step toward more efficient ensemble architectures.

5. When Dynamics Models Read the Wrong Time Steps: Label-Free Event Credit Re-Anchoring for Robust Global Readouts

当动力学模型读取错误的时间步：无标签事件信用重锚定以实现鲁棒的全局读出

AI 总结：针对序列到全局接口中的时间信用稀释问题，提出无训练无标签的CREST方法，通过事件核心估计与对比重锚定，减少分布外误差并恢复事件信用。

链接：https://arxiv.org/abs/2606.17572

作者：Yifan Wang

英文摘要：Learned dynamics models often answer global physical questions, such as fault severity or impact stiffness, by pooling a per-step feature sequence into one readout vector. This sequence-to-global interface creates an under-studied temporal credit problem: with only trajectory-level supervision, a model can predict accurately in training conditions while reading from abundant smooth correlates rather than the brief physical events that determine the target. We call this failure temporal credit dilution. It is not exposed by the training loss and is not removed by standard physics-informed residuals, because the error lies in where the global readout assigns functional credit. We introduce Credit-in-Event, an interface-level probe for measuring how much pooled credit lands on event steps, and prove in closed form that a pooled linear reader routes credit to a spurious background channel as the event fraction shrinks. We then propose CREST, a training-free and label-free readout that estimates a transient event core from learned features and re-anchors the pooled representation through event-versus-rest contrast. Across simulated gear and impact systems, recurrent and attention encoders, and public bearing vibration data, CREST reduces out-of-distribution error while restoring event credit. Ablations show that stable-step selection and receptive-field shrinking fail, confirming that the gain comes from event-core credit re-anchoring rather than a generic locality or stability prior.

6. Conservation Laws for Modern Neural Architectures

现代神经架构的守恒律

AI 总结：本文提出统一框架，刻画GELU、SiLU、SwiGLU激活的前馈网络、多头注意力及混合专家模型中的梯度流守恒律，实验验证了理论预测的不变量。

链接：https://arxiv.org/abs/2606.17816

作者：Viet-Hoang Tran, Vinh Khanh Bui, Tan Lai Ngoc, Nam Nguyen, Tuan Dam, Tan M. Nguyen

英文摘要：Understanding gradient descent dynamics is key to explaining the success of over-parameterized models, where implicit bias manifests through conservation laws in gradient flow. While such laws are well understood for linear and ReLU networks, they remain largely unexplored for modern architectures. This work develops a unified framework to characterize conservation laws for contemporary models, including feedforward networks with GELU, SiLU, and SwiGLU activations, multihead attention with sinusoidal and rotary positional encodings, and Mixture-of-Experts architectures under diverse gating designs. Our theoretical findings are supported by experiments that validate the predicted invariants.

7. Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity

注意力中的功能等价性：一项综合研究及其在线性模式连通性中的应用

AI 总结：本文形式化研究了Transformer中位置编码对功能等价性的影响，发现正弦编码保持原始注意力的对称性，而旋转编码显著减少对称群从而增强表达力，并通过对齐算法实证了位置编码对线性模式连通性的关键作用。

链接：https://arxiv.org/abs/2606.17830

作者：Viet-Hoang Tran, Vinh Khanh Bui, Van-Hoan Trinh, Tan Lai Ngoc, Tan M. Nguyen

英文摘要：Neural network parameter spaces are inherently non-injective, as distinct parameter configurations can realize identical functions through functional equivalence. While this symmetry is well understood in classical fully connected and convolutional models, it becomes substantially more intricate in modern attention-based architectures. Existing analyses of multihead attention have largely focused on the vanilla formulation, overlooking positional encodings that fundamentally reshape architectural symmetries. In this work, we provide a formal study of functional equivalence in Transformers with positional encodings. Focusing on the two most widely used variants--sinusoidal and rotary positional encodings (RoPE)--we show that sinusoidal encodings preserve the equivalence structure of vanilla attention, whereas rotary encodings significantly reduce the symmetry group, thereby enhancing expressivity. This offers a principled explanation for the growing prominence of RoPE in practice. We further examine how positional encodings affect linear mode connectivity, and through an alignment algorithm, empirically demonstrate that the presence and variability of connectivity across Transformer settings crucially depend on the positional encoding.

8. From Drift to Coherence: Stabilizing Beliefs in LLMs

从漂移到一致：稳定LLM中的信念

AI 总结：研究LLM在多项选择问答中的信念漂移问题，提出提示式预测重采样（PPR）方法，发现信念过程会自稳定并收敛，进而提出种子答案提示策略和自一致性损失以加速稳定并提高预测一致性。

链接：https://arxiv.org/abs/2606.17832

作者：SongEun Kim, Seungyoo Lee, Edwin Fong, Hyungi Lee, Juho Lee

英文摘要：Large language models (LLMs) are often hypothesized to perform implicit Bayesian inference, yet a key coherence condition, the martingale property of predictive beliefs, has been shown to fail in controlled synthetic in-context learning settings. We revisit this question in a more typical usage regime: generic multiple-choice question answering. Exploiting the discrete answer space, we compute exact predictive distributions and study belief dynamics induced by autoregressive answer resampling. We introduce prompted predictive resampling (PPR), where an LLM generates a sequence of answers to the same question. Empirically, PPR reveals early-stage belief drift, indicating martingale violations. However, after sufficient resampling steps, the belief process self-stabilizes and converges to a coherent predictive distribution. Based on this observation, we further propose (i) a seed-answer prompting strategy to accelerate stabilization, and (ii) a self-consistency loss that amortizes early-stage drift into the model via fine-tuning. Experiments on multiple-choice QA benchmarks show that our methods substantially reduce belief drift and improve predictive coherence without sacrificing accuracy.

9. Monotonic Kolmogorov-Arnold Networks: A Theoretical and Empirical Study of Monotonicity as an Inductive Bias

单调Kolmogorov-Arnold网络：单调性作为归纳偏置的理论与实证研究

AI 总结：提出MKAN，通过指数重参数化B样条系数、正边权和单调基激活实现硬单调性，理论证明任何特征提取器可被单调化且编码器规模有界，实验表明MKAN在单调性基准上达到最优并保持KAN的逐边功能透明性。

链接：https://arxiv.org/abs/2606.17886

机构：Jozef Stefan Institute（约瑟夫·斯特凡研究所）

作者：Mikhail Krasnov, Carolina Fortuna, Blaž Bertalanič

英文摘要： Monotonicity has been a long-running architectural inductive bias for neural networks, motivated by tabular, scientific, and economic settings where outputs are known to respond monotonically to certain inputs. Existing approaches are MLP- or flow-based and lack per-edge functional transparency; the only Kolmogorov--Arnold Network (KAN) variant with monotonicity, MonoKAN, enforces the constraint only on a restricted parameter subset and requires a projection-style training procedure. We close this gap with \textbf{MKAN}, a KAN with hard monotonicity guaranteed for \emph{all} parameter values via exponential reparameterization of B-spline coefficients, positive edge weights, and a monotone base activation. Training reduces to standard unconstrained gradient descent. Our headline theoretical contribution is a \emph{representation-cost} theorem: any $C^K, K >0$ feature extractor inducing a ball-shaped semantic-neighborhood partition admits a monotone realization of the equivalent neighborhood structure at $N' = N^* + k \le 2N^*$, where $k$ is the number of non-monotone coordinates of the original. The bound is architecture-agnostic and gives a principled sizing rule for monotone encoders. Empirically, MKAN is competitive with state-of-the-art monotone NNs on the SMM/ICML-2024 benchmark while being the only method that combines hard unconstrained monotonicity with KAN's per-edge functional transparency; the $2N^*$ prediction is validated in a self-supervised feature-size sweep on four real datasets, and on a controlled monotone-generative dataset MKAN recovers ground-truth factors with substantially higher Spearman alignment than KAN, MLP, and linear baselines.

10. KANLib -- An Modular, Extensible and Fast Kolmogorov-Arnold Network Implementation

KANLib -- 一个模块化、可扩展且快速的Kolmogorov-Arnold网络实现

AI 总结：提出KANLib框架，通过统一现有KAN实现、支持多种基函数和自适应网格缩放，在保持灵活性和高性能的同时，实现可复现的预测结果。

链接：https://arxiv.org/abs/2606.17927

作者：Julian Hoever, Gregor Schiele

英文摘要：Kolmogorov-Arnold Networks (KANs) have recently emerged as a promising alternative to traditional multilayer perceptrons by replacing linear weights with learnable univariate functions. Despite their theoretical advantages in interpretability and expressiveness, practical research of KANs remains difficult due to high computational costs and inconsistent feature support across existing frameworks. This paper introduces KANLib, a modular, extensible, and computationally efficient framework for developing and evaluating KAN architectures. KANLib unifies core concepts from existing implementations, including PyKAN, EfficientKAN, and FastKAN, within a consistent software architecture that emphasizes flexibility, feature parity, and high performance. The framework supports two basis function types, adaptive grid rescaling, grid extension, and fine-grained architectural customization while maintaining compatibility with standard PyTorch workflows. Experimental evaluation on the California Housing benchmark demonstrates that KANLib reproduces the predictive behavior of established reference KAN implementations while achieving competitive computational efficiency. Furthermore, the framework enables the exploration of architectural variations beyond standard KAN formulations with only minor impacts on predictive performance. Overall, KANLib provides a robust foundation for future research on scalable and extensible KAN architectures.

11. SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLMs

SoftMoE: 用于大语言模型混合专家网络的软可微路由

AI 总结：提出SoftMoE，通过软top-k LapSum松弛替代离散路由，实现专家路由的梯度优化，并学习每层专家激活数量，在语言建模中激活更少专家达到相当或更优性能。

链接：https://arxiv.org/abs/2606.17952

作者：Mikołaj Zasada, Łukasz Struski, Jacek Tabor, Marcin Kurdziel

英文摘要：Sparse Mixture-of-Experts (MoE) architectures enable scaling LLM parameters under a fixed inference budget by activating only a small subset of experts via top-$k$ routing. While this preserves causality and suits autoregressive language models, the discrete top-$k$ operator is not differentiable, forcing a fixed number of active experts per input and resulting in inefficient use of computation. We propose SoftMoE, which replaces discrete routing with a truncated soft top-$k$ LapSum relaxation, allowing gradient-based optimization of expert routing. We further parameterize the mean number of active experts per layer and impose a global budget constraint, enabling the model to learn how to allocate expert capacity across layers. SoftMoE remains fully compatible with autoregressive modeling and achieves performance comparable to or better than sparse MoE on language modeling and downstream tasks, while activating significantly fewer experts. Notably, the learned allocation is highly non-uniform, with later layers activating more experts. The source code is publicly available$^\dagger$.

12. LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling

LoopCoder-v2: 仅循环一次以实现高效的测试时计算扩展

AI 总结：本文提出并行循环Transformer（PLT）并研究循环次数选择，发现两循环变体在代码生成等任务上显著提升，而三循环以上性能下降，揭示了增益-成本权衡。

链接：https://arxiv.org/abs/2606.18023

机构：Beihang University（北京航空航天大学）； IQuest Research； Langboat（浪波）； Renmin University of China（中国人民大学）

作者：Jian Yang, Shawn Guo, Wei Zhang, Tianyu Zheng, Yaxin Du, Haau-Sing Li, Jiajun Wu, Yue Song, Yan Xing, Qingsong Cai, Zelong Huang, Chuan Hao, Ran Tao, Xianglong Liu, Wayne Xin Zhao, Mingjie Tang, Weifeng Lv, Ming Zhou, Bryan Dai

英文摘要：Looped Transformers scale latent computation by repeatedly applying shared blocks, but sequential looping increases latency and KV-cache memory with the loop count. Parallel loop Transformers (PLT) alleviate this cost through cross-loop position offsets (CLP) and shared-KV gated sliding-window attention, making loop count a practical design choice. We therefore study PLT loop-count selection through a gain--cost view: an extra loop may refine representations, but CLP also introduces a positional mismatch at each loop boundary. We instantiate this study by training LoopCoder-v2, a family of 7B PLT coders with different loop counts, from scratch on 18T tokens, followed by matched instruction tuning and evaluation. Empirically, the two-loop variant delivers broad gains over the non-looped baseline across code generation, code reasoning, agentic software engineering, and tool-use benchmarks, improving SWE-bench Verified from 43.0 to 64.4 points and Multi-SWE from 14.0 to 31.0 points. In contrast, variants with three or more loops regress, revealing a strongly non-monotonic loop-count effect. Our diagnostics show that loop 2 provides the main productive refinement, while later loops yield diminishing, oscillatory updates and reduced representational diversity. Because the CLP-induced mismatch remains roughly fixed as refinement gains shrink, the offset cost increasingly dominates. This gain--cost trade-off explains PLT's saturation at two loops and provides diagnostics for loop-count selection.

13. Looped World Models

循环世界模型

AI 总结：提出循环世界模型（LoopWM），通过参数共享的Transformer块迭代细化潜在环境状态，实现高达100倍参数效率，并建立迭代潜在深度作为世界模拟的新缩放轴。

链接： https://arxiv.org/abs/2606.18208

机构：FaceMind Research Asia

作者：Hongyuan Adam Lu, Z.L. Victor Wei, Qun Zhang, Jinrui Zeng, Bowen Cao, Lingwei Meng, Mocheng Li, Zezhong Wang, Haonan Yin, Naifu Xue, Minyu Chen, Cenyuan Zhang, Zefan Zhang, Hao Wei, Jiawei Zhou, Haoran Xu, Hao Yang, Ronglai Zuo, Tongda Xu, Yonghao Li, Jian Chen, Hebin Wang, Zeyu Gao, Yang Li, Wei Zhao, Qimin Zhong, Siqi Liu, Yumeng Zhang, Leyan Cui, Zhangyu Wang, Wai Lam

英文摘要：Current world models face a fundamental tension: faithful long-horizon simulation demands deep computation, but deeper models are expensive to deploy and prone to compounding errors. We resolve this by introducing Looped World Models (LoopWM), which are the first looped architectures for world modelling. Our method iteratively refines latent environment states through a parameter-shared transformer block. This yield up to 100x parameter efficiency over conventional approaches with adaptive computation that automatically scales depth to match the complexity of each prediction step. Orthogonal to scaling model size and training data, LoopWM establishes iterative latent depth as a new scaling axis for world simulation, which might significantly push the community forward.

2. 表示学习、自监督与对比学习 | 3 篇

14. FoundCause: Causal Discovery with Latent Confounders from Observational Data

FoundCause: 从观测数据中发现含隐混淆因子的因果关系

AI 总结：提出FoundCause，一种基于合成数据训练的摊销因果发现模型，通过单次前向传递直接映射数据集到因果图，显式建模隐混淆因子，在15个真实数据集上优于11种非摊销和4种摊销方法。

链接：https://arxiv.org/abs/2606.17516

机构：Amazon Web Services（亚马逊云服务）； Department of Statistics, University of California, Davis（加州大学戴维斯分校统计系）

作者：Patrick Blöbaum, Krishnakumar Balasubramanian, Shiva Prasad Kasiviswanathan

英文摘要：Causal discovery from observational data remains challenging due to the need to recover directed structure and latent confounding without interventions. We propose FoundCause, an amortized causal discovery model trained entirely on synthetic data that maps datasets directly to causal graphs in a single forward pass. By learning from large collections of simulated structural causal models, FoundCause captures transferable statistical patterns that generalize beyond individual datasets. The architecture incorporates several key inductive biases for causal discovery. It uses a permutation-invariant transformer encoder with alternating attention over samples and variables to jointly model cross-variable dependence and per-variable distributions. Pairwise statistical features derived from classical asymmetry measures are injected through statistics-conditioned attention, guiding the model toward known causal signals. A factorized decoder separates edge existence from direction, while a triangular refinement module enables reasoning over higher-order causal motifs such as chains and colliders. In addition, a dedicated confounder module based on learnable latent tokens explicitly models hidden common causes, and the model explicitly handles missing data via its masked input representation. To our knowledge, FoundCause is the first amortized causal discovery approach to explicitly model latent confounding. FoundCause outperforms 11 classical non-amortized methods (e.g., PC, GES, NOTEARS-style optimization) and 4 amortized causal discovery methods on 15 real-world datasets, achieving +9.6% improvement in $F_1$, +1.2% in AUROC, and an 18.9% reduction in structural Hamming distance relative to the strongest non-amortized methods, while performing inference in a single forward pass.

15. Expanding SPHERE-JEPA: A Family of Statistical Regularizers for the Hypersphere

扩展SPHERE-JEPA：超球面上的统计正则化器家族

AI 总结：为解决自监督学习中切片统计正则化器因蒙特卡洛采样引入投影方差导致优化不稳定和收敛慢的问题，提出全维MMD、KSD和KL散度正则化器，并采用旋转不变核，在ImageNet和Galaxy10上实现更稳定优化和一致改进。

链接：https://arxiv.org/abs/2606.17603

作者：Léo Nicollier (CB, ATT), Enric Meinhardt-Llopis (CB), Max Dunitz (ATT), Marc Pic (ATT), Pablo Musé (CB, IFUMI), Gabriele Facciolo (CB)

英文摘要：In Self-Supervised Learning (SSL), preventing representation collapse by explicitly enforcing a uniform distribution on the unit hypersphere has proven to be effective. However, current frameworks typically rely on sliced statistical regularizers such as SIGReg (used in LeJEPA) and SUSReg (used in SPHERE-JEPA), which approximate this continuous objective via Monte Carlo sampling along random 1D directions. This stochasticity injects projection variance into the training gradients, destabilizing optimization, and hindering convergence. In this work, we first show that analytically integrating out these random projections natively yields a deterministic Maximum Mean Discrepancy (MMD), bypassing the variance of sliced methods. Motivated by this equivalence, we formulate full-dimensional objectives for MMD, Kernel Stein Discrepancy (KSD), and Kullback-Leibler (KL) divergence directly on the sphere to enforce a uniform distribution. To prevent spatial bias, we equip these tests with rotationally invariant kernels constructed via spectral theory, systematically evaluating two canonical families: smooth exponential decay (Heat) and strict frequency cutoff (Bandlimited) filters. Empirically, removing projection-induced noise results in more stable optimization, faster convergence, and consistent improvements over stochastic sliced regularizers on ImageNet and Galaxy10. Furthermore, we reveal that the choice of the statistical test shapes the geometry of the learned latent space: MMD and KSD favor locally clustered organization suitable for object-centric domains, whereas the continuous KDE-based KL divergence promotes fine-grained instance separation, yielding the strongest results on unclustered procedural texture retrieval.

16. Blind Recovery of Latent Domains via Unsupervised Symmetry Discovery

通过无监督对称性发现实现潜在域盲恢复

AI 总结：提出无监督框架，通过发现数据分布的对称性，从无结构观测中恢复潜在域和信号，使用浅层群卷积网络并施加平稳性和局部性正则化。

链接：https://arxiv.org/abs/2606.17782

机构：Bogazici University（博阿齐奇大学）

作者：Onur Efe, Arkadas Ozakin

英文摘要：Primary motivation in blind inverse problems is to recover signals of interest from corrupted observations without knowing the obfuscating mechanism. Blind deconvolution is a prominent approach when the corruption is convolutional, but it is not applicable when general linear transformations obfuscate the domain structure. In this work, we propose an unsupervised framework for recovering latent domains and signals by discovering symmetries of the data distribution. Our framework models observations as linear measurements of signals sampled from a latent random field, and optimizes a shallow group-convolutional network by imposing stationarity and locality regularization at the model output. The model learns a latent symmetry action and an appropriate filter, thereby mapping unstructured observations to a symmetry-based representation that reveals latent signals. Experiments on stochastic processes, Ising models, shuffled and bit-scrambled images, and neural recordings show that the method recovers latent domains and signals from unstructured observations, suggesting symmetry discovery as a new direction for unsupervised structure learning and blind inverse problems.

3. 强化学习与序列决策 | 11 篇

17. Rethinking Groups in Critic-Free RLVR

重新思考无评论强化学习中的分组

AI 总结：针对无评论强化学习分组策略的数据低效和同步问题，提出负令牌过滤方法，实现单次 rollout 稳定训练，在推理和代理任务上表现相当或更优。

链接：https://arxiv.org/abs/2606.17250

机构：Université de Montréal（蒙特利尔大学）； McGill University（麦吉尔大学）； Mila - Quebec AI Institute（Mila - 魁北克人工智能研究所）； University of Waterloo（滑铁卢大学）； The Chinese University of Hong Kong（香港中文大学）； Huawei Noah’s Ark Lab（华为诺亚方舟实验室）

作者：Yihong Wu, Liheng Ma, Lingfeng Xiao, Muzhi Li, Xinyu Wang, Yingxue Zhang, Jian-Yun Nie

英文摘要：Reinforcement learning (RL) has become a central paradigm for post-training large language models. Existing critic-free RL methods typically generate a group of rollouts for the same question to estimate value baselines for advantage computation. However, this design suffers from data inefficiency, group synchronization barriers, and inflexibility with structured rollouts. In this work, we revisit the role of the ``group'' and show that its underlying function is not merely to estimate baselines but to prevent false penalties on negative samples. Building on this insight, we propose negative token filtering, a simple and effective strategy that enables stable single-rollout training. We apply it to two batch-level advantage methods, achieving comparable performance on reasoning tasks and stronger performance on agentic tasks relative to group-based RL techniques.

18. Decision-Driven Geosteering Under Uncertainty: A Unified Framework for Sequential Decision Optimization

不确定性下的决策驱动地质导向：序列决策优化的统一框架

AI 总结：提出一个将粒子滤波与强化学习结合的地质导向框架，通过显式建模地质不确定性并评估三种决策策略，实现稳定且高效的井轨迹实时优化。

链接：https://arxiv.org/abs/2606.17331

作者：Hibat Errahmen Djecta, Sergey Alyaev, Kristian Fossum, Reidar B. Bratvold, Ressi Bonti Muhammad, Apoorv Srivastava

英文摘要：Geosteering requires navigating a well trajectory through an unknown geological configuration, while sequentially updating decisions based on indirect measurements acquired during drilling. This work presents an uncertainty-aware geosteering framework that tightly integrates particle filtering for probabilistic subsurface interpretation with value-based reinforcement learning for sequential decision-making. Geological uncertainty ahead of the drill bit is represented explicitly through a particle filter (PF), enabling belief-informed control rather than deterministic trajectory correction. The framework couples PF belief updates with belief-informed decision policies and evaluates three decision-making options that operate under identical uncertainty representations: an interpretable Approximate Dynamic Programming (ADP) scheme, a Deep Q-learning baseline, and a Dual Deep Reinforcement Learning (Dual DRL) architecture trained with a target Q-network scheme for stability, using a dueling (value/advantage) decomposition for Q-value parameterization. Beyond final placement performance, we assess policy behavior using stability-oriented metrics that quantify steering smoothness over time, providing additional operational insight into how decision policies respond as uncertainty evolves. The framework is integrated with an API for validation within an industrial geosteering simulator under realistic measurement noise and drilling constraints. Using identical geological realizations, operational limits, and reward definitions across methods, the experiments provide a controlled and high-fidelity evaluation of how alternative decision policies behave throughout the drilling process, rather than evaluating performance solely from the final well trajectory.

19. Performance-Driven Environment Abstraction with Multi-Timescale Learning

性能驱动的多时间尺度学习环境抽象

AI 总结：针对大规模马尔可夫决策过程，提出一种性能驱动的环境抽象方法，通过多时间尺度强化学习联合优化策略和树结构抽象，平衡性能与复杂度，实现状态压缩和样本效率提升。

链接：https://arxiv.org/abs/2606.17377

机构：Georgia Institute of Technology（佐治亚理工学院）； University of North Carolina at Charlotte（北卡罗来纳大学夏洛特分校）

作者：Yue Guan, Dipankar Maity, Panagiotis Tsiotras

英文摘要： We study performance-driven environment abstraction for decision-making in large Markov decision processes. Rather than preserving geometric or topological structure, we seek abstractions that directly optimize decision quality. We model abstraction as a controlled approximation obtained by aggregating the state space and enforcing a shared action distribution within each aggregated state. For a fixed partition, we establish a performance guarantee that separates value-function approximation error from the loss introduced by action sharing. Guided by this analysis, we develop a multi-timescale reinforcement learning framework that jointly adapts the policy and a tree-structured environment abstraction. The resulting algorithm refines and coarsens regions of the state space based on Q-value discrepancies, balancing performance against abstraction size and complexity. Empirical results demonstrate substantial state compression, improved sample efficiency, and faster replanning compared to actor-critic baselines.

20. Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations

用于对抗性航天器接近操作中自适应安全关键控制的内存高效元强化学习

AI 总结：本文研究利用元强化学习调整输入约束控制屏障函数的类K函数，比较三种循环网络架构和两种训练算法，发现Mamba与PPO组合在合作与非合作场景中均能提升任务完成率、安全性和燃料效率。

链接：https://arxiv.org/abs/2606.17414

机构：MIT（麻省理工学院）； University of Illinois, Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

作者：Alejandro Posadas-Nava, Richard Linares, Minduli Wijayatunga

英文摘要：Autonomous spacecraft rendezvous and proximity operations (RPO) require controllers that guarantee safety under thrust constraints while minimizing fuel expenditure. Input-constrained control barrier functions (ICCBFs) provide a control method for nonlinear systems with actuation constraints that construct a forward-invariant safe set. Previous work has shown that learning class-$\mathcal{K}$ functions defining the ICCBF recursion via meta reinforcement learning (meta-RL) yields a robust, non-greedy approach to safety-critical control in RPO. This paper extends that framework further by investigating the performance of three recurrent network architectures (Long Short Term Memory (LSTM), Gated Recurrent Unit (GRU), Selective State Space Model (Mamba)) and two training algorithms (Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC)) to identify the best setup for tuning ICCBF class-K functions via meta-RL. In addition to cooperative test cases, performance is evaluated in the presence of adversarial behavior where the target spacecraft behaves in a way that worsens the safety of the chaser spacecraft. Results indicate that state space models such as Mamba when used with PPO achieve superior task completion, safety, and fuel-savings compared to other architectures, across all cooperative and uncooperative scenarios tested.

21. Online LLM Selection via Constrained Bandits with Time-Varying Demand

基于时变需求的约束赌博机在线LLM选择

AI 总结：针对边缘云推理系统中异构LLM的选择问题，提出一种基于置信界估计和需求预测的在线学习算法，在硬预算和软延迟约束下实现亚线性遗憾和约束违反。

链接：https://arxiv.org/abs/2606.17489

机构：Department of Electrical and Computer Engineering, University of Florida（佛罗里达大学电气与计算机工程系）； Manning College of Information and Computer Sciences, University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校曼宁信息与计算机科学学院）

作者：Yin Huang, Qingsong Liu, Jie Xu

英文摘要：Large Language Models (LLMs) are increasingly deployed in edge-cloud inference systems to handle diverse user tasks with heterogeneous accuracy, latency, and cost profiles. Selecting the appropriate LLM for each incoming task is critical for ensuring service quality and efficient resource utilization. However, model heterogeneity, stochastic and unknown performance characteristics, and time-varying task demands make static selection strategies inadequate. Real-world deployments often impose hard resource budgets such as monetary expenditure limits, along with soft service-level requirements such as latency guarantees. These constraints introduce additional challenges for online decision-making. We formulate this problem as a constrained stochastic bandit learning task, where the learner sequentially selects models under both packing-type (hard) and covering-type (soft) constraints, while adapting to time-varying task demand. The learner operates without access to the underlying reward, cost, or latency distributions and must rely on partial feedback. We develop a novel online learning algorithm that leverages confidence-bound estimates and demand predictions to balance reward maximization with long-term constraint satisfaction. We provide theoretical guarantees showing sublinear regret and sublinear covering constraint violations compared to an offline benchmark with full information. Experimental results on synthetic workloads demonstrate the effectiveness and robustness of our approach in dynamic, resource-constrained environments.

22. Learning to Refine Hidden States for Reliable LLM Reasoning

学习精炼隐藏状态以实现可靠的LLM推理

AI 总结：提出ReLAR框架，通过强化学习引导的潜在状态精炼，自适应调整推理步数和方向，提升复杂多步推理的准确性和稳定性，降低推理开销。

链接：https://arxiv.org/abs/2606.17524

作者：Chia-Hsuan Hsu, Jui-Ming Yao

英文摘要：Large language models show strong reasoning ability, but their internal reasoning process can remain unstable in complex multi-step settings, where early hidden-state errors may propagate to incorrect predictions. We propose ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations before decoding. ReLAR maintains a compact latent reasoning state and uses learned depth and action controllers to adaptively determine both the number and direction of refinement steps. The controllers are trained with a policy gradient objective based on step-wise likelihood improvement, enabling efficient input-dependent reasoning without explicit chain-of-thought generation. Experiments on medical, mathematical, multi-hop reasoning, and open-ended generation benchmarks show that ReLAR improves accuracy, generation quality, and reasoning stability with substantially lower inference overhead than explicit reasoning baselines.

23. Continuous-time Optimal Stopping through Deep Reinforcement Learning

通过深度强化学习的连续时间最优停止

AI 总结：提出CARLOS算法，利用聚合深度神经网络学习任意精细时间分辨率下的停止规则，通过渐进式时间网格细化和自适应采样，逼近美式期权价格上界。

链接：https://arxiv.org/abs/2606.17545

机构：Department of Statistics & Applied Probability, UC Santa Barbara（加州大学圣塔芭芭拉分校统计与应用概率系）

作者：Cosmin Borsa, Michael Ludkovski

英文摘要：Simulation based solvers for optimal stopping problems must discretize the stopping decision. Under classical dynamic programming, a coarse exercise grid with only a few stopping opportunities can materially undervalue the optimal expected reward, whereas on a very fine grid, approximation errors accumulate through the backward recursion. To remove this limitation, we develop a new reinforcement-learning inspired algorithm that enables us to learn the exercise rule at arbitrarily fine time resolution. Our CARLOS (Continuous-time Adaptive Reinforcement Learning for Optimal Stopping) algorithm utilizes an aggregate deep neural network (ADNN) to learn a joint space-time decision boundary. Starting from a coarse time grid, we progressively increase the frequency of stopping opportunities, while in parallel training the ADNN to refine its timing-value estimates. We moreover design an adaptive sampling strategy that gradually concentrates training effort near the stopping boundary. Benchmarked results show that CARLOS delivers higher prices than existing Bermudan solvers, approaching the American upper bound, and achieves high computational efficiency relative to non-RL comparators.

24. Reversal Q-Learning

逆向Q学习

AI 总结：提出逆向Q学习（RQL）算法，通过扩展MDP框架和逆向流生成虚拟在线轨迹，结合偏差-方差缩减技术，实现基于流策略的离线强化学习，在50个机器人任务中取得最佳平均性能。

链接：https://arxiv.org/abs/2606.17551

作者：Aditya Oberai, Seohong Park, Sergey Levine

英文摘要：Iterative generative modeling techniques, such as flow matching, provide powerful tools to model complex behaviors for effective offline reinforcement learning (RL). In this work, we propose a new off-policy RL algorithm that trains a flow policy based on prior data. Our idea starts from the "expanded" Markov decision process (MDP) framework, which treats individual flow refinement steps as separate actions in an MDP. To enable off-policy RL within this framework, we apply two techniques: we generate virtual on-policy trajectories (by "reversing" flows) to make this framework compatible with prior data, and we apply a bias-and-variance reduction technique to mitigate the curse of horizon in off-policy RL. We call the resulting algorithm Reversal Q-learning (RQL). RQL has several advantages over previous flow-based RL methods: it does not suffer from backpropagation through time, makes better use of the learned value function, and directly trains the full, expressive flow policy. Through our experiments on 50 challenging simulated robotic tasks, we show that RQL leads to the best average offline RL performance compared to state-of-the-art flow-based offline RL algorithms.

25. EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

EnvRL: 在智能体强化学习中从环境动力学中学习

AI 总结：提出EnvRL框架，通过状态预测和逆动力学两个辅助目标，将环境动力学学习融入智能体强化学习，在长周期任务中显著提升成功率。

链接：https://arxiv.org/abs/2606.17680

机构：Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； Shanghai AI Laboratory（上海人工智能实验室）

作者：Zhitong Wang, Songze Li, Hao Peng, Shuzheng Si, Yi Wang, Maosong Sun, Juanzi Li

英文摘要：Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuitively, this overlooks the rich environment dynamics information contained in rollout interaction trajectories. We argue that the interaction experience inherently serves as an implicit supervision signal, reveals the underlying transition mechanisms of the environment, and enables the agent to construct a more accurate internal model of the environment.. Therefore, in this work, we investigate how to leverage this additional signal to improve policy learning. Specifically, we propose EnvRL, a framework that incorporates environment dynamics learning into agentic RL via two auxiliary objectives: state prediction and inverse dynamics. By jointly optimizing with the primary RL objective, we encourage the agent to internalize environment dynamics from its own interaction experience. Extensive experiments on two long-horizon agentic benchmarks demonstrate that EnvRL achieves significant improvements on success-rates over RL-only baselines, e.g., when trained with GRPO, lifting Qwen-2.5-1.5B-Instruct from 72.8% to 77.4% on ALFWorld, and from 56.8% to 67.0% on WebShop.

26. Deep Reinforcement Learning for Minimum Zero-Forcing Sets

深度强化学习用于最小零强制集

AI 总结：提出一种基于强化学习的框架SD-ZFS，通过改进S2V-DQN架构求解最小零强制集问题，在多种图结构上验证了其优于贪心启发式算法。

链接：https://arxiv.org/abs/2606.18106

机构：Department of Computing Sciences, Villanova University（维拉诺瓦大学计算科学系）

作者：Steve Halley, Maurício Gruppi

英文摘要： This paper explores the problem of finding the minimum zero-forcing set on undirected graphs and proposes an adapted machine-learning framework to solve the problem. The minimum zero-forcing set problem is a graph coloring problem where the color of an initial set of nodes propagates throughout a network. The set of nodes is zero-forcing if it forces all uncolored nodes to change color under the constraint of the color-change rule. There are several applications to this problem across different domains such as network science, network control, and designing logical circuits. Finding the minimum zero-forcing set is shown to be NP-hard. We propose a reinforcement learning framework, SD-ZFS, that adapts the S2V-DQN architecture to the ZFS problem. We train several models on this adapted framework and analyze the performance across graph datasets that have varying structures. We evaluate how the models trained on the framework generalize, scale, and transfer to different network types. The results demonstrate the effectiveness of the framework when compared against the optimal solution and greedy heuristic. We provide further insight into how the ZFS problem can be solved through machine-learning and the influence of network structure on the problem.

27. Learning Fair Pareto-Optimal Policies in Multi-Objective Reinforcement Learning

多目标强化学习中学习公平帕累托最优策略

AI 总结：针对多目标强化学习中固定用户偏好无法提供多样化策略的问题，提出基于广义基尼福利函数的多策略方法，学习公平帕累托最优策略集。

链接：https://arxiv.org/abs/2606.18111

作者：Umer Siddique, Peilang Li, Yongcan Cao

英文摘要：Fairness is an important aspect of decision-making in multi-objective reinforcement learning (MORL), where policies must ensure both optimality and equity across multiple, potentially conflicting objectives. While single-policy MORL methods can learn fair policies for fixed user preferences using welfare functions such as the generalized Gini welfare function (GGF), they fail to provide the diverse set of policies necessary for dynamic or unknown user preferences. To address this limitation, we formalize the fair optimization problem in multi-policy MORL, where the goal is to learn a set of Pareto-optimal policies that ensure fairness across all possible user preferences. Our key technical contributions are threefold: (1) We show that for concave, piecewise-linear welfare functions (e.g., GGF), fair policies remain in the convex coverage set (CCS), which is an approximated Pareto front for linear scalarization. (2) We demonstrate that non-stationary policies, augmented with accrued reward histories, and stochastic policies improve fairness by dynamically adapting to historical inequities. (3) We propose three novel algorithms, which include integrating GGF with multi-policy multi-objective Q-Learning (MOQL), state-augmented multi-policy MOQL for learning non-statoinary policies, and its novel extension for learning stochastic policies. We evaluate our algorithms across various domains and compare our methods against the state-of-the-art MORL baselines. The empirical results show that our methods learn a set of fair policies that accommodate different user preferences.

4. 生成模型与概率建模 | 8 篇

28. Informative Missingness to Generate Irregular Clinical Time Series

信息性缺失生成不规则临床时间序列

AI 总结：提出基于扩散的临床时间序列生成方法，联合建模实验室值和观察模式，在DACMI基准上验证，能捕获生理与检测行为间的临床依赖。

链接：https://arxiv.org/abs/2606.17106

机构：Aalborg University（奥尔堡大学）； University of Pavia（帕维亚大学）； Bowling Green State University（博林格林州立大学）

作者：Hadi Mehdizavareh, Gabriele Santangelo, Giovanna Nicora, Simon Lebech Cichosz, Arianna Dagliati, Arijit Khan, Riccardo Bellazzi

英文摘要：Laboratory tests in electronic health records are collected irregularly, and the absence of a test order can be as informative as the measurement itself. Such missingness reflects clinicians' decisions and patient physiology, making it important to model it directly rather than treat it as a preprocessing artifact. Here we present a diffusion-based approach for generating clinical time series that jointly models laboratory values and their observation patterns using the public Data Analytics Challenge on Missing Data Imputation (DACMI) benchmark derived from MIMIC-III. To preserve realistic sampling, we align chart times into 4-hour intervals and segment admissions into 7-day windows, producing trajectories that pair each lab value with a corresponding observation indicator. Standard transformations and normalization are applied to stabilize training. Our method extends the TimeDiff framework to learn continuous lab values and discrete missingness patterns through complementary diffusion objectives. Experiments show that the generated data closely match real patient trajectories across individual lab distributions and joint value-missingness embeddings, demonstrating that diffusion models can capture clinically meaningful dependencies between patient physiology and clinicians' testing behavior under MNAR-like (missing-not-at-random) missingness. These preliminary results indicate that our model can serve as an initial component toward developing clinical foundation models. By producing synthetic priors that preserve key physiology-missingness relationships, this work motivates the subsequent training of Prior-Data Fitted Networks capable of leveraging informative missingness, which we will investigate in the extended work.

29. Constrained Diffusion Models with Primal-Dual Inference

约束扩散模型与原始-对偶推理

AI 总结：提出原始-对偶推理（PDI）方法，通过联合推断最优原始分布和其对偶变量，在扩散模型反向过程中交替去噪与对偶上升，实现平均约束下的熵正则化优化问题采样。

链接：https://arxiv.org/abs/2606.17192

机构：Department of Electrical and Systems Engineering, University of Pennsylvania（宾夕法尼亚大学电气与系统工程系）

作者：Samar Hadou, Yigit Berkay Uslu, Alejandro Ribeiro

英文摘要：This paper develops constrained diffusion models with primal-dual inference (PDI) to sample from optimal distributions of entropy-regularized optimization problems with \emph{average} constraints. We formalize constrained sampling in the Lagrangian dual domain, where the optimal distribution takes the form of a Gibbs distribution indexed by the optimal dual variable. Rather than estimating this dual multiplier before sampling and freezing it throughout generation, PDI jointly infers the optimal primal distribution and its parametrizing dual variable. Each reverse diffusion step denoises using the score field associated with the current multiplier and then updates the multiplier through dual ascent using the estimated constraint violation of the denoised samples. To enable this conditional score field, we train a single dual-conditioned score network over the family of Gibbs distributions induced by the dual variables encountered during inference. We prove that the time average of the dual variables generated along the inference trajectory converges to a neighborhood of the dual optimum and bound the effect of residual dual mismatch on the terminal distribution through schedule-dependent stability factors. We evaluate PDI on constrained sampling from a mixture of Gaussians, wireless resource allocation, and portfolio management.

30. Discrete Autoregressive Transformer for Generative Mechanism Synthesis

离散自回归变压器用于生成式机构综合

AI 总结：提出离散自回归变压器，将平面路径综合转化为条件序列建模，通过VAE潜在变量和机构类型令牌生成关节坐标，实现多样准确机构设计。

链接：https://arxiv.org/abs/2606.17409

机构：Computer-Aided Design and Innovation Lab, Department of Mechanical Engineering, Stony Brook University（石溪大学机械工程系计算机辅助设计与创新实验室）

作者：Anar Nurizada, Anurag Purwar

英文摘要：Planar path synthesis requires mechanisms whose coupler curves match a prescribed trajectory; the mapping from curve to linkage is inherently one-to-many across four-, six-, and eight-bar topologies. We address this design problem with simulation-grounded evaluation on a curated corpus of over one million mechanisms, reporting Chamfer distance and dynamic time warping after forward kinematics and geometric alignment. We formulate synthesis as conditional autoregressive sequence modeling: joint coordinates are uniformly quantized to tokens and generated by a decoder-only transformer with a variational-autoencoder (VAE) latent of the target curve and an explicit mechanism-type token. Training combines token cross-entropy with a Gaussian-smoothed bin auxiliary loss that respects ordinal structure among bins. At inference, a bounded latent-noise schedule decodes all mechanism types at each noise level; we retain the top five candidates by geometric error, yielding diverse accurate families without dataset lookup. On held-out tests, aggregate mean Chamfer distance is $0.0132$ and mean dynamic time warping is $0.153$; a latent $k$-nearest-neighbor baseline that conditions on training-set neighbor latents in VAE space achieves matched-topology mean Chamfer distance $0.0071$ and mean dynamic time warping $0.117$ using the same decoder.

31. Perron--Frobenius Operator Matching for Generative Modeling

Perron--Frobenius算子匹配用于生成建模

AI 总结：提出Perron-Frobenius算子匹配（PFOM）生成框架，通过积分PF算子匹配密度演化，统一流、扩散和跳跃模型，并证明KL散度在Bregman散度中唯一保持密度级与样本条件目标等价，开发Nesterov加速训练和采样方法。

链接：https://arxiv.org/abs/2606.17465

机构：Texas A&M University（德克萨斯农工大学）； City University of Hong Kong（香港城市大学）

作者：Shiqi Zhang, Wuwei Wu, Jaemin Oh, Jie Chen, Xiaoning Qian

英文摘要：We introduce Perron--Frobenius Operator Matching (PFOM), a generative framework that matches density evolution via the integral PF operator, subsuming flow, diffusion, and jump models. We prove that among Bregman divergences, only Kullback--Leibler divergence preserves equality between density-level and sample-conditioned objectives, yielding a practical loss equivalent to Koopman path matching. We further develop Nesterov-accelerated training and sampling that stabilize discretization and accelerate convergence. %On Gaussian mixtures and two-moons, PFOM achieves faster KL/$W_2$/MMD decrease and improved wall-clock efficiency with empirical validation. PFOM unifies operator-theoretic identification with modern generative modeling and opens paths to adaptive dictionaries and high-dimensional applications.

32. Recursive Scaling in Masked Diffusion Models

掩码扩散模型中的递归缩放

AI 总结：提出递归掩码扩散模型（R-MDMs），通过在每个扩散步骤中重复应用同一去噪变换器增加递归深度，实现参数高效缩放，在数独和倒计时等结构化生成任务中，以更少参数匹配非递归基线性能。

链接：https://arxiv.org/abs/2606.18022

作者：Alba Carballo-Castro, Julianna Piskorz, Paulius Rauba, Mihaela van der Schaar, Pascal Frossard

英文摘要：Masked diffusion models (MDMs) have recently emerged as a promising paradigm for sequence generation. Scaling MDMs is conventionally achieved by increasing the parameter count or the number of denoising steps. We introduce Recursive Masked Diffusion Models (R-MDMs), which add recursive depth as a third scaling axis by repeatedly applying the same denoising transformer within each diffusion step. Recursion enables iterative refinement of the output through parameter reuse, increasing effective model depth without increasing parameter count. Across structured generation tasks, including Sudoku and Countdown, we show that R-MDMs achieve substantially improved parameter efficiency: a model with $L$ recursive iterations often matches the performance of non-recursive baselines with roughly $L\times$ more parameters. Moreover, recursive refinement can partially substitute for additional denoising steps, allowing recursive models to reach the same generation quality with fewer forward passes at inference time. These results suggest that recursive depth is a practically useful scaling mechanism for MDMs, improving both parameter efficiency and the allocation of test-time compute.

33. NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment

NoiseTilt: 噪声倾斜反向核用于扩散奖励对齐

AI 总结：提出噪声倾斜反向核(NTRK)，通过将奖励梯度注入噪声项实现奖励引导采样，保持预训练反向核不变，每步仅需单样本，在奖励对齐任务中超越现有方法且不损失样本质量。

链接：https://arxiv.org/abs/2606.18066

机构：KAIST（韩国科学技术院）； The University of Tokyo（东京大学）

作者：Jisung Hwang, Yunhong Min, Jaihoon Kim, I-Chao Shen, Minhyuk Sung

英文摘要： We introduce the Noise-Tilted Reverse Kernel (NTRK), a reward-guided diffusion sampler that injects reward gradients through the noise term, leaving the pretrained reverse kernel unchanged and requiring only a single sample per step. Reward-guided sampling at inference time has greatly expanded the versatility of pretrained diffusion models. Yet existing methods face a trade-off. Gradient-based guidance shifts the reverse mean, steering generation but pushing intermediate states outside the region that the model was trained on and degrading quality. Search-based methods preserve quality but gain no gradient signal. No prior method achieves both. NTRK resolves this by keeping the reverse mean fixed and biasing the noise term toward high reward. We introduce a whitening operator, the central mechanism behind NTRK, that makes the reward gradient safe to inject as noise without losing its guiding signal. Across various reward alignment tasks, NTRK outperforms recent state-of-the-art baselines without losing sample quality. Remarkably, on aesthetic generation, NTRK surpasses the reward of the best baseline at 500 NFEs using only 25 NFEs, a 20$\times$ reduction in compute.

34. Volterra Generative Models

Volterra生成模型

AI 总结：提出Volterra生成模型，通过分数阶核引入路径依赖噪声，利用马尔可夫提升和残差状态学习，解决非马尔可夫动力学下的扩散生成问题，在MNIST和CIFAR-10上验证有效性。

链接：https://arxiv.org/abs/2606.18071

机构：The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

作者：Yusen Jia, Bingyan Han

英文摘要：Score-based diffusion models typically use Brownian perturbations, which provide tractable reverse-time dynamics but impose memoryless noising. We introduce Volterra generative models, a continuous-time score-based framework whose forward process injects path-dependent noise through fractional kernels. To handle the non-Markovian and non-semimartingale dynamics, we construct finite-dimensional Markovian lifts using Gaussian quadrature in both regimes and a hybrid finite-difference exponential approximation in the smooth regime. We prove squared error bounds, derive an augmented linear-Gaussian forward process, and show that the learning can remain data-dimensional by considering residual states and analytic auxiliary Gaussian scores. We also identify covariance and reverse-time degeneracies caused by shared Brownian factors and signed smooth-regime weights. The degeneracy motivates stabilized conditioning and, for stiff larger lifts, a Gaussian-bridge reconstruction sampler. Experiments on MNIST and CIFAR-10 show that persistent fractional perturbations with small Markovian lifts can improve score-based generation on MNIST and provide a promising extension to natural images, while the bridge sampler provides a stability mechanism for larger lifts.

35. Kolmogorov Regression for Robust Diffusion Policies

用于鲁棒扩散策略的Kolmogorov回归

AI 总结：提出后向Kolmogorov方程将扩散策略提升至Cameron-Martin空间，用确定性边界值PDE问题替代随机分数匹配，通过精度加权损失和残差诊断实现收敛保证、轨迹规则化和无奖励故障检测。

链接：https://arxiv.org/abs/2606.18186

作者：Lekan Molu

英文摘要：Finite-dimensional (FD) diffusion policies exhibit temporal drift owing to discretization artifacts that degrade long-horizon performance (when deployed on physical systems). We introduce a backward Kolmogorov equation that lifts diffusion policies to a Cameron-Martin space -- a subset of the Hilbert space. Essentially, replacing stochastic score matching with a deterministic boundary-value PDE problem. Our core innovation thrives on Gaussian measure theory whereupon the diffusion noise covariance operator is realized from a colored noise distribution which prescribes a notion of regularity on samples from the model at inference time. We train the diffusion model with a derived precision-weighted Cameron- Martin loss and a Kolmogorov residual is introduced as a PDE diagnostic during inference. These substitutions yield (i) convergence guarantees where the bound's constants depend on the effective rank of the kernel rather than action dimension, (ii) improved trajectory regularity via spectral weighting, and (iii) a deterministic failure detector without reward signals. Validation across two application domains demonstrates substantial improvements: on the PushT manipulation benchmark, the Cameron-Martin loss achieves a 17% improvement in maximum episode reward (0.95 vs. 0.78 for MSE) and 67.6% reduction in inter-step drifts during inference via the introduced residual magnitude. Similarly, on a 6-station manufacturing line with constant work-in-process (CONWIP) flow control, we achieve 28.4% lower RMSE than classical LSTM baselines; a high starvation-event recall (1.0 in test cycles), and effective bottleneck identification (Precision@1 = 1.0 in test set, 13x signal-to-noise ratio). We then certify the dispatch policies with Hamilton-Jacobi reachability theory which reduces deadlock events by 96% compared to uncontrolled dispatch over 100 simulated runs (351 events prevented).

5. 优化、泛化与理论分析 | 6 篇

36. Noise-Driven Escape from Metastable Phases explains Grokking in Deep Neural Networks

噪声驱动从亚稳态逃逸解释深度神经网络中的grokking现象

AI 总结：本文通过线性DNN模型证明，grokking现象源于L2正则化引起的一阶相变中的迟滞效应，SGD噪声驱动模型从低精度亚稳态逃逸，逃逸时间符合Arrhenius标度。

链接：https://arxiv.org/abs/2606.17120

机构：Complexity Science Group, Institute of Physics and Astronomy, University of Potsdam（波茨坦大学物理与天文研究所复杂性科学组）

作者：Ibrahim Talha Ersoy, Karoline Wiesner

英文摘要： Deep neural networks (DNNs) exhibit first order phase transitions under variations of the L2 regularization strength, with each transition marking the onset of a new learnable feature. Below a critical regularization strength, all features are in principle learnable, but coexisting metastable states, separated by energy barriers, can trap the network and impede convergence. A strength of DNNs is their ability to generalize. But many open questions remain, among them the origin of so called grokking: the abrupt, delayed onset of generalization after prolonged apparent overfitting. We show for linear DNNs that grokking is consistent with hysteresis in first-order L2 phase transitions: using L2 regularization to engineer deliberate trapping, we demonstrate that a model in a low-accuracy metastable state escapes only when SGD noise drives it across an energy barrier, with escape times following Arrhenius scaling. We reproduce grokking-like delayed convergence across two orders of magnitude in escape time by deliberately trapping models in metastable phases. Using sparse sub-sampling we also reproduce the canonical grokking curve where test error eventually approaches the final training error. Our work suggests that the number of metastable states equals the number of learnable features -- one per singular value of the data covariance -- the potential for hysteresis grows naturally with task complexity. We provide evidence that the same mechanism likely operates in general nonlinear DNNs. Our results provide routes toward more efficient learning schemes.

37. Sum-of-Squares Degree Barriers for the Reweighted-Hinge Method in Robust Halfspace Learning: A Christoffel-Function Characterization

鲁棒半空间学习中重加权铰链方法的平方和度障碍：一个Christoffel函数刻画

AI 总结：本文通过Christoffel函数精确刻画了有界度证书无法去除的异常质量，揭示了重加权铰链方法在恶意噪声下学习γ-间隔半空间时，证书的SoS度与异常容忍度之间的基本权衡。

链接：https://arxiv.org/abs/2606.17215

作者：Xiaoyu Li

英文摘要：A certificate that removes outliers sees the data only through its low-degree moments, and an adversary exploits exactly this, hiding corruption where the clean data already looks typical, in the blind spot no bounded-degree test resolves. That blind spot turns out to have an exact size: the Christoffel function of the clean marginal, the very quantity modern data analysis thresholds to detect outliers, here read from the adversary's side as the corruption a bounded-degree certificate cannot remove. We turn this inversion into the organizing principle of the reweighted-hinge approach to robustly learning $\gamma$-margin halfspaces under malicious noise (Shen, 2025; Zeng and Shen, 2025): the governing resource is the Sum-of-Squares degree of the outlier-removal certificate, and the resolution principle states that the maximal corruption mass which can hide at a center $c$ from a degree-$2t$ certificate is exactly the Christoffel function $\lambda_{t+1}(c)$ of the clean marginal. Three consequences follow, all against the certificate method (not information-theoretic). A margin-degree tradeoff: certifying the dense pancake to error $\epsilon$ costs SoS degree $\Omega(\log(1/\epsilon))$ or margin $\Omega(\sqrt{\log(1/\epsilon)}/\sqrt{d})$, explaining why the $\log(1/\epsilon)$ margin Shen (2025) records is forced, with a weighted-Chebyshev reduction making the threshold $2t=\Theta((|c|/s)^2)$ tight modulo one classical weighted-extremal estimate. A degree-$2$ outlier barrier: the resolution principle realized as an explicit instance on which degree $2$ is stuck at $\eta^{1/2}$ while degree $4$ escapes, locating the method's small breakdown rate in the degree, not the analysis. And a degree-$2t$ algorithm tracing the frontier $\eta^{1-1/2t}$ (recovering Shen (2025) at $t=1$), whose gain is an explicit constant, capped by the pancake density and shown unimprovable by the degree-$2$ barrier.

38. Generalization Guarantees for Multi-Input Neural Operator Learning in Sobolev Spaces

多输入神经算子学习在Sobolev空间中的泛化保证

AI 总结：针对多输入神经算子，在Sobolev范数下建立逼近和泛化误差估计，量化各输入空间对误差界的贡献，并揭示平衡状态下输入维度、正则性和Sobolev阶的相互作用。

链接：https://arxiv.org/abs/2606.17419

机构：Georgia Institute of Technology（佐治亚理工学院）； University of Notre Dame（圣母大学）； Hong Kong Baptist University（香港浸会大学）

作者：Yahong Yang, Zecheng Zhang, Wei Zhu, Wenjing Liao, Hao Liu

英文摘要：We develop approximation and generalization error estimates for multi-input neural operators, with the output error measured in Sobolev norms. In contrast to standard operator-learning settings with a single input function, our framework allows multiple input functions defined on possibly different domains, with different dimensions and Sobolev regularities. The derived rates explicitly quantify the contribution of each input space to the final error bound. In particular, in the balanced regime, the approximation and generalization rates are governed by the interaction between the input dimensions, regularities, and Sobolev orders, while the dependence on the model complexity retains a $\log\log/\log$-type structure. Our analysis provides a general theoretical framework for multi-input operator learning, including Sobolev training, and is applicable to operator learning problems arising from partial differential equations and scientific computing.

39. MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization

MGUP：一种用于随机优化的动量-梯度对齐更新策略

AI 总结：提出MGUP机制，通过按固定比例选择参数施加大步长、其余参数用小步长，增强动量优化器，理论保证收敛，实验表明提升训练效率与稳定性。

链接：https://arxiv.org/abs/2606.17526

机构：Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences（中国科学院深圳先进技术研究院）； Shenzhen University of Advanced Technology（深圳理工大学）； Pengcheng Laboratory（鹏城实验室）； University of Chinese Academy of Sciences（中国科学院大学）

作者：Da Chang, Ganzhao Yuan

英文摘要：Efficient optimization is essential for training large language models. Although intra-layer selective updates have been explored, a general mechanism that enables fine-grained control while ensuring convergence guarantees is still lacking. To bridge this gap, we propose \textbf{MGUP}, a novel mechanism for selective updates. \textbf{MGUP} augments standard momentum-based optimizers by applying larger step-sizes to a selected fixed proportion of parameters in each iteration, while applying smaller, non-zero step-sizes to the rest. As a nearly {plug-and-play} module, \textbf{MGUP} seamlessly integrates with optimizers such as AdamW, Lion, and Muon. This yields powerful variants such as \textbf{MGUP-AdamW}, \textbf{MGUP-Lion}, and \textbf{MGUP-Muon}. Under standard assumptions, we provide theoretical convergence guarantees for \textbf{MGUP-AdamW} (without weight decay) in stochastic optimization. Extensive experiments across diverse tasks, including MAE pretraining, LLM pretraining, and downstream fine-tuning, demonstrate that our \textbf{MGUP}-enhanced optimizers achieve superior or more stable performance compared to their original base optimizers. We offer a principled, versatile, and theoretically grounded strategy for efficient intra-layer selective updates, accelerating and stabilizing the training of large-scale models. The code is publicly available at this https URL.

40. Edge Flow: A Tractable and Predictive Continuous-Time Model for Gradient Descent at the Edge of Stability

Edge Flow: 一种可处理且可预测的梯度下降在稳定性边缘的连续时间模型

AI 总结：针对深度学习梯度下降在稳定性边缘（EoS）的动力学，提出Edge Flow模型，通过三个耦合常微分方程分解为中心、振荡方向和幅度，实现可处理且预测性的建模，并揭示锐度自稳定机制。

链接：https://arxiv.org/abs/2606.18080

机构：Inria, École Normale Supérieure, PSL Research University（法国国家信息与自动化研究所，巴黎高等师范学院，PSL研究大学）

作者：Pierre Marion

英文摘要：Gradient descent in deep learning may operate at the edge of stability (EoS), a regime in which the largest eigenvalue of the loss Hessian hovers near the stability threshold $2/\eta$, where $\eta$ is the learning rate. Classical analysis tools such as gradient flow and the descent lemma do not apply here, motivating the search for a continuous-time model valid at EoS. We propose Edge Flow, a system of three coupled ordinary differential equations that provides a tractable, faithful, and predictive model of gradient descent dynamics at EoS. Edge Flow decomposes the dynamics into a center, an oscillation direction, and an oscillation magnitude. The center follows a modified gradient flow on a symmetrized loss; the direction tracks a top eigenvector of the Hessian via Rayleigh quotient dynamics; and the magnitude grows or decays exponentially depending on whether the sharpness exceeds or falls below the threshold $2/\eta$. Crucially, sharpness stabilization emerges from the coupled dynamics via a self-stabilization feedback loop. Discretizing Edge Flow only requires two gradient evaluations and one Hessian--vector product at each iteration. We demonstrate empirically that Edge Flow tracks the dynamics of gradient descent at least as faithfully as previously proposed continuous-time EoS models, while in addition resolving the oscillation of the sharpness at the onset of EoS, and that it provides a principled framework for understanding and mitigating instabilities in this regime.

41. Sign-Rank, Index, and List Replicability: Connections and Separations

符号秩、索引与列表可复制性：联系与分离

AI 总结：本文研究二元概念类的符号秩下界，通过比较Z2-索引和列表可复制数，证明Z2-索引被列表可复制数的线性函数上界，从而解决符号秩与Z2-索引的分离问题，并进一步建立列表可复制数的上界与组合性质。

链接：https://arxiv.org/abs/2606.18236

机构：McGill University（麦吉尔大学）； Ohio State University（俄亥俄州立大学）

作者：Ari Blondal, Hamed Hatami, Pooya Hatami, Chavdar Lalov, Sivan Tretiak

英文摘要：In learning theory, the sign rank of a binary concept class captures the smallest dimension in which it can be represented by points and halfspaces. Despite tremendous interest, lower bounds on sign rank are notoriously difficult to come by. Two recent approaches to the problem establish lower bounds on sign rank by measures that are easier to analyze: the $\mathbb{Z}_2$-index and the list replicability number. We order these measures, showing that the $\mathbb{Z}_2$-index is upper-bounded by a linear function of the list replicability number. As a main consequence, we obtain a strong separation between sign rank and $\mathbb{Z}_2$-index, thereby resolving a question of Frick, Hosseini, and Vasileuski. This motivates a thorough study of list replicability, the stronger of the two lower-bounding measures. We establish upper bounds on the list replicability number by two combinatorial measures: height and minimum star number. We also prove a fundamental composition result, showing that the product of two concept classes has list replicability number bounded by the sum of the list replicability numbers of the two classes.

6. 高效学习、压缩与部署 | 8 篇

42. MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

MODE: 面向MoE多模态大语言模型的模态分解专家级混合精度量化

AI 总结：针对MoE多模态大语言模型在专家重要性估计中存在的跨模态和视觉内偏差，提出模态分解的专家级混合精度量化框架MODE，通过分解选择频率、过滤冗余视觉令牌并评估模态敏感性，在给定预算下分配比特宽度，在W3A16下平均性能损失控制在2.9%以内。

链接：https://arxiv.org/abs/2606.17118

机构：Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Zhongguancun Academy（中关村学院）

作者：Yuanteng Chen, Peisong Wang, Zhilei Liu, Nanxin Zeng, Yuantian Shao, Shiqiang Lang, Tao Liu, Chuangyi Li, Qinghao Hu, Gang Li, Jing Liu, Jian Cheng

英文摘要：Mixture-of-Experts Multimodal Large Language Models (MoE-MLLMs) offer remarkable performance but incur prohibitive GPU memory costs, making compression essential. Among PTQ methods, expert-level mixed-precision quantization has proven effective for MoE-LLMs, yet suffers notable degradation on MoE-MLLMs due to two overlooked biases in expert importance estimation. (1) At the cross-modal level, the numerical dominance of vision tokens causes expert selection frequency to be dominated by vision tokens, masking experts that are critical to the text modality; (2) at the intra-vision level, the large proportion of redundant vision tokens further skew frequency statistics, obscuring experts critical for informative visual content. To bridge gaps, we propose MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE-MLLMs that decomposes expert selection frequency by modality, filters redundant vision tokens to obtain denoised visual frequency, and further evaluates quantization sensitivity per modality as a complementary signal to frequency-based estimation. These signals are integrated into an Integer Linear Programming formulation to assign per-expert bit-widths under a given budget. Extensive experiments show that MODE is particularly well-suited for MoE-MLLMs, limiting average performance loss to within 2.9% at W3A16, with larger gains at the extreme 2-bit setting.

43. Operator Boosting Produces Pareto-Efficient PDE Surrogates

算子提升产生帕累托高效的PDE代理模型

AI 总结：提出算子提升框架，通过残差学习直接构建紧凑神经算子代理，在30个数据集-架构对上平均准确率提升，参数量减少72-95%，并在多个PDE基准上实现帕累托改进。

链接：https://arxiv.org/abs/2606.17460

机构： College of Computing, Georgia Institute of Technology（佐治亚理工学院计算学院）； Department of Mathematics and Systems Engineering, Florida Institute of Technology（佛罗里达理工学院数学与系统工程系）

作者：Lennon J. Shikhman

英文摘要：Neural operators are widely used as surrogate solution maps for partial differential equations (PDEs), but full-size models can be costly to store, deploy, and evaluate in many-query scientific workflows. This work introduces Operator Boosting, a stagewise residual-learning framework for constructing compact neural-operator surrogates directly, rather than training a large model and compressing it afterward. Starting from the empirical mean predictor in normalized output coordinates, the method trains a sequence of tiny same-family neural operators on residual fields and incorporates each correction through validation-selected shrinkage. We instantiate the framework with Fourier neural operators (FNOs), DeepONets, and convolutional neural operators (CNOs), and compare boosted tiny stacks against full-size monolithic baselines across one-, two-, and three-dimensional PDE benchmarks from PDEBench, APEBench, and The Well. Across 30 dataset-architecture pairs, 21 show positive mean accuracy gains and 17 have positive confidence intervals, while all boosted stacks reduce trainable parameter count by approximately 72-95%. Best-model comparisons show empirical Pareto improvements on 7 of 10 completed PDE benchmarks, including two-dimensional Navier-Stokes, shallow-water dynamics, Darcy flow, one-dimensional transport and reaction systems, and three-dimensional compressible Navier-Stokes. These results show that Operator Boosting often improves the empirical accuracy-parameter Pareto frontier of neural PDE surrogates, while also exposing PDE- and architecture-dependent regimes where residual boosting fails to offset compression.

44. ReRAM-aware Model Finetuning addressing I-V Non-linearity and Retention Errors

面向ReRAM的模型微调：解决I-V非线性和保留误差

AI 总结：提出一种基于微调的硬件感知训练算法，通过范围收缩的sinh变换缓解I-V非线性，并将保留误差纳入正则化损失，实现ReRAM上DNN的高效部署，在图像分类和问答任务中精度损失极小。

链接：https://arxiv.org/abs/2606.17471

作者：Ching-Yi Lin, Shamik Kundu, Arnab Raha, Sahil Shah

英文摘要：Traditional CPU, GPU, and NPU architectures are increasingly limited by the von Neumann bottleneck. While In-Memory Computing (IMC) using ReRAM crossbar arrays offers a high-density, energy-efficient alternative, its practical deployment is constrained through their non-idealities. Existing hardware-aware training frameworks often require training from scratch, which is computationally prohibitive for modern large-scale models. In this work, we propose a finetuning-based hardware-aware training algorithm that enables robust DNN deployment on ReRAM with minimal training overhead. Our approach mitigates I-V non-linearity by applying a range-shrunk sinh transformation and incorporates retention errors directly into a regularization loss during the finetuning process. We evaluate our framework across models and tasks such as image classification and question-answering (QA). Experimental results demonstrate that our method achieves similar accuracy on large-scale models like ResNet18 and DeiT-Tiny as the base model. In-case of ImageNet for MobileNetV3 families the technique has only less than 2% accuracy degradation. Further, applying the technique on the SQuAD v2 dataset results in only 1 point degradation of F-1 score.

45. Reconfigurable Computing Challenge: Transformer for Jet Tagging on Versal AI Engines

可重构计算挑战：Versal AI Engine上的喷注标记Transformer

AI 总结：针对CERN LHC喷注标记任务，提出在AMD Versal AI Engine上部署量化整数Transformer的初始实现，并开发可重用软件框架自动生成Vitis图代码。

链接：https://arxiv.org/abs/2606.17500

作者：Gram Koski, Sean Lipps, Zhenghua Ma, G. Abarajithan, Ryan Kastner

英文摘要：Transformer-based models achieve strong performance for jet tagging at the CERN LHC, but deploying them in low-latency, resource-constrained trigger systems is challenging. We present an initial implementation of a quantized, integer-only transformer for jet tagging on the AMD Versal AI Engine (AIE), mapping dense and multi-head attention (MHA) layers to AIE tiles. The main contribution is a reusable software framework that represents transformer layers as composable AIE building blocks and automatically generates the corresponding Vitis graph code from a high-level Python model description. This framework provides a foundation for future research and is released as open-source software at this https URL.

46. Continual Self-Improvement with Lightweight Experiential Latent Memories

持续自我改进：轻量级经验潜在记忆

AI 总结：提出一种在线方法，将推理时计算转化为轻量级模块化潜在记忆，通过自生成测试时信号进行训练，实现持续改进且避免灾难性遗忘。

链接：https://arxiv.org/abs/2606.17803

机构：Toyota Motor Europe（丰田汽车欧洲公司）； University of Trento（特伦托大学）

作者：Vaggelis Dorovatas, Nancy Kalaj, Rahaf Aljundi

英文摘要： Large language models achieve strong reasoning performance by scaling inference-time compute, yet remain fundamentally stateless, discarding the rich, self-produced reasoning traces generated during this process. We investigate whether models can instead learn online from this experience, converting transient computation (reasoning traces) into persistent reusable knowledge, and without external supervision or access to future data. We show that In-Context Learning (ICL) over raw reasoning traces fails to generalize, reflecting a fundamental limitation of token-level reuse: individual traces lack the abstraction needed for transfer, even after refinement (e.g. self-reflection). In contrast, drawing inspiration from recent works on unsupervised reinforcement learning, we find that lightweight per-instance training with self-generated test-time signals (majority voting) as rewards yields substantial gains, often surpassing full-dataset offline training, motivating a shift from raw traces to learned latent representations. Building on this insight, we propose an online method that distills inference-time compute spent on encountered problems into compact modular latent memories capturing the underlying reasoning structure. These memories are stored and retrieved for future inputs, enabling continual improvement while avoiding catastrophic forgetting through modular design. Importantly, our method is highly efficient, parametrized as extremely lightweight soft prompt memories (~0.001% of model parameters) and trained with only a few gradient steps, yet achieving performance competitive with full parametric updates and offline training. Across challenging mathematical reasoning benchmarks, our approach significantly outperforms zero-shot and raw data ICL baselines, while transferring effectively across datasets.

47. AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor

AnchorKV: 通过拒绝锚点的软惩罚实现安全感知的KV缓存压缩

AI 总结：提出AnchorKV，一种通过软惩罚机制调整令牌保留分数以远离有害提示的KV缓存压缩方法，在保持实用性的同时显著提升安全性。

链接：https://arxiv.org/abs/2606.17872

机构：Department of Computer Science, Tufts University（塔夫茨大学计算机科学系）； Department of Electrical and Computer Engineering, Tufts University（塔夫茨大学电气与计算机工程系）

作者：Ning Ni, Yingjie Lao

英文摘要：Large language models (LLMs) outperform earlier architectures on generative inference and long-context tasks, but their large size introduces significant challenges in memory usage, energy cost, and on-device deployment. Since scaling pre-trained language models improves downstream capability \cite{zhao2023survey}, the key-value (KV) cache becomes a dominant inference bottleneck. Recent KV cache compression methods \cite{jo2025fastkv,li2024snapkv,zhou2024dynamickv} reduce this cost by retaining only a subset of attention-relevant tokens. However, while these approaches preserve accuracy on benign workloads, their compression policies either fail to defend against jailbreak attacks \cite{jiang2024robustkv} or degrade safety alignment under aggressive eviction. We propose AnchorKV, a drop-in modification to KV cache compression that biases token retention scores away from directions in key space associated with harmful prompts. AnchorKV constructs an offline safety anchor by adapting a difference-of-means representation engineering approach \cite{arditi2024refusal,zou2023representation} to the layer-specific key projection space used in KV caching. Based on this anchor, a soft penalty token selection rule trades a small amount of utility for substantially improved safety alignment, while reducing to the original compressor when the penalty is zero.

48. S4oP: Operator-level Pruning of Structured State Space Models for Resource-Constrained Devices

S4oP：面向资源受限设备的结构化状态空间模型的算子级剪枝

AI 总结：提出一种针对S4和S4D模型的增量算子级剪枝方法，通过结构化掩码与微调交替进行，在保持预测性能的同时显著降低推理成本，首次系统研究SSM的结构化算子剪枝。

链接：https://arxiv.org/abs/2606.18096

作者：Marco Deano, Filippo Ziche, Nicola Bombieri

英文摘要：Structured State Space Models (SSMs), including the S4 and S4D architectures, have recently emerged as powerful alternatives to attention-based models for capturing long-range dependencies in sequential data. Despite their strong empirical performance, deploying these models in time- and resource-constrained settings remains challenging due to their computational and memory demands. In this paper, we propose a novel incremental, operator-level pruning approach for S4- and S4D-based models that significantly reduces inference cost while preserving predictive performance. To the best of our knowledge, this is the first work to systematically investigate structured operator pruning for SSMs. Our method progressively prunes model operators by interleaving structured masking with fine-tuning, while jointly monitoring accuracy and inference latency. We implement this approach within a unified training and evaluation framework that enables systematic exploration of efficiency-accuracy trade-offs. Experiments across multiple benchmark datasets show that pruning up to 70% of the model operators preserves the performance of the original models in most cases, while substantially reducing inference latency. These results demonstrate that structured operator pruning is an effective and previously unexplored strategy for improving the efficiency of SSMs and facilitate their deployment in practical, resource-constrained scenarios.

49. Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models

Ternary Mamba: 分组量化感知训练的 W1.58A16 状态空间模型

AI 总结：提出从预训练检查点进行分组量化感知训练（QAT）结合知识蒸馏，以极低数据量（1亿token）将Mamba-2 1.3B压缩至3.61倍，零样本准确率接近Bi-Mamba，并发现预训练QAT特有的零比率坍塌问题。

链接：https://arxiv.org/abs/2606.18114

机构：EdgeVerve Systems Limited（EdgeVerve系统有限公司）

作者：Ramprasath Ganesaraja, Sahil Dilip Panse, Swathika N

英文摘要：State Space Models (SSMs) such as Mamba-2 offer linear-time inference but their memory footprint limits edge deployment. Prior ternary SSM work (Slender-Mamba) trains from scratch on 150B tokens; we show a pretrained checkpoint suffices, reducing the marginal token budget by 1,000x. Using grouped quantization-aware training (QAT) with knowledge distillation from a frozen FP16 teacher, we compress Mamba-2 1.3B to 3.61x (2,687 to 744 MB) and achieve 48.1% zero-shot accuracy (7-task average) in just 102M tokens (4 GPU-hours, single H100) -- approaching Bi-Mamba's 48.4% (within +/-0.9pp CI). This QAT-from-pretrained setting reveals zero-ratio collapse, a novel instability caused by learnable quantization scales that does not arise in from-scratch training. We further show that post-hoc correction strategies effective for Transformers fail for SSMs due to error accumulation through the recurrence. These results demonstrate that ternary SSMs do not require expensive from-scratch training: QAT from pretrained checkpoints with KD is a data-efficient alternative.

7. 联邦学习、隐私与安全 | 1 篇

50. C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift

C2FL：空间和时间漂移下的聚类持续联邦学习

AI 总结：针对空间异质性和时间漂移下节点隐私保护的集体自适应问题，提出C2FL方法，通过空间聚类自组织学习组，结合经验回放和停留时间感知自适应平均，实现鲁棒集体适应。

链接：https://arxiv.org/abs/2606.18003

作者：Davide Domini, Gianluca Aguzzi, Lorenzo Pellegrini, Mirko Viroli, Lukas Esterle

英文摘要：Collective Adaptive Systems (CAS) increasingly rely on machine learning to let each node learn from locally sensed data, aligning its behavior with the surrounding environment. Scaling this intelligence, however, raises fundamental challenges: sensed data is often privacy-sensitive, preventing centralized collection; nodes are mobile, traversing regions where nearby nodes perceive similar phenomena while distant ones observe radically different conditions, creating natural spatial clusters; and these distributions evolve over time due to mobility, introducing temporal drift that makes local models progressively stale. These dynamics arise across domains - vehicular sensing, drone-based monitoring, smartphone crowdsensing - yet the interplay of privacy, spatial heterogeneity, and temporal drift severely undermines conventional learning strategies. Therefore, we propose C2FL, a fully distributed Federated Learning (FL) approach where nodes self-organize into learning groups through spatial clustering, reflecting the geographic structure of the environment. To counteract temporal drift, each node combines experience replay with a dwell-time-aware adaptive averaging step, progressively incorporating the regional consensus as it remains longer within the same area, while preserving previously acquired knowledge under evolving distributions. We evaluate our approach on synthetic experiments that systematically reproduce spatial and temporal shifts, showing that standard federated strategies degrade significantly under these conditions and that our method restores robust collective adaptation.

8. 鲁棒性、不确定性与可信学习 | 5 篇

51. MM++: Unsupervised Scale-Invariant Multilayer OOD Detection via Top-K Gated Feature Fusion

MM++: 无监督尺度不变多层OOD检测通过Top-K门控特征融合

AI 总结：提出MM++框架，通过熵密度下降识别判别性中间层，结合Ledoit-Wolf正则化协方差矩阵实现无监督、后处理、尺度不变的多层OOD检测，在近/远OOD场景中表现鲁棒。

链接：https://arxiv.org/abs/2606.17352

机构：School of Computing, State University of New York at Binghamton（纽约州立大学宾汉姆顿分校计算机学院）

作者：Rahim Hossain, Md Tawheedul Islam Bhuian, Md Farhan Shadiq, Kyoung-Don Kang

英文摘要：We introduce MM++ (Multilayer Mahalanobis++), a fully unsupervised, strictly post-hoc, and scale-invariant framework for Out-of-Distribution (OOD) detection. To address the trade-off between scale invariance and hierarchical expressivity, MM++ constructs a principled joint feature space. It first identifies discriminative intermediate layers by measuring entropy density drops, which mark the boundaries of sharp semantic compression. By fusing these selected layers with the terminal representation, the framework captures latent cross-layer correlations while mitigating early-layer noise. Crucially, a Ledoit-Wolf regularized tied covariance matrix stabilizes this unified space, enabling reliable distance estimation. Requiring no auxiliary OOD data, classifier fine-tuning, or architectural modifications, MM++ delivers robust performance across distinct architectures for both near- and far-OOD detection.

52. MorphStrata: Layer-Specific Perturbations for Generating Morphence Students in Time-Series Moving Target Defense

MorphStrata: 时间序列移动目标防御中生成Morphence学生的层特定扰动

AI 总结：提出MorphStrata策略，通过选择性层特定随机噪声注入生成结构异质的学生模型，在保持移动目标防御鲁棒性的同时，将训练开销增量控制在1%以内，并在高熵周期性数据集上实现高达24.11%和97.97%的RMSE降低。

链接：https://arxiv.org/abs/2606.17435

作者：Abhishek Bhardwaj, Arnav Doshi, Anusri Nagarajan, Thanh Quynh Nhu Ta, Mohammad Masum, Robert Chun, Jaydip Sen, Saptarshi Sengupta

英文摘要：Time-series forecasting models remain vulnerable to gradient-based adversarial attacks while existing defense mechanisms typically incur a trade-off in robustness for bounded response and compute cost. The problem is pronounced in Moving Target Defense where maintaining multiple randomized model instances substantially exacerbates the training overhead. In this work, we introduce MorphStrata, a student generation strategy with selective, layer-specific stochastic noise injection that extends the traditional Morphence defense. MorphStrata uses a Transformer backbone as the teacher and perturbs randomly selected architectural blocks to create structured heterogeneity across student models in response to varied data distributions and threat models. We evaluate against vanilla Transformer and Morphence backbones on a suite of benchmarks including the Jena Climate, Electricity Load Diagrams, and Appliances Energy Prediction using FGSM, BIM and PGD attacks across multiple attack strengths. Across datasets and attack regimes, the proposed ensemble maintains comparable adversarial RMSE. Specifically, for high entropy, periodic datasets as in the case of the AEP data, MorphStrata achieves the lowest RMSE across all attacks and perturbation budgets, improving over the static baseline by up to 24.11% and 97.97% under FGSM and BIM respectively at an epsilon value of 0.5 over 30 randomized trials. Targeting the layers to generate MorphStrata students accounts for less than 1% increase in train-times over the Morphence MTD baseline for most of the experiments, while accounting for double digit gains in adversarial RMSE reduction. We also observe a positive correlation between higher pairwise L2 distance (among generated students) and overall defense effectiveness. In summary, MorphStrata maintains adversarial robustness as an MTD defense at marginal cost deltas when compared to existing baselines.

53. Geometry-Aware Post-Hoc Uncertainty Quantification in Operator Learning

几何感知的算子学习事后不确定性量化

AI 总结：提出REEF-GP框架，通过高斯过程拟合冻结神经算子的残差，利用其内在坐标-特征表示构建几何感知的不确定性，在多个PDE基准上实现校准的不确定性估计，且计算成本远低于深度集成。

链接：https://arxiv.org/abs/2606.17513

机构：Department of Mechanical and Aerospace Engineering, University of California, Irvine（加州大学尔湾分校机械与航空航天工程系）

作者：Oriol Vendrell-Gallart, Nima Negarandeh, Ramin Bostanabad

英文摘要：Neural operators provide fast surrogates for PDEs but their deterministic predictions limit their use in tasks requiring uncertainty quantification (UQ), especially under geometric variability. Existing approaches primarily model uncertainty in network parameters, largely overlooking the geometry-aware representations learned by the operator itself. We propose REEF-GP (Residual on Embedded Features Gaussian Process), a post-hoc UQ framework that fits a GP to the residuals of a frozen neural operator whose internal embeddings define the kernel feature space. Rather than learning a separate feature map, REEF-GP adapts the operator's intrinsic coordinate-feature representations to construct geometry-aware uncertainties. To ensure stability and scalability on unstructured domains, REEF-GP incorporates spectral-normalized projections, heteroscedastic geometry-aware noise, and efficient subset-based training that avoids restrictive low-rank approximations. Across five PDE benchmarks with varying geometries, REEF-GP preserves predictive accuracy while achieving calibrated uncertainty estimates competitive with deep ensembles but at a fraction of their cost. Our approach remains robust under geometric distribution shift, with uncertainty concentrating in physically meaningful regions (e.g., shock fronts). Our results demonstrate that accurate and scalable post-hoc UQ for neural operators can be achieved directly in their learned feature space, offering a practical alternative to parameter-centric approaches.

54. A fairness-aware extension of Stochastic Multicriteria Acceptability Analysis for ranking

一种公平性感知的随机多准则可接受性分析扩展用于排序

AI 总结：提出SMAA-Fair，通过重加权排序以提升群体公平性，结合统计均等、rKL和nDKL指标，在保持鲁棒性同时改善受保护群体在有利位置的代表性。

链接：https://arxiv.org/abs/2606.17756

机构：Engineering School, Mackenzie Presbyterian University（麦肯锡长老会大学工程学院）

作者：Guilherme Dean Pelegrina, Renata Pelissari

英文摘要：Fairness has become a central concern in ranking problems involving individuals or social groups, particularly under the Responsible Artificial Intelligence agenda. In Multi-Criteria Decision Analysis, Stochastic Multicriteria Acceptability Analysis (SMAA) provides a robust framework for handling uncertainty and incomplete preference information, but it does not explicitly address fairness in the resulting rankings. This paper proposes SMAA-Fair, a fairness-aware extension of SMAA for ranking problems. The approach reweights the simulated rankings generated by SMAA according to their level of group fairness, so that fairer rankings contribute more strongly to the acceptability indices and central weights vector. The framework is independent of the aggregation model and can incorporate different fairness metrics. In this study, Statistical Parity, normalized discounted Kullback--Leibler divergence (rKL) and normalized discounted cumulative Kullback--Leibler divergence (nDKL) are adopted. Rankings are derived from the fairness-adjusted acceptability matrix using expected ranking and maximum acceptability ranking. We also derive the central weight according to the degree of fairness in the obtained rankings. Numerical experiments with synthetic and real data show that SMAA-Fair improves the representation of protected groups among favourable ranking positions, while preserving robustness to preference uncertainty.

55. No-Free-Fairness: Fundamental Limits and Trade-offs in Learning Systems

无免费公平：学习系统中的基本限制与权衡

AI 总结：本文提出无免费公平定理，揭示学习系统中三个固有差异来源：任务固有成本导致性能与公平的权衡、有限样本诱导子群差异、模型类表达力限制导致公平不可达，表明不公平源于决策问题结构、数据有限性和模型表达力。

链接：https://arxiv.org/abs/2606.17810

机构：Hanoi University of Science and Technology（河内科技大学）

作者：Khoat Than

英文摘要：In this paper, we establish a set of theoretical impossibility results, termed the No-Free-Fairness theorems, that identify three fundamental sources of disparity in learning systems. First, we show that when a task exhibits irreducible cost on a subgroup, any decision rule must trade off overall performance with disparity, yielding an inherent fairness--cost frontier. Second, we prove that even in ideal, noise-free settings where a perfectly fair and accurate solution exists, finite-sample learning alone induces nontrivial subgroup disparity, ruling out distribution-free fairness guarantees. More seriously, enforcing strict relative fairness creates a statistical bottleneck: achieving low cost may require exponentially many samples. Third, we show that limitations of the model class can independently induce disparity: if the model cannot represent accurate solutions for a subgroup, fairness remains unattainable regardless of data or training procedure. Overall, these results demonstrate that unfairness is not solely a consequence of biased data or suboptimal optimization, but arises from the intrinsic structure of decision problems, the constraints of finite data, and the expressivity of models. Our framework applies broadly beyond standard supervised learning, and suggests that achieving fairness requires explicit trade-offs and should be treated as a core design consideration.

9. 图学习与结构化数据 | 6 篇

56. Towards Fast GNN Surrogates for CO2 Migration in Complex Geological Formations

面向复杂地质构造中CO2运移的快速GNN替代模型

AI 总结：提出一种端到端图神经替代模型，用于地质封存中CO2羽流运移预测，通过各向异性消息传递和自回归残差公式在SPE11A基准上实现竞争性预测。

链接：https://arxiv.org/abs/2606.17180

机构：Systems and Computer Engineering and High Performance Computing Center, NACAD - COPPE, Federal University of Rio de Janeiro（里约热内卢联邦大学COPPE工程研究生院NACAD高性能计算中心，系统与计算机工程）； Civil Engineering and High Performance Computing Center, NACAD - COPPE, Federal University of Rio de Janeiro（里约热内卢联邦大学COPPE工程研究生院NACAD高性能计算中心，土木工程）； Mechanical Engineering and High Performance Computing Center, NACAD - COPPE, Federal University of Rio de Janeiro（里约热内卢联邦大学COPPE工程研究生院NACAD高性能计算中心，机械工程）； Shell Global Solutions International B.V.（壳牌全球解决方案国际公司）； TotalEnergies OneTech（道达尔能源OneTech）

作者： Rodrigo S. Luna, Thiago H. N. Coelho, Luiz S. L. Neto, Roberto M. Velho, Adriano M. A. Cortes, Renato N. Elias, Alexandre G. Evsukoff, Fernando A. Rochinha, Mauricio Araya-Polo, Herve Gross, Alvaro L. G. A. Coutinho

英文摘要：This chapter discusses how a data-driven machine learning approach can reproduce key aspects of the physical behavior of multiphase flows in complex geological formations. We propose an end-to-end graph neural surrogate tailored to CO$_2$ plume migration forecasting in geological storage. The method is evaluated on the SPE11A benchmark, a well-known industry test case designed to assess CO$_2$ storage scenarios and characterized by sharp gas-water interfaces, strong advective transport, and rapid convective mixing with fingering development. The benchmark is reformulated as a graph in which nodes represent computational cells and edges encode transmissibility-based interactions enriched with geometric attributes. Directional transport arising from grid geometry, permeability contrasts, and geological heterogeneity is captured through an anisotropic message-passing mechanism, where interaction weights are computed via geometry-conditioned edge embeddings, biasing message aggregation toward physically relevant transport directions. Temporal evolution is modeled in latent space using an autoregressive residual formulation trained with multi-step supervision. The proposed model produces competitive forecasts of gas saturation and liquid-phase density, which are key indicators for CO$_2$ storage monitoring, with cumulative errors that remain moderate over extended forecasting horizons.

57. Finsler Geometry, Graph Neural Networks, and You

芬斯勒几何、图神经网络与你

AI 总结：针对图拉普拉斯只能近似各向同性算子的局限，提出基于芬斯勒拉普拉斯的图神经网络层，证明其收敛性并恢复非线性扩散方程的几何结构。

链接：https://arxiv.org/abs/2606.17185

机构：Rice University（莱斯大学）

作者：T. Mitchell Roddenberry, Richard G. Baraniuk

英文摘要：Graph neural network architectures based on the graph Laplacian approximate the Laplace-Beltrami operator, thus limiting their application to isotropic operators. As a nonlinear alternative to the Laplace-Beltrami operator, we consider estimates of the Finsler Laplacian on point clouds sampled from a manifold. We prove that these discrete estimates converge to the true operator on the manifold as the number of point samples grows. Moreover, we show that this operator can be expressed as a graph neural network layer, which we use to define a family of Finslerian graph neural networks constrained to express Finsler geometry. We show that Finslerian graph neural networks recover the geometry underlying nonlinear diffusion equations in practice.

58. Non-negative Matrix Factorisation with Topological Regularisation

带拓扑正则化的非负矩阵分解

AI 总结：提出通过持久同调作为拓扑正则化项融入非负矩阵分解目标函数，以学习具有空间连贯性、周期结构或团状图信号的可解释基函数。

链接：https://arxiv.org/abs/2606.17531

机构：Recursive Inc.（Recursive公司）； Graduate School of Science, Kyoto University（京都大学理学研究科）； Institute of Mathematics for Industry, Kyushu University（九州大学数理学研究院）

作者：Matias de Jong van Lier, Shizuo Kaji, Keunsu Kim

英文摘要：We investigate the learning of interpretable bases in non-negative matrix factorisation (NMF) by regularising the topology of the learned basis functions. Our approach is motivated by the observation that many data modalities can be viewed as non-negative functions on a structured domain, where the quality of a basis is intrinsically linked to its topology. However, naive methods for incorporating the topology of the support are often hindered by discreteness and threshold dependence, rendering them unsuitable for continuous optimisation. We address these challenges by employing persistent homology as a stable, threshold-free topological quantifier and by designing topological scores that integrate into the NMF objective as regularisers. The resulting framework encompasses spatially coherent image components, periodic time-series structures, and clique-like graph signals within a unified modelling language.

59. LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

LLM特征可能损害GNN：同配图基准上的拼接干扰

AI 总结：本文发现将LLM特征通过纯输入拼接（而非联合训练）引入图神经网络时，会在同配基准上系统性地降低准确率，并提出了一个基于LLM单独判别性指标Delta_sig来预测拼接效果。

链接：https://arxiv.org/abs/2606.17579

作者：Zhongyuan Wang, Pratyusha Vemuri

英文摘要：Adding LLM-generated node features to graph neural networks (GNNs) is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when LLM features are introduced through pure input concatenation (rather than joint training, distillation, or prompt-conditioning), they can systematically degrade accuracy on the same homophilous benchmarks where end-to-end LLM pipelines succeed. With an MLP backbone on the Planetoid public split and bag-of-words original features, concatenating SBERT-encoded GPT-4o-mini TAPE features reduces PubMed test accuracy by -17.0 +/- 0.3 pp and Cora by -4.3 +/- 0.6 pp (CiteSeer -0.6 +/- 0.8 pp, within seed noise). The drop attenuates as we relax each condition (GCN / GCNII / GAT backbones, random splits, smaller encoders) and reverses on medium-homophily WikiCS (+4.4 pp) and ogbn-arxiv (+11.7 pp). To predict when concatenation helps versus hurts, we report a simple measure of LLM-alone discriminability, Delta_sig. Across 9 datasets Delta_sig correlates with the concatenation cost more strongly than homophily at point estimate (r^2 = 0.38 vs. 0.06; N=9, bootstrap CIs overlap). The bootstrap-best change-point is tau = 13.8 pp, and the rule "Delta_sig <= tau predicts non-positive concat cost" classifies 7/9 datasets correctly; since 60% of bootstrap samples place tau in [5, 30] pp, we treat Delta_sig as an interpretive lens rather than a precision filter. A dimension-controlled ablation on PubMed places the LLM-feature drop between same-source PCA (-2.3 pp) and same-dim Gaussian noise (-37.3 pp), ruling out dimensionality and weight-decay artifacts. Nine PubMed configurations fit a power law |Delta_concat| proportional to (sqrt(d_l/n))^1.31 with r^2 = 0.97; the low-Delta_sig, small-n corner is exactly where the headline -17 pp PubMed deficit appears.

60. Handling Feature Heterogeneity with Learnable Graph Patches

处理特征异质性：可学习图块方法

AI 总结：提出可学习图块概念，将图分解为语义单元，通过补丁编码器和聚合器实现跨域图数据的可迁移预训练，提升下游任务性能。

链接：https://arxiv.org/abs/2606.17667

机构：Zhejiang University（浙江大学）； Huazhong University of Science and Technology（华中科技大学）； Finvolution Group（信也科技集团）

作者：Yifei Sun, Yang Yang, Xiao Feng, Zijun Wang, Haoyang Zhong, Chunping Wang, Lei Chen

英文摘要：In recent years, the rapid development of foundation models and graph pre-training technologies has spurred increasing interest in constructing a universal pre-trained graph model or Graph Foundation Model (GFM). However, a significant challenge is that existing models are unable to address feature heterogeneity in graph data without textual information, which hinders the transferability of graph models across different datasets. To bridge this gap, we propose the concept of learnable graph patches, which we regard as the smallest semantic units of any graph data. We decompose the graph into learnable graph patches by unfolding the node features and constructing corresponding patch structures separately. We then design a framework that mines transferable information from graph data across domains. Specifically, after extracting graph patches, we propose a patch encoder to extract knowledge from each unit and a patch aggregator to learn how the units are combined into a whole. Due to its domain-agnostic nature, the model can be applied to downstream data across different domains. Furthermore, we analyze the connection between our method and existing graph models, as well as the transferability of the node embeddings it generates. Empirically, our method not only achieves the capability to use multi-domain graphs for pre-training, but also shows enhanced performance across various downstream datasets and tasks. Moreover, we observe consistent improvement in downstream performance as the volume of pre-training data increases.

61. Half a Link can Be Enough to Predict a Whole Link: Understanding Generalization in Knowledge Graph Foundation Models

半条链接足以预测整条链接：理解知识图谱基础模型中的泛化

AI 总结：本文通过分析知识图谱基础模型在未见图上的零样本泛化，发现模型利用部分可见的“半链接”进行预测，并基于此提出四类场景的分类法，揭示现有模型的泛化机制与改进方向。

链接：https://arxiv.org/abs/2606.18001

机构：Institute for AI, University of Stuttgart（斯图加特大学人工智能研究所）； University of Southampton（南安普顿大学）； University of Edinburgh（爱丁堡大学）

作者：Cosimo Gregucci, Obaidah Theeb, Daniel Hernandez, Antonio Vergari, Steffen Staab

英文摘要：Knowledge graph (KG) foundation models (KGFMs) are zero-shot generalizers: trained once, they can predict links on unseen graphs without retraining. However, understanding when and how they can robustly generalize across KGs is still an open question. In this paper, we shed some light on their generalization mechanisms highlighting how their performance on unseen KGs is not uniform when it comes to partially seen links, which we call half-links. In fact, we show that to predict a test triple $(h,r,t)$ it might suffice in practice to have observed the half-link $(h,r)$ or $(r,t)$ in the inference graph. This yields a taxonomy of four scenarios when combinations of these half-links are observed or not. In a rigorous stratified analysis over these scenarios, we reveal that SoTA KGFMs use seen half links for predictions, while unseen half-links pose different challenges. As such, our finer-grained taxonomy can be a diagnostic protocol for robust KGFM generalization and highlights where novel KGFMs can improve.

10. 迁移、元学习与持续学习 | 6 篇

62. A Risk Decomposition Framework for Pre-Hoc Fine-Tuning Prediction

预微调预测的风险分解框架

AI 总结：提出风险分解框架，将预微调性能预测风险分解为内在极限与可降优化方差，证明优化方差衰减率存在下界，并导出预算最优探测原则及可预测性相图。

链接：https://arxiv.org/abs/2606.17649

作者：Yuxiang Luo, Chen Wang, Nan Tang

英文摘要：The high cost of fine-tuning LLMs poses a significant economic barrier; pre-hoc performance prediction offers a critical solution to substantially reduce this expense. However, the theoretical limits of pre-hoc performance prediction remain unexplored. We formulate it as a stochastic estimation problem under information constraints, decomposing prediction risk into two components: an intrinsic limit (static data-model compatibility) and a reducible optimization variance. We prove that optimization variance admits a necessary lower bound on its decay rate, implying fundamental constraints on how quickly uncertainty dissipates, regardless of the predictor used. Based on these dynamics, we derive a budget-optimal probing principle and introduce a predictability phase diagram that organizes tasks into three distinct regimes: Static-Sufficient, Dynamic-Critical, and Noise-Dominant. Extensive experiments on synthetic and real-world benchmarks validate these theoretical regimes and demonstrate the efficiency of our probing strategy.

63. TuneAhead: Predicting Fine-tuning Performance Before Full Training Begins

TuneAhead: 在完整训练开始前预测微调性能

AI 总结：提出TuneAHEAD框架，通过元特征向量和SHAP归因，在微调前预测性能，在Qwen2.5-7B-Instruct上RMSE为1.47个百分点，95.1%预测误差在±3%内。

链接：https://arxiv.org/abs/2606.17660

作者：Yuxiang Luo, Haonan Long, Chen Wang, Qiqi Duan, Xiaotian Lin, Yanwei Xu, Yuyu Luo, Weikai Yang, Nan Tang

英文摘要： Fine-tuning large language models (LLMs) is compute-intensive and error-prone: model performance depends sensitively on data quality and hyperparameter choices, and naïve runs can even degrade model performance. This raises a practical question:can we predict fine-tuning performance before committing to a full training run? We present TUNEAHEAD, a lightweight framework for pre-hoc prediction of fine-tuning performance. TUNEAHEAD encodes each candidate run as a meta-feature vector that combines static dataset descriptors with dynamic probe features from a short standardized probe. A predictor maps these features to performance estimates, while SHAP-based attributions provide interpretable diagnostics that reveal which specific features drive the prediction. Across 1,300+ fine-tuning runs on Qwen2.5-7B-Instruct, TUNEAHEAD consistently outperforms strong baselines such as Early-Stop Extrapolation and ProxyLM. On a held-out test set of 370 runs, TUNEAHEAD achieves an RMSE of 1.47 percentage points and places 95.1% of predictions within +3/-3 percentage points of the true score. These accurate continuous predictions support practical go/no-go screening policies that can reduce unnecessary full fine-tuning while retaining most promising runs.

64. Confusion-Aware Transfer Teacher Curriculum Learning Framework: Disentangling Scoring and Pacing Effects

混淆感知的迁移教师课程学习框架：解耦评分与节奏效应

AI 总结：提出混淆感知难度评分，通过阶段性子集测试和随机基线解耦课程学习的评分与节奏效应，在CIFAR-10上验证评分可解释性，但全数据下无提升，仅在小数据量下提升数据效率。

链接：https://arxiv.org/abs/2606.17706

作者：Savini Kommalage, Sanka Mohottala, Asiri Gawesha, Dulara Madhusanka, Menan Velayuthan, Dharshana Kasthurirathna, Mahima Milinda Alwis Weerasinghe, Charith Abhayaratne

英文摘要：Curriculum learning couples two design choices, how samples are scored by difficulty and how harder samples are paced into training, making it difficult to attribute observed gains to either component. We disentangle these factors with two evaluation protocols: stage-wise test subsets that validate scoring functions independently of curriculum training, and a baseline that applies the same pacing schedule to randomly ordered data. Within the Transfer Teacher framework (TTF), we use these protocols to evaluate a confusion-aware difficulty score that considers both correct-class confidence and the probability distribution over incorrect classes. On CIFAR-10 with ResNet-18 and VGG-16, the proposed score produces model-interpretable difficulty rankings that align with human intuition. However, at full data, neither curriculum nor anti-curriculum ordering improves accuracy over standard training, indicating that improving the scoring function alone is insufficient to overcome the known failure modes of curriculum learning in TTF. In contrast, We find that confusion-aware curriculum ordering result in consistent data-efficiency benefits, outperforming random ordering by up to 8.7% points at the 20% data regime, suggesting the potential of TTF as a data-efficient training method.

65. Dimensionality Controls When Modularity Helps in Continual Learning

维度控制模块化在持续学习中的有效性

AI 总结：研究在持续学习中，模块化架构、任务相似性和表示维度如何共同影响组合学习，发现低维“丰富”机制下模块化结构显著提升性能，而高维“懒惰”机制下影响较小。

链接：https://arxiv.org/abs/2606.17889

作者：Kathrin Korte, Christian Medeiros Adriano, Joachim Winther Pedersen, Eleni Nisioti, Sebastian Risi

英文摘要：Compositional learning systems must balance plasticity, the ability to acquire new knowledge, with stability, the preservation of previously learned components, especially when tasks share structure and risk interference. We study how modular architecture, task similarity, and representational dimensionality jointly shape compositional continual learning in a sequential A-B-A paradigm, comparing a task-partitioned recurrent network to a single-network baseline while inducing high- and low-dimensional regimes via weight-scale manipulations. In a high-dimensional "lazy" regime, both architectures achieve similar performance and internal geometry, suggesting that explicit modular structure has little impact when representations are weakly constrained. In a lower-dimensional "rich" regime, modularity becomes decisive: the modular network develops graded task-specific subspaces that overlap for similar tasks, partially align for moderately dissimilar tasks, and separate for dissimilar tasks, yielding a more compositional and interpretable organization than the single network. These findings identify the representational regime induced by initialization scale, which co-varies with representational dimensionality, as a key factor governing when compositional, modular structure is functionally beneficial in continual learning, and support viewing safety and robustness as problems of adaptive allocation of representational subspaces rather than fixed separation versus sharing.

66. Catastrophic Forgetting is Low-Rank: A Function-Space Theory for Continual Adaptation

灾难性遗忘是低秩的：持续适应的函数空间理论

AI 总结：本文在神经正切核（NTK）框架下提出函数空间理论，推导出新任务训练导致旧任务预测漂移的闭式表达式，揭示遗忘集中在少量旧任务NTK本征模式上，并给出低秩特性与Kronecker缩放规则。

链接：https://arxiv.org/abs/2606.18024

作者：Ido Nitzan Hidekel, Dan Raviv

英文摘要：Catastrophic forgetting in continual adaptation is usually studied through parameter drift, replay, or distillation, but these views do not identify which output-space directions are vulnerable. We give a function-space account in the NTK regime: new-task training induces old-task prediction drift through the cross-task kernel, yielding a closed-form predictor for the forgetting vector before any new-task gradient step. In frozen-backbone linear-head PEFT-CL, where the model is linear in the trainable parameters, the predictor is exact up to numerical precision; for nonlinear adapters/full fine-tuning, it is a local NTK approximation. The same expression reveals that forgetting concentrates in a small number of old-task NTK eigenmodes and under frozen linear heads gives a Kronecker scaling rule for the vulnerable rank. These results clarify the relation to prior NTK-overlap theory, explain why parameter-space regularizers can miss output-space interference, and motivate a targeted spectral regularizer.

67. From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning

从推理轨迹到可复用模块：理解语言模型推理中的组合泛化

AI 总结：本文通过层次化潜在选择模型形式化组合泛化，理论证明SFT提供原子模块，RL分解轨迹实现组合泛化，实验验证RL能从复合轨迹中提取原子模块并重组解决新配置。

链接：https://arxiv.org/abs/2606.18089

作者：Lingjing Kong, Xin Liu, Guangyi Chen, Martin Q. Ma, Xiangchen Song, Yuekai Sun, Mikhail Yurochkin, Taylor W. Killian, Ruslan Salakhutdinov, Kun Zhang, Eric P. Xing, Zhengzhong Liu

英文摘要：Post-training pipelines that combine supervised fine-tuning (SFT) with reinforcement learning (RL) have emerged as the key recipe for transforming large language models (LLMs) into robust reasoners. We argue that this combined success is driven by compositional generalization, which we formalize through a hierarchical latent selection model. In this framework, reasoning traces are generated by a cascade of discrete latent selection variables corresponding to reusable atomic modules, including both skills (local operations) and routing mechanisms (how intermediate information is selected, reused, and composed). Within this model, we theoretically show that SFT and RL play asymmetric, complementary roles: SFT supplies the raw module materials in compositional traces, and RL decomposes those traces to identify the latent atomic modules and enable compositional generalization. We design controlled experiments to validate this theory. Our results demonstrate that RL can extract atomic modules from compound traces supplied by SFT and recombine them to solve new configurations. Moreover, we find that training on compound traces yields stronger generalization than training on isolated atomic modules. Finally, we investigate the relationship between SFT and RL data and identify an effective protocol in which SFT ensures coverage of all atomic modules through compositional traces, while RL focuses on novel compositions outside the SFT support to drive exploration.

11. 数据集、基准与评测 | 6 篇

68. ProCUA-SFT Technical Report

ProCUA-SFT 技术报告

AI 总结：提出 ProCUA-SFT 数据集，通过自动化管道从 2484 个应用组合的合成轨迹中蒸馏出 310 万步级 SFT 样本，微调 UI-TARS 7B 在 OSWorld 上达到 45.0% 的成功率，比基线提升 18.7 个百分点。

链接：https://arxiv.org/abs/2606.17321

机构：NVIDIA（英伟达）； University of Washington（华盛顿大学）； Allen Institute for AI（艾伦人工智能研究所）

作者：Jaehun Jung, Ximing Lu, Brandon Cui, Muhammad Khalifa, Shaokun Zhang, Hao Zhang, Jin Xu, Amala Sanjay Deshmukh, Karan Sapra, Andrew Tao, Yejin Choi, Jan Kautz, Mingjie Liu, Yi Dong

英文摘要：Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M step-level SFT samples distilled from 93K synthetic trajectories across 2,484 application combinations. The dataset is produced by a fully automated pipeline that (i) synthesizes grounded tasks on live desktops seeded with real-world content -- 912 spreadsheets from SpreadsheetBench, approximately 10K permissively-licensed presentations from Zenodo10K, and multi-application OSWorld configs -- and (ii) verifies each task's feasibility through binary precondition checking before rollout. A single VLM (Kimi-K2.5) serves as goal generator, precondition judge, and trajectory executor, eliminating planner-actor capability gaps. Each trajectory is expanded into step-prefix samples that exactly reproduce the context layout seen at inference time. Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch yields 45.0% on OSWorld -- an 18.7 percentage-point improvement over the base model and over 35% above AgentNet-trained counterparts. A subset of ProCUA was incorporated into the training data for the Nemotron 3 Nano Omni model, contributing to its computer-use capabilities.

69. CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models

CheckMIABench: 语言模型成员推理攻击的坚实基础

AI 总结：为解决成员推理攻击评估中的分布偏移问题，提出基于训练中固定点前后数据同分布的基准框架，在Pythia和OLMo模型上评估多种攻击，并开源模块化库。

链接：https://arxiv.org/abs/2606.17464

作者：Jeffrey G. Wang, Jason Wang, Marvin Li, Seth Neel

英文摘要：Membership inference attacks (MIAs) are a canonical way to assess a machine learning model's privacy properties. Although several attempts have been made to evaluate MIAs on language models, the extant literature has suffered numerous difficulties in constructing clean evaluations to test new techniques. In particular, subtle distribution shifts between member and non-member sets can undermine the statistical validity of MIAs; recent work has underscored this by showing that "blind" methods with no access to the underlying model can perform far better than published methods on the same benchmarks. This paper constructs a benchmark for principled evaluation of MIAs against LLMs, by leveraging the insight that training data before and after a fixed point during training are drawn from the same distribution. Therefore, all open-source models with intermediate checkpoints and public training data can be converted into MIA testbeds. We apply our framework to a half-dozen published attacks on the Pythia and OLMo family of models, from 70M to 7B parameters. To facilitate further privacy research, we open-source a modular library for designing and implementing attacks in this setting: this https URL.

70. When the Next Step Is Not One Step: Distribution-Aware Execution Modeling for Concurrent Go Programs

当下一步不是一步：面向并发Go程序的分布感知执行建模

AI 总结：针对并发程序非确定性调度导致单标签预测困难的问题，提出分布感知训练方法，通过多次运行聚合经验分布并微调7B模型，在真实Go缺陷预测中准确率达36.2%，并降低期望校准误差。

链接：https://arxiv.org/abs/2606.17508

机构： University of Colombo School of Computing（科伦坡大学计算学院）

作者：Kaviru Hapuarachchi

英文摘要：Training a model to predict the next step in a concurrent program is harder than it looks: two runs of the same program from the same trace prefix can produce different next events, both valid, because the scheduler is nondeterministic. A model trained against a single label is learning to guess one outcome of a random process. We turn this around and use the nondeterminism as a training signal. We run each program many times, aggregate the observed next events into an empirical distribution, and fine-tune a 7B model to match that distribution with a KL objective. On 798 held-out predictions drawn from real production Go bugs (CockroachDB, Kubernetes, gRPC, etcd), fine-tuning on fewer than a thousand traces reaches 36.2% accuracy, ahead of Gemini 3.5 Flash used zero-shot (34.8%) and the same model without fine-tuning (28.6%). Distribution training matches cross-entropy on accuracy (35.8% vs. 36.2%) while reducing Expected Calibration Error from 0.205 to 0.169. We also derive a formal goroutine-leak signature for a class of select-blocked goroutines where P(GoUnblock)=0 holds by scheduler semantics, not by learning. We release the dataset, trained adapters, and all tooling.

71. Offline Preference-Based Trajectory Evaluation

基于偏好的离线轨迹评估

AI 总结：针对离线评估中仅使用终端成功率导致统计效率低下的问题，提出基于偏好的轨迹评估方法，通过比较轨迹的时间偏好减少平局，提升区分能力、排名稳定性和数据效率。

链接：https://arxiv.org/abs/2606.17541

机构：Carnegie Mellon University（卡内基梅隆大学）

作者：Fernando Diaz

英文摘要：Offline evaluation of agentic systems often collapses trajectories to terminal success, discarding information about partial progress and inducing widespread ties, creating substantial statistical inefficiency by reducing effective sample size and weakening the ability to distinguish systems. We propose preference-based trajectory evaluation, which compares trajectories directly through temporal preferences over progress and time-to-return profiles. We find that, across diverse agentic and interactive benchmarks, standard success-based metrics produce tied comparisons on roughly 75% of instances, whereas trajectory-aware preferences reduce ties to roughly 35%, improving discriminative power, ranking stability, and data efficiency. Our results suggest that benchmark saturation, often attributed to poor data collection or problem difficulty, may also be explained by the choice of evaluation measure.

72. Meta-classification of one-class classification models using ranking correlation and nearest neighbor

使用排序相关性和最近邻的一类分类模型的元分类

AI 总结：提出用排序相关性和最近邻对一类分类模型进行元分类，实验表明能高精度区分数据集、算法和超参数，本质是数据集分类。

链接：https://arxiv.org/abs/2606.17858

机构：Faculty of Science, University of Hradec Kralove（赫拉德茨-克拉洛韦大学理学院）； Malaysia-Japan International Institute of Technology (MJIIT), Universiti Teknologi Malaysia（马来西亚-日本国际技术学院，马来西亚理工大学）； Regional Research Center, Iwate Prefectural University（岩手县立大学区域研究中心）

作者：Toshitaka Hayashi, Hamido Fujita, Dalibor Cimr, Richard Cimler, Jitka Kühnová

英文摘要：Machine Learning (ML) techniques have been applied to various problems. However, applying ML to ML models is an unexplored direction. For this purpose, this paper considers a meta-classification of one-class classification (OCC) models, because all ML models could be approximated as OCC models. The proposal represents OCC models as normality rankings and classifies them using nearest-neighbor and ranking-correlation metrics. The experiment classifies OCC models, where classes correspond to training datasets, algorithms, and hyperparameters. The proposal achieves high accuracy when class labels are datasets. Moreover, it can classify algorithms when the training datasets contain the same class. In addition, the discussion highlights that the classification of OCC models is essentially the classification of datasets that treats multiple samples as a single input. The experiment demonstrates the classification of datasets using sleeping records. The proposed method can provide a unified solution for classifying OCC models, datasets, and rankings. Source code is uploaded to the public repository this https URL.

73. Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?

重新思考用于分类的数据集蒸馏：蒸馏集是否优于核心集？

AI 总结：本文通过大规模标准化实验评估七种最先进的数据集蒸馏方法，发现其在大型数据集上性能不如或仅相当于核心集，且构建成本更高，核心集在数据分布覆盖和计算效率上更具优势。

链接：https://arxiv.org/abs/2606.18209

机构：Dolby Laboratory（杜比实验室）

作者：Trisha Mittal, Akshay Mehra, Joshua Kimball

英文摘要： Dataset distillation (DD) has emerged as a prominent approach in data centric machine learning, aiming to synthesize compact training sets for efficient training by compressing the information in large datasets into a small number of synthetic samples. However, DD methods are often evaluated under inconsistent evaluation protocols, ranging from standard ERM to single/multi-teacher supervision, making it difficult to isolate the effectiveness of distilled data from evaluation. Moreover, many prior methods claim that DD outperforms data pruning approaches such as coreset selection (CS), based on the assumption that restricting condensed datasets to subsets of real samples fundamentally limits their expressiveness. In this work, we critically evaluate DD methods through large-scale experiments using standardized datasets and evaluation protocols to assess their intrinsic effectiveness. We benchmark seven state-of-the-art (SOTA) DD methods on ImageNet-1K, ImageNet100, and ImageNette, using three widely adopted training protocols against three CS strategies. Our results show that while some DD methods fail to outperform even simple random subsets, the SOTA DD approaches are comparable to or worse than coresets on large-scale datasets and incur a substantially higher cost for construction. Beyond accuracy, we also evaluate the representativeness, diversity, and quality of condensed sets, and find that coresets consistently achieve better coverage of the original data distribution. These findings highlight the limited practical advantages of current DD methods and show that coresets remain competitive and are often a more computationally efficient alternative for data-centric learning.

12. 机器学习应用 | 20 篇

74. Correct When Paired, Wrong When Split: Decoupling and Editing Modality-Specific Neurons in MLLMs

配对时正确，分离时错误：多模态大语言模型中模态特定神经元的解耦与编辑

AI 总结：针对多模态大语言模型知识编辑中存在的解耦失败问题，提出DECODE方法，通过显式解耦和定位模态特定神经元组，实现跨模态触发下的有效知识更新。

链接：https://arxiv.org/abs/2606.17057

机构：School of Information Science and Engineering, Yunnan University（云南大学信息科学与工程学院）； School of Software, Yunnan University（云南大学软件学院）； National University of Singapore（新加坡国立大学）； School of Engineering, Yunnan University（云南大学工程学院）

作者：Tingchao Fu, Wenkai Wang, Fanxiao Li, Huadong Zhang, Jinhong Zhang, Dayang Li, Yunyun Dong, Renyang Liu, Wei Zhou

英文摘要：Although Knowledge Editing provides an efficient mechanism for updating the knowledge of Multimodal Large Language Models (MLLMs), we find that current paradigms still suffer from an important yet remain underexplored issue: editing decoupling failure, where entity-related knowledge can be updated when the model is triggered by multimodal inputs (text--image query pairs), however, it often reverts to outdated pre-edit facts when the paired inputs are split into unimodal ones. Our in-depth empirical analysis reveals that the entity knowledge in MLLMs is not stored as a unified representation, but is instead distributed across disentangled modality-specific pathways. As a result, updates biased toward multimodal queries fail to propagate effectively to unimodal circuits. To bridge this gap, we propose DECODE, which explicitly disentangles and localizes modality-specific neuron groups for targeted knowledge. Extensive experiments demonstrate that DECODE consistently achieves effective knowledge updates under different modality triggers, thereby mitigating editing decoupling failures.

75. Diagnosing and Repairing Shape-Prior Shortcuts in Long-Range Single-Shot Fringe Projection Profilometry

诊断和修复长距离单次条纹投影轮廓测量中的形状先验捷径

AI 总结：通过机械可解释性和共形不确定性量化诊断长距离单次条纹投影轮廓测量中网络依赖形状先验而非条纹相位解码的问题，提出PhiCalNet架构修复，将物体平均绝对误差降低3.3倍。

链接：https://arxiv.org/abs/2606.17093

机构：Department of Mechanical Engineering, Iowa State University（爱荷华州立大学机械工程系）； College of Engineering, University of Georgia（佐治亚大学工程学院）

作者：Adam Haroon, Anush Lakshman, Cody Fleming, Beiwen Li

英文摘要：Learning-based single-shot fringe projection profilometry (FPP) has been studied mostly at close range. The long-range regime (standoff beyond 1 m) remains largely unaddressed: inverse-square intensity falloff lowers fringe signal-to-noise ratio and degrades physical ground truth, the single-shot problem is ill-posed because fringe-order information is absent from one image, and these architectures have not been studied mechanistically. We present a diagnose-repair-verify study using mechanistic interpretability (MI) and conformal uncertainty quantification (UQ) as convergent diagnostics: they agree on one physical failure locus, driving and verifying an architectural repair. On a photorealistic synthetic benchmark (15,600 fringe images, 50 objects at 1.5-2.1 m), a best UNet baseline reaches 14.54 mm object mean absolute error (MAE). Three probes (linear probing, Grad-CAM, flat-plane out-of-distribution test) converge: the baseline solves the task via object-boundary shape priors rather than fringe-phase decoding. We repair this with PhiCalNet, which outputs wrapped phase rather than depth and applies a fixed differentiable calibration layer mapping phase to depth, removing the shape-prior solution from the hypothesis space architecturally rather than by a loss penalty. A physics-informed loss that enforces the same physics as a soft penalty on a depth-regressing network yields no measurable gain, isolating the architecture as the operative factor. PhiCalNet reduces object MAE 3.3x to 4.46 mm; the residual is carried by 0.103% of pixels at the +/-pi wrap discontinuity. Pixel-wise conformal UQ confirms the diagnosis: rejecting the top 5% of object pixels by snapshot disagreement cuts PhiCalNet RMSE by 64% (20.6->7.4 mm) versus 3.5% for the baseline. MI and UQ converge on the same failure locus.

76. The Critical Role of Model Selection in Causal Inference: A Comparative Analysis of Classification Models within the InferBERT Framework for Pharmacovigilance

模型选择在因果推断中的关键作用：基于InferBERT框架的药物警戒分类模型比较分析

AI 总结：本研究在InferBERT框架下比较XGBoost、ALBERT、BioBERT和Med-LLaMA四种模型，发现领域特定预训练（BioBERT）在药物警戒因果ADE检测中优于简单基线和大型LLM，校准改善ECE但对准确率和因果发现影响不一。

链接：https://arxiv.org/abs/2606.17113

机构：Department of Stochastics, Institute of Mathematics, Budapest University of Technology and Economics（布达佩斯技术与经济大学数学研究所随机学系）； Institute of Biostatistics and Network Science, Semmelweis University（塞梅维什大学生物统计学与网络科学研究所）； Department of Computer Science, University of Warwick（华威大学计算机科学系）

作者：Csaba Kiss, Roland Molontay, Gabriele Pergola

英文摘要： Distinguishing causal adverse drug events (ADEs) from spurious correlations remains a central challenge in pharmacovigilance. The InferBERT framework integrates transformer models with Do-calculus, but its success hinges on the underlying classification model. This study evaluates the impact of model choice in InferBERT, assessing whether simpler models suffice, if domain-specific pre-training helps, whether scaling to LLMs improves causal detection, and the effect of post-hoc calibration. We performed a comparative study on two benchmarks: Analgesics-induced Acute Liver Failure (AILF) and Tramadol-related Mortalities (TRAM). Four models were evaluated-XGBoost (baseline), ALBERT (original InferBERT), BioBERT (biomedical transformer), and Med-LLaMA (medical LLM)-using 5-fold cross-validation repeated over 20 runs. We measured accuracy, Expected Calibration Error (ECE) pre- and post-isotonic regression, and Jaccard concordance of causal terms with PRR, ROR, and EBGM; significance was tested with paired t-tests. BioBERT achieved the highest accuracy on both datasets, while Med-LLaMA underperformed despite its size and parameter-efficient fine-tuning. Domain-specific pre-training was decisive. Calibration improved ECE but had mixed effects on accuracy and causal discovery. BioBERT's superiority also yielded the strongest concordance with traditional pharmacovigilance signals. These results show that domain-specific pre-training provides a clear advantage over simpler baselines and larger LLMs. Investing in manageable, domain-aware models is more effective for computational pharmacovigilance than simply scaling model size.

77. Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis

探测、融合与可信度：基础模型表示在多模态癌症分析中的系统评估

AI 总结：系统评估基础模型表示在计算病理学任务中的性能，发现图像和组学表示互补，多模态融合在单模态不占优时有效，并利用共形预测验证了不确定性感知推理的临床价值。

链接：https://arxiv.org/abs/2606.17115

机构：The Alan Turing Institute（艾伦·图灵研究所）； University of Bristol（布里斯托大学）； University of Manchester（曼彻斯特大学）； The Institute of Cancer Research（癌症研究所）； Genentech（基因泰克）

作者：Jingyu Hu, Giuseppe Tripodi, Reed Naidoo, Sarah F. McGough, Tapabrata Chakraborti

英文摘要：Foundation models (FMs) have emerged as powerful representation extractors for medical data, yet their generalizability to datasets under distribution shift remains underexplored. This work systematically evaluates FM-based representations on a suite of computational pathology tasks across two real-world commercial cohorts, IH-BC and IH-NSCLC, drawn from the licensed in-house (IH) oncology dataset. The analysis focuses on two modalities, whole-slide images and transcriptomic profiles, drawn from the IH multimodal data. We first benchmark unimodal probing performance across five FMs on eight downstream classification tasks, and find that image and omics representations carry complementary predictive signals. Then we investigate whether multimodal fusion can yield additional gains over unimodal baselines by comparing three image-omics fusion strategies built on paired representations. The trustworthiness of selected unimodal and multimodal pipelines is further assessed through conformal prediction. Our results show that FM representations achieve competitive performance on out-of-distribution data and that multimodal fusion helps mainly when no single modality dominates the signal. Conformal prediction reveals that in the majority of cases where a point prediction fails, the true diagnosis remains recoverable within the prediction set, reinforcing the value of uncertainty-aware inference for clinical support.

78. Uncertainty Quantification of Engineering Structures by Polynomial Chaos Expansion and Multivariate Active Learning

基于多项式混沌展开与多元主动学习的工程结构不确定性量化

AI 总结：针对多输出工程问题中单一实验设计难以同时准确近似所有输出量的问题，提出一种自适应序贯采样方法，通过平衡输入空间探索与多输出聚合方差信息，构建多项式混沌展开代理模型，数值实验表明该方法提高了代理精度和稳定性。

链接：https://arxiv.org/abs/2606.17233

机构：Brno University of Technology（布尔诺理工大学）； University of Rostock（罗斯托克大学）

作者：Qitian Lu, Jafar Jafari-Asl, Panagiotis Spyridis, Lukas Novak

英文摘要：In many engineering applications, a single high-fidelity model produces multiple quantities of interest (QoIs) under the same input parameters, e.g. finite element models of complex physical systems. To alleviate the high computational cost of direct model evaluations, surrogate models are widely used to construct efficient approximations of model responses. Naturally, the accuracy of surrogates strongly depends on the quality of the experimental design (ED). However, a single ED may not provide an adequate representation for all outputs simultaneously, especially when different outputs exhibit varying sensitivities to the input variables. A straightforward solution is to perform separate sampling for each output, but this results in increased sampling complexity and computational cost. From a statistical perspective, such an approach also ignores potential correlations among all outputs and may compromise data consistency. To address this issue, an adaptive sequential sampling method for constructing polynomial chaos expansion surrogate models is generalized for vector valued QoIs. The method sequentially selects new samples from a candidate pool based on their local contribution to the output variance, while balancing distance-based exploration of the input space and exploitation of aggregated variance information across all outputs. Its performance is compared with non-sequential Latin Hypercube Sampling through several numerical examples from engineering problems. Numerical results demonstrate that the proposed strategy improves both surrogate accuracy and stability, and provides a more reliable estimation of second-order statistics.

79. Counterfactual Optimization of Baseball Pitch Sequences and Estimation of Its Impact on Season-Level Statistics

棒球投球序列的反事实优化及其对赛季级统计指标影响的估计

AI 总结：利用Transformer模型和反事实分析，优化MLB投球序列中的最终投球和设置投球，发现可显著提升赛季级表现（如K/9提高1.0以上），并提供了速度带有效位置等实用见解。

链接：https://arxiv.org/abs/2606.17345

作者：Ryota Takamido, Hiroki Nakamoto

英文摘要：Although pitch sequencing is a central topic in baseball analytics, previous studies have primarily focused on optimizing the final pitch within a single plate appearance, leaving the role of preceding setup pitches and their impact on long-term season-level performance insufficiently examined. To address these issues, this study conducted counterfactual analyses using MLB Statcast data. A Transformer-based machine-learning model was trained to predict whether a target pitch would result in an in-play outcome or swing-out. Counterfactual pitch sequences were then generated by replacing either the final pitch or the preceding setup pitch with alternative pitch types and locations while keeping the surrounding contextual information fixed. Optimal counterfactual selections were defined as those that minimized the predicted in-play probability, and their expected effects on pitchers' seasonal statistics were estimated using regression models linking model outputs to season statistics. The results suggest that the optimization of both final and setup pitches may substantially influence season-level performance, including improvements of more than 1.0 in K/9. The analyses also provided several practical insights, including velocity-band-specific effective locations, the importance of pitch commands, and the expansion of pitch-selection options through middle-velocity pitches. These findings quantitatively support the strategic importance of pitch sequencing in baseball.

80. Amortized Probabilistic Retrieval of Atmospheric CO2 from OCO-2 Spectra Using Deep Learning with Laplace Approximations and Normalizing Flows

基于深度学习的OCO-2光谱大气CO2摊销概率检索：结合拉普拉斯近似与归一化流

AI 总结：提出深度学习框架，利用拉普拉斯近似和归一化流从OCO-2光谱中快速、准确地检索大气CO2浓度，并量化不确定性，相比传统方法加速数个数量级且精度更高。

链接：https://arxiv.org/abs/2606.17413

机构：University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； Jet Propulsion Laboratory, California Institute of Technology（加州理工学院喷气推进实验室）

作者：Alejandro Calle-Saldarriaga, Felix Jimenez, Jack Grosskreuz, Jiazheng Wang, Jonathan Hobbs, Matthias Katzfuss

英文摘要：Space-based monitoring of atmospheric carbon dioxide (CO2) is essential for constraining the global carbon budget. NASA's Orbiting Carbon Observatory-2 (OCO-2) estimates column-averaged dry-air mole fractions of CO2 (XCO2) using high-resolution spectra. However, current operational retrieval algorithms are computationally expensive and do not properly quantify uncertainties. We present a novel deep learning framework that addresses these challenges. Due to the difficulties of ground-truth data for real satellite observations, we develop and validate our approach using a high-fidelity simulation dataset. This dataset, created to support OCO-2 uncertainty quantification (UQ), incorporates realistic forward model errors. Our architecture encodes spectral bands using a multi-branch neural network and estimates posteriors of the full CO2 column or desired summaries thereof using two scalable UQ methods: Laplace approximations and normalizing flows. Our approach has five key advantages relative to operational "full-physics" solvers: (1) Amortization: Inference is orders of magnitude faster, enabling real-time processing of massive data streams; (2) Model error robustness: By training on simulations that explicitly include model discrepancies, our method accounts for systematic errors often neglected by standard inversions; (3) Point estimate accuracy: We achieve superior predictive accuracy compared to baseline methods; (4) Improved UQ: The probabilistic outputs yield better-calibrated uncertainty estimates; and (5) Non-Gaussian posteriors: When utilizing normalizing flows, our framework successfully models complex, asymmetric posterior distributions, overcoming the limitations of the Gaussian assumption. These results suggest that simulation-based deep learning is a viable path toward next-generation operational processing systems.

81. Toward Controllable Catalyst Inverse Design via Large-Scale Autoregressive Pretraining

面向可控催化剂逆向设计的大规模自回归预训练

AI 总结：提出基于生成式预训练Transformer的条件催化剂生成模型，通过大规模预训练和微调实现高结构有效性和条件匹配率，显著提升筛选效率。

链接：https://arxiv.org/abs/2606.17445

作者：Dong Hyeon Mok, Jonggeol Na, Seoin Back

英文摘要：Inverse design of heterogeneous catalysts remains challenging because catalyst surfaces exhibit substantial structural complexity with coupled surface-adsorbate interactions across a vast chemical space that is difficult to explore efficiently through conventional screening alone. Although machine learning-based high-throughput screening has accelerated catalyst discovery, its efficiency inevitably declines as the search space grows, motivating the development of generative models that can directly construct catalysts with target properties. Here, we present a conditional catalyst generative model based on the Generative Pretrained Transformer architecture with a numerical embedding layer that enables the generation of catalyst structures conditioned on both categorical and continuous properties within a single autoregressive framework. The model was pretrained on 133 million catalyst structures and subsequently fine-tuned on approximately 460,000 optimized structures with associated categorical properties and binding energies for conditional generation. The resulting model achieved 98% structural validity, 95% optimization validity, and high categorical condition fidelity, with a 93 % joint match rate for adsorbate type and composition. For binding energy conditioning, the match rate of approximately 20% represents a four-fold improvement over the baseline training distribution, and the generated distributions shift systematically toward the target values, enabling a 1.5 to 4-fold improvement in screening efficiency for reaction-targeted catalyst discovery without additional fine-tuning. These results show that large-scale autoregressive pre-training, combined with explicit property conditioning, provides a practical route toward controllable catalyst generation and accelerated catalysts discovery.

82. Credibility-Weighted Pricing of Autonomous Vehicle Liability Under Operational Design Domain Shift

操作设计域转移下自动驾驶汽车责任的可信度加权定价

AI 总结：针对自动驾驶系统部署中经验稀疏、ODD转移及风险非平稳问题，提出分层贝叶斯可信度框架，通过ODD相似性核进行部分池化，在Waymo数据上验证其有效性。

链接：https://arxiv.org/abs/2606.17451

作者：Doyeon Jang

英文摘要：Automated Driving System deployments create a foundational ratemaking challenge: sparse experience, shifting operational design domains, and non-stationary risk across software releases. We propose a hierarchical Bayesian credibility framework pooling across cities, software versions, and territories via a learned ODD-similarity kernel, nesting Buhlmann-Straub as a limiting case. Demonstrated on 648 verified-engaged Waymo crashes across four U.S. metros from the NHTSA Standing General Order database against 116 million matched miles, city-aggregate credibility weights are moderate (0.12-0.46), partial pooling decisively outperforms no pooling, and a power analysis shows the learned kernel's advantage becomes detectable at approximately twelve deployed cities.

83. ResAware: Cross-Environment Website Fingerprinting via Resource-Privileged Distillation

ResAware: 通过资源特权蒸馏实现跨环境网站指纹识别

AI 总结：提出ResAware框架，利用资源级特征训练教师模型并通过异构知识蒸馏指导学生模型，在不增加在线开销下提升跨环境鲁棒性，在五个月大规模数据集上显著提升基线方法性能。

链接：https://arxiv.org/abs/2606.17462

机构：Beijing University of Posts and Telecommunications（北京邮电大学）； Zhongguancun Laboratory（中关村实验室）

作者：Chongru Fan, Wei Wang, Wentao Huang, Zhenquan Ding, Jinqiao Shi, Lei Cui, Zhiyu Hao, Xiaochun Yun

英文摘要：While Website Fingerprinting (WF) attacks achieve high accuracy in controlled laboratory settings, they often degrade substantially in real-world environments due to spatio-temporal drift, browser heterogeneity, proxy obfuscation and etc. This limitation stems from their sole reliance on low-level traffic features that are noisy and highly sensitive to environmental perturbations. To address this problem, we propose \textbf{ResAware}, a cross-environment resource-aware distillation framework under a \textit{training-rich/inference-poor} asymmetric setting. Specifically, ResAware trains a teacher model on resource-level features, and then distills the resulting privileged knowledge into a student model through heterogeneous knowledge distillation. At deployment time, the student model performs inference using only encrypted traffic, incurring zero additional cost. We evaluate ResAware on a large-scale dataset collected over five months from six globally distributed vantage points, comprising more than $160{,}000$ paired samples. The results show that ResAware significantly enhances the cross-environment robustness of diverse WF baselines. Under a 150-day temporal drift, for example, ResAware improves the F1-score of Var-CNN from $72.77\%$ to $81.49\%$ and the open-world $TPR@1\%FPR$ from $22.40\%$ to $27.20\%$. Our results demonstrate that resource-level supervision improves WF robustness without expanding online observation capabilities.

84. Multi-Adapter PPO: A Cross-Attention Enhanced Wavelength Selection Framework for LIBS Quantitative Analysis

多适配器PPO：一种用于LIBS定量分析的交叉注意力增强波长选择框架

AI 总结：提出多适配器PPO框架，将波长选择转化为强化学习问题，利用交叉注意力和多适配器捕获光谱关系，在钢铁和煤炭数据集上综合评分平均提升28.4%，预测精度提升45.2%。

链接：https://arxiv.org/abs/2606.17476

作者：Hao Li, Man Fung Zhuo

英文摘要：Laser-induced breakdown spectroscopy (LIBS) quantitative analysis faces critical challenges in wavelength selection due to high-dimensional spectral data and the fundamental trade-off between prediction accuracy and feature efficiency. This paper presents a novel Multi-Adapter PPO framework that transforms wavelength selection into a reinforcement learning problem, leveraging cross-attention mechanisms and multiple specialized adapters to capture complex spectral relationships. Our approach outperforms traditional Particle Swarm Optimization (PSO) by an average of 28.4\% in comprehensive score and 45.2\% in prediction accuracy across steel and coal datasets. The proposed method demonstrates superior performance in balancing prediction accuracy with feature efficiency, achieving state-of-the-art results in LIBS quantitative analysis while maintaining interpretability and computational efficiency. We released our code and dataset here: this https URL

85. SpatioTemporal Causal Network Diagnostics for Geographic Tipping Point Early Warning

地理临界点早期预警的时空因果网络诊断

AI 总结：提出时空因果网络诊断（ST-CND）框架，通过构建数据驱动的有向因果网络，结合局部恢复率估计和脆弱子网识别，解决地理临界点早期预警中的空间稀释、欧氏假设和相关噪声问题，在AMOC任务上AUROC达0.783。

链接：https://arxiv.org/abs/2606.17553

机构：Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application（江苏省地理信息资源开发与应用协同创新中心）； National Center for Applied Mathematics, Tianjin University（天津大学国家应用数学中心）

作者：Zhaoyuan Yu, Zhangyong Liang

英文摘要：Geographic tipping points in ecosystems, climate subsystems, or ice sheets pose severe challenges for localized early warning. Classical spatial indicators such as Moran's I summarize global spatial structure, but they struggle with three issues: spatial dilution, Euclidean assumptions, and correlated noise. This paper introduces SpatioTemporal Causal Network Diagnostics (ST-CND), a framework that addresses these three issues by representing the geographic field as a time-evolving directed causal network. The core workflow is as follows: (1) infer which spatial nodes help predict other nodes via transfer entropy, replacing fixed Euclidean neighborhoods with data-driven information-flow topology; (2) estimate local recovery rates within each candidate subnetwork via dynamic mode decomposition; and (3) identify the most vulnerable subnetwork by combining three signals, namely high internal fluctuation, high internal synchronization, and low external coupling, thereby suppressing false alarms from spatially correlated noise. Validated on synthetic bifurcations and two observational sea-surface temperature benchmarks, namely Indo-Pacific SST and North Atlantic AMOC, ST-CND delivers localized and interpretable warnings. On the AMOC task, it achieves an AUROC of 0.783 and a critical-subnetwork IoU of 0.378, outperforming recurrence-network and lambda-AR1 baselines. The framework provides an interpretable and scalable pipeline for spatial early warning in Earth system science.

86. Physics-Constrained Neural Networks for Improved Short-Term Weather Forecasting: A Case Study over the South Pacific

物理约束神经网络改进短期天气预报：南太平洋案例研究

AI 总结：提出三种改进物理约束神经网络（PCNN）的方法，包括升级数值求解器、统一自回归混合块和集成两种神经骨干，在WeatherBench南太平洋子集上相比纯神经网络模型在1-12小时预报中均方根误差降低8-22%，同时保持物理一致性。

链接：https://arxiv.org/abs/2606.17659

机构：Faculty of Computer Science, Higher School of Economics（高等经济学院计算机科学系）

作者：Egor Bugaev, Fedor Buzaev, Dmitry Efremenko, Denis Derkach, Fedor Ratnikov

英文摘要： This study introduces enhancements to physics-constrained neural networks (PCNNs) that improve the accuracy and stability of hybrid short-term weather forecasting models. Building on the WeatherGFT architecture, three innovations are proposed. First, an upgraded numerical solver, combining a fifth-order weighted essentially non-oscillatory scheme (WENO-5), a beta-plane approximation, and subgrid-scale viscosity, permits a fourfold increase in the integration time step to 1200 s while reducing the daily mean squared error by up to 26%. Second, a unified autoregressive hybrid block replaces the original chain of 24 specialised modules, eliminating overfitting to specific lead times. Third, the physical core is integrated with two state-of-the-art neural backbones, resulting in PI-PredFormer and PI-IAM4VP. Evaluation on the WeatherBench South Pacific subset from 2000 to 2004 shows that these hybrids reduce root mean squared error at 1-12 h lead times by 8-22% compared to purely neural counterparts, while better preserving physical consistency. These results demonstrate that incremental refinement of hybrid components offers a practical route toward more accurate and efficient short-range weather forecasting.

87. ASTEROID: A Spatiotemporal Information Transformer for Forecasting Multi-Step Time Series of Molecular Dynamics

ASTEROID: 用于分子动力学多步时间序列预测的时空信息变换器

AI 总结：提出ASTEROID框架，通过将分子动力学轨迹重构为高维时空序列并集成时空信息变换方程到Transformer中，实现多步原子坐标的直接预测，在多个量子力学分子数据集上显著提升预测精度并降低计算成本。

链接：https://arxiv.org/abs/2606.17668

作者：Kexin Wu, Luonan Chen, Renxiao Wang

英文摘要：Molecular dynamics (MD) simulation is computationally demanding, particularly for large-scale systems requiring long-term analysis. Accurate forecast of the outcomes of a MD simulation is not only an attractive scientific challenge but also has substantial practical value. In this work, we developed a data-driven framework, termed ASTEROID (Advanced Spatiotemporal TransformER fOr Inferring Dynamics), that can directly predict multi-step atomic coordinates, avoiding conventional iterative integration. For this purpose, our ASTEROID reformulates MD trajectories as high-dimensional spatiotemporal sequences and integrates the Spatiotemporal Information (STI) Transformation equation into a Transformer architecture. The core innovation of ASTEROID lies in its ability to model multiscale spatiotemporal dependencies. In particular, for spatial dependencies, a local-global self-attention mechanism captures both short- and long-range interactions. For temporal dependencies, an encoder-decoder structure integrates global context with autoregressive forecasting. ASTEROID was evaluated on several quantum-mechanics derived molecular datasets. Our results indicate that ASTEROID achieved not only a higher level of accuracy in multi-step prediction than existing methods on various benchmarks, but also significantly reduced computational cost of conventional MD simulation. Moreover, the model supports iterative multi-step forecasting over an extended time scale. This work establishes a robust and generalizable data-driven paradigm for accelerating MD simulations.

88. Delta-Based Target Reformulation for Short-Term Electricity Load Forecasting Using LSTM and Transformer Models

基于Delta目标重构的LSTM与Transformer短期电力负荷预测

AI 总结：针对电力负荷非平稳性，提出Delta目标重构方法，让LSTM和Transformer预测负荷变化量而非绝对值，在小时级预测中MAE和MAPE降低超50%。

链接：https://arxiv.org/abs/2606.17692

作者：Vansh Bansal

英文摘要：Accurate short-term electricity load forecasting is critical for the reliable and economic operation of modern power systems, under non-stationarity arising from weather variability, calendar effects, and evolving consumption patterns. While deep learning models such as LSTMs and Transformers show promising performance, most existing studies focus on direct absolute load prediction without explicitly addressing target non-stationarity. Motivated by classical time-series differencing techniques in ARIMA models, this paper investigates a delta-based target reformulation for short-term electricity load forecasting using deep learning. Instead of directly predicting absolute load values, the proposed formulation trains models to predict the change in load between consecutive time steps, with final forecasts reconstructed using the last observed load. This aims to stabilize the learning target and reduce forecasting difficulty. Using multi-year, hourly real-world electricity load data from India, augmented with meteorological variables from the NASA POWER project and calendar features, this study evaluates LSTM and Transformer models under both formulations, benchmarking them against LightGBM. Experiments are conducted for hour-ahead and day-ahead horizons, assessing performance via Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE). Results show that delta-based reformulation consistently improves forecasting accuracy for hour-ahead prediction across all evaluated models, yielding MAPE reductions of over 50% compared to absolute formulations. For day-ahead forecasting, delta targets specifically benefit deep sequence models (LSTM and Transformer), while LightGBM remains competitive under the absolute formulation. These findings indicate that while delta reformulation is a powerful inductive bias for neural networks, its efficacy is model- and horizon-dependent.

89. QueryMarket: Cost-Aware Online Active Learning in Data Markets

QueryMarket: 数据市场中成本感知的在线主动学习

AI 总结：提出QueryMarket框架和OVBAL算法，通过D-最优性准则估计边际效用，在滚动预算约束下实现成本感知的在线主动学习，适应非平稳流和异构标签成本。

链接：https://arxiv.org/abs/2606.17805

机构：Dyson School of Design Engineering, Imperial College London（帝国理工学院戴森设计工程学院）； Halfspace (part of Accenture)（埃森哲旗下Halfspace）； Technical University of Denmark (DTU Management)（丹麦技术大学（DTU管理系））； Aarhus University (CoRE)（奥胡斯大学（CoRE））

作者：Xiwen Huang, Pierre Pinson

英文摘要：Data acquisition is a major bottleneck for learning in real-time streams: analysts must decide on the fly which labels to purchase while respecting a rolling budget. However, existing online active learning rarely unifies pricing, information gain, and rolling budget constraints under concept drift. We introduce QueryMarket, a market-inspired framework that queries each incoming data point based on its estimated utility to the model and its price. Within this framework, we propose OVBAL (online variance-based active learning), which integrates data pricing with information-driven selection by estimating each sample's marginal utility via a D-optimality criterion with exponential forgetting and executing cost-aware purchases under rolling budget constraints. OVBAL yields a simple, fully online decision rule that adapts to nonstationary streams and heterogeneous label costs. Experiments on synthetic data and a real-world solar power generation forecasting task show that OVBAL is particularly effective under seller-centric pricing and yields a more favorable long-run error-cost trade-off in the real-world task under both pricing schemes.

90. Predictive Analytics in E-Commerce for CustomerBehavior Forecasting using hybrid Ret-DNN withXGBoost Model

电子商务中基于混合Ret-DNN与XGBoost模型的客户行为预测分析

AI 总结：提出混合Ret-DNN与XGBoost模型，通过特征提取和梯度提升预测客户购买概率，在UK零售数据集上MAE达0.2193。

链接：https://arxiv.org/abs/2606.17931

作者：Degala Pushpa Sri, Mayank Atreya, Lakshmi. H, Navin Chhibber, Mukesh Soni

英文摘要：In recent years, electronic (E) commerce services have rapidly increased in the daily lives of people, which helpsthem to purchase products online. However, retail platforms have struggled to understand customer behavior and make it difficult to predict their future purchases. To overcome these challenges, this study proposes a hybrid Retail Deep NeuralNetwork (Ret-DNN) with an Extreme Gradient Boosting(XGBoost) model for capturing temporal features and tabular dynamics of retail data. First, data were sourced from a UnitedKingdom (UK)-based online retailer that contains transactions with almost 500,000 records. Then, the collected data were pre-processed using a series of techniques, such as data cleaning, outlier handling, temporal feature extraction, feature encoding, and z-score normalization, to ensure that the data were ready for model training and testing. Subsequently, the preprocessed data were fed into the Ret-DNN model, which acts as a feature extractor to understand the complete context of customer transactions. Further, the extracted data were fed as input into the XGBoost model, which predicted the final output as the purchase probability of customers. Finally, the proposed Ret-DNN XGBoost model achieved better results by attaining aMean Absolute Error (MAE) 0.2193 when compared to the existing Ret-DNN model. Keywords: Customer behavior forecasting, extreme gradientboosting, electronic commerce, predictive analytic, retail deepneural networks.

91. Multiple cyclicity and Wavelet Decomposition with Channel Correlation for Long-term Time Series Forecasting

多重周期性与通道相关的小波分解在长期时间序列预测中的应用

AI 总结：提出McWC模型，通过多层周期性构建、多层感知机提取通道相关性、多级小波分解融合高低频信息，并在频域解耦通道内自相关，实现高效准确的长期预测。

链接：https://arxiv.org/abs/2606.17996

机构：School of Computer Science and Engineering, Central South University（中南大学计算机科学与工程学院）

作者：Bin Wang, Heming Yang, Jinfang Sheng

英文摘要：Cyclicity and trend are important components of time series data and many studies based on cyclicity and trend have achieved good results in long-term time series forecasting. However, we believe that current work neglects the influence of real-world inter-channel correlations in time series data which leads to suboptimal predictions. Furthermore, these models rely on complex designs to capture diverse information so that resulting in low computational efficiency. To address this challenge, we propose McWC, a long-term time series forecasting model that separately models the cyclicity, trend, and inter-channel correlations. Specifically, McWC first decouples cyclical information from data using a multi-layer cyclicity construction module. Then, it extracts inter-channel correlations using multi-layer perceptron. Next, it models and fuses the multi-layer high-frequency and low-frequency information from data using a multi-level wavelet decomposition module. Finally, it aggregates the results of different components to obtain the output. Simultaneously, we decouple intra-channel autocorrelations by calculating a loss function in the frequency domain. Experiments on six real-world datasets demonstrate that McWC achieves state-of-the-art performance, exhibiting excellent computational efficiency and historical information extraction capabilities.

92. ConTex: Reformulating Counterfactual Generation For Time Series Forecasting

ConTex：重新定义时间序列预测的反事实生成

AI 总结：针对时间序列预测中反事实解释的不一致和高计算成本问题，提出ConTex模型，通过全局一致的干预策略实现单次前向传播生成稀疏反事实，显著降低计算成本并支持实时应用。

链接：https://arxiv.org/abs/2606.18049

机构：Institute for Technologies and Management of Digital Transformation, University of Wuppertal（伍珀塔尔大学数字转型技术与管理研究所）

作者：Jan Voets, Hasan Tercan, Tobias Meisen, Sebastian Baum

英文摘要：Decision-making with deep learning-based time series forecasting requires not only accurate predictions but also actionable insights. However, current architectures do not inherently provide such information. Specifically, guidance is needed on how current conditions must be modified to shift from a predicted outcome to a desired future scenario. Counterfactual explanations provide a natural framework for this task, as they represent minimal input changes that alter the model's prediction, indicating when and how intervention is required. Existing approaches rely on instance-wise optimization, leading to inconsistency across instances, high computational costs, and limited applicability in real-time settings. To address these limitations, we reformulate counterfactual generation for time series forecasting as the problem of learning a globally consistent intervention strategy, allowing counterfactuals to be generated through a single shared function. We propose Counterfactual Time Series Explanations (ConTex), a model-agnostic, decomposed architecture comprising a temporal context encoder and a conditional encoder, followed by two heads that capture interventions in terms of temporal relevance and modification strength. This structure overcomes the instability and inconsistency of instance-based approaches by producing targeted, interpretable interventions across time and feature dimensions in a single forward pass, making it suitable for real-time applications. Across multiple forecasting architectures and benchmark datasets, ConTex achieves state-of-the-art validity while generating sparse counterfactuals that minimize the number of necessary interventions. Additionally, our approach reduces computational cost by at least 12-36x compared to instance-wise generation and supports real-time inference at approximately 0.007 seconds.

93. Embedded Machine Learning for Microcontroller-Class Edge Devices: Data, Feature, Evaluation, and Deployment Pipelines

面向微控制器级边缘设备的嵌入式机器学习：数据、特征、评估与部署流程

AI 总结： 本文系统介绍面向微控制器平台的嵌入式机器学习工作流，重点涵盖采样缓冲、特征提取、不平衡验证、模型/运行时协同设计及流式部署等工程决策，并以惯性运动识别和关键词检测为例给出实用设计规则。

链接：https://arxiv.org/abs/2606.18122

作者：Mostafa Darvishi

英文摘要：Embedded machine learning moves inference from cloud services to resource-constrained devices that must acquire data, preprocess signals, run a model, and act within tight limits on memory, energy, and latency. This paper presents a systems-oriented synthesis of an embedded machine-learning workflow for microcontroller-class platforms. The emphasis is placed on engineering decisions that are often hidden in generic machine-learning introductions: sampling and buffering, feature extraction as dimensionality reduction, validation under class imbalance, model/runtime co-design, and streaming deployment. Two representative signal families are used throughout the paper. The first is inertial motion recognition, where a two-second, three-axis accelerometer window is transformed from raw samples into root-mean-square and spectral features before classification. The second is keyword spotting, where audio is sampled, anti-aliased, transformed into mel-frequency cepstral coefficients, and processed by a compact one-dimensional convolutional network. The paper concludes with practical design rules for robust on-device inference, including data curation, quantization, thresholding, scheduling, and field monitoring.

13. 其他/综合机器学习 | 2 篇

94. Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems

多智能体大语言模型系统中并发异常的可验证检测与预防

AI 总结：针对多智能体LLM系统，形式化四种并发异常并建立一致性层级，通过Verus验证检测器正确性，并在Rust运行时中实现预防。

链接：https://arxiv.org/abs/2606.17182

机构：independent researcher（独立研究员）

作者：Sajjad Khan

英文摘要：Multi-agent LLM systems share state through memory stores, vector indices, and tool registries. We model such sharing as long-running read-generate-write operations under deterministic-generation semantics -- the regime durable-execution engines enforce by deterministic replay -- and formalize four concurrency anomalies in TLA+: stale-generation, phantom-tool, causal-cascade, and tool-effect reordering, structural analogues of classical isolation anomalies, each with a TLC counter-example. The exclusion lattice over these anomalies is trivial; the contribution is the mechanically verified realizability and strict separation of one maximal chain within it, $L_0 \subsetneq \cdots \subsetneq L_4$, to our knowledge the first machine-checked consistency hierarchy for such runtimes. A development of 274 Verus obligations (zero assume, zero admit; trust base: two structural axioms and a mutex correspondence) proves the detectors sound and complete against the specifications and each runtime its avoidance set. Three deployed Rust runtimes realize L0-L1 (pessimistic locking, serializable snapshot isolation, default-SI), each verified against stale-generation and refined to its state machine; L2-L4 are exec-mode-verified with dependency-free prevention twins (A3, A6, A2: 0/1000 versus 1000/1000), and L2 is run live across three model families (A3 prevented in all 120 retracted sessions). We reproduce a silent lost update in ByteDance's deer-flow, formalizing its fix as a verified $L_0 \to L_1$ refinement, and exhibit tool-effect reordering in LangGraph's ToolNode on unmodified output, removed by an L3 commit-order sequencer. The verified detector, refinements, and realizability artifacts are the contribution; the phenomena and lattice are classical.

95. Rift: A Conflict Signature for Deception in Language Models

Rift: 语言模型中欺骗行为的冲突特征

AI 总结：通过对比知情欺骗与无知错误，发现欺骗性前向传递具有高残差秩的冲突特征，能以100%准确率无标签识别谎言，并跨模型、语言和架构迁移。

链接：https://arxiv.org/abs/2606.17229

机构：Harmonic Labs

作者：Petr Nyoma

英文摘要：A model that lies while knowing the truth is the central case ELK cannot handle with behavioral evaluation alone. We ask whether such deception leaves an internal signature distinguishing it from honest error. Our key move is a control for wrongness: we contrast a sleeper agent (knows the truth, lies on trigger) against a naive liar (fine-tuned to emit the same wrong answers with no honest training). Both produce identical wrong outputs; any difference is about knowledge conflict, not incorrectness. We find deceptive forward passes carry a conflict signature - 2.1-2.3x higher residual rank than naive-liar passes on the same wrong answer - strong enough to identify which of two responses is the lie with 100% accuracy and no labels, across GPT-2 small/medium (three seeds) and three instruct models. Across Qwen2.5-1.5B/7B and Phi-3-mini, instructed deception raises residual rank on every tested fact (18/18, 40/40, 34/34); on Phi-3, lies separate perfectly from both honest answers and hallucinations (AUC 1.0, Wilcoxon p~6e-11). The signature survives strategic self-constructed deception (model invents its own lie, AUC 1.0), active concealment attempts (AUC 1.0), and length-controlled replication (20/20, AUC 1.0, p~1e-6). Using basis-free relative representations, a probe trained on one model family detects deception in two other families zero-shot (mean AUC 0.933), surviving simultaneous architecture and format change (AUC 0.821), and transfers across five languages (AUC 1.000, length-controlled). The signature is read-only: detectable but not injectable (0/8 both directions). Honest limitations and six negative experiments are documented in full.