机器学习学术速递[8.7]

点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.LG 方向，今日共计151篇

大模型相关(24篇)

【1】GeRe: Towards Efficient Anti-Forgetting in Continual Learning of LLM via General Samples Replay
标题：GeRe：通过一般样本回放在LLM持续学习中实现高效防遗忘
链接：https://arxiv.org/abs/2508.04676

作者：ng, Shuoran Jiang, Mengchen Zhao, Yuefeng Li, Yang Fan, Xiangping Wu, Qingcai Chen
摘要：大型语言模型（LLM）的持续学习能力对于推进人工智能至关重要。然而，跨各个领域的持续微调LLM经常遭受灾难性遗忘，其特征在于：1）显著忘记其一般能力，以及2）先前学习的任务的性能急剧下降。为了以一种简单而稳定的方式同时解决这两个问题，我们提出了通用样本重放（GeRe），这是一个使用通常的预训练文本进行有效抗遗忘的框架。除了重新审视GeRe下最流行的基于重放的实践之外，我们还利用神经状态引入了一种增强的激活状态约束优化方法，该方法使用基于阈值的边际（TM）损失，在重放学习期间保持激活状态的一致性。我们是第一个验证一个小的，固定的一组预先收集的一般重放样本足以解决这两个问题-保留一般功能，同时促进跨顺序任务的整体性能。的确，前者可以促进后者。通过控制实验，我们系统地比较了TM与不同的重放策略下的GeRe框架，包括香草标签拟合，通过KL分歧的logit模仿和通过L1/L2损失的特征模仿。结果表明，TM不断提高性能，表现出更好的鲁棒性。我们的工作为未来LLM的有效重播铺平了道路。我们的代码和数据可在https://github.com/Qznan/GeRe上获得。
摘要：The continual learning capability of large language models (LLMs) is crucial for advancing artificial general intelligence. However, continual fine-tuning LLMs across various domains often suffers from catastrophic forgetting, characterized by: 1) significant forgetting of their general capabilities, and 2) sharp performance declines in previously learned tasks. To simultaneously address both issues in a simple yet stable manner, we propose General Sample Replay (GeRe), a framework that use usual pretraining texts for efficient anti-forgetting. Beyond revisiting the most prevalent replay-based practices under GeRe, we further leverage neural states to introduce a enhanced activation states constrained optimization method using threshold-based margin (TM) loss, which maintains activation state consistency during replay learning. We are the first to validate that a small, fixed set of pre-collected general replay samples is sufficient to resolve both concerns--retaining general capabilities while promoting overall performance across sequential tasks. Indeed, the former can inherently facilitate the latter. Through controlled experiments, we systematically compare TM with different replay strategies under the GeRe framework, including vanilla label fitting, logit imitation via KL divergence and feature imitation via L1/L2 losses. Results demonstrate that TM consistently improves performance and exhibits better robustness. Our work paves the way for efficient replay of LLMs for the future. Our code and data are available at https://github.com/Qznan/GeRe.

【2】Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management
标题：雕塑家：通过主动上下文管理赋予法学硕士认知代理权
链接：https://arxiv.org/abs/2508.04664

作者：H. Xu, Qitai Tan, Ting Cao, Yunxin Liu
备注：Preprint. Work in progress
摘要：大型语言模型（LLM）在处理长上下文时，由于主动干扰而导致性能显著下降，其中上下文早期部分中的不相关信息会破坏推理和记忆回忆。虽然大多数研究都集中在外部记忆系统，以增强LLM的能力，我们提出了一个补充的方法：授权LLM与主动上下文管理（ACM）工具，积极塑造他们的内部工作记忆。我们介绍Sculptor，一个为LLM配备三类工具的框架：（1）上下文碎片，（2）摘要，隐藏和恢复，以及（3）智能搜索。我们的方法使LLM能够主动管理他们的注意力和工作记忆，类似于人类如何选择性地专注于相关信息，同时过滤掉干扰。对信息稀疏基准PI-LLM（主动干扰）和NeedleBench多针推理的实验评估表明，即使没有特定的训练，Sculptor也可以显着提高性能，利用LLM固有的工具调用泛化能力。通过启用主动上下文管理，Sculptor不仅减轻了主动干扰，而且还为跨各种长上下文任务的更可靠推理提供了认知基础-强调了明确的上下文控制策略，而不仅仅是更大的令牌窗口，是大规模鲁棒性的关键。
摘要：Large Language Models (LLMs) suffer from significant performance degradation when processing long contexts due to proactive interference, where irrelevant information in earlier parts of the context disrupts reasoning and memory recall. While most research focuses on external memory systems to augment LLMs' capabilities, we propose a complementary approach: empowering LLMs with Active Context Management (ACM) tools to actively sculpt their internal working memory. We introduce Sculptor, a framework that equips LLMs with three categories of tools: (1) context fragmentation, (2) summary, hide, and restore, and (3) intelligent search. Our approach enables LLMs to proactively manage their attention and working memory, analogous to how humans selectively focus on relevant information while filtering out distractions. Experimental evaluation on information-sparse benchmarks-PI-LLM (proactive interference) and NeedleBench Multi-Needle Reasoning-demonstrates that Sculptor significantly improves performance even without specific training, leveraging LLMs' inherent tool calling generalization capabilities. By enabling Active Context Management, Sculptor not only mitigates proactive interference but also provides a cognitive foundation for more reliable reasoning across diverse long-context tasks-highlighting that explicit context-control strategies, rather than merely larger token windows, are key to robustness at scale.

【3】Causal Reflection with Language Models
标题：语言模型的因果反思
链接：https://arxiv.org/abs/2508.04495

作者：, Zac Liu
摘要：虽然LLM表现出令人印象深刻的流畅性和事实回忆，但他们难以进行强大的因果推理，通常依赖于虚假的相关性和脆弱的模式。同样，传统的强化学习代理也缺乏因果理解，优化奖励，而没有建模为什么行动会导致结果。我们引入了因果反射，一个框架，明确模型因果关系作为一个动态函数的状态，动作，时间和扰动，使代理人的原因延迟和非线性效应。此外，我们定义了一个正式的反映机制，识别预测和观察到的结果之间的不匹配，并产生因果假设，以修改代理的内部模型。在这种架构中，LLM不是作为黑盒推理机，而是作为结构化推理引擎，将正式的因果输出转换为自然语言解释和反事实。我们的框架奠定了理论基础的因果反射代理，可以适应，自我纠正，并在不断变化的环境中沟通的因果理解。
摘要：While LLMs exhibit impressive fluency and factual recall, they struggle with robust causal reasoning, often relying on spurious correlations and brittle patterns. Similarly, traditional Reinforcement Learning agents also lack causal understanding, optimizing for rewards without modeling why actions lead to outcomes. We introduce Causal Reflection, a framework that explicitly models causality as a dynamic function over state, action, time, and perturbation, enabling agents to reason about delayed and nonlinear effects. Additionally, we define a formal Reflect mechanism that identifies mismatches between predicted and observed outcomes and generates causal hypotheses to revise the agent's internal model. In this architecture, LLMs serve not as black-box reasoners, but as structured inference engines translating formal causal outputs into natural language explanations and counterfactuals. Our framework lays the theoretical groundwork for Causal Reflective agents that can adapt, self-correct, and communicate causal understanding in evolving environments.

【4】CARD: Cache-Assisted Parallel Speculative Decoding for Efficient Large Language Model Inference
标题：CARD：高速缓存辅助并行推测解码的高效大语言模型推理
链接：https://arxiv.org/abs/2508.04462

作者：, Kai Sheng, Hao Chen, Xin He
摘要：推测解码（SD），其中额外的草稿模型首先提供多个草稿令牌，然后原始目标模型并行验证这些令牌，已经显示出LLM推理加速的强大功能。然而，现有的SD方法必须坚持“draft-then-verify”范式，这迫使起草和验证过程在SD期间顺序执行，导致推理性能低下，并限制了草案模型的大小。此外，一旦候选序列中的单个标记在起草过程期间被拒绝，则必须丢弃所有后续候选标记，导致低效的起草。为了解决这些挑战，我们提出了一个基于缓存的并行推测性解码框架采用“查询和正确”的范例。具体来说，CARD将起草和验证并行化：草案模型生成候选令牌以填充共享缓存，而目标模型同时纠正草案模型的生成方向。这有效地使目标模型能够以接近草稿模型的速度执行推理。我们的方法实现了高达4.83的加速比香草解码，而不需要微调的草案或目标模型。我们的代码可在https://github.com/hunzhizi/CARD上获得。
摘要：Speculative decoding (SD), where an extra draft model first provides multiple draft tokens and the original target model then verifies these tokens in parallel, has shown great power for LLM inference acceleration. However, existing SD methods must adhere to the 'draft-then-verify' paradigm, which forces drafting and verification processes to execute sequentially during SD, resulting in inefficient inference performance and limiting the size of the draft model. Furthermore, once a single token in the candidate sequence is rejected during the drafting process, all subsequent candidate tokens must be discarded, leading to inefficient drafting. To address these challenges, we propose a cache-based parallel speculative decoding framework employing a 'query-and-correct' paradigm. Specifically, CARD decouples drafting and verification: the draft model generates candidate tokens to populate a shared cache, while the target model concurrently rectifies the draft model's generation direction. This effectively enables the target model to perform inference at speed approaching that of the draft model. Our approach achieves up to 4.83 speedup over vanilla decoding without requiring fine-tuning of either the draft or target models. Our code is available at https://github.com/hunzhizi/CARD.

【5】Automatic LLM Red Teaming
标题：自动LLM红色团队
链接：https://arxiv.org/abs/2508.04451

作者：aire, Arunesh Sinha, Pradeep Varakantham
摘要：红色团队对于识别漏洞和建立对当前LLM的信任至关重要。然而，目前大型语言模型（LLM）的自动化方法依赖于脆弱的提示模板或单轮攻击，无法捕捉真实世界对抗对话的复杂，交互性。我们提出了一个新的范式：训练一个AI从战略上“打破”另一个AI。通过将红色团队形式化为马尔可夫决策过程（MDP）并采用分层强化学习（RL）框架，我们有效地解决了固有的稀疏奖励和长期挑战。我们的生成代理通过细粒度的令牌级伤害奖励学习连贯的多回合攻击策略，使其能够发现现有基线遗漏的微妙漏洞。这种方法设置了一个新的最先进的，从根本上将LLM红色团队重新构建为一个动态的，基于自动化的过程（而不是一步测试），对于强大的AI部署至关重要。
摘要：Red teaming is critical for identifying vulnerabilities and building trust in current LLMs. However, current automated methods for Large Language Models (LLMs) rely on brittle prompt templates or single-turn attacks, failing to capture the complex, interactive nature of real-world adversarial dialogues. We propose a novel paradigm: training an AI to strategically `break' another AI. By formalizing red teaming as a Markov Decision Process (MDP) and employing a hierarchical Reinforcement Learning (RL) framework, we effectively address the inherent sparse reward and long-horizon challenges. Our generative agent learns coherent, multi-turn attack strategies through a fine-grained, token-level harm reward, enabling it to uncover subtle vulnerabilities missed by existing baselines. This approach sets a new state-of-the-art, fundamentally reframing LLM red teaming as a dynamic, trajectory-based process (rather than a one-step test) essential for robust AI deployment.

【6】StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion
标题：StepFun-Formalizer：通过知识推理融合释放LLM的自动形式化潜力
链接：https://arxiv.org/abs/2508.04440

作者：, Di Huang, Ruosi Wan, Yue Peng, Shijie Shang, Chenrui Cao, Lei Qi, Rui Zhang, Zidong Du, Jie Yan, Xing Hu
备注：24 pages, 17 figures, under review
摘要：自动形式化旨在将自然语言的数学语句转换为形式语言。虽然LLM加速了这一领域的进展，但现有的方法仍然存在精度低的问题。我们确定了有效的自动形式化的两个关键能力：全面掌握形式语言领域知识，自然语言问题理解和非正式正式对齐的推理能力。如果没有前者，模型就无法识别正确的形式对象;如果没有后者，它就很难解释现实世界的上下文，并将它们精确地映射到形式表达式中。为了解决这些差距，我们引入了ThinkingF，这是一个数据合成和训练管道，可以提高这两种能力。首先，我们构建了两个数据集：一个是通过提取和选择丰富的形式知识的大规模的例子，另一个是通过生成由专家设计的模板指导的非正式到正式的推理轨迹。然后，我们将SFT和RLVR应用于这些数据集，以进一步融合和完善这两种能力。由此产生的7 B和32 B模型表现出全面的正式知识和强大的非正式到正式的推理。值得注意的是，StepFun-Formalizer-32 B在FormalMATH-Lite上的SOTA BEq@1得分为40.5%，在ProverBench上为26.7%，超过了之前所有的通用和专用型号。
摘要：Autoformalization aims to translate natural-language mathematical statements into a formal language. While LLMs have accelerated progress in this area, existing methods still suffer from low accuracy. We identify two key abilities for effective autoformalization: comprehensive mastery of formal-language domain knowledge, and reasoning capability of natural language problem understanding and informal-formal alignment. Without the former, a model cannot identify the correct formal objects; without the latter, it struggles to interpret real-world contexts and map them precisely into formal expressions. To address these gaps, we introduce ThinkingF, a data synthesis and training pipeline that improves both abilities. First, we construct two datasets: one by distilling and selecting large-scale examples rich in formal knowledge, and another by generating informal-to-formal reasoning trajectories guided by expert-designed templates. We then apply SFT and RLVR with these datasets to further fuse and refine the two abilities. The resulting 7B and 32B models exhibit both comprehensive formal knowledge and strong informal-to-formal reasoning. Notably, StepFun-Formalizer-32B achieves SOTA BEq@1 scores of 40.5% on FormalMATH-Lite and 26.7% on ProverBench, surpassing all prior general-purpose and specialized models.

【7】FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design
标题：FlexQ：通过系统协同设计为LLM服务提供高效的训练后INT 6量化
链接：https://arxiv.org/abs/2508.04405

作者：, Aining Jia, Weifeng Bu, Yushu Cai, Kai Sheng, Hao Chen, Xin He
摘要：大型语言模型（LLM）表现出卓越的性能，但需要大量的内存和计算成本，限制了它们的实际部署。虽然现有的INT 4/INT 8量化降低了这些成本，但它们通常会降低精度或缺乏最佳效率。INT 6量化在模型精度和推理效率之间提供了一个优越的权衡，但缺乏现代GPU的硬件支持，迫使通过更高精度的算术单元进行仿真，从而限制了加速。在本文中，我们提出了FlexQ，这是一种新的训练后INT 6量化框架，将算法创新与系统级优化相结合。FlexQ在所有层中采用统一的6位权重量化，并通过逐层灵敏度分析识别层中的8位激活。为了最大限度地提高硬件效率，我们开发了一个专门的高性能GPU内核，通过二进制张量核心（BTC）等效物支持W 6A 6和W 6A 8表示的矩阵乘法，有效地绕过了缺乏原生INT 6张量核心的问题。对LLaMA模型的评估显示，FlexQ保持了接近FP 16的准确性，困惑度增加不超过0.05。该内核在LLaMA-2- 70 B线性层上实现了比ABQ-LLM平均1.39times的加速比。与SmoothQuant相比，FlexQ端到端提供了1.33$\times$的推理加速和1.21$\times$的内存节省。代码发布于https://github.com/FlyFoxPlayer/FlexQ。
摘要：Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade accuracy or lack optimal efficiency. INT6 quantization offers a superior trade-off between model accuracy and inference efficiency, but lacks hardware support in modern GPUs, forcing emulation via higher-precision arithmetic units that limit acceleration. In this paper, we propose FlexQ, a novel post-training INT6 quantization framework combining algorithmic innovation with system-level optimizations. FlexQ employs uniform 6-bit weight quantization across all layers, with adaptive retention of 8-bit activations in layers identified through layer-wise sensitivity analysis. To maximize hardware efficiency, we develop a specialized high-performance GPU kernel supporting matrix multiplication for W6A6 and W6A8 representations via Binary Tensor Core (BTC) equivalents, effectively bypassing the lack of native INT6 tensor cores. Evaluations on LLaMA models show FlexQ maintains near-FP16 accuracy, with perplexity increases of no more than 0.05. The proposed kernel achieves an average 1.39$\times$ speedup over ABQ-LLM on LLaMA-2-70B linear layers. End-to-end, FlexQ delivers 1.33$\times$ inference acceleration and 1.21$\times$ memory savings over SmoothQuant. Code is released at https://github.com/FlyFoxPlayer/FlexQ.

【8】Improving Crash Data Quality with Large Language Models: Evidence from Secondary Crash Narratives in Kentucky
标题：用大型语言模型提高崩溃数据质量：肯塔基州二次崩溃叙述的证据
链接：https://arxiv.org/abs/2508.04399

作者： Mei Chen
备注：19 pages, 2 figures
摘要：本研究评估了先进的自然语言处理（NLP）技术，以提高碰撞数据的质量，挖掘崩溃的叙述，使用二次崩溃识别在肯塔基州的案例研究。从2015-2022年的16，656份手动审查的叙述中，我们比较了三种模型类别：zero-shot开源大型语言模型（LLM）（LLaMA 3：70 B，DeepSeek-R1：70 B，Qwen 3：32 B，Gemma 3：27 B）;微调Transformers（BERT，DistilBERT，RoBERTa，XLNet，Longformer）;以及传统的逻辑回归作为基线。模型根据2015-2021年的数据进行了校准，并对2022年的1，771个叙述进行了测试。微调Transformers实现了卓越的性能，RoBERTa产生了最高的F1分数（0.90）和准确性（95%）。Zero-shot LLaMA 3：70 B达到了0.86的可比F1，但需要139分钟的推理;逻辑基线远远落后于（F1：0.66）。LLM在某些变体的召回方面表现出色（例如，GEMMA 3：27 B在0.94），但产生了很高的计算成本（DeepSeek-R1：70 B高达723分钟），而经过微调的模型在短暂训练后几秒钟内就处理了测试集。进一步的分析表明，中型LLM（例如，DeepSeek-R1：32 B）可以在性能上与更大的同行竞争，同时减少运行时间，这意味着优化部署的机会。结果突出了准确性，效率和数据要求之间的权衡，微调Transformer模型有效地平衡了肯塔基州数据的精确度和召回率。实际部署考虑因素强调保护隐私的本地部署，用于提高准确性的集成方法，以及用于可扩展性的增量处理，提供可复制的方案，用于通过高级NLP提高崩溃数据质量。
摘要：This study evaluates advanced natural language processing (NLP) techniques to enhance crash data quality by mining crash narratives, using secondary crash identification in Kentucky as a case study. Drawing from 16,656 manually reviewed narratives from 2015-2022, with 3,803 confirmed secondary crashes, we compare three model classes: zero-shot open-source large language models (LLMs) (LLaMA3:70B, DeepSeek-R1:70B, Qwen3:32B, Gemma3:27B); fine-tuned transformers (BERT, DistilBERT, RoBERTa, XLNet, Longformer); and traditional logistic regression as baseline. Models were calibrated on 2015-2021 data and tested on 1,771 narratives from 2022. Fine-tuned transformers achieved superior performance, with RoBERTa yielding the highest F1-score (0.90) and accuracy (95%). Zero-shot LLaMA3:70B reached a comparable F1 of 0.86 but required 139 minutes of inference; the logistic baseline lagged well behind (F1:0.66). LLMs excelled in recall for some variants (e.g., GEMMA3:27B at 0.94) but incurred high computational costs (up to 723 minutes for DeepSeek-R1:70B), while fine-tuned models processed the test set in seconds after brief training. Further analysis indicated that mid-sized LLMs (e.g., DeepSeek-R1:32B) can rival larger counterparts in performance while reducing runtime, suggesting opportunities for optimized deployments. Results highlight trade-offs between accuracy, efficiency, and data requirements, with fine-tuned transformer models balancing precision and recall effectively on Kentucky data. Practical deployment considerations emphasize privacy-preserving local deployment, ensemble approaches for improved accuracy, and incremental processing for scalability, providing a replicable scheme for enhancing crash-data quality with advanced NLP.

【9】Chain of Questions: Guiding Multimodal Curiosity in Language Models
标题：问题链：在语言模型中指导多模态课程
链接：https://arxiv.org/abs/2508.04350

作者： Kia Dashtipour
摘要：大型语言模型（LLM）的推理能力通过诸如思想链和明确的分步解释等方法得到了实质性的进步。然而，这些改进还没有完全过渡到多模态环境，在这种环境中，模型必须主动决定在与复杂的现实世界环境交互时使用哪些感官模态，如视觉、音频或空间感知。在本文中，我们介绍了问题链（CoQ）框架，好奇心驱动的推理方法，鼓励多模态语言模型动态生成有针对性的问题，他们的周围环境。这些生成的问题引导模型选择性地激活相关模态，从而收集准确推理和响应生成所需的关键信息。我们在一个新型多模式基准数据集上评估了我们的框架，该数据集是通过集成WebGPT、ScienceQA、AVSD和ScanQA数据集而组装的。实验结果表明，我们的CoQ方法提高了基础模型的能力，有效地识别和整合相关的感官信息。这导致了更高的准确性，可解释性，并与不同的多模态任务的推理过程对齐。
摘要：Reasoning capabilities in large language models (LLMs) have substantially advanced through methods such as chain-of-thought and explicit step-by-step explanations. However, these improvements have not yet fully transitioned to multimodal contexts, where models must proactively decide which sensory modalities such as vision, audio, or spatial perception to engage when interacting with complex real-world environments. In this paper, we introduce the Chain of Questions (CoQ) framework, a curiosity-driven reasoning approach that encourages multimodal language models to dynamically generate targeted questions regarding their surroundings. These generated questions guide the model to selectively activate relevant modalities, thereby gathering critical information necessary for accurate reasoning and response generation. We evaluate our framework on a novel multimodal benchmark dataset, assembled by integrating WebGPT, ScienceQA, AVSD, and ScanQA datasets. Experimental results demonstrate that our CoQ method improves a foundation model's ability to effectively identify and integrate pertinent sensory information. This leads to improved accuracy, interpretability, and alignment of the reasoning process with diverse multimodal tasks.

【10】Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning
标题：忘记：实现更好大型语言模型微调的新机制
链接：https://arxiv.org/abs/2508.04329

作者：i Ghahrizjani, Alireza Taban, Qizhou Wang, Shanshan Ye, Abdolreza Mirzaei, Tongliang Liu, Bo Han
摘要：监督微调（SFT）对于预训练的大型语言模型（LLM）起着至关重要的作用，特别是增强了它们获取特定领域知识的能力，同时保留或潜在地增强了它们的通用功能。然而，SFT的功效取决于数据质量和数据量，否则它可能导致有限的性能增益，甚至相对于相关基线的性能下降。为了减轻这种依赖，我们建议将每个语料库中的标记分为两部分-正面和负面标记-基于它们是否对提高模型性能有用。积极的标记可以用普通的方式训练，而消极的标记，可能缺乏基本的语义或误导，应该明确地忘记。总的来说，令牌分类有助于模型学习信息量较少的消息，而遗忘过程则形成了一个知识边界，以指导模型更精确地学习什么信息。我们在成熟的基准上进行了实验，发现这种遗忘机制不仅提高了整体模型性能，还促进了更多样化的模型响应。
摘要：Supervised fine-tuning (SFT) plays a critical role for pretrained large language models (LLMs), notably enhancing their capacity to acquire domain-specific knowledge while preserving or potentially augmenting their general-purpose capabilities. However, the efficacy of SFT hinges on data quality as well as data volume, otherwise it may result in limited performance gains or even degradation relative to the associated baselines. To mitigate such reliance, we suggest categorizing tokens within each corpus into two parts -- positive and negative tokens -- based on whether they are useful to improve model performance. Positive tokens can be trained in common ways, whereas negative tokens, which may lack essential semantics or be misleading, should be explicitly forgotten. Overall, the token categorization facilitate the model to learn less informative message, and the forgetting process shapes a knowledge boundary to guide the model on what information to learn more precisely. We conduct experiments on well-established benchmarks, finding that this forgetting mechanism not only improves overall model performance and also facilitate more diverse model responses.

【11】Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models
标题：超越排行榜：重新思考大型语言模型的医学基准
链接：https://arxiv.org/abs/2508.04325

作者：, Wenxuan Wang, Guo Yu, Yiu-Fai Cheung, Meidan Ding, Jie Liu, Wenting Chen, Linlin Shen
摘要：大型语言模型（LLM）在医疗保健领域显示出巨大的潜力，促使许多基准来评估它们的能力。然而，人们仍然担心这些基准的可靠性，这些基准往往缺乏临床保真度，强大的数据管理和安全导向的评价指标。为了解决这些缺点，我们引入MedCheck，第一个面向生命周期的评估框架，专门为医疗基准设计。我们的框架将基准的发展分解为从设计到治理的五个连续阶段，并提供了一个全面的46个医学定制标准的清单。使用MedCheck，我们对53个医学LLM基准进行了深入的实证评估。我们的分析揭示了广泛的系统性问题，包括与临床实践的深刻脱节，由于未减轻的污染风险而导致的数据完整性危机，以及对模型稳健性和不确定性意识等安全关键评估维度的系统性忽视。基于这些发现，MedCheck既可以作为现有基准的诊断工具，也可以作为可操作的指南，以促进更标准化，可靠和透明的方法来评估医疗保健中的人工智能。
摘要：Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clinical fidelity, robust data management, and safety-oriented evaluation metrics. To address these shortcomings, we introduce MedCheck, the first lifecycle-oriented assessment framework specifically designed for medical benchmarks. Our framework deconstructs a benchmark's development into five continuous stages, from design to governance, and provides a comprehensive checklist of 46 medically-tailored criteria. Using MedCheck, we conducted an in-depth empirical evaluation of 53 medical LLM benchmarks. Our analysis uncovers widespread, systemic issues, including a profound disconnect from clinical practice, a crisis of data integrity due to unmitigated contamination risks, and a systematic neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness. Based on these findings, MedCheck serves as both a diagnostic tool for existing benchmarks and an actionable guideline to foster a more standardized, reliable, and transparent approach to evaluating AI in healthcare.

【12】Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success
标题：在合成世界中通过强化学习增强视觉语言模型训练，以实现现实世界的成功
链接：https://arxiv.org/abs/2508.04280

作者：edis, Stanislav Dereka, Viacheslav Sinii, Ruslan Rakhimov, Daniil Gavrilov
摘要：交互式多模态代理必须将原始视觉观察转换为连贯的语言条件动作序列-当前视觉语言模型（VLM）仍然缺乏这种能力。原则上，早期的自主学习（RL）努力可以赋予VLM这样的技能，但他们很少测试学习到的行为是否超出了他们的训练模拟器，并且它们依赖于脆弱的超参数调整或低状态可变性的密集奖励环境。我们介绍了视觉语言解耦的演员-评论家（VL-DAC），一个轻量级的，无超参数的RL算法。VL-DAC将PPO更新应用于动作令牌，同时仅在环境步骤级别学习价值：据我们所知，这种安排之前并未针对大型VLM或LLM进行过探索。这种简单的解耦消除了不稳定的加权项，并产生更快，更可靠的收敛。在一个便宜的模拟器（MiniWorld，Gym-Cards，ALFWorld或WebShop）中训练一个VLM已经产生了广泛推广的策略：BALROG（以游戏为中心的代理控制）上的相对值为+50\%，VSI-Bench（空间规划）最难的部分上的相对值为+5\%，VisualWebBench（网络导航）上的相对值为+2\%，所有这些都不会降低一般图像理解的准确性。这些结果提供了第一个证据，证明一个简单的RL算法可以完全在廉价的合成世界中训练VLM，同时在真实图像代理，空间推理和网络导航基准上提供可测量的增益。
摘要：Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions -- a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC), a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable convergence. Training a single VLM with VL-DAC in one inexpensive simulator at a time (MiniWorld, Gym-Cards, ALFWorld, or WebShop) already produces policies that generalize widely: +50\% relative on BALROG (game-centric agentic control), +5\% relative on the hardest part of VSI-Bench (spatial planning), and +2\% on VisualWebBench (web navigation), all without degrading general image understanding accuracy. These results provide the first evidence that a simple RL algorithm can train VLMs entirely in cheap synthetic worlds while delivering measurable gains on real-image agentic, spatial-reasoning, and web-navigation benchmarks.

【13】Mockingbird: How does LLM perform in general machine learning tasks?
标题：知更鸟：LLM在一般机器学习任务中的表现如何？
链接：https://arxiv.org/abs/2508.04279

作者：, Yoshiki Obinata, Kento Kawaharazuka, Kei Okada
摘要：大型语言模型（LLM）现在越来越频繁地被用作聊天机器人，其任务是根据用户指令总结信息或生成文本和代码。LLM的推理能力和推理速度的快速增长揭示了它们在聊天机器人领域之外扩展到一般机器学习任务的应用方面的巨大潜力。这项工作是出于对这种潜力的好奇而进行的。在这项工作中，我们提出了一个框架Mockingbird适应LLM一般的机器学习任务，并评估其性能和可扩展性的几个一般的机器学习任务。该框架的核心理念是指导LLM角色扮演功能并反思其错误以改进自身。我们的评估和分析结果表明，LLM驱动的机器学习方法，如Mockingbird，可以在常见的机器学习任务上取得可接受的结果;然而，目前仅仅反映其本身并不能胜过特定领域文档和人类专家反馈的效果。
摘要：Large language models (LLMs) are now being used with increasing frequency as chat bots, tasked with the summarizing information or generating text and code in accordance with user instructions. The rapid increase in reasoning capabilities and inference speed of LLMs has revealed their remarkable potential for applications extending beyond the domain of chat bots to general machine learning tasks. This work is conducted out of the curiosity about such potential. In this work, we propose a framework Mockingbird to adapt LLMs to general machine learning tasks and evaluate its performance and scalability on several general machine learning tasks. The core concept of this framework is instructing LLMs to role-play functions and reflect on its mistakes to improve itself. Our evaluation and analysis result shows that LLM-driven machine learning methods, such as Mockingbird, can achieve acceptable results on common machine learning tasks; however, solely reflecting on its own currently cannot outperform the effect of domain-specific documents and feedback from human experts.

【14】Empowering Time Series Forecasting with LLM-Agents
标题：利用LLM代理增强时间序列预测
链接：https://arxiv.org/abs/2508.04231

作者： Michael Yeh, Vivian Lai, Uday Singh Saini, Xiran Fan, Yujie Fan, Junpeng Wang, Xin Dai, Yan Zheng
摘要：大型语言模型（LLM）驱动的代理已经成为自动机器学习（AutoML）系统的有效规划者。虽然大多数现有的AutoML方法专注于自动化特征工程和模型架构搜索，但最近的时间序列预测研究表明，轻量级模型通常可以实现最先进的性能。这一观察使我们探索提高数据质量，而不是模型架构，作为AutoML在时间序列数据上的一个潜在的富有成效的方向。我们提出了DCATS，一个以数据为中心的代理时间序列。DCATS利用伴随时间序列的元数据来清理数据，同时优化预测性能。我们评估了DCATS使用四个时间序列预测模型的大规模交通量预测数据集。结果表明，DCATS在所有测试模型和时间范围内平均减少了6%的误差，凸显了AutoML中以数据为中心的方法在时间序列预测中的潜力。
摘要：Large Language Model (LLM) powered agents have emerged as effective planners for Automated Machine Learning (AutoML) systems. While most existing AutoML approaches focus on automating feature engineering and model architecture search, recent studies in time series forecasting suggest that lightweight models can often achieve state-of-the-art performance. This observation led us to explore improving data quality, rather than model architecture, as a potentially fruitful direction for AutoML on time series data. We propose DCATS, a Data-Centric Agent for Time Series. DCATS leverages metadata accompanying time series to clean data while optimizing forecasting performance. We evaluated DCATS using four time series forecasting models on a large-scale traffic volume forecasting dataset. Results demonstrate that DCATS achieves an average 6% error reduction across all tested models and time horizons, highlighting the potential of data-centric approaches in AutoML for time series forecasting.

【15】Hierarchical Text Classification Using Black Box Large Language Models
标题：使用黑匣子大语言模型的分层文本分类
链接：https://arxiv.org/abs/2508.04219

作者：shimura, Hisashi Kashima
备注：16 pages, 6 figures
摘要：层次文本分类（HTC）的目标是将文本分配到结构化的标签层次结构中;然而，由于数据稀缺和模型复杂性，它面临着挑战。这项研究探讨了使用通过HTC API访问的黑盒大语言模型（LLM）的可行性，作为需要大量标记数据和计算资源的传统机器学习方法的替代方案。我们评估了三种提示策略-直接叶标签预测（DL），直接分层标签预测（DH），自顶向下的多步分层标签预测（TMH）-在zero-shot和Few-Shot设置，比较这些策略的准确性和成本效益。在两个数据集上的实验表明，与zero-shot设置相比，Few-Shot设置一致地提高了分类精度。虽然传统的机器学习模型在具有浅层次结构的数据集上实现了高精度，但LLM，特别是DH策略，往往在具有更深层次结构的数据集上优于机器学习模型。由于DH策略上更深的标签层次结构需要更高的输入令牌，API成本显着增加。这些结果强调了准确性提高和即时策略的计算成本之间的权衡。这些发现突出了HTC黑盒LLM的潜力，同时强调需要仔细选择一个快速的策略来平衡性能和成本。
摘要：Hierarchical Text Classification (HTC) aims to assign texts to structured label hierarchies; however, it faces challenges due to data scarcity and model complexity. This study explores the feasibility of using black box Large Language Models (LLMs) accessed via APIs for HTC, as an alternative to traditional machine learning methods that require extensive labeled data and computational resources. We evaluate three prompting strategies -- Direct Leaf Label Prediction (DL), Direct Hierarchical Label Prediction (DH), and Top-down Multi-step Hierarchical Label Prediction (TMH) -- in both zero-shot and few-shot settings, comparing the accuracy and cost-effectiveness of these strategies. Experiments on two datasets show that a few-shot setting consistently improves classification accuracy compared to a zero-shot setting. While a traditional machine learning model achieves high accuracy on a dataset with a shallow hierarchy, LLMs, especially DH strategy, tend to outperform the machine learning model on a dataset with a deeper hierarchy. API costs increase significantly due to the higher input tokens required for deeper label hierarchies on DH strategy. These results emphasize the trade-off between accuracy improvement and the computational cost of prompt strategy. These findings highlight the potential of black box LLMs for HTC while underscoring the need to carefully select a prompt strategy to balance performance and cost.

【16】Model Inversion Attacks on Vision-Language Models: Do They Leak What They Learn?
标题：对视觉语言模型的模型倒置攻击：它们会泄露它们学到的东西吗？
链接：https://arxiv.org/abs/2508.04097

作者：Nguyen, Sy-Tuyen Ho, Koh Jun Hao, Ngai-Man Cheung
备注：Under review
摘要：模型反演（MI）攻击通过从训练过的神经网络中重建私有训练数据来构成重大的隐私风险。虽然之前的工作主要集中在传统的单峰DNN上，但视觉语言模型（VLM）的脆弱性仍然没有得到充分的研究。在本文中，我们进行了第一次研究，以了解VLM在泄漏私人视觉训练数据方面的漏洞。为了适应VLMs的基于令牌的生成性质，我们提出了一套新的基于令牌和基于序列的模型反演策略。特别是，我们提出了基于令牌的模型反演（TMI），收敛的基于令牌的模型反演（TMI-C），基于序列的模型反演（SMI），和基于序列的模型反演与自适应令牌加权（SMI-AW）。通过对三个最先进的VLM和多个数据集进行广泛的实验和用户研究，我们首次证明了VLM容易受到训练数据泄漏的影响。实验表明，我们提出的基于序列的方法，特别是SMI-AW结合基于词汇表示的logit最大化损失，可以实现有竞争力的重建，并在攻击精度和视觉相似性方面优于基于令牌的方法。重要的是，重建图像的人类评估产生了75.31%的攻击准确率，强调了模型反转威胁的严重性。值得注意的是，我们还展示了对公开发布的VLM的反转攻击。我们的研究揭示了VLM的隐私漏洞，因为它们在医疗保健和金融等许多应用中越来越受欢迎。
摘要：Model inversion (MI) attacks pose significant privacy risks by reconstructing private training data from trained neural networks. While prior works have focused on conventional unimodal DNNs, the vulnerability of vision-language models (VLMs) remains underexplored. In this paper, we conduct the first study to understand VLMs' vulnerability in leaking private visual training data. To tailored for VLMs' token-based generative nature, we propose a suite of novel token-based and sequence-based model inversion strategies. Particularly, we propose Token-based Model Inversion (TMI), Convergent Token-based Model Inversion (TMI-C), Sequence-based Model Inversion (SMI), and Sequence-based Model Inversion with Adaptive Token Weighting (SMI-AW). Through extensive experiments and user study on three state-of-the-art VLMs and multiple datasets, we demonstrate, for the first time, that VLMs are susceptible to training data leakage. The experiments show that our proposed sequence-based methods, particularly SMI-AW combined with a logit-maximization loss based on vocabulary representation, can achieve competitive reconstruction and outperform token-based methods in attack accuracy and visual similarity. Importantly, human evaluation of the reconstructed images yields an attack accuracy of 75.31\%, underscoring the severity of model inversion threats in VLMs. Notably we also demonstrate inversion attacks on the publicly released VLMs. Our study reveals the privacy vulnerability of VLMs as they become increasingly popular across many applications such as healthcare and finance.

【17】Efficient Strategy for Improving Large Language Model (LLM) Capabilities
标题：提高大型语言模型（LLM）能力的有效策略
链接：https://arxiv.org/abs/2508.04073

作者：milo Velandia Gutiérrez
备注：Based on master's thesis in Systems and Computer Engineering, Universidad Nacional de Colombia (2025)
摘要：大型语言模型（LLM）已经成为人工智能和自然语言处理领域的一个里程碑。然而，它们的大规模部署仍然受到对大量计算资源的需求的限制。这项工作建议从一个基本模型开始，探索和结合数据处理和仔细的数据选择技术，培训策略和架构调整，以提高LLM在资源受限的环境中的效率，并在一个划定的知识库。方法学方法包括定义构建可靠数据集的标准，使用不同配置进行受控实验，并系统地评估所产生的变体的能力，多功能性，响应时间和安全性。最后，进行了比较测试，以衡量所开发的变体的性能，并验证所提出的策略的有效性。这项工作是基于系统与计算机工程硕士论文，题为“提高大型语言模型（LLM）能力的有效策略”。
摘要：Large Language Models (LLMs) have become a milestone in the field of artificial intelligence and natural language processing. However, their large-scale deployment remains constrained by the need for significant computational resources. This work proposes starting from a base model to explore and combine data processing and careful data selection techniques, training strategies, and architectural adjustments to improve the efficiency of LLMs in resource-constrained environments and within a delimited knowledge base. The methodological approach included defining criteria for building reliable datasets, conducting controlled experiments with different configurations, and systematically evaluating the resulting variants in terms of capability, versatility, response time, and safety. Finally, comparative tests were conducted to measure the performance of the developed variants and to validate the effectiveness of the proposed strategies. This work is based on the master's thesis in Systems and Computer Engineering titled "Efficient Strategy for Improving the Capabilities of Large Language Models (LLMs)".

【18】Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models
标题：真理幻觉：大型语言模型中的事实核查和事实评估回顾
链接：https://arxiv.org/abs/2508.03860

作者：di Rahman, Md. Adnanul Islam, Md. Mahbub Alam, Musarrat Zeba, Md. Abdur Rahman, Sadia Sultana Chowa, Mohaimenul Azam Khan Raiaan, Sami Azam
备注：30 pages, 11 figures, 6 tables. Submitted to Artificial Intelligence Review for peer review
摘要：大型语言模型（LLM）是在庞大而多样化的互联网语料库上训练的，这些语料库通常包含不准确或误导性的内容。因此，LLM可能会产生错误信息，因此必须进行强有力的事实核查。本文系统地分析了如何通过探索诸如幻觉，数据集限制和评估指标的可靠性等关键挑战来评估LLM生成的内容的事实准确性。审查强调，需要强大的事实检查框架，集成先进的提示策略，特定领域的微调，检索增强生成（RAG）的方法。它提出了五个研究问题，指导对2020年至2025年的最新文献的分析，重点是评估方法和缓解技术。审查还讨论了指令调整，多智能体推理，并通过RAG框架的外部知识访问的作用。主要研究结果强调了当前指标的局限性，用经过验证的外部证据支持输出的价值，以及特定领域定制对提高事实一致性的重要性。总的来说，审查强调了建立LLM的重要性，这些LLM不仅准确和可解释，而且还针对特定领域的事实检查。这些见解有助于研究向更值得信赖和上下文感知的语言模型的进步。
摘要：Large Language Models (LLMs) are trained on vast and diverse internet corpora that often include inaccurate or misleading content. Consequently, LLMs can generate misinformation, making robust fact-checking essential. This review systematically analyzes how LLM-generated content is evaluated for factual accuracy by exploring key challenges such as hallucinations, dataset limitations, and the reliability of evaluation metrics. The review emphasizes the need for strong fact-checking frameworks that integrate advanced prompting strategies, domain-specific fine-tuning, and retrieval-augmented generation (RAG) methods. It proposes five research questions that guide the analysis of the recent literature from 2020 to 2025, focusing on evaluation methods and mitigation techniques. The review also discusses the role of instruction tuning, multi-agent reasoning, and external knowledge access via RAG frameworks. Key findings highlight the limitations of current metrics, the value of grounding outputs with validated external evidence, and the importance of domain-specific customization to improve factual consistency. Overall, the review underlines the importance of building LLMs that are not only accurate and explainable but also tailored for domain-specific fact-checking. These insights contribute to the advancement of research toward more trustworthy and context-aware language models.

【19】GTPO: Trajectory-Based Policy Optimization in Large Language Models
标题：GSTP：大型语言模型中基于轨迹的策略优化
链接：https://arxiv.org/abs/2508.03772

作者：oni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino
摘要：如今，基于策略的优化被广泛用于语言模型的训练和对齐，其中最新和最有效的方法之一是组相关策略优化（GRPO）。在本文中，我们揭示和分析了GRPO的两个主要局限性：（i）令牌经常出现在具有正和负奖励的完成中，导致冲突的梯度更新，这可能会降低它们的输出概率，即使对于保持适当的结构是必要的;（ii）负奖励完成可能会惩罚自信的响应并将模型决策转向不太可能的令牌，逐渐使输出分布变平，并降低学习效果。为了解决这些问题，并提供更稳定和有效的策略优化策略，我们引入了GTPO（基于组相对轨迹的策略优化），它可以识别冲突令牌，令牌出现在完成的相同位置，具有相反的奖励，通过跳过负面更新来保护它们，同时放大正面更新。为了进一步防止策略崩溃，GTPO过滤掉熵超过可证明阈值的完成。与GRPO不同，GTPO不依赖KL发散正则化，在训练过程中无需参考模型，同时仍然确保更高的训练稳定性和更高的性能，并通过GSM 8 K，MATH和AIME 2024基准的多次实验进行了验证。
摘要：Policy-based optimizations are widely adopted today for the training and alignment of language models, where one of the most recent and effective approaches is Group-relative Policy Optimization (GRPO). In this paper, we reveals and analyze two major limitations of GRPO: (i) tokens frequently appear in completions with both positive and negative rewards, leading to conflicting gradient updates that can reduce their output probability, even though can be essential for maintaining proper structure; (ii) negatively rewarded completions may penalize confident responses and shift model decisions toward unlikely tokens, progressively flattening the output distribution and degrading learning. To address these issues and provide a more stable and effective policy optimization strategy, we introduce GTPO (Group-relative Trajectory-based Policy Optimization), which identifies conflict tokens, tokens appearing in the same position across completions with opposite rewards, protects them by skipping negative updates, while amplifying positive ones. To further prevent policy collapse, GTPO filters out completions whose entropy exceeds a provable threshold. Unlike GRPO, GTPO does not rely on KL-divergence regularization, eliminating the need for a reference model during training, while still ensuring greater training stability and improved performance, validated through multiple experiments on GSM8K, MATH and AIME 2024 benchmarks.

【20】LLM-Prior: A Framework for Knowledge-Driven Prior Elicitation and Aggregation
标题：LLM-Prior：知识驱动的先验启发和聚合框架
链接：https://arxiv.org/abs/2508.03766

作者：Huang
摘要：先验分布的规范是贝叶斯推理的基础，但它仍然是一个重要的瓶颈。先验启发过程通常是一个手动的、主观的、不可扩展的任务。我们提出了一个新的框架，利用大型语言模型（LLM）来自动化和扩展这个过程。我们引入\texttt{LLMPrior}，一个原则性的操作符，将丰富的，非结构化的上下文，如自然语言描述，数据或数字转换为有效的，易处理的概率分布。我们通过在架构上将LLM与一个显式的、易处理的生成模型（如高斯混合模型（形成基于LLM的混合密度网络））耦合来形式化该算子，确保所得先验满足基本的数学性质。我们进一步扩展这一框架的多智能体系统，对数意见池是用来聚合分散的知识引起的先验分布。我们提出了联邦事先聚合算法，\texttt{Fed-LLMPrior}，聚合分布式，上下文相关的先验的方式强大的代理异构性。这项工作为一类新的工具提供了基础，这些工具可能会降低复杂贝叶斯建模的进入门槛。
摘要：The specification of prior distributions is fundamental in Bayesian inference, yet it remains a significant bottleneck. The prior elicitation process is often a manual, subjective, and unscalable task. We propose a novel framework which leverages Large Language Models (LLMs) to automate and scale this process. We introduce \texttt{LLMPrior}, a principled operator that translates rich, unstructured contexts such as natural language descriptions, data or figures into valid, tractable probability distributions. We formalize this operator by architecturally coupling an LLM with an explicit, tractable generative model, such as a Gaussian Mixture Model (forming a LLM based Mixture Density Network), ensuring the resulting prior satisfies essential mathematical properties. We further extend this framework to multi-agent systems where Logarithmic Opinion Pooling is employed to aggregate prior distributions induced by decentralized knowledge. We present the federated prior aggregation algorithm, \texttt{Fed-LLMPrior}, for aggregating distributed, context-dependent priors in a manner robust to agent heterogeneity. This work provides the foundation for a new class of tools that can potentially lower the barrier to entry for sophisticated Bayesian modeling.

【21】Latent Knowledge Scalpel: Precise and Massive Knowledge Editing for Large Language Models
标题：潜在知识刀：大型语言模型的精确、海量知识编辑
链接：https://arxiv.org/abs/2508.03741

作者：Qiyang Song, Shaowen Xu, Kerou Zhou, Wenbo Jiang, Xiaoqi Jia, Weijuan Zhang, Heqing Huang, Yakai Li
备注：Accepted by ECAI 2025 - 28th European Conference on Artificial Intelligence
摘要：大型语言模型（LLM）通常会保留来自预训练的不准确或过时的信息，导致在推理过程中出现不正确的预测或有偏见的输出。虽然现有的模型编辑方法可以解决这一挑战，但它们难以同时编辑大量的事实信息，并且可能会损害模型的一般功能。在本文中，我们的实证研究表明，它是可行的，以类似于编辑自然语言输入的方式编辑LLM的内部表示和替换实体。基于这一见解，我们引入了潜在知识解剖刀（LKS），这是一种LLM编辑器，它通过轻量级超网络操纵特定实体的潜在知识，以实现精确和大规模的编辑。在Llama-2和Mistral上进行的实验表明，即使同时编辑的数量达到10，000，LKS也能有效地进行知识编辑，同时保留编辑的LLM的一般能力。代码可从以下网址获得：https://github.com/Linuxin-xxx/LKS。
摘要：Large Language Models (LLMs) often retain inaccurate or outdated information from pre-training, leading to incorrect predictions or biased outputs during inference. While existing model editing methods can address this challenge, they struggle with editing large amounts of factual information simultaneously and may compromise the general capabilities of the models. In this paper, our empirical study demonstrates that it is feasible to edit the internal representations of LLMs and replace the entities in a manner similar to editing natural language inputs. Based on this insight, we introduce the Latent Knowledge Scalpel (LKS), an LLM editor that manipulates the latent knowledge of specific entities via a lightweight hypernetwork to enable precise and large-scale editing. Experiments conducted on Llama-2 and Mistral show even with the number of simultaneous edits reaching 10,000, LKS effectively performs knowledge editing while preserving the general abilities of the edited LLMs. Code is available at: https://github.com/Linuxin-xxx/LKS.

【22】CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning
标题：CX-Mind：一个开创性的多模态大型语言模型，用于通过课程指导的强化学习进行胸部X射线交叉推理
链接：https://arxiv.org/abs/2508.03733

作者：, Yujie Zhang, Haoran Sun, Yueqi Li, Fanrui Zhang, Mengzhe Xu, Victoria Borja Clausich, Sade Mellin, Renhao Yang, Chenrun Wang, Jethro Zih-Shuo Wang, Shiyi Yao, Gen Li, Yidong Xu, Hanyu Wang, Yilin Huang, Angela Lin Wang, Chen Shi, Yin Zhang, Jianan Guo, Luqi Yang, Renxuan Li, Yang Xu, Jiawei Liu, Yao Zhang, Lei Liu, Carlos Gutiérrez SanRomán, Lei Wang
摘要：胸部X射线（CXR）成像是临床实践中使用最广泛的诊断模式之一，涵盖广泛的诊断任务。最近的进展已经看到了基于推理的多模态大语言模型（MLLM）在医学成像中的广泛应用，以提高诊断效率和可解释性。然而，现有的多模态模型主要依赖于“一次性”的诊断方法，缺乏可验证的监督推理过程。这导致了多任务CXR诊断的挑战，包括冗长的推理，稀疏的奖励和频繁的幻觉。为了解决这些问题，我们提出了CX-Mind，这是第一个生成模型，用于实现CXR任务的交叉“思考-回答”推理，由基于知识的强化学习和可验证过程奖励（CuRL-VPR）驱动。具体来说，我们构建了一个包含708，473张图像和2，619，148个样本的推理调整数据集CX-Set，并生成了42，828个由临床报告监督的高质量交错推理数据点。在组相对策略优化框架下，优化分两个阶段进行：首先用封闭域任务稳定基本推理，然后转移到开放域诊断，结合基于规则的条件过程奖励，以绕过对预训练奖励模型的需求。大量的实验结果表明，CX-Mind在视觉理解、文本生成和时空对齐方面明显优于现有的医学和通用领域MLLM，与可比的CXR特定模型相比，平均性能提高了25.1%。在真实世界的临床数据集（Rui-CXR）上，CX-Mind在14种疾病中实现了平均召回率@1，大大超过了第二好的结果，多中心专家评估进一步证实了其在多个维度上的临床实用性。
摘要：Chest X-ray (CXR) imaging is one of the most widely used diagnostic modalities in clinical practice, encompassing a broad spectrum of diagnostic tasks. Recent advancements have seen the extensive application of reasoning-based multimodal large language models (MLLMs) in medical imaging to enhance diagnostic efficiency and interpretability. However, existing multimodal models predominantly rely on "one-time" diagnostic approaches, lacking verifiable supervision of the reasoning process. This leads to challenges in multi-task CXR diagnosis, including lengthy reasoning, sparse rewards, and frequent hallucinations. To address these issues, we propose CX-Mind, the first generative model to achieve interleaved "think-answer" reasoning for CXR tasks, driven by curriculum-based reinforcement learning and verifiable process rewards (CuRL-VPR). Specifically, we constructed an instruction-tuning dataset, CX-Set, comprising 708,473 images and 2,619,148 samples, and generated 42,828 high-quality interleaved reasoning data points supervised by clinical reports. Optimization was conducted in two stages under the Group Relative Policy Optimization framework: initially stabilizing basic reasoning with closed-domain tasks, followed by transfer to open-domain diagnostics, incorporating rule-based conditional process rewards to bypass the need for pretrained reward models. Extensive experimental results demonstrate that CX-Mind significantly outperforms existing medical and general-domain MLLMs in visual understanding, text generation, and spatiotemporal alignment, achieving an average performance improvement of 25.1% over comparable CXR-specific models. On real-world clinical dataset (Rui-CXR), CX-Mind achieves a mean recall@1 across 14 diseases that substantially surpasses the second-best results, with multi-center expert evaluations further confirming its clinical utility across multiple dimensions.

【23】FeynTune: Large Language Models for High-Energy Theory
标题：FeynButton：高能理论的大型语言模型
链接：https://arxiv.org/abs/2508.03716

作者：mond, Prarit Agarwal, Borun Chowdhury, Vasilis Niarchos, Constantinos Papageorgakis
备注：16 pages
摘要：我们为理论高能物理提出了专门的大语言模型，作为80亿参数Llama-3.1模型的20个微调变体。每个变体都在来自hep-th，hep-ph和gr-qc的不同组合的arXiv摘要（截至2024年8月）上进行了训练。为了进行比较研究，我们还在包含来自不同领域（如q-bio和cs类别）的摘要的数据集上训练模型。所有模型都使用两种不同的低秩自适应微调方法和不同的数据集大小进行微调，并在第七个抽象完成任务上优于基础模型。我们将性能与领先的商业LLM（ChatGPT，Claude，Gemini，DeepSeek）进行比较，并获得进一步开发高能理论物理专业语言模型的见解。
摘要：We present specialized Large Language Models for theoretical High-Energy Physics, obtained as 20 fine-tuned variants of the 8-billion parameter Llama-3.1 model. Each variant was trained on arXiv abstracts (through August 2024) from different combinations of hep-th, hep-ph and gr-qc. For a comparative study, we also trained models on datasets that contained abstracts from disparate fields such as the q-bio and cs categories. All models were fine-tuned using two distinct Low-Rank Adaptation fine-tuning approaches and varying dataset sizes, and outperformed the base model on hep-th abstract completion tasks. We compare performance against leading commercial LLMs (ChatGPT, Claude, Gemini, DeepSeek) and derive insights for further developing specialized language models for High-Energy Theoretical Physics.

【24】MD-LLM-1: A Large Language Model for Molecular Dynamics
标题：MD-LLM-1：分子动力学的大型语言模型
链接：https://arxiv.org/abs/2508.03709

作者：in Murtada, Z. Faidon Brotzakis, Michele Vendruscolo
摘要：分子动力学（MD）是一个强大的方法来模拟分子系统，但它仍然是计算密集的空间和时间尺度上的许多大分子系统的生物利益。为了探索深度学习提供的解决这个问题的机会，我们引入了分子动力学大型语言模型（MD-LLM）框架，以说明如何利用LLM来学习蛋白质动力学并发现训练中看不到的状态。通过应用MD-LLM-1，这种方法的第一个实现，通过微调Mistral 7 B，T4溶菌酶和Mad 2蛋白质系统，我们表明，在一个构象状态的训练，使其他构象状态的预测。这些结果表明，MD-LLM-1可以学习探索蛋白质构象景观的原则，尽管它还没有明确建模它们的热力学和动力学。
摘要：Molecular dynamics (MD) is a powerful approach for modelling molecular systems, but it remains computationally intensive on spatial and time scales of many macromolecular systems of biological interest. To explore the opportunities offered by deep learning to address this problem, we introduce a Molecular Dynamics Large Language Model (MD-LLM) framework to illustrate how LLMs can be leveraged to learn protein dynamics and discover states not seen in training. By applying MD-LLM-1, the first implementation of this approach, obtained by fine-tuning Mistral 7B, to the T4 lysozyme and Mad2 protein systems, we show that training on one conformational state enables the prediction of other conformational states. These results indicate that MD-LLM-1 can learn the principles for the exploration of the conformational landscapes of proteins, although it is not yet modeling explicitly their thermodynamics and kinetics.

Graph相关(图学习|图神经网络|图优化等)(2篇)

【1】GraphProp: Training the Graph Foundation Models using Graph Properties
标题：GraphProp：使用图形属性训练图形基础模型
链接：https://arxiv.org/abs/2508.04594

作者：n, Qi Feng, Lehao Lin, Chris Ding, Jicong Fan
摘要：这项工作的重点是训练图基础模型（GFM），在图级任务（如图分类）中具有很强的泛化能力。有效的GFM培训需要在不同领域捕获一致的信息。我们发现，图结构提供了更一致的跨域信息相比，节点功能和图形标签。然而，传统的GFM主要集中在将节点特征从不同的域转移到一个统一的表示空间，但往往缺乏结构化的跨域泛化。为了解决这个问题，我们引入了GraphProp，它强调结构泛化。GraphProp的培训过程包括两个主要阶段。首先，我们通过预测图不变量来训练结构化GFM。由于图不变量是仅依赖于抽象结构而不依赖于图的特定标记或绘图的图的属性，因此这种结构GFM具有很强的捕获抽象结构信息并提供跨不同域可比较的有区别的图表示的能力。在第二阶段中，我们使用的表示给出的结构GFM作为位置编码的训练一个全面的GFM。此阶段利用特定于域的节点属性和图标签来进一步改进跨域节点特征泛化。我们的实验表明，GraphProp在监督学习和Few-Shot学习方面明显优于竞争对手，特别是在处理没有节点属性的图方面。
摘要：This work focuses on training graph foundation models (GFMs) that have strong generalization ability in graph-level tasks such as graph classification. Effective GFM training requires capturing information consistent across different domains. We discover that graph structures provide more consistent cross-domain information compared to node features and graph labels. However, traditional GFMs primarily focus on transferring node features from various domains into a unified representation space but often lack structural cross-domain generalization. To address this, we introduce GraphProp, which emphasizes structural generalization. The training process of GraphProp consists of two main phases. First, we train a structural GFM by predicting graph invariants. Since graph invariants are properties of graphs that depend only on the abstract structure, not on particular labellings or drawings of the graph, this structural GFM has a strong ability to capture the abstract structural information and provide discriminative graph representations comparable across diverse domains. In the second phase, we use the representations given by the structural GFM as positional encodings to train a comprehensive GFM. This phase utilizes domain-specific node attributes and graph labels to further improve cross-domain node feature generalization. Our experiments demonstrate that GraphProp significantly outperforms the competitors in supervised learning and few-shot learning, especially in handling graphs without node attributes.

【2】U-PINet: End-to-End Hierarchical Physics-Informed Learning With Sparse Graph Coupling for 3D EM Scattering Modeling
标题：U-PINet：具有稀疏图耦合的端到端分层物理信息学习，用于3D EM散射建模
链接：https://arxiv.org/abs/2508.03774

作者：Yuexing Peng, Peng Wang, George C. Alexandropoulos, Wenbo Wang, Wei Xiang
摘要：电磁散射建模是雷达遥感的关键技术之一，但其固有的复杂性带来了巨大的计算挑战。传统的数值求解器提供高精度，但遭受可扩展性问题和大量的计算成本。纯数据驱动的深度学习方法虽然高效，但在训练过程中缺乏物理约束嵌入，需要大量的标记数据，限制了它们的适用性和推广性。为了克服这些限制，我们提出了一个U形物理信息网络（U-PINet），这是第一个完全基于深度学习的、基于物理信息的计算EM分层框架，旨在确保物理一致性，同时最大限度地提高计算效率。受EM求解器中的分层分解策略和局部EM耦合的固有稀疏性的激励，U-PINet通过多尺度处理神经网络架构来对近场和远场相互作用的分解和耦合进行建模，同时采用物理启发的稀疏图表示来有效地对复杂三维（3D）对象的网格元素之间的自耦合和互耦合进行建模。这种原理性方法能够实现端到端的多尺度EM散射建模，提高了效率，泛化和物理一致性。实验结果表明，U-PINet准确预测了表面电流分布，与传统求解器实现了密切的一致性，同时显著减少了计算时间，并在准确性和鲁棒性方面优于传统的深度学习基线。此外，我们对雷达截面预测任务的评估证实了U-PINet用于下游EM散射应用的可行性。
摘要：Electromagnetic (EM) scattering modeling is critical for radar remote sensing, however, its inherent complexity introduces significant computational challenges. Traditional numerical solvers offer high accuracy, but suffer from scalability issues and substantial computational costs. Pure data-driven deep learning approaches, while efficient, lack physical constraints embedding during training and require extensive labeled data, limiting their applicability and generalization. To overcome these limitations, we propose a U-shaped Physics-Informed Network (U-PINet), the first fully deep-learning-based, physics-informed hierarchical framework for computational EM designed to ensure physical consistency while maximizing computational efficiency. Motivated by the hierarchical decomposition strategy in EM solvers and the inherent sparsity of local EM coupling, the U-PINet models the decomposition and coupling of near- and far-field interactions through a multiscale processing neural network architecture, while employing a physics-inspired sparse graph representation to efficiently model both self- and mutual- coupling among mesh elements of complex $3$-Dimensional (3D) objects. This principled approach enables end-to-end multiscale EM scattering modeling with improved efficiency, generalization, and physical consistency. Experimental results showcase that the U-PINet accurately predicts surface current distributions, achieving close agreement with traditional solver, while significantly reducing computational time and outperforming conventional deep learning baselines in both accuracy and robustness. Furthermore, our evaluations on radar cross section prediction tasks confirm the feasibility of the U-PINet for downstream EM scattering applications.

Transformer(4篇)

【1】Small transformer architectures for task switching
标题：用于任务切换的小型Transformer架构
链接：https://arxiv.org/abs/2508.04461

作者：Gros
备注：ICANN 2025, in press
摘要：大规模生成式人工智能的快速发展在很大程度上是基于注意力机制。相反，设想小规模的应用程序，其中基于注意力的架构优于传统的方法，如多层感知器或递归网络，这是不平凡的。我们在“任务切换”的背景下研究这个问题。在这个框架中，模型在正在进行的令牌序列上工作，当前任务由随机散布的控制令牌确定。我们发现，标准的Transformers不能解决一个基本的任务切换参考模型的基础上有限域算法，其中包含专用于增量/加法/反向复制/上下文（IARC）的子任务。我们发现，Transformers、长短期记忆递归网络（LSTM）和普通多层感知器（MLP）实现了类似的预测精度，但只有适度的预测精度。我们扩大我们的比较研究，包括扩展的标准Transformer架构，其非平移不变的对应物，顺式变换器，和一个替代的注意力机制，广泛的关注。后者的组合被认为是唯一的模型能够实现相当高的性能水平，约95%。我们的研究结果表明，注意力的工作可以更好地理解，甚至改善，当比较定性不同的配方在任务切换设置。
摘要：The rapid progress seen in terms of large-scale generative AI is largely based on the attention mechanism. It is conversely non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional approaches, such as multi-layer perceptrons or recurrent networks. We examine this problem in the context of 'task switching'. In this framework models work on ongoing token sequences with the current task being determined by stochastically interspersed control tokens. We show that standard transformers cannot solve a basic task switching reference model based on finite domain arithmetics which contains subtasks dedicated to increment / addition / reverse copy / context (IARC). We show that transformers, long short-term memory recurrent networks (LSTM), and plain multi-layer perceptrons (MLPs) achieve similar, but only modest prediction accuracies. We enlarge our comparative study by including an extension of the standard transformer architecture to its non-translational invariant counterpart, the cisformer, and an alternative attention mechanism, extensive attention. A combination of the latter is found to be the only model able to achieve considerable performance levels, of around 95%. Our results indicate that the workings of attention can be understood better, and even improved, when comparing qualitatively different formulations in task-switching settings.

【2】Quantum Temporal Fusion Transformer
标题：量子时间融合Transformer
链接：https://arxiv.org/abs/2508.04048

作者：nta Barik, Goutam Paul
摘要：由Lim等人提出的时间融合Transformer（TFT）[\textit{International Journal of Forecasting}，2021]是一种最先进的基于注意力的深度神经网络架构，专门为多时间序列预测而设计。它已经证明，与现有基准相比，性能有了显著改善。在这项工作中，我们提出了一个量子时间融合Transformer（QTFT），量子增强的混合量子经典架构，扩展了经典TFT框架的能力。我们的结果表明，QTFT在预测数据集上成功训练，并且能够准确预测未来值。特别是，我们的实验结果显示，在某些测试用例中，该模型在训练和测试损失方面优于经典模型，而在其余情况下，它实现了相当的性能。我们的方法的一个关键优势在于它的基础上的变分量子算法，使当前的噪声中间尺度量子（NISQ）设备上的实现没有严格的量子位数或电路深度的要求。
摘要：The Temporal Fusion Transformer (TFT), proposed by Lim et al. [\textit{International Journal of Forecasting}, 2021], is a state-of-the-art attention-based deep neural network architecture specifically designed for multi-horizon time series forecasting. It has demonstrated significant performance improvements over existing benchmarks. In this work, we propose a Quantum Temporal Fusion Transformer (QTFT), a quantum-enhanced hybrid quantum-classical architecture that extends the capabilities of the classical TFT framework. Our results demonstrate that QTFT is successfully trained on the forecasting datasets and is capable of accurately predicting future values. In particular, our experimental results display that in certain test cases, the model outperforms its classical counterpart in terms of both training and test loss, while in the remaining cases, it achieves comparable performance. A key advantage of our approach lies in its foundation on a variational quantum algorithm, enabling implementation on current noisy intermediate-scale quantum (NISQ) devices without strict requirements on the number of qubits or circuit depth.

【3】Delving Deeper Into Astromorphic Transformers
标题：深入研究星象Transformer
链接：https://arxiv.org/abs/2312.10925

作者：Ahmed Mia, Malyaban Bal, Abhronil Sengupta
摘要：将星形胶质细胞（占人脑细胞的50%以上）的关键作用纳入脑启发神经形态计算的初步尝试仍处于起步阶段。本文旨在深入研究神经元-突触-星形胶质细胞相互作用的各个关键方面，以模仿Transformers中的自我注意机制。在这项工作中探索的跨层视角涉及神经元-星形胶质细胞网络中Hebbian和突触前可塑性的生物合理建模，将非线性和反馈的影响与算法公式结合起来，将神经元-星形胶质细胞计算映射到自我注意机制，并评估从机器学习应用方面结合生物现实效应的影响。我们对情感和图像分类任务（IMDB和CIFAR 10数据集）的分析突出了Astromorphic Transformers的优势，提供了更高的准确性和学习速度。此外，该模型在WikiText-2数据集上展示了强大的自然语言生成能力，与传统模型相比，实现了更好的困惑，从而在不同的机器学习任务中展示了增强的泛化能力和稳定性。
摘要：Preliminary attempts at incorporating the critical role of astrocytes - cells that constitute more than 50\% of human brain cells - in brain-inspired neuromorphic computing remain in infancy. This paper seeks to delve deeper into various key aspects of neuron-synapse-astrocyte interactions to mimic self-attention mechanisms in Transformers. The cross-layer perspective explored in this work involves bioplausible modeling of Hebbian and presynaptic plasticities in neuron-astrocyte networks, incorporating effects of non-linearities and feedback along with algorithmic formulations to map the neuron-astrocyte computations to self-attention mechanism and evaluating the impact of incorporating bio-realistic effects from the machine learning application side. Our analysis on sentiment and image classification tasks (IMDB and CIFAR10 datasets) highlights the advantages of Astromorphic Transformers, offering improved accuracy and learning speed. Furthermore, the model demonstrates strong natural language generation capabilities on the WikiText-2 dataset, achieving better perplexity compared to conventional models, thus showcasing enhanced generalization and stability across diverse machine learning tasks.

【4】Negative binomial regression and inference using a pre-trained transformer
标题：使用预先训练的Transformer进行负二项回归和推断
链接：https://arxiv.org/abs/2508.04111

作者： Svensson
备注：6 pages, 5 figures
摘要：负二项回归对于分析比较研究中的过度分散计数数据至关重要，但在需要数百万次比较的大屏幕中，参数估计在计算上具有挑战性。我们研究使用预先训练的Transformer从观察到的计数数据中产生负二项回归参数的估计值，通过合成数据生成来训练，以学习从参数中反转生成计数的过程。Transformer方法比最大似然优化方法获得更好的参数精度，同时速度快20倍。然而，比较意外地发现，矩估计方法在准确性方面表现得与最大似然优化一样好，同时速度快1,000倍，并产生更好的校准和更强大的测试，使其成为该应用程序最有效的解决方案。
摘要：Negative binomial regression is essential for analyzing over-dispersed count data in in comparative studies, but parameter estimation becomes computationally challenging in large screens requiring millions of comparisons. We investigate using a pre-trained transformer to produce estimates of negative binomial regression parameters from observed count data, trained through synthetic data generation to learn to invert the process of generating counts from parameters. The transformer method achieved better parameter accuracy than maximum likelihood optimization while being 20 times faster. However, comparisons unexpectedly revealed that method of moment estimates performed as well as maximum likelihood optimization in accuracy, while being 1,000 times faster and producing better-calibrated and more powerful tests, making it the most efficient solution for this application.

GAN|对抗|攻击|生成相关(9篇)

【1】Attack Pattern Mining to Discover Hidden Threats to Industrial Control Systems
标题：攻击模式挖掘以发现工业控制系统的隐藏威胁
链接：https://arxiv.org/abs/2508.04561

作者：Azmi Umer, Chuadhry Mujeeb Ahmed, Aditya Mathur, Muhammad Taha Jilani
摘要：这项工作的重点是在工业控制系统（ICS）安全的背景下，攻击模式挖掘的验证。对ICS进行全面的安全评估需要生成大量各种攻击模式。为此，我们提出了一个数据驱动的技术来生成攻击模式的ICS。所提出的技术已被用于从运行的水处理厂收集的数据中生成超过100，000个攻击模式。在这项工作中，我们提出了一个详细的案例研究，以验证攻击模式。
摘要：This work focuses on validation of attack pattern mining in the context of Industrial Control System (ICS) security. A comprehensive security assessment of an ICS requires generating a large and variety of attack patterns. For this purpose we have proposed a data driven technique to generate attack patterns for an ICS. The proposed technique has been used to generate over 100,000 attack patterns from data gathered from an operational water treatment plant. In this work we present a detailed case study to validate the attack patterns.

【2】Emotion Detection Using Conditional Generative Adversarial Networks (cGAN): A Deep Learning Approach
标题：使用条件生成对抗网络（cGAN）的情绪检测：一种深度学习方法
链接：https://arxiv.org/abs/2508.04481

作者：rivastava
备注：3 pages, 2 tables, submitted for arXiv preprint
摘要：本文提出了一种基于深度学习的方法，使用条件生成对抗网络（cGAN）进行情感检测。与依赖于单一数据类型的传统单峰技术不同，我们探索了一个集成文本，音频和面部表情的多模态框架。所提出的cGAN架构经过训练，可以生成合成的情感丰富的数据，并提高多种模态的分类准确性。我们的实验结果表明，与基线模型相比，情感识别性能有显着提高。这项工作突出了cGAN通过实现更细致的情感理解来增强人机交互系统的潜力。
摘要：This paper presents a deep learning-based approach to emotion detection using Conditional Generative Adversarial Networks (cGANs). Unlike traditional unimodal techniques that rely on a single data type, we explore a multimodal framework integrating text, audio, and facial expressions. The proposed cGAN architecture is trained to generate synthetic emotion-rich data and improve classification accuracy across multiple modalities. Our experimental results demonstrate significant improvements in emotion recognition performance compared to baseline models. This work highlights the potential of cGANs in enhancing human-computer interaction systems by enabling more nuanced emotional understanding.

【3】LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation
标题：LayerT2V：用于视频生成的交互式多对象轨迹分层
链接：https://arxiv.org/abs/2508.04228

作者：en, Baixuan Zhao, Yi Xin, Siqi Luo, Guangtao Zhai, Xiaohong Liu
备注：Project webpage: this https URL
摘要：在文本到视频（T2V）生成中控制对象运动轨迹是一个具有挑战性且相对未被开发的领域，特别是在涉及多个移动对象的场景中。T2V域中的大多数社区模型和数据集都是为单对象运动设计的，这限制了当前生成模型在多对象任务中的性能。此外，T2V中现有的运动控制方法要么缺乏对多对象运动场景的支持，要么在对象轨迹相交时经历严重的性能下降，这主要是由于碰撞区域中的语义冲突。为了解决这些限制，我们引入了LayerT2V，这是通过逐层合成背景和前景对象来生成视频的第一种方法。这种分层生成使得能够灵活地集成视频内的多个独立元素，将每个元素定位在不同的“层”上，从而促进连贯的多对象合成，同时增强对生成过程的控制。大量实验证明了LayerT2V在生成复杂多对象场景方面的优越性，与最先进的（SOTA）方法相比，mIoU和AP50指标分别提高了1.4倍和4.5倍。项目页面和代码可以在https://kr-panghu.github.io/LayerT2V/上找到。
摘要：Controlling object motion trajectories in Text-to-Video (T2V) generation is a challenging and relatively under-explored area, particularly in scenarios involving multiple moving objects. Most community models and datasets in the T2V domain are designed for single-object motion, limiting the performance of current generative models in multi-object tasks. Additionally, existing motion control methods in T2V either lack support for multi-object motion scenes or experience severe performance degradation when object trajectories intersect, primarily due to the semantic conflicts in colliding regions. To address these limitations, we introduce LayerT2V, the first approach for generating video by compositing background and foreground objects layer by layer. This layered generation enables flexible integration of multiple independent elements within a video, positioning each element on a distinct "layer" and thus facilitating coherent multi-object synthesis while enhancing control over the generation process. Extensive experiments demonstrate the superiority of LayerT2V in generating complex multi-object scenarios, showcasing 1.4x and 4.5x improvements in mIoU and AP50 metrics over state-of-the-art (SOTA) methods. Project page and code are available at https://kr-panghu.github.io/LayerT2V/ .

【4】One Small Step with Fingerprints, One Giant Leap for emph{De Novo} Molecule Generation from Mass Spectra
标题：指纹的一小步，从光谱生成{De Novo}分子的一大步
链接：https://arxiv.org/abs/2508.04180

作者：Nigel Neo, Lim Jing, Ngoui Yong Zhau Preston, Koh Xue Ting Serene, Bingquan Shen
摘要：从质谱重新生成分子问题的常见方法涉及两个阶段的流水线：（1）将质谱编码为分子指纹，然后（2）将这些指纹解码为分子结构。在我们的工作中，我们采用\textsc{MIST}~\citep{MISTgoldmanAnnotatingMetaboliteMass 2023}作为编码器，\textsc{MolForge}~\citep{ucakReconstructionLosslesMolecular 2023}作为解码器，利用预训练来提高性能。值得注意的是，预训练\textsc{MolForge}被证明特别有效，使其能够作为一个强大的指纹到结构解码器。此外，不是传递指纹中每个比特的概率，而是将概率阈值化为阶跃函数有助于将解码器集中在子结构的存在上，即使当由\textsc{MIST}预测的指纹在Tanimoto相似性方面仅适度类似于地面真实时，也可以提高准确分子结构的恢复。编码器和解码器的这种组合导致比先前的最先进的方法提高了十倍，从质谱中正确地生成前1个28%/前10个36%的分子结构。我们将此管道定位为从质谱中重新解析分子的未来研究的强有力的基线。
摘要：A common approach to the \emph{de novo} molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt \textsc{MIST}~\citep{MISTgoldmanAnnotatingMetaboliteMass2023} as the encoder and \textsc{MolForge}~\citep{ucakReconstructionLosslessMolecular2023} as the decoder, leveraging pretraining to enhance performance. Notably, pretraining \textsc{MolForge} proves especially effective, enabling it to serve as a robust fingerprint-to-structure decoder. Additionally, instead of passing the probability of each bit in the fingerprint, thresholding the probabilities as a step function helps focus the decoder on the presence of substructures, improving recovery of accurate molecular structures even when the fingerprints predicted by \textsc{MIST} only moderately resembles the ground truth in terms of Tanimoto similarity. This combination of encoder and decoder results in a tenfold improvement over previous state-of-the-art methods, generating top-1 28\% / top-10 36\% of molecular structures correctly from mass spectra. We position this pipeline as a strong baseline for future research in \emph{de novo} molecule elucidation from mass spectra.

【5】Evaluating Selective Encryption Against Gradient Inversion Attacks
标题：抗梯度反转攻击的选择性加密算法
链接：https://arxiv.org/abs/2508.04155

作者：, Yuhang Yao, Shuaiqi Wang, Carlee Joe-Wong
摘要：梯度反转攻击对联邦学习等分布式训练框架构成了重大的隐私威胁，使恶意方能够在聚合过程中根据客户端和聚合服务器之间的梯度通信重建敏感的本地训练数据。虽然传统的基于防御的防御，如同态加密，提供了强大的隐私保证，而不损害模型效用，他们往往会招致令人望而却步的计算开销。为了缓解这一问题，选择性加密已经成为一种有前途的方法，它根据数据在一定度量下的重要性仅加密梯度数据的子集。然而，很少有系统的研究，如何在实践中指定这一指标。本文系统地评估了具有不同重要性度量的选择性加密方法对最先进的攻击。我们的研究结果证明了选择性加密在减少计算开销的同时保持对攻击的弹性的可行性。我们提出了一个基于距离的重要性分析框架，为选择加密的关键梯度元素提供了理论基础。通过对不同模型架构（LeNet，CNN，BERT，GPT-2）和攻击类型的广泛实验，我们将梯度幅度确定为针对基于优化的梯度反转的一般有效的度量。然而，我们也观察到，没有一个单一的选择性加密策略是普遍最佳的所有攻击场景，我们提供了指导方针，为不同的模型架构和隐私要求选择适当的策略。
摘要：Gradient inversion attacks pose significant privacy threats to distributed training frameworks such as federated learning, enabling malicious parties to reconstruct sensitive local training data from gradient communications between clients and an aggregation server during the aggregation process. While traditional encryption-based defenses, such as homomorphic encryption, offer strong privacy guarantees without compromising model utility, they often incur prohibitive computational overheads. To mitigate this, selective encryption has emerged as a promising approach, encrypting only a subset of gradient data based on the data's significance under a certain metric. However, there have been few systematic studies on how to specify this metric in practice. This paper systematically evaluates selective encryption methods with different significance metrics against state-of-the-art attacks. Our findings demonstrate the feasibility of selective encryption in reducing computational overhead while maintaining resilience against attacks. We propose a distance-based significance analysis framework that provides theoretical foundations for selecting critical gradient elements for encryption. Through extensive experiments on different model architectures (LeNet, CNN, BERT, GPT-2) and attack types, we identify gradient magnitude as a generally effective metric for protection against optimization-based gradient inversions. However, we also observe that no single selective encryption strategy is universally optimal across all attack scenarios, and we provide guidelines for choosing appropriate strategies for different model architectures and privacy requirements.

【6】Adversarial Fair Multi-View Clustering
标题：对抗性公平多视图聚集
链接：https://arxiv.org/abs/2508.04071

作者：g, Jiahui Zhou, Lianyu Hu, Xinying Liu, Zengyou He, Zhikui Chen
摘要：聚类分析是数据挖掘和机器学习中的一个基本问题。近年来，多视图聚类由于其能够整合来自多个视图的互补信息而受到越来越多的关注。然而，现有的方法主要集中在聚类性能，而公平性，在以人为中心的应用程序中的一个关键问题，已在很大程度上被忽视。虽然最近的研究已经探讨了多视图聚类中的组公平性，但大多数方法对聚类分配施加显式正则化，依赖于敏感属性和底层聚类结构之间的对齐。然而，这种假设在实践中往往会失败，并且会降低集群性能。在本文中，我们提出了一个对抗性的公平多视图聚类（AFMVC）框架，将公平性学习集成到表示学习过程中。具体来说，我们的方法采用对抗性训练从根本上去除学习特征中的敏感属性信息，确保得到的聚类分配不受其影响。此外，我们从理论上证明，通过KL散度将特定于视图的聚类分配与公平不变的一致性分布对齐，可以保持聚类一致性，而不会显着损害公平性，从而为我们的框架提供了额外的理论保证。在具有公平性约束的数据集上进行的大量实验表明，与现有的多视图聚类和公平性感知聚类方法相比，AFMVC具有更好的公平性和竞争性聚类性能。
摘要：Cluster analysis is a fundamental problem in data mining and machine learning. In recent years, multi-view clustering has attracted increasing attention due to its ability to integrate complementary information from multiple views. However, existing methods primarily focus on clustering performance, while fairness-a critical concern in human-centered applications-has been largely overlooked. Although recent studies have explored group fairness in multi-view clustering, most methods impose explicit regularization on cluster assignments, relying on the alignment between sensitive attributes and the underlying cluster structure. However, this assumption often fails in practice and can degrade clustering performance. In this paper, we propose an adversarial fair multi-view clustering (AFMVC) framework that integrates fairness learning into the representation learning process. Specifically, our method employs adversarial training to fundamentally remove sensitive attribute information from learned features, ensuring that the resulting cluster assignments are unaffected by it. Furthermore, we theoretically prove that aligning view-specific clustering assignments with a fairness-invariant consensus distribution via KL divergence preserves clustering consistency without significantly compromising fairness, thereby providing additional theoretical guarantees for our framework. Extensive experiments on data sets with fairness constraints demonstrate that AFMVC achieves superior fairness and competitive clustering performance compared to existing multi-view clustering and fairness-aware clustering methods.

【7】FLAT: Latent-Driven Arbitrary-Target Backdoor Attacks in Federated Learning
标题：FLAT：联邦学习中的潜在驱动辅助目标后门攻击
链接：https://arxiv.org/abs/2508.04064

作者：en, Khoa D Doan, Kok-Seng Wong
摘要：联邦学习（FL）容易受到后门攻击，但大多数现有方法受到固定模式或单一目标触发器的限制，使其不灵活且更容易检测。我们提出了FLAT（FL攻击目标），一种新的后门攻击，利用延迟驱动的条件自动编码器来生成不同的，特定于目标的触发器。通过引入潜在代码，FLAT能够创建视觉自适应和高度可变的触发器，允许攻击者在不重新训练的情况下选择任意目标，并逃避传统的检测机制。我们的方法在一个框架内统一了攻击成功，隐身和多样性，为FL中的后门攻击引入了新的灵活性和复杂性。大量实验表明，FLAT实现了高攻击成功率，并对先进的FL防御保持稳健。这些结果突出表明，迫切需要新的防御策略来解决联邦环境中潜在驱动的多目标后门威胁。
摘要：Federated learning (FL) is vulnerable to backdoor attacks, yet most existing methods are limited by fixed-pattern or single-target triggers, making them inflexible and easier to detect. We propose FLAT (FL Arbitrary-Target Attack), a novel backdoor attack that leverages a latent-driven conditional autoencoder to generate diverse, target-specific triggers as needed. By introducing a latent code, FLAT enables the creation of visually adaptive and highly variable triggers, allowing attackers to select arbitrary targets without retraining and to evade conventional detection mechanisms. Our approach unifies attack success, stealth, and diversity within a single framework, introducing a new level of flexibility and sophistication to backdoor attacks in FL. Extensive experiments show that FLAT achieves high attack success and remains robust against advanced FL defenses. These results highlight the urgent need for new defense strategies to address latent-driven, multi-target backdoor threats in federated settings.

【8】Next Generation Equation-Free Multiscale Modelling of Crowd Dynamics via Machine Learning
标题：通过机器学习进行下一代无方程多尺度人群动态建模
链接：https://arxiv.org/abs/2508.03926

作者：rgas Alvarez, Dimitrios G. Patsatzis, Lucia Russo, Ioannis Kevrekidis, Constantinos Siettos
备注：29 pages (9 pages of Appendix), 9 figures (3 in Appendix)
摘要：在人群动力学中，将微观和宏观建模尺度连接起来，对系统的数值分析、优化和控制构成了一个重要的、开放的挑战。我们提出了一种结合流形和机器学习的方法来学习离散演化算子的潜在空间中的紧急人群动态高保真基于代理的模拟。建议的框架建立在我们以前的工作下一代无方程算法学习代理模型的高维和多尺度系统。我们的方法是一个四阶段的，明确地保存在高维空间中的重建动力学的质量。在第一步中，我们使用KDE从离散的微观数据（行人的位置）获得连续的宏观场（密度）。在第二步中，基于流形学习，我们构建了一个从宏观环境空间到潜空间的映射，该映射基于相应密度分布的POD由几个坐标参数化。第三步涉及使用机器学习技术，特别是LSTM网络和MVAR，在潜在空间中学习降阶代理ROM。最后，我们重建的宏观密度分布的高维空间中的人群动力学。我们证明，POD重建的密度分布，通过SVD保存的质量。通过这种“嵌入->在潜在空间中学习->提升回周围空间”的管道，我们为密度演化创建了不可用的宏观PDE的有效求解算子。对于我们的插图，我们使用社会力模型来生成具有障碍物的走廊中的数据，施加周期性边界条件。数值结果表明，高精度，鲁棒性和概括性，从而允许快速，准确的建模/模拟人群动态从基于代理的模拟。
摘要：Bridging the microscopic and the macroscopic modelling scales in crowd dynamics constitutes an important, open challenge for systematic numerical analysis, optimization, and control. We propose a combined manifold and machine learning approach to learn the discrete evolution operator for the emergent crowd dynamics in latent spaces from high-fidelity agent-based simulations. The proposed framework builds upon our previous works on next-generation Equation-free algorithms on learning surrogate models for high-dimensional and multiscale systems. Our approach is a four-stage one, explicitly conserving the mass of the reconstructed dynamics in the high-dimensional space. In the first step, we derive continuous macroscopic fields (densities) from discrete microscopic data (pedestrians' positions) using KDE. In the second step, based on manifold learning, we construct a map from the macroscopic ambient space into the latent space parametrized by a few coordinates based on POD of the corresponding density distribution. The third step involves learning reduced-order surrogate ROMs in the latent space using machine learning techniques, particularly LSTMs networks and MVARs. Finally, we reconstruct the crowd dynamics in the high-dimensional space in terms of macroscopic density profiles. We demonstrate that the POD reconstruction of the density distribution via SVD conserves the mass. With this "embed->learn in latent space->lift back to the ambient space" pipeline, we create an effective solution operator of the unavailable macroscopic PDE for the density evolution. For our illustrations, we use the Social Force Model to generate data in a corridor with an obstacle, imposing periodic boundary conditions. The numerical results demonstrate high accuracy, robustness, and generalizability, thus allowing for fast and accurate modelling/simulation of crowd dynamics from agent-based simulations.

【9】Point-Based Shape Representation Generation with a Correspondence-Preserving Diffusion Model
标题：使用对应保持扩散模型的基于点的形状表示生成
链接：https://arxiv.org/abs/2508.03925

作者： Yinzhu Jin, Ifrah Zawar, P. Thomas Fletcher
摘要：我们提出了一个扩散模型，旨在生成基于点的形状表示与对应。传统的统计形状模型广泛考虑了点对应关系，但目前的深度学习方法没有考虑到它们，而是专注于无序的点云。当前用于点云的深度生成模型不解决生成具有所生成的形状之间的点对应的形状。这项工作的目的是制定一个扩散模型，能够生成逼真的基于点的形状表示，保持点的对应关系，存在于训练数据。使用来自开放获取系列成像研究3（OASIS-3）的对应关系的形状表示数据，我们证明了我们的对应关系保留模型有效地生成基于点的海马形状表示，与现有方法相比，这些表示非常逼真。我们进一步通过下游任务展示了我们的生成模型的应用，例如健康和AD受试者的条件生成以及通过反事实生成预测疾病进展的形态变化。
摘要：We propose a diffusion model designed to generate point-based shape representations with correspondences. Traditional statistical shape models have considered point correspondences extensively, but current deep learning methods do not take them into account, focusing on unordered point clouds instead. Current deep generative models for point clouds do not address generating shapes with point correspondences between generated shapes. This work aims to formulate a diffusion model that is capable of generating realistic point-based shape representations, which preserve point correspondences that are present in the training data. Using shape representation data with correspondences derived from Open Access Series of Imaging Studies 3 (OASIS-3), we demonstrate that our correspondence-preserving model effectively generates point-based hippocampal shape representations that are highly realistic compared to existing methods. We further demonstrate the applications of our generative model by downstream tasks, such as conditional generation of healthy and AD subjects and predicting morphological changes of disease progression by counterfactual generation.

半/弱/无/有监督|不确定性|主动学习(8篇)

【1】SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
标题：SEAgent：自进化的计算机使用代理，具有从经验中自主学习
链接：https://arxiv.org/abs/2508.04700

作者： Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, Jiaqi Wang
备注：Code at this https URL
摘要：将大型视觉语言模型（LVLM）重新用作计算机使用代理（CUA）已经取得了重大突破，主要是由人类标记的数据驱动的。然而，这些模型经常与新颖和专业的软件相冲突，特别是在缺乏人工注释的情况下。为了应对这一挑战，我们提出了SEAgent，一个代理的自我发展的框架，使CUAs自主发展，通过与不熟悉的软件的互动。具体来说，SEAgent使计算机使用代理通过体验式学习自主掌握新的软件环境，代理探索新的软件，通过迭代试错学习，并逐步解决从简单到复杂的自动生成的任务。为了实现这一目标，我们设计了一个世界国家模型，用于逐步轨迹评估，以及一个课程生成器，可以生成越来越多样化和具有挑战性的任务。智能体的策略通过经验学习进行更新，包括对失败行为的对抗性模仿和对成功行为的组相对策略优化（GRPO）。此外，我们引入了从专家到多面手的培训策略，整合了专业代理的个人经验见解，促进了能够持续自主进化的更强大的多面手CUA的开发。这个统一的代理最终实现性能超越其专业软件上的单个专业代理的集合。我们验证了SEAgent的有效性，在OS世界的五个新的软件环境。我们的方法在成功率上实现了23.2%的显著提高，从11.3%提高到34.5%，超过了有竞争力的开源CUA，即，UI-TARS。
摘要：Repurposing large vision-language models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent's policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five novel software environments within OS-World. Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA, i.e., UI-TARS.

【2】Neuromorphic Cybersecurity with Semi-supervised Lifelong Learning
标题：半监督终身学习的神经形态网络安全
链接：https://arxiv.org/abs/2508.04610

作者：Ahmed Mia, Malyaban Bal, Sen Lu, George M. Nishibuchi, Suhas Chelian, Srini Vasan, Abhronil Sengupta
摘要：受大脑分层处理和能量效率的启发，提出了一种用于终身网络入侵检测系统（NIDS）的脉冲神经网络（SNN）结构。该系统首先采用一个有效的静态SNN来识别潜在的入侵，然后激活一个自适应动态SNN负责分类特定的攻击类型。模仿生物适应，动态分类器利用生长时需要（GWR）启发的结构可塑性和一种新的自适应尖峰定时依赖可塑性（Ad-STDP）的学习规则。这些生物合理的机制使网络能够在保留现有知识的同时逐步学习新的威胁。在持续学习环境中对UNSW-NB 15基准进行测试，该架构表现出强大的适应性，减少了灾难性遗忘，并实现了85.3 $\%的整体准确性。此外，使用英特尔Lava框架的模拟证实了高操作稀疏性，突出了在神经形态硬件上低功耗部署的潜力。
摘要：Inspired by the brain's hierarchical processing and energy efficiency, this paper presents a Spiking Neural Network (SNN) architecture for lifelong Network Intrusion Detection System (NIDS). The proposed system first employs an efficient static SNN to identify potential intrusions, which then activates an adaptive dynamic SNN responsible for classifying the specific attack type. Mimicking biological adaptation, the dynamic classifier utilizes Grow When Required (GWR)-inspired structural plasticity and a novel Adaptive Spike-Timing-Dependent Plasticity (Ad-STDP) learning rule. These bio-plausible mechanisms enable the network to learn new threats incrementally while preserving existing knowledge. Tested on the UNSW-NB15 benchmark in a continual learning setting, the architecture demonstrates robust adaptation, reduced catastrophic forgetting, and achieves $85.3$\% overall accuracy. Furthermore, simulations using the Intel Lava framework confirm high operational sparsity, highlighting the potential for low-power deployment on neuromorphic hardware.

【3】Semi-Supervised Deep Domain Adaptation for Predicting Solar Power Across Different Locations
标题：半监督深域自适应预测不同地点的太阳能发电量
链接：https://arxiv.org/abs/2508.04165

作者： Islam, A S M Jahid Hasan, Md Saydur Rahman, Md Saiful Islam Sajol
摘要：准确的太阳能发电预测对于正确估计不同地理位置的可再生能源资源至关重要。然而，地理和天气特征因位置而异，这引入了域转移-这是发展位置不可知预测模型的主要瓶颈。因此，一个机器学习模型可以很好地预测一个位置的太阳能，但在另一个位置可能表现出低于标准的性能。此外，缺乏正确标记的数据和存储问题使任务更具挑战性。为了解决由于不同气象区域的天气条件变化而导致的域偏移，本文提出了一种半监督的深度域自适应框架，允许使用来自目标位置的最少标记数据进行准确预测。我们的方法涉及在源位置的数据上训练深度卷积神经网络，并使用无源的师生模型配置使其适应目标位置。师生模型利用半监督学习的一致性和交叉熵损失，确保有效的适应，而无需任何源数据进行预测。在目标域中只有20美元的数据注释，我们的方法表现出的改善高达11.36美元，6.65美元，4.92美元的预测精度分别为加利福尼亚州，佛罗里达州和纽约作为目标域，相对于非自适应方法。
摘要：Accurate solar generation prediction is essential for proper estimation of renewable energy resources across diverse geographic locations. However, geographical and weather features vary from location to location which introduces domain shift - a major bottleneck to develop location-agnostic prediction model. As a result, a machine-learning model which can perform well to predict solar power in one location, may exhibit subpar performance in another location. Moreover, the lack of properly labeled data and storage issues make the task even more challenging. In order to address domain shift due to varying weather conditions across different meteorological regions, this paper presents a semi-supervised deep domain adaptation framework, allowing accurate predictions with minimal labeled data from the target location. Our approach involves training a deep convolutional neural network on a source location's data and adapting it to the target location using a source-free, teacher-student model configuration. The teacher-student model leverages consistency and cross-entropy loss for semi-supervised learning, ensuring effective adaptation without any source data requirement for prediction. With annotation of only $20 \%$ data in the target domain, our approach exhibits an improvement upto $11.36 \%$, $6.65 \%$, $4.92\%$ for California, Florida and New York as target domain, respectively in terms of accuracy in predictions with respect to non-adaptive approach.

【4】Active Learning and Transfer Learning for Anomaly Detection in Time-Series Data
标题：用于时间序列数据异常检测的主动学习和迁移学习
链接：https://arxiv.org/abs/2508.03921

作者：elleher, Matthew Nicholson, Rahul Agrahari, Clare Conran
摘要：本文研究了主动学习和迁移学习相结合的跨域时间序列数据异常检测的有效性。我们的研究结果表明，聚类和主动学习之间存在相互作用，并且通常使用单个聚类（换句话说，当不应用聚类时）实现最佳性能。此外，我们发现使用主动学习向训练集添加新样本确实可以提高模型性能，但一般来说，改进速度比文献中报告的结果慢。我们将这种差异归因于改进的实验设计，其中不同的数据样本用于采样和测试池。最后，我们评估了迁移学习与多个数据集的主动学习相结合的天花板性能，发现性能最初确实有所提高，但最终随着更多目标点被选择用于训练而开始下降。性能的这种拖尾可以指示主动学习过程在对用于选择的数据点进行排序方面做得很好，将不太有用的点推向选择过程的末尾，并且当这些不太有用的点最终被添加时发生这种拖尾。总之，我们的研究结果表明，主动学习是有效的，但模型性能的改善遵循一个线性平坦函数的选择和标记的点的数量。
摘要：This paper examines the effectiveness of combining active learning and transfer learning for anomaly detection in cross-domain time-series data. Our results indicate that there is an interaction between clustering and active learning and in general the best performance is achieved using a single cluster (in other words when clustering is not applied). Also, we find that adding new samples to the training set using active learning does improve model performance but that in general, the rate of improvement is slower than the results reported in the literature suggest. We attribute this difference to an improved experimental design where distinct data samples are used for the sampling and testing pools. Finally, we assess the ceiling performance of transfer learning in combination with active learning across several datasets and find that performance does initially improve but eventually begins to tail off as more target points are selected for inclusion in training. This tail-off in performance may indicate that the active learning process is doing a good job of sequencing data points for selection, pushing the less useful points towards the end of the selection process and that this tail-off occurs when these less useful points are eventually added. Taken together our results indicate that active learning is effective but that the improvement in model performance follows a linear flat function concerning the number of points selected and labelled.

【5】A Comprehensive Framework for Uncertainty Quantification of Voxel-wise Supervised Models in IVIM MRI
标题：IVIM MRI中体素监督模型不确定性量化的综合框架
链接：https://arxiv.org/abs/2508.04588

作者：sali, Alessandro Brusaferri, Giuseppe Baselli, Stefano Fumagalli, Edoardo Micotti, Gianluigi Forloni, Riaz Hussein, Giovanna Rizzo, Alfonso Mastropietro
摘要：由于逆问题的不适定性质和对噪声的高敏感性，特别是在灌注室中，从扩散加权MRI中精确估计体素内非相干运动（IVIM）参数仍然具有挑战性。在这项工作中，我们提出了一个基于混合密度网络（MDN）的深度集成（DE）的概率深度学习框架，可以估计总的预测不确定性并分解为任意（AU）和认识（EU）分量。该方法是基准对非概率神经网络，贝叶斯拟合方法和概率网络与单一高斯参数化。对合成数据进行了监督训练，并对模拟数据集和两个体内数据集进行了评估。使用校准曲线、输出分布锐度和连续排序概率评分（CRPS）评估量化不确定性的可靠性。MDN产生的D和f参数的校准和更尖锐的预测分布，虽然轻微的过度自信，观察到D*。稳健变异系数（RCV）表明，与高斯模型相比，MDN的D* 体内估计值更平滑。尽管训练数据涵盖了预期的生理范围，但体内EU升高表明与真实采集条件不匹配，突出了纳入EU的重要性，这是DE允许的。总的来说，我们提出了一个全面的框架IVIM拟合与不确定性量化，这使得识别和解释不可靠的估计。所提出的方法也可以通过适当的架构和仿真调整来拟合其他物理模型。
摘要：Accurate estimation of intravoxel incoherent motion (IVIM) parameters from diffusion-weighted MRI remains challenging due to the ill-posed nature of the inverse problem and high sensitivity to noise, particularly in the perfusion compartment. In this work, we propose a probabilistic deep learning framework based on Deep Ensembles (DE) of Mixture Density Networks (MDNs), enabling estimation of total predictive uncertainty and decomposition into aleatoric (AU) and epistemic (EU) components. The method was benchmarked against non probabilistic neural networks, a Bayesian fitting approach and a probabilistic network with single Gaussian parametrization. Supervised training was performed on synthetic data, and evaluation was conducted on both simulated and two in vivo datasets. The reliability of the quantified uncertainties was assessed using calibration curves, output distribution sharpness, and the Continuous Ranked Probability Score (CRPS). MDNs produced more calibrated and sharper predictive distributions for the D and f parameters, although slight overconfidence was observed in D*. The Robust Coefficient of Variation (RCV) indicated smoother in vivo estimates for D* with MDNs compared to Gaussian model. Despite the training data covering the expected physiological range, elevated EU in vivo suggests a mismatch with real acquisition conditions, highlighting the importance of incorporating EU, which was allowed by DE. Overall, we present a comprehensive framework for IVIM fitting with uncertainty quantification, which enables the identification and interpretation of unreliable estimates. The proposed approach can also be adopted for fitting other physical models through appropriate architectural and simulation adjustments.

【6】Quantum circuit complexity and unsupervised machine learning of topological order
标题：量子电路复杂性和无监督机器学习拓扑顺序
链接：https://arxiv.org/abs/2508.04486

作者：he, Clemens Gneiting, Xiaoguang Wang, Franco Nori
备注：17 pages, with appendix; 4 figures. Code is available upon reasonable request, and will be open-sourced along with the publication. Comments are welcome
摘要：受Kolmogorov复杂性和无监督机器学习之间密切关系的启发，我们探索量子电路复杂性，这是量子计算和量子信息科学中的一个重要概念，作为理解和构建量子多体系统拓扑顺序的可解释和有效的无监督机器学习的支点。为了搭建一座从概念到实际应用的桥梁，我们提出了两个定理，分别将Nielsen的量子电路复杂性与保真度变化和纠缠产生联系起来，用于两个任意量子多体态之间的量子路径规划。利用这些连接，基于一致性和基于纠缠的相似性度量或内核，这是更实际的实施，制定。使用这两个建议的内核，针对无监督聚类的键交替XXZ自旋链，基态Kitaev的复曲面码和随机产品状态的量子相位的数值实验，进行了，展示了其优越的性能。与经典的阴影层析成像和阴影核学习的关系也进行了讨论，后者可以自然地从我们的方法推导和理解。我们的研究结果建立了量子电路计算，量子复杂性和拓扑量子序的机器学习的关键概念和工具之间的联系。
摘要：Inspired by the close relationship between Kolmogorov complexity and unsupervised machine learning, we explore quantum circuit complexity, an important concept in quantum computation and quantum information science, as a pivot to understand and to build interpretable and efficient unsupervised machine learning for topological order in quantum many-body systems. To span a bridge from conceptual power to practical applicability, we present two theorems that connect Nielsen's quantum circuit complexity for the quantum path planning between two arbitrary quantum many-body states with fidelity change and entanglement generation, respectively. Leveraging these connections, fidelity-based and entanglement-based similarity measures or kernels, which are more practical for implementation, are formulated. Using the two proposed kernels, numerical experiments targeting the unsupervised clustering of quantum phases of the bond-alternating XXZ spin chain, the ground state of Kitaev's toric code and random product states, are conducted, demonstrating their superior performance. Relations with classical shadow tomography and shadow kernel learning are also discussed, where the latter can be naturally derived and understood from our approach. Our results establish connections between key concepts and tools of quantum circuit computation, quantum complexity, and machine learning of topological quantum order.

【7】Benchmarking Uncertainty and its Disentanglement in multi-label Chest X-Ray Classification
标题：多标签胸部X线分类中的基准不确定性及其消除
链接：https://arxiv.org/abs/2508.04457

作者：r, Wojciech Samek, Jackie Ma
摘要：可靠的不确定性量化对于可靠的决策和在医学成像中部署AI模型至关重要。虽然先前的工作已经探索了神经网络在合成或定义良好的数据设置（如自然图像分类）中使用信息理论方法量化预测，认知和任意不确定性的能力，但其对现实生活中医疗诊断任务的适用性仍然未得到充分探索。在这项研究中，我们使用MIMIC-CXR-JPG数据集为多标签胸部X射线分类提供了一个广泛的不确定性量化基准。我们评估了13种用于卷积（ResNet）和基于变换器（Vision Transformer）架构的不确定性量化方法，这些方法适用于各种任务。此外，我们将证据深度学习，HetClass NN和深度确定性不确定性扩展到多标签设置。我们的分析提供了深入了解不确定性估计的有效性和能力，解开认识和任意的不确定性，揭示方法和架构的具体优势和局限性。
摘要：Reliable uncertainty quantification is crucial for trustworthy decision-making and the deployment of AI models in medical imaging. While prior work has explored the ability of neural networks to quantify predictive, epistemic, and aleatoric uncertainties using an information-theoretical approach in synthetic or well defined data settings like natural image classification, its applicability to real life medical diagnosis tasks remains underexplored. In this study, we provide an extensive uncertainty quantification benchmark for multi-label chest X-ray classification using the MIMIC-CXR-JPG dataset. We evaluate 13 uncertainty quantification methods for convolutional (ResNet) and transformer-based (Vision Transformer) architectures across a wide range of tasks. Additionally, we extend Evidential Deep Learning, HetClass NNs, and Deep Deterministic Uncertainty to the multi-label setting. Our analysis provides insights into uncertainty estimation effectiveness and the ability to disentangle epistemic and aleatoric uncertainties, revealing method- and architecture-specific strengths and limitations.

【8】Reliable Programmatic Weak Supervision with Confidence Intervals for Label Probabilities
标题：具有标签概率置信区间的可靠程序弱监督
链接：https://arxiv.org/abs/2508.03896

作者：Álvarez, Santiago Mazuelas, Steven An, Sanjoy Dasgupta
摘要：数据集的准确标记通常既昂贵又耗时。给定一个未标记的数据集，程序化弱监督通过利用多个弱标记函数（LF）来获得标签的概率预测，这些函数提供对标签的粗略猜测。弱LF通常提供具有各种类型和未知相互依赖性的猜测，这可能导致不可靠的预测。此外，现有的技术程序弱监督不能提供评估的标签的概率预测的可靠性。本文提出了一种程序化弱监督的方法，可以为标签概率提供置信区间，并获得更可靠的预测。特别是，所提出的方法使用不确定性分布集，其封装了具有不受限制的行为和类型的LF所提供的信息。在多个基准数据集上的实验表明，所提出的方法比现有的最先进的改进和实用的置信区间。
摘要：The accurate labeling of datasets is often both costly and time-consuming. Given an unlabeled dataset, programmatic weak supervision obtains probabilistic predictions for the labels by leveraging multiple weak labeling functions (LFs) that provide rough guesses for labels. Weak LFs commonly provide guesses with assorted types and unknown interdependences that can result in unreliable predictions. Furthermore, existing techniques for programmatic weak supervision cannot provide assessments for the reliability of the probabilistic predictions for labels. This paper presents a methodology for programmatic weak supervision that can provide confidence intervals for label probabilities and obtain more reliable predictions. In particular, the methods proposed use uncertainty sets of distributions that encapsulate the information provided by LFs with unrestricted behavior and typology. Experiments on multiple benchmark datasets show the improvement of the presented methods over the state-of-the-art and the practicality of the confidence intervals presented.

迁移|Zero/Few/One-Shot|自适应(8篇)

【1】A Scalable Pretraining Framework for Link Prediction with Efficient Adaptation
标题：具有高效自适应的可扩展链接预测预训练框架
链接：https://arxiv.org/abs/2508.04645

作者：Zhigang Hua, Harry Shomer, Yan Xie, Jingzhe Liu, Bo Long, Hui Liu
备注：Accepted by KDD 2025 Research Track
摘要：链接预测（LP）是图机器学习中的一个关键任务。虽然图神经网络（GNN）最近显着提高了LP性能，但现有方法面临着关键挑战，包括稀疏连接的有限监督，对初始化的敏感性以及分布变化下的泛化能力差。我们探索预培训作为解决这些挑战的解决方案。与节点分类不同，LP本质上是一个成对任务，需要集成节点级和边缘级信息。在这项工作中，我们提出了第一个系统的研究，这些不同的模块的可转移性，并提出了一个后期融合策略，有效地结合它们的输出，以提高性能。为了处理预训练数据的多样性并避免负迁移，我们引入了一个混合专家（MoE）框架，该框架可以在不同的专家中捕获不同的模式，从而促进预训练模型在不同下游数据集上的无缝应用。为了快速适应，我们开发了一种参数高效的调整策略，允许预训练的模型以最小的计算开销适应看不见的数据集。在两个领域的16个数据集上进行的实验证明了我们方法的有效性，在低资源链路预测上实现了最先进的性能，同时与端到端训练方法相比获得了有竞争力的结果，计算开销降低了10，000倍以上。
摘要：Link Prediction (LP) is a critical task in graph machine learning. While Graph Neural Networks (GNNs) have significantly advanced LP performance recently, existing methods face key challenges including limited supervision from sparse connectivity, sensitivity to initialization, and poor generalization under distribution shifts. We explore pretraining as a solution to address these challenges. Unlike node classification, LP is inherently a pairwise task, which requires the integration of both node- and edge-level information. In this work, we present the first systematic study on the transferability of these distinct modules and propose a late fusion strategy to effectively combine their outputs for improved performance. To handle the diversity of pretraining data and avoid negative transfer, we introduce a Mixture-of-Experts (MoE) framework that captures distinct patterns in separate experts, facilitating seamless application of the pretrained model on diverse downstream datasets. For fast adaptation, we develop a parameter-efficient tuning strategy that allows the pretrained model to adapt to unseen datasets with minimal computational overhead. Experiments on 16 datasets across two domains demonstrate the effectiveness of our approach, achieving state-of-the-art performance on low-resource link prediction while obtaining competitive results compared to end-to-end trained methods, with over 10,000x lower computational overhead.

【2】T3Time: Tri-Modal Time Series Forecasting via Adaptive Multi-Head Alignment and Residual Fusion
标题：T3 Time：通过自适应多头对齐和剩余融合进行三模时间序列预测
链接：https://arxiv.org/abs/2508.04251

作者：af Chowdhury, Rabeya Akter, Safaeid Hossain Arib
摘要：多变量时间序列预测（MTSF）试图对变量之间的时间动态进行建模，以预测未来趋势。基于transformer的模型和大型语言模型（LLM）由于能够捕获长期依赖关系和模式而显示出了希望。然而，目前的方法往往依赖于刚性的归纳偏差，忽略变量间的相互作用，或应用静态融合策略，限制跨预测范围的适应性。这些限制在捕捉时间序列数据中细微的、特定于地平线的关系方面造成了瓶颈。为了解决这个问题，我们提出了T3 Time，这是一个由时间、频谱和提示分支组成的新型三峰框架，其中专用的频率编码分支捕获周期性结构以及门控机制，该机制根据预测范围学习时间和频谱特征之间的优先级。我们还提出了一种机制，自适应聚合多个跨模态对齐头动态加权的重要性的基础上的功能。在基准数据集上进行的大量实验表明，我们的模型始终优于最先进的基线，MSE平均降低3.28%，MAE平均降低2.29%。此外，它在Few-Shot学习设置中表现出很强的泛化能力：使用5%的训练数据，我们看到MSE和MAE分别降低了4.13%和1.91%;使用10%的数据，平均降低了3.62%和1.98%。代码-https://github.com/monaf-chowdhury/T3Time/
摘要：Multivariate time series forecasting (MTSF) seeks to model temporal dynamics among variables to predict future trends. Transformer-based models and large language models (LLMs) have shown promise due to their ability to capture long-range dependencies and patterns. However, current methods often rely on rigid inductive biases, ignore intervariable interactions, or apply static fusion strategies that limit adaptability across forecast horizons. These limitations create bottlenecks in capturing nuanced, horizon-specific relationships in time-series data. To solve this problem, we propose T3Time, a novel trimodal framework consisting of time, spectral, and prompt branches, where the dedicated frequency encoding branch captures the periodic structures along with a gating mechanism that learns prioritization between temporal and spectral features based on the prediction horizon. We also proposed a mechanism which adaptively aggregates multiple cross-modal alignment heads by dynamically weighting the importance of each head based on the features. Extensive experiments on benchmark datasets demonstrate that our model consistently outperforms state-of-the-art baselines, achieving an average reduction of 3.28% in MSE and 2.29% in MAE. Furthermore, it shows strong generalization in few-shot learning settings: with 5% training data, we see a reduction in MSE and MAE by 4.13% and 1.91%, respectively; and with 10% data, by 3.62% and 1.98% on average. Code - https://github.com/monaf-chowdhury/T3Time/

【3】RLGS: Reinforcement Learning-Based Adaptive Hyperparameter Tuning for Gaussian Splatting
标题：RLGS：基于强化学习的高斯溅射自适应超参数调整
链接：https://arxiv.org/abs/2508.04078

作者：Huangying Zhan, Changyang Li, Qingan Yan, Yi Xu
备注：14 pages, 9 figures
摘要：3D高斯溅射（3DGS）中的超参数调整是一个劳动密集型和专家驱动的过程，通常会导致不一致的重建和次优结果。我们提出了RLGS，即插即用强化学习框架，通过轻量级策略模块在3DGS中进行自适应超参数调整，动态调整关键超参数，如学习速率和致密化阈值。该框架与模型无关，可以无缝集成到现有的3DGS管道中，而无需修改架构。我们展示了它在多个最先进的3DGS变体（包括Taming-3DGS和3DGS-MCMC）中的泛化能力，并验证了它在不同数据集上的鲁棒性。RLGS始终如一地增强渲染质量。例如，在固定的高斯预算下，它将Tanks and Temple（TNT）数据集上的Taming-3DGS提高了0.7dB PSNR，并且即使在基线性能饱和时也会继续产生增益。我们的研究结果表明，RLGS为3DGS训练中的超参数自动调整提供了一种有效和通用的解决方案，弥合了将强化学习应用于3DGS的差距。
摘要：Hyperparameter tuning in 3D Gaussian Splatting (3DGS) is a labor-intensive and expert-driven process, often resulting in inconsistent reconstructions and suboptimal results. We propose RLGS, a plug-and-play reinforcement learning framework for adaptive hyperparameter tuning in 3DGS through lightweight policy modules, dynamically adjusting critical hyperparameters such as learning rates and densification thresholds. The framework is model-agnostic and seamlessly integrates into existing 3DGS pipelines without architectural modifications. We demonstrate its generalization ability across multiple state-of-the-art 3DGS variants, including Taming-3DGS and 3DGS-MCMC, and validate its robustness across diverse datasets. RLGS consistently enhances rendering quality. For example, it improves Taming-3DGS by 0.7dB PSNR on the Tanks and Temple (TNT) dataset, under a fixed Gaussian budget, and continues to yield gains even when baseline performance saturates. Our results suggest that RLGS provides an effective and general solution for automating hyperparameter tuning in 3DGS training, bridging a gap in applying reinforcement learning to 3DGS.

【4】Fine-tuning for Better Few Shot Prompting: An Empirical Comparison for Short Answer Grading
标题：微调以实现更好的少数镜头预算：简短答案评分的经验比较
链接：https://arxiv.org/abs/2508.04063

作者：h, Siddarth Mamidanna, Benjamin Nye, Mark Core, Daniel Auerbach
备注：Proceedings of the Second Workshop on Automated Evaluation of Learning and Assessment Content co-located with 26th International Conference on Artificial Intelligence in Education (AIED 2025)
摘要：改进自动简短答案评分的研究最近集中在大型语言模型（LLM）上，这些模型具有快速工程和无或Few-Shot提示以实现最佳结果。这与微调方法相反，微调方法历来需要大多数用户无法访问的大规模计算集群。新的封闭模型方法，如OpenAI的微调服务承诺结果只有100个例子，而使用开放权重的方法，如量化低秩自适应（QLORA），可用于微调消费者GPU上的模型。我们评估了这两种微调方法，测量它们与Few-Shot提示的交互，以结构化（JSON）输出进行自动简短答案评分（ASAG）。我们的研究结果表明，使用少量数据进行微调对于Llama开放权重模型的实用性有限，但微调方法可以优于OpenAI封闭模型的Few-Shot基线调整LLM。虽然我们的评估集是有限的，但我们发现一些证据表明，微调的观察到的好处可能会受到领域主题的影响。最后，我们观察到LLama 3.1 8B-Instruct开放权重模型的显着改进，通过使用大量廉价生成的合成训练数据播种初始训练示例。
摘要：Research to improve Automated Short Answer Grading has recently focused on Large Language Models (LLMs) with prompt engineering and no- or few-shot prompting to achieve best results. This is in contrast to the fine-tuning approach, which has historically required large-scale compute clusters inaccessible to most users. New closed-model approaches such as OpenAI's fine-tuning service promise results with as few as 100 examples, while methods using open weights such as quantized low-rank adaptive (QLORA) can be used to fine-tune models on consumer GPUs. We evaluate both of these fine-tuning methods, measuring their interaction with few-shot prompting for automated short answer grading (ASAG) with structured (JSON) outputs. Our results show that finetuning with small amounts of data has limited utility for Llama open-weight models, but that fine-tuning methods can outperform few-shot baseline instruction-tuned LLMs for OpenAI's closed models. While our evaluation set is limited, we find some evidence that the observed benefits of finetuning may be impacted by the domain subject matter. Lastly, we observed dramatic improvement with the LLama 3.1 8B-Instruct open-weight model by seeding the initial training examples with a significant amount of cheaply generated synthetic training data.

【5】Dynamic User-controllable Privacy-preserving Few-shot Sensing Framework
标题：动态用户可控的隐私保护Few-Shot传感框架
链接：https://arxiv.org/abs/2508.03989

作者：atan Chathoth, Shuhao Yu, Stephen Lee
摘要：用户可控的隐私在现代感测系统中是重要的，因为隐私偏好可以因人而异并且可以随时间而演变。这在配备惯性测量单元（IMU）传感器的设备中尤其重要，例如智能手机和可穿戴设备，这些设备不断收集丰富的时间序列数据，这些数据可能会无意中暴露敏感的用户行为。虽然先前的工作已经提出了传感器数据的隐私保护方法，但大多数依赖于静态的，预定义的隐私标签或需要大量的私人训练数据，限制了它们的适应性和用户代理。在这项工作中，我们介绍了PrivCLIP，一个动态的，用户可控的，Few-Shot隐私保护传感框架。PrivCLIP允许用户通过将活动分类为敏感（黑名单）、非敏感（白名单）或中性（灰名单）来指定和修改其隐私偏好。利用多模态对比学习方法，PrivCLIP将IMU传感器数据与共享嵌入空间中的自然语言活动描述对齐，从而实现敏感活动的Few-Shot检测。当识别出隐私敏感活动时，系统使用语言引导的活动消毒器和运动生成模块（IMU-GPT）将原始数据转换为语义上类似于非敏感活动的隐私兼容版本。我们在多个人类活动识别数据集上评估了PrivCLIP，并证明它在隐私保护和数据实用性方面都显着优于基线方法。
摘要：User-controllable privacy is important in modern sensing systems, as privacy preferences can vary significantly from person to person and may evolve over time. This is especially relevant in devices equipped with Inertial Measurement Unit (IMU) sensors, such as smartphones and wearables, which continuously collect rich time-series data that can inadvertently expose sensitive user behaviors. While prior work has proposed privacy-preserving methods for sensor data, most rely on static, predefined privacy labels or require large quantities of private training data, limiting their adaptability and user agency. In this work, we introduce PrivCLIP, a dynamic, user-controllable, few-shot privacy-preserving sensing framework. PrivCLIP allows users to specify and modify their privacy preferences by categorizing activities as sensitive (black-listed), non-sensitive (white-listed), or neutral (gray-listed). Leveraging a multimodal contrastive learning approach, PrivCLIP aligns IMU sensor data with natural language activity descriptions in a shared embedding space, enabling few-shot detection of sensitive activities. When a privacy-sensitive activity is identified, the system uses a language-guided activity sanitizer and a motion generation module (IMU-GPT) to transform the original data into a privacy-compliant version that semantically resembles a non-sensitive activity. We evaluate PrivCLIP on multiple human activity recognition datasets and demonstrate that it significantly outperforms baseline methods in terms of both privacy protection and data utility.

【6】Data-Driven Spectrum Demand Prediction: A Spatio-Temporal Framework with Transfer Learning
标题：数据驱动的频谱需求预测：具有迁移学习的时空框架
链接：https://arxiv.org/abs/2508.03863

作者：jzadeh, Hongzhao Zheng, Sarah Dumoulin, Trevor Ha, Halim Yanikomeroglu, Amir Ghasemi
备注：Accepted to be presented at IEEE PIMRC 2025
摘要：准确的频谱需求预测对于明智的频谱分配、有效的监管规划以及促进现代无线通信网络的可持续发展至关重要。它支持政府的努力，特别是国际电信联盟（ITU）领导的努力，以建立公平的频谱分配政策，改善拍卖机制，并满足先进的5G，即将推出的6G和物联网（IoT）等新兴技术的要求。本文提出了一种有效的时空预测框架，利用众包用户侧关键性能指标（KPI）和监管数据集来建模和预测频谱需求。所提出的方法通过结合先进的特征工程、综合相关分析和迁移学习技术，实现了卓越的预测精度和跨区域的泛化能力。与通常受到任意输入和不切实际的假设约束的传统ITU模型不同，这种方法利用粒度，数据驱动的见解来解释频谱利用的空间和时间变化。以国际电联的估计为基准进行的比较评估强调了我们的框架提供更现实和可操作的预测的能力。实验结果验证了我们的方法的有效性，突出了其作为决策者和监管机构加强频谱管理和规划的强大方法的潜力。
摘要：Accurate spectrum demand prediction is crucial for informed spectrum allocation, effective regulatory planning, and fostering sustainable growth in modern wireless communication networks. It supports governmental efforts, particularly those led by the international telecommunication union (ITU), to establish fair spectrum allocation policies, improve auction mechanisms, and meet the requirements of emerging technologies such as advanced 5G, forthcoming 6G, and the internet of things (IoT). This paper presents an effective spatio-temporal prediction framework that leverages crowdsourced user-side key performance indicators (KPIs) and regulatory datasets to model and forecast spectrum demand. The proposed methodology achieves superior prediction accuracy and cross-regional generalizability by incorporating advanced feature engineering, comprehensive correlation analysis, and transfer learning techniques. Unlike traditional ITU models, which are often constrained by arbitrary inputs and unrealistic assumptions, this approach exploits granular, data-driven insights to account for spatial and temporal variations in spectrum utilization. Comparative evaluations against ITU estimates, as the benchmark, underscore our framework's capability to deliver more realistic and actionable predictions. Experimental results validate the efficacy of our methodology, highlighting its potential as a robust approach for policymakers and regulatory bodies to enhance spectrum management and planning.

【7】Bernoulli-LoRA: A Theoretical Framework for Randomized Low-Rank Adaptation
标题：Bernoulli-LoRA：随机低等级适应的理论框架
链接：https://arxiv.org/abs/2508.03820

作者：lov, Abdurakhmon Sadiev, Yury Demidovich, Fawaz S Al-Qahtani, Peter Richtárik
备注：64 Pages, 9 Algorithms, 22 Theorems, 10 Lemmas, 2 Figures, 3 Tables
摘要：参数高效微调（PEFT）已成为使大型基础模型适应特定任务的关键方法，特别是在模型大小持续呈指数级增长的情况下。在PEFT方法中，低秩自适应（LoRA）（arXiv：2106.09685）因其有效性和简单性而脱颖而出，将自适应表示为两个低秩矩阵的乘积。虽然广泛的实证研究证明了LoRA的实际效用，但对这些方法的理论理解仍然有限。最近关于RAC-LoRA的工作（arXiv：2410.08305）采取了严格分析的初步步骤。在这项工作中，我们介绍了Bernoulli-LoRA，一个新的理论框架，统一和扩展现有的LoRA方法。我们的方法引入了一个概率伯努利机制来选择更新哪个矩阵。这种方法包含和概括了各种现有的更新策略，同时保持理论上的易处理性。在非凸优化文献的标准假设下，我们分析了我们框架的几个变体：Bernoulli-LoRA-GD，Bernoulli-LoRA-SGD，Bernoulli-LoRA-PAGE，Bernoulli-LoRA-MVR，Bernoulli-LoRA-QGD，Bernoulli-LoRA-MARINA和Bernoulli-LoRA-EF21，为每个变体建立收敛保证。此外，我们扩展我们的分析凸非光滑函数，提供恒定和自适应（Polyak型）步长的收敛速度。通过对各种任务的广泛实验，我们验证了我们的理论研究结果，并证明了我们的方法的实际效果。这项工作是一个步骤，发展理论上接地但实际有效的PEFT方法。
摘要：Parameter-efficient fine-tuning (PEFT) has emerged as a crucial approach for adapting large foundational models to specific tasks, particularly as model sizes continue to grow exponentially. Among PEFT methods, Low-Rank Adaptation (LoRA) (arXiv:2106.09685) stands out for its effectiveness and simplicity, expressing adaptations as a product of two low-rank matrices. While extensive empirical studies demonstrate LoRA's practical utility, theoretical understanding of such methods remains limited. Recent work on RAC-LoRA (arXiv:2410.08305) took initial steps toward rigorous analysis. In this work, we introduce Bernoulli-LoRA, a novel theoretical framework that unifies and extends existing LoRA approaches. Our method introduces a probabilistic Bernoulli mechanism for selecting which matrix to update. This approach encompasses and generalizes various existing update strategies while maintaining theoretical tractability. Under standard assumptions from non-convex optimization literature, we analyze several variants of our framework: Bernoulli-LoRA-GD, Bernoulli-LoRA-SGD, Bernoulli-LoRA-PAGE, Bernoulli-LoRA-MVR, Bernoulli-LoRA-QGD, Bernoulli-LoRA-MARINA, and Bernoulli-LoRA-EF21, establishing convergence guarantees for each variant. Additionally, we extend our analysis to convex non-smooth functions, providing convergence rates for both constant and adaptive (Polyak-type) stepsizes. Through extensive experiments on various tasks, we validate our theoretical findings and demonstrate the practical efficacy of our approach. This work is a step toward developing theoretically grounded yet practically effective PEFT methods.

【8】Deep Neural Network-Driven Adaptive Filtering
标题：深度神经网络驱动的自适应过滤
链接：https://arxiv.org/abs/2508.04258

作者：ng, Gang Wang, Ying-Chang Liang
摘要：本文提出了一种深度神经网络（DNN）驱动的框架，以解决自适应滤波（AF）中长期存在的泛化挑战。与强调显式成本函数设计的传统AF框架相比，所提出的框架将范式转向直接梯度采集。DNN作为一种通用的非线性算子，在结构上嵌入到AF系统的核心架构中，在滤波残差和学习梯度之间建立直接映射。采用最大似然作为隐式成本函数，使得衍生算法固有地数据驱动，从而赋予示例性的泛化能力，这是通过广泛的数值实验跨频谱的非高斯场景进行验证。相应的均值和均方稳定性也进行了详细的分析。
摘要：This paper proposes a deep neural network (DNN)-driven framework to address the longstanding generalization challenge in adaptive filtering (AF). In contrast to traditional AF frameworks that emphasize explicit cost function design, the proposed framework shifts the paradigm toward direct gradient acquisition. The DNN, functioning as a universal nonlinear operator, is structurally embedded into the core architecture of the AF system, establishing a direct mapping between filtering residuals and learning gradients. The maximum likelihood is adopted as the implicit cost function, rendering the derived algorithm inherently data-driven and thus endowed with exemplary generalization capability, which is validated by extensive numerical experiments across a spectrum of non-Gaussian scenarios. Corresponding mean value and mean square stability analyses are also conducted in detail.

强化学习(4篇)

【1】Reinforcement Learning for Target Zone Blood Glucose Control
标题：目标区血糖控制的强化学习
链接：https://arxiv.org/abs/2508.03875

作者：Mguni, Jing Dong, Wanrong Yang, Ziquan Liu, Muhammad Salman Haleem, Baoxiang Wang
摘要：在临床安全目标区域内管理生理变量是医疗保健中的核心挑战，特别是对于慢性疾病，如1型糖尿病（T1 DM）。强化学习（RL）为个性化治疗提供了希望，但与干预措施的延迟和异质性效应作斗争。我们提出了一个新的强化学习框架来研究和支持T1 DM技术的决策，例如自动胰岛素输送。我们的方法通过统一两种控制方式来捕获治疗的复杂时间动态：用于离散的快速干预的脉冲控制（例如，胰岛素推注）和\textit{切换控制}用于长效治疗和方案转换。我们的方法的核心是一个受约束的马尔可夫决策过程，增强了生理状态特征，使安全的政策学习在临床和资源的限制。该框架结合了生物学上的现实因素，包括胰岛素衰减，从而制定出更好地反映现实世界治疗行为的政策。虽然不打算用于临床部署，但这项工作为医疗保健中未来安全和时间感知的RL奠定了基础。我们提供了收敛的理论保证，并证明了在风格化的T1 DM控制任务中的经验改进，将血糖水平违规从22.4%（最先进的）降低到10.8%。
摘要：Managing physiological variables within clinically safe target zones is a central challenge in healthcare, particularly for chronic conditions such as Type 1 Diabetes Mellitus (T1DM). Reinforcement learning (RL) offers promise for personalising treatment, but struggles with the delayed and heterogeneous effects of interventions. We propose a novel RL framework to study and support decision-making in T1DM technologies, such as automated insulin delivery. Our approach captures the complex temporal dynamics of treatment by unifying two control modalities: \textit{impulse control} for discrete, fast-acting interventions (e.g., insulin boluses), and \textit{switching control} for longer-acting treatments and regime shifts. The core of our method is a constrained Markov decision process augmented with physiological state features, enabling safe policy learning under clinical and resource constraints. The framework incorporates biologically realistic factors, including insulin decay, leading to policies that better reflect real-world therapeutic behaviour. While not intended for clinical deployment, this work establishes a foundation for future safe and temporally-aware RL in healthcare. We provide theoretical guarantees of convergence and demonstrate empirical improvements in a stylised T1DM control task, reducing blood glucose level violations from 22.4\% (state-of-the-art) to as low as 10.8\%.

【2】Provably Near-Optimal Distributionally Robust Reinforcement Learning in Online Settings
标题：在线环境中可证明的近优分布鲁棒强化学习
链接：https://arxiv.org/abs/2508.03768

作者：Ghosh, George K. Atia, Yue Wang
备注：arXiv admin note: text overlap with arXiv:2404.03578 by other authors
摘要：由于模拟与真实的差距，强化学习（RL）在现实世界的部署中面临着重大挑战，由于训练和部署条件之间的不匹配，在模拟器中训练的策略在实践中往往表现不佳。分布式鲁棒RL通过在不确定的环境集合上优化最坏情况下的性能并提供优化的部署性能下限来解决这个问题。然而，现有的研究通常假设可以访问生成模型或离线数据集，这些数据集广泛覆盖部署环境-这些假设限制了它们在未知环境中的实用性，而没有先验知识。在这项工作中，我们研究了在线分布式鲁棒强化学习的更现实和更具挑战性的设置，其中代理仅与单个未知的训练环境交互，同时旨在优化其最坏情况下的性能。我们专注于一般的$f$-发散为基础的不确定性集，包括卡方和KL发散球，并提出了一个计算效率高的算法与次线性遗憾保证在最小的假设。此外，我们还建立了在线学习后悔的极大极小下限，证明了我们的方法接近最优。在不同环境下的大量实验进一步证实了我们的算法的鲁棒性和效率，验证了我们的理论研究结果。
摘要：Reinforcement learning (RL) faces significant challenges in real-world deployments due to the sim-to-real gap, where policies trained in simulators often underperform in practice due to mismatches between training and deployment conditions. Distributionally robust RL addresses this issue by optimizing worst-case performance over an uncertainty set of environments and providing an optimized lower bound on deployment performance. However, existing studies typically assume access to either a generative model or offline datasets with broad coverage of the deployment environment -- assumptions that limit their practicality in unknown environments without prior knowledge. In this work, we study the more realistic and challenging setting of online distributionally robust RL, where the agent interacts only with a single unknown training environment while aiming to optimize its worst-case performance. We focus on general $f$-divergence-based uncertainty sets, including Chi-Square and KL divergence balls, and propose a computationally efficient algorithm with sublinear regret guarantees under minimal assumptions. Furthermore, we establish a minimax lower bound on regret of online learning, demonstrating the near-optimality of our approach. Extensive experiments across diverse environments further confirm the robustness and efficiency of our algorithm, validating our theoretical findings.

【3】Comparing Normalization Methods for Portfolio Optimization with Reinforcement Learning
标题：比较投资组合优化的规范化方法与强化学习
链接：https://arxiv.org/abs/2508.03910

作者：ouza Barbosa Costa, Anna Helena Reali Costa
摘要：最近，强化学习在机器人、游戏、自然语言处理和金融等各个领域都取得了显著的成果。在金融领域，这种方法已被应用于投资组合优化等任务，其中代理人不断调整金融投资组合中的资产分配以最大化利润。许多研究为此目的引入了新的仿真环境，神经网络架构和训练算法。其中，领域特定的策略梯度算法已经获得了显着的关注，在研究界的轻量级，快速，并优于其他方法。然而，最近的研究表明，这种算法可能会产生不一致的结果，表现不佳，特别是当投资组合不包含加密货币时。对这个问题的一个可能的解释是，通常使用的状态归一化方法可能会导致代理人丢失有关交易资产真实价值的关键信息。本文通过评估三个不同市场（IBOVESPA，NYSE和加密货币）中最广泛使用的两种归一化方法，并将其与训练前归一化数据的标准做法进行比较，来探索这一假设。结果表明，在这个特定的领域，状态规范化确实可以降低代理的性能。
摘要：Recently, reinforcement learning has achieved remarkable results in various domains, including robotics, games, natural language processing, and finance. In the financial domain, this approach has been applied to tasks such as portfolio optimization, where an agent continuously adjusts the allocation of assets within a financial portfolio to maximize profit. Numerous studies have introduced new simulation environments, neural network architectures, and training algorithms for this purpose. Among these, a domain-specific policy gradient algorithm has gained significant attention in the research community for being lightweight, fast, and for outperforming other approaches. However, recent studies have shown that this algorithm can yield inconsistent results and underperform, especially when the portfolio does not consist of cryptocurrencies. One possible explanation for this issue is that the commonly used state normalization method may cause the agent to lose critical information about the true value of the assets being traded. This paper explores this hypothesis by evaluating two of the most widely used normalization methods across three different markets (IBOVESPA, NYSE, and cryptocurrencies) and comparing them with the standard practice of normalizing data before training. The results indicate that, in this specific domain, the state normalization can indeed degrade the agent's performance.

【4】Reinforcement Learning in MDPs with Information-Ordered Policies
标题：具有信息有序策略的MPP中的强化学习
链接：https://arxiv.org/abs/2508.03904

作者：Zhang, Shipra Agrawal, Ilan Lobel, Sean R. Sinclair, Christina Lee Yu
备注：57 pages, 2 figures
摘要：我们提出了一个基于时代的强化学习算法的无限地平线平均成本马尔可夫决策过程（MDP），利用偏序的政策类。在这种结构中，如果在$\pi$下收集的数据可以用于估计$\pi $的性能，则可以在没有额外环境交互的情况下进行反事实推理。利用这个偏序，我们证明了我们的算法达到了$O（\sqrt{w \log（|\Theta|）T}）$，其中$w$是偏序的宽度。值得注意的是，边界与状态和动作空间的大小无关。我们说明了这些偏序在运筹学中的许多领域，包括库存控制和排队系统的适用性。对于每一个，我们应用我们的框架，这个问题，产生新的理论保证和强有力的实证结果，而不施加额外的假设，如凸性的库存模型或专门的到达率结构的排队模型。
摘要：We propose an epoch-based reinforcement learning algorithm for infinite-horizon average-cost Markov decision processes (MDPs) that leverages a partial order over a policy class. In this structure, $\pi' \leq \pi$ if data collected under $\pi$ can be used to estimate the performance of $\pi'$, enabling counterfactual inference without additional environment interaction. Leveraging this partial order, we show that our algorithm achieves a regret bound of $O(\sqrt{w \log(|\Theta|) T})$, where $w$ is the width of the partial order. Notably, the bound is independent of the state and action space sizes. We illustrate the applicability of these partial orders in many domains in operations research, including inventory control and queuing systems. For each, we apply our framework to that problem, yielding new theoretical guarantees and strong empirical results without imposing extra assumptions such as convexity in the inventory model or specialized arrival-rate structure in the queuing model.

元学习(3篇)

【1】Query Attribute Modeling: Improving search relevance with Semantic Search and Meta Data Filtering
标题：查询属性建模：通过语义搜索和Meta数据过滤提高搜索相关性
链接：https://arxiv.org/abs/2508.04683

作者：enon, Batool Arhamna Haider, Muhammad Arham, Kanwal Mehreen, Ram Mohan Rao Kadiyala, Hamza Farooq
摘要：本研究介绍了查询属性建模（QAM），一个混合框架，提高搜索精度和相关性，通过分解成结构化的元数据标签和语义元素的开放文本查询。QAM通过从自由格式的文本查询中自动提取元数据过滤器，减少噪音并实现相关项目的集中检索，解决了传统搜索的局限性。使用Amazon Toys Reviews数据集（10，000个独特的商品，40，000多条评论和详细的产品属性）进行的实验评估证明了QAM的卓越性能，在5（mAP@5）处实现了52.99%的平均精度。这代表了对传统方法的显著改进，包括BM 25关键字搜索，基于编码器的语义相似性搜索，跨编码器重新排名，以及通过互秩融合（RRF）将BM 25和语义结果组合的混合搜索。结果建立QAM作为一个强大的解决方案，企业搜索应用程序，特别是在电子商务系统。
摘要：This study introduces Query Attribute Modeling (QAM), a hybrid framework that enhances search precision and relevance by decomposing open text queries into structured metadata tags and semantic elements. QAM addresses traditional search limitations by automatically extracting metadata filters from free-form text queries, reducing noise and enabling focused retrieval of relevant items. Experimental evaluation using the Amazon Toys Reviews dataset (10,000 unique items with 40,000+ reviews and detailed product attributes) demonstrated QAM's superior performance, achieving a mean average precision at 5 (mAP@5) of 52.99\%. This represents significant improvement over conventional methods, including BM25 keyword search, encoder-based semantic similarity search, cross-encoder re-ranking, and hybrid search combining BM25 and semantic results via Reciprocal Rank Fusion (RRF). The results establish QAM as a robust solution for Enterprise Search applications, particularly in e-commerce systems.

【2】Algorithm Selection for Recommender Systems via Meta-Learning on Algorithm Characteristics
标题：通过算法特征元学习进行推荐系统的算法选择
链接：https://arxiv.org/abs/2508.04419

作者：hi Decker, Joeran Beel
摘要：推荐系统的算法选择问题-为给定的用户或上下文选择最佳算法-仍然是一个重大的挑战。传统的元学习方法通常将算法视为分类选择，忽略了它们的内在属性。最近的工作表明，明确地表征算法的功能，可以提高模型的性能在其他领域。在此基础上，我们提出了一个每个用户的元学习方法推荐系统的选择，利用用户的元功能和自动提取的算法功能，从源代码。我们的初步结果，平均在六个不同的数据集，表明增强算法功能的元学习者提高了8.83%，其平均NDCG@10性能从0.135（仅用户功能）到0.147。这种增强的模型优于单一最佳算法基线（0.131），并成功地缩小了与理论预言选择器的10.5%的性能差距。这些发现表明，即使是静态的源代码指标提供了一个有价值的预测信号，提出了一个有前途的方向，建立更强大的和智能的推荐系统。
摘要：The Algorithm Selection Problem for recommender systems-choosing the best algorithm for a given user or context-remains a significant challenge. Traditional meta-learning approaches often treat algorithms as categorical choices, ignoring their intrinsic properties. Recent work has shown that explicitly characterizing algorithms with features can improve model performance in other domains. Building on this, we propose a per-user meta-learning approach for recommender system selection that leverages both user meta-features and automatically extracted algorithm features from source code. Our preliminary results, averaged over six diverse datasets, show that augmenting a meta-learner with algorithm features improves its average NDCG@10 performance by 8.83% from 0.135 (user features only) to 0.147. This enhanced model outperforms the Single Best Algorithm baseline (0.131) and successfully closes 10.5% of the performance gap to a theoretical oracle selector. These findings show that even static source code metrics provide a valuable predictive signal, presenting a promising direction for building more robust and intelligent recommender systems.

【3】Step More: Going Beyond Single Backpropagation in Meta Learning Based Model Editing
标题：更多：超越基于Meta学习的模型编辑中的单一反向传播
链接：https://arxiv.org/abs/2508.04012

作者：Li, Shasha Li, Xi Wang, Shezheng Song, Bin Ji, Shangwen Wang, Jun Ma, Xiaodong Liu, Mina Liu, Jie Yu
摘要：大型语言模型（LLM）支撑着许多AI应用程序，但它们的静态性质使得更新知识的成本高昂。模型编辑通过有针对性的参数修改注入新信息，提供了一种有效的替代方案。特别是，基于元学习的模型编辑（MLBME）方法在编辑效果和效率方面都表现出显着的优势。尽管如此，我们发现MLBME在低数据场景下表现出次优性能，并且其训练效率受到KL散度计算的影响。为了解决这些问题，我们提出了$\textbf{S}$tep $\textbf{M}$ore $\textbf{Edit}$（$\textbf{SMEdit}$），这是一种新的MLBME方法，采用$\textbf{M}$multiple $\textbf{B}$ackpro$\textbf{P}$agation $\textbf{S}$teps（$\textbf{MBPS}$）来提高有限监督下的编辑性能，并对权重更新进行范数正则化以提高训练效率。在两个数据集和两个LLM上的实验结果表明，SMEdit优于先前的MLBME基线，MBPS策略可以无缝集成到现有的方法中，以进一步提高其性能。我们的代码即将发布。
摘要：Large Language Models (LLMs) underpin many AI applications, but their static nature makes updating knowledge costly. Model editing offers an efficient alternative by injecting new information through targeted parameter modifications. In particular, meta-learning-based model editing (MLBME) methods have demonstrated notable advantages in both editing effectiveness and efficiency. Despite this, we find that MLBME exhibits suboptimal performance in low-data scenarios, and its training efficiency is bottlenecked by the computation of KL divergence. To address these, we propose $\textbf{S}$tep $\textbf{M}$ore $\textbf{Edit}$ ($\textbf{SMEdit}$), a novel MLBME method that adopts $\textbf{M}$ultiple $\textbf{B}$ackpro$\textbf{P}$agation $\textbf{S}$teps ($\textbf{MBPS}$) to improve editing performance under limited supervision and a norm regularization on weight updates to improve training efficiency. Experimental results on two datasets and two LLMs demonstrate that SMEdit outperforms prior MLBME baselines and the MBPS strategy can be seamlessly integrated into existing methods to further boost their performance. Our code will be released soon.

分层学习(1篇)

【1】Hierarchical Scoring for Machine Learning Classifier Error Impact Evaluation
标题：机器学习分类器错误影响评估的分层评分
链接：https://arxiv.org/abs/2508.04489

作者：s, Daniel Wolodkin, Laura J. Freeman
摘要：机器学习（ML）模型的一个常见用途是预测样本的类别。对象检测是分类的扩展，包括通过样本内的边界框定位对象。如果预测的标签与地面实况标签不匹配，则通常通过将预测计数为不正确来评估分类以及扩展对象检测。此通过/未通过评分将所有错误分类视为等同。在许多情况下，类标签可以组织成具有层次结构的类分类法，以反映数据之间的关系或错误分类的操作员评估。当存在这种分层结构时，分层评分度量可以返回与预测和地面实况标签之间的距离相关的给定预测的模型性能。这样的指标可以被视为对预测给予部分信任，而不是通过/失败，从而能够更细粒度地理解错误分类的影响。这项工作开发了层次评分指标不同的复杂性，利用评分树来编码类标签之间的关系，并产生指标，反映在评分树的距离。评分指标在一个抽象的用例上进行了演示，该用例具有代表三种加权策略的评分树，并根据不鼓励的错误类型进行评估。结果表明，这些指标捕获的错误与更精细的粒度和评分树，使调优。这项工作展示了一种评估ML性能的方法，该方法不仅通过错误的数量，而且通过错误的种类或影响来对模型进行排名。在发布时，评分指标的Python实现将在开源存储库中提供。
摘要：A common use of machine learning (ML) models is predicting the class of a sample. Object detection is an extension of classification that includes localization of the object via a bounding box within the sample. Classification, and by extension object detection, is typically evaluated by counting a prediction as incorrect if the predicted label does not match the ground truth label. This pass/fail scoring treats all misclassifications as equivalent. In many cases, class labels can be organized into a class taxonomy with a hierarchical structure to either reflect relationships among the data or operator valuation of misclassifications. When such a hierarchical structure exists, hierarchical scoring metrics can return the model performance of a given prediction related to the distance between the prediction and the ground truth label. Such metrics can be viewed as giving partial credit to predictions instead of pass/fail, enabling a finer-grained understanding of the impact of misclassifications. This work develops hierarchical scoring metrics varying in complexity that utilize scoring trees to encode relationships between class labels and produce metrics that reflect distance in the scoring tree. The scoring metrics are demonstrated on an abstract use case with scoring trees that represent three weighting strategies and evaluated by the kind of errors discouraged. Results demonstrate that these metrics capture errors with finer granularity and the scoring trees enable tuning. This work demonstrates an approach to evaluating ML performance that ranks models not only by how many errors are made but by the kind or impact of errors. Python implementations of the scoring metrics will be available in an open-source repository at time of publication.

医学相关(3篇)

【1】Continual Multiple Instance Learning for Hematologic Disease Diagnosis
标题：血液病诊断的连续多实例学习
链接：https://arxiv.org/abs/2508.04368

作者：ahimi, Raheleh Salehi, Nassir Navab, Carsten Marr, Ario Sadafi
备注：Accepted for publication at MICCAI 2024 workshop on Efficient Medical AI
摘要：实验室和诊所的动态环境，每天都有数据流到达，需要定期更新经过训练的机器学习模型，以实现一致的性能。持续学习应该有助于训练模型，而不会出现灾难性的遗忘。然而，现有技术的方法对于多实例学习（MIL）是无效的，多实例学习（MIL）通常用于基于单细胞的血液病诊断（例如，白血病检测）。在这里，我们提出了第一个专门为MIL定制的持续学习方法。我们的方法是基于从各种包中选择单个实例的排练。我们使用实例注意力得分和距离袋平均值和类平均值向量的组合来仔细选择哪些样本和实例存储在先前任务的示例性集合中，从而保留数据的多样性。使用来自白血病实验室的一个月数据的真实输入，我们研究了我们的方法在类增量场景中的有效性，将其与众所周知的持续学习方法进行比较。我们表明，我们的方法大大优于最先进的方法，为MIL提供了第一个持续学习的方法。这使得模型能够适应随着时间的推移而变化的数据分布，例如由疾病发生或潜在遗传变异引起的变化。
摘要：The dynamic environment of laboratories and clinics, with streams of data arriving on a daily basis, requires regular updates of trained machine learning models for consistent performance. Continual learning is supposed to help train models without catastrophic forgetting. However, state-of-the-art methods are ineffective for multiple instance learning (MIL), which is often used in single-cell-based hematologic disease diagnosis (e.g., leukemia detection). Here, we propose the first continual learning method tailored specifically to MIL. Our method is rehearsal-based over a selection of single instances from various bags. We use a combination of the instance attention score and distance from the bag mean and class mean vectors to carefully select which samples and instances to store in exemplary sets from previous tasks, preserving the diversity of the data. Using the real-world input of one month of data from a leukemia laboratory, we study the effectiveness of our approach in a class incremental scenario, comparing it to well-known continual learning methods. We show that our method considerably outperforms state-of-the-art methods, providing the first continual learning approach for MIL. This enables the adaptation of models to shifting data distributions over time, such as those caused by changes in disease occurrence or underlying genetic alterations.

【2】Health Insurance Coverage Rule Interpretation Corpus: Law, Policy, and Medical Guidance for Health Insurance Coverage Understanding
标题：健康保险覆盖范围规则解释文集：了解健康保险覆盖范围的法律、政策和医疗指南
链接：https://arxiv.org/abs/2508.03718

作者：ner
备注：22 pages, 7 figures
摘要：美国的医疗保险是复杂的，对医疗保险的理解不足和诉诸司法的机会有限，对最弱势群体产生了可怕的影响。自然语言处理的进步为支持有效的、针对具体案例的理解以及改善获得司法和医疗保健的机会提供了机会。然而，现有的语料库缺乏必要的背景下，即使是简单的情况下进行评估。我们收集并发布与美国健康保险相关的有信誉的法律和医学文本语料库。我们还为健康保险上诉引入了一个结果预测任务，旨在支持监管和患者自助应用程序，并为我们的任务发布了一个标记的基准，以及在此基础上训练的模型。
摘要：U.S. health insurance is complex, and inadequate understanding and limited access to justice have dire implications for the most vulnerable. Advances in natural language processing present an opportunity to support efficient, case-specific understanding, and to improve access to justice and healthcare. Yet existing corpora lack context necessary for assessing even simple cases. We collect and release a corpus of reputable legal and medical text related to U.S. health insurance. We also introduce an outcome prediction task for health insurance appeals designed to support regulatory and patient self-help applications, and release a labeled benchmark for our task, and models trained on it.

【3】Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training
标题：通过医学视觉语言预训练的解剖学规范性建模提高视觉语义密度
链接：https://arxiv.org/abs/2508.03742

作者：o, Jianpeng Zhang, Zhongyi Shui, Sinuo Wang, Zeli Chen, Xi Li, Le Lu, Xianghua Ye, Tingbo Liang, Qi Zhang, Ling Zhang
摘要：视觉语言预训练（VLP）在开发多功能和通用医疗诊断能力方面具有巨大的潜力。然而，将具有低信噪比（SNR）的医学图像与具有高SNR的报告对齐呈现语义密度差距，从而导致视觉对齐偏差。在本文中，我们提出了提高视觉语义密度，以提高对齐效率。一方面，我们通过疾病级视觉对比学习增强视觉语义，这增强了模型区分每个解剖结构的正常和异常样本的能力。另一方面，我们引入了一种解剖正常性建模方法来模拟每个解剖结构的正常样本的分布，利用VQ-VAE重建潜在空间中的正常视觉嵌入。该过程通过利用异常样本中的分布变化来放大异常信号，增强模型对异常属性的感知和区分。增强的视觉表示有效地捕获诊断相关的语义，促进与诊断报告的更有效和准确的对齐。我们对两个胸部CT数据集CT-RATE和Rad-ChestCT以及腹部CT数据集MedVL-CT69 K进行了广泛的实验，并全面评估了胸部和腹部CT场景中多个任务的诊断性能，实现了最先进的zero-shot性能。值得注意的是，我们的方法在15个器官的54种疾病中实现了84.9%的平均AUC，显著超过了现有方法。此外，我们还展示了预训练模型的卓越迁移学习能力。代码可在https://github.com/alibaba-damo-academy/ViSD-Boost上获得。
摘要：Vision-language pre-training (VLP) has great potential for developing multifunctional and general medical diagnostic capabilities. However, aligning medical images with a low signal-to-noise ratio (SNR) to reports with a high SNR presents a semantic density gap, leading to visual alignment bias. In this paper, we propose boosting vision semantic density to improve alignment effectiveness. On one hand, we enhance visual semantics through disease-level vision contrastive learning, which strengthens the model's ability to differentiate between normal and abnormal samples for each anatomical structure. On the other hand, we introduce an anatomical normality modeling method to model the distribution of normal samples for each anatomy, leveraging VQ-VAE for reconstructing normal vision embeddings in the latent space. This process amplifies abnormal signals by leveraging distribution shifts in abnormal samples, enhancing the model's perception and discrimination of abnormal attributes. The enhanced visual representation effectively captures the diagnostic-relevant semantics, facilitating more efficient and accurate alignment with the diagnostic report. We conduct extensive experiments on two chest CT datasets, CT-RATE and Rad-ChestCT, and an abdominal CT dataset, MedVL-CT69K, and comprehensively evaluate the diagnosis performance across multiple tasks in the chest and abdominal CT scenarios, achieving state-of-the-art zero-shot performance. Notably, our method achieved an average AUC of 84.9% across 54 diseases in 15 organs, significantly surpassing existing methods. Additionally, we demonstrate the superior transfer learning capabilities of our pre-trained model. Code is available at https://github.com/alibaba-damo-academy/ViSD-Boost.

推荐(5篇)

【1】Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation
标题：推荐系统真的能利用多模态内容吗？多模态推荐表示方法的综合分析
链接：https://arxiv.org/abs/2508.04571

作者：omo, Matteo Attimonelli, Danilo Danese, Fedelucio Narducci, Tommaso Di Noia
备注：Accepted as Full Research Papers at CIKM 2025
摘要：多模态推荐系统旨在通过整合异构内容（如图像和文本元数据）来提高推荐准确性。虽然有效，但仍不清楚它们的收益是否来自真正的多模态理解或增加的模型复杂性。这项工作调查的多模态项目嵌入的作用，强调的表示的语义信息。初始实验表明，来自标准提取器的嵌入（例如，ResNet 50，Sentence-Bert）增强了性能，但依赖于特定于模态的编码器和缺乏对跨模态对齐控制的自组织融合策略。为了克服这些限制，我们利用大型视觉语言模型（LVLM）通过结构化提示生成多模态的设计嵌入。这种方法产生语义对齐的表示，而不需要任何融合。多个设置的实验显示出显着的性能改进。此外，LVLM嵌入提供了一个独特的优势：它们可以被解码成结构化的文本描述，使他们的多模态理解的直接评估。当这样的描述作为辅助内容被纳入推荐系统时，它们提高了推荐性能，经验性地验证了LVLM输出中编码的语义深度和对齐。我们的研究强调了语义丰富的表示和位置LVLM作为一个令人信服的基础，建立强大的和有意义的多模态表示在推荐任务的重要性。
摘要：Multimodal Recommender Systems aim to improve recommendation accuracy by integrating heterogeneous content, such as images and textual metadata. While effective, it remains unclear whether their gains stem from true multimodal understanding or increased model complexity. This work investigates the role of multimodal item embeddings, emphasizing the semantic informativeness of the representations. Initial experiments reveal that embeddings from standard extractors (e.g., ResNet50, Sentence-Bert) enhance performance, but rely on modality-specific encoders and ad hoc fusion strategies that lack control over cross-modal alignment. To overcome these limitations, we leverage Large Vision-Language Models (LVLMs) to generate multimodal-by-design embeddings via structured prompts. This approach yields semantically aligned representations without requiring any fusion. Experiments across multiple settings show notable performance improvements. Furthermore, LVLMs embeddings offer a distinctive advantage: they can be decoded into structured textual descriptions, enabling direct assessment of their multimodal comprehension. When such descriptions are incorporated as side content into recommender systems, they improve recommendation performance, empirically validating the semantic depth and alignment encoded within LVLMs outputs. Our study highlights the importance of semantically rich representations and positions LVLMs as a compelling foundation for building robust and meaningful multimodal representations in recommendation tasks.

【2】Measuring the stability and plasticity of recommender systems
标题：衡量推荐系统的稳定性和可塑性
链接：https://arxiv.org/abs/2508.03941

作者：o Lavoura, Robert Jungnickel, João Vinagre
摘要：评估推荐算法的典型离线协议是收集用户-项目交互的数据集，然后使用该数据集的一部分来训练模型，剩余的数据来测量模型推荐与观察到的用户交互的匹配程度。该协议简单、有用且实用，但它仅捕获在过去某个时间点训练的特定模型的性能。然而，我们知道，在线系统随着时间的推移而发展。一般来说，模型反映这些变化是一个好主意，因此模型经常使用最近的数据进行重新训练。但如果是这样的话，我们在多大程度上可以相信以前的评估？当不同的模式（重新）出现时，模型将如何表现？在本文中，我们提出了一种方法来研究如何推荐模型的行为时，他们被重新训练。这个想法是根据算法的能力来分析算法，一方面，保留过去的模式-稳定性-另一方面，（快速）适应变化-可塑性。我们设计了一个离线评估协议，该协议提供了模型长期行为的详细信息，并且对数据集，算法和指标不可知。为了说明这个框架的潜力，我们提出了GoodReads数据集上三种不同类型算法的初步结果，这些算法表明不同的稳定性和可塑性，这取决于算法技术，以及稳定性和可塑性之间可能的权衡。它们已经说明了所提出的框架对于获得关于推荐模型的长期动态的见解的有用性。
摘要：The typical offline protocol to evaluate recommendation algorithms is to collect a dataset of user-item interactions and then use a part of this dataset to train a model, and the remaining data to measure how closely the model recommendations match the observed user interactions. This protocol is straightforward, useful and practical, but it only captures performance of a particular model trained at some point in the past. We know, however, that online systems evolve over time. In general, it is a good idea that models reflect such changes, so models are frequently retrained with recent data. But if this is the case, to what extent can we trust previous evaluations? How will a model perform when a different pattern (re)emerges? In this paper we propose a methodology to study how recommendation models behave when they are retrained. The idea is to profile algorithms according to their ability to, on the one hand, retain past patterns -- stability -- and, on the other hand, (quickly) adapt to changes -- plasticity. We devise an offline evaluation protocol that provides detail on the long-term behavior of models, and that is agnostic to datasets, algorithms and metrics. To illustrate the potential of this framework, we present preliminary results of three different types of algorithms on the GoodReads dataset that suggest different stability and plasticity profiles depending on the algorithmic technique, and a possible trade-off between stability and plasticity.Although additional experiments will be necessary to confirm these observations, they already illustrate the usefulness of the proposed framework to gain insights on the long term dynamics of recommendation models.

【3】Two-dimensional Sparse Parallelism for Large Scale Deep Learning Recommendation Model Training
标题：大规模深度学习推荐模型训练的二维稀疏并行主义
链接：https://arxiv.org/abs/2508.03854

作者：, Quanyu Zhu, Liangbei Xu, Zain Huda, Wang Zhou, Jin Fang, Dennis van der Staay, Yuxi Hu, Jade Nie, Jiyan Yang, Chunzhi Yang
摘要：深度学习推荐模型（DLRM）的复杂性不断增加，导致对能够有效训练大量数据的大规模分布式系统的需求不断增长。在DLRM中，稀疏嵌入表是管理稀疏分类特征的关键组件。通常，工业DLRM中的这些表包含数万亿个参数，需要模型并行策略来解决内存限制。然而，随着训练系统随着大规模GPU的扩展而扩展，用于嵌入表的传统完全并行策略面临重大的可扩展性挑战，包括不平衡和掉队问题，密集的查找通信和繁重的嵌入激活内存。为了克服这些限制，我们提出了一种新的二维稀疏并行方法。我们的解决方案不是在所有GPU上完全分片表，而是在模型并行之上引入数据并行。这可以实现高效的全对全通信，并降低峰值内存消耗。此外，我们还开发了动量缩放的行式AdaGrad算法，以减轻与训练范式转变相关的性能损失。我们广泛的实验表明，所提出的方法显着提高了训练效率，同时保持模型的性能平价。它实现了接近线性的训练速度，可扩展到4K GPU，为推荐模型训练设定了一个新的最先进的基准。
摘要：The increasing complexity of deep learning recommendation models (DLRM) has led to a growing need for large-scale distributed systems that can efficiently train vast amounts of data. In DLRM, the sparse embedding table is a crucial component for managing sparse categorical features. Typically, these tables in industrial DLRMs contain trillions of parameters, necessitating model parallelism strategies to address memory constraints. However, as training systems expand with massive GPUs, the traditional fully parallelism strategies for embedding table post significant scalability challenges, including imbalance and straggler issues, intensive lookup communication, and heavy embedding activation memory. To overcome these limitations, we propose a novel two-dimensional sparse parallelism approach. Rather than fully sharding tables across all GPUs, our solution introduces data parallelism on top of model parallelism. This enables efficient all-to-all communication and reduces peak memory consumption. Additionally, we have developed the momentum-scaled row-wise AdaGrad algorithm to mitigate performance losses associated with the shift in training paradigms. Our extensive experiments demonstrate that the proposed approach significantly enhances training efficiency while maintaining model performance parity. It achieves nearly linear training speed scaling up to 4K GPUs, setting a new state-of-the-art benchmark for recommendation model training.

【4】Evaluating Generative AI Tools for Personalized Offline Recommendations: A Comparative Study
标题：评估个性化离线推荐的生成性人工智能工具：比较研究
链接：https://arxiv.org/abs/2508.03710

作者：linas-Buestan, Otto Parra, Nelly Condori-Fernandez, Maria Fernanda Granda
备注：ESEM Registered Report Track
摘要：背景：生成式人工智能工具在支持各个领域的个性化推荐方面变得越来越重要。然而，它们在与健康有关的行为干预，特别是那些旨在减少使用技术的有效性，仍然没有得到充分的探索。目的：这项研究评估了五种最广泛使用的生成式人工智能工具的性能和用户满意度，这些工具为有重复性劳损风险的个人推荐了量身定制的非数字活动。方法：遵循目标/问题/指标（GQM）范式，这个拟议的实验涉及生成AI工具，这些工具根据预定义的用户配置文件和干预场景建议离线活动。评估的重点是定量性能（精度，召回率，F1分数和MCC分数）和定性方面（用户满意度和感知推荐相关性）。定义了两个研究问题：RQ 1评估哪种工具提供最准确的建议，RQ 2评估工具选择如何影响用户满意度。
摘要：Background: Generative AI tools have become increasingly relevant in supporting personalized recommendations across various domains. However, their effectiveness in health-related behavioral interventions, especially those aiming to reduce the use of technology, remains underexplored. Aims: This study evaluates the performance and user satisfaction of the five most widely used generative AI tools when recommending non-digital activities tailored to individuals at risk of repetitive strain injury. Method: Following the Goal/Question/Metric (GQM) paradigm, this proposed experiment involves generative AI tools that suggest offline activities based on predefined user profiles and intervention scenarios. The evaluation is focused on quantitative performance (precision, recall, F1-score and MCC-score) and qualitative aspects (user satisfaction and perceived recommendation relevance). Two research questions were defined: RQ1 assessed which tool delivers the most accurate recommendations, and RQ2 evaluated how tool choice influences user satisfaction.

【5】Suggest, Complement, Inspire: Story of Two Tower Recommendations at Allegro.com
标题：建议，补充，批评：双塔推荐的故事，Allegro.com
链接：https://arxiv.org/abs/2508.03702

作者：a Osowska-Kurczab, Klaudia Nazarko, Mateusz Marzec, Lidia Wojciechowska, Eliška Kremeňová
备注：Recsys 2025 Industrial Track
摘要：构建大规模的电子商务推荐系统需要解决三个关键的技术挑战：（1）设计一个跨数十个位置的通用推荐架构，（2）降低过度的维护成本，（3）管理高度动态的产品目录。本文提出了一个统一的基于内容的推荐系统部署在Allegro.com，欧洲最大的电子商务平台的起源。该系统是建立在一个流行的双塔检索框架，表示使用文本和结构化属性的产品，这使得通过近似最近邻搜索进行有效的检索。我们演示了如何相同的模型架构可以适应三个不同的推荐任务：相似性搜索，互补的产品建议，和鼓舞人心的内容发现，通过修改只有少数组件的模型或服务逻辑。两年多的A/B测试证实，桌面和移动应用渠道的参与度和基于利润的指标都有显著提高。我们的研究结果表明，一个灵活的，可扩展的体系结构可以服务于不同的用户意图，最小的维护开销。
摘要：Building large-scale e-commerce recommendation systems requires addressing three key technical challenges: (1) designing a universal recommendation architecture across dozens of placements, (2) decreasing excessive maintenance costs, and (3) managing a highly dynamic product catalogue. This paper presents a unified content-based recommendation system deployed at Allegro.com, the largest e-commerce platform of European origin. The system is built on a prevalent Two Tower retrieval framework, representing products using textual and structured attributes, which enables efficient retrieval via Approximate Nearest Neighbour search. We demonstrate how the same model architecture can be adapted to serve three distinct recommendation tasks: similarity search, complementary product suggestions, and inspirational content discovery, by modifying only a handful of components in either the model or the serving logic. Extensive A/B testing over two years confirms significant gains in engagement and profit-based metrics across desktop and mobile app channels. Our results show that a flexible, scalable architecture can serve diverse user intents with minimal maintenance overhead.

聚类(1篇)

【1】Bootstrap Deep Spectral Clustering with Optimal Transport
标题：具有最佳传输的Bootstrap深谱聚集
链接：https://arxiv.org/abs/2508.04200

作者：uo, Wei Ye, Chunchun Chen, Xin Sun, Christian Böhm, Claudia Plant, Susanto Rahardja
摘要：谱聚类是一种重要的聚类方法。它的两个主要缺点是不相交的优化过程和有限的表示能力。为了解决这些问题，我们提出了一个深度谱聚类模型（名为BootSC），它以端到端的方式使用单个网络联合学习谱聚类的所有阶段-亲和矩阵构建，谱嵌入和$k$均值聚类。BootSC利用有效且高效的最优传输导出监督来引导亲和矩阵和集群分配矩阵。此外，一个语义一致的正交重新参数化技术被引入到正交化的光谱嵌入，显着提高的歧视能力。实验结果表明，BootSC实现了最先进的聚类性能。例如，它在具有挑战性的ImageNet-Dogs数据集上实现了比亚军方法显着的16\% NMI改进。我们的代码可在https://github.com/spdj2271/BootSC上获得。
摘要：Spectral clustering is a leading clustering method. Two of its major shortcomings are the disjoint optimization process and the limited representation capacity. To address these issues, we propose a deep spectral clustering model (named BootSC), which jointly learns all stages of spectral clustering -- affinity matrix construction, spectral embedding, and $k$-means clustering -- using a single network in an end-to-end manner. BootSC leverages effective and efficient optimal-transport-derived supervision to bootstrap the affinity matrix and the cluster assignment matrix. Moreover, a semantically-consistent orthogonal re-parameterization technique is introduced to orthogonalize spectral embeddings, significantly enhancing the discrimination capability. Experimental results indicate that BootSC achieves state-of-the-art clustering performance. For example, it accomplishes a notable 16\% NMI improvement over the runner-up method on the challenging ImageNet-Dogs dataset. Our code is available at https://github.com/spdj2271/BootSC.

超分辨率|去噪|去模糊|去雾(1篇)

【1】Assessing the Impact of Image Super Resolution on White Blood Cell Classification Accuracy
标题：评估图像超分辨率对白细胞分类准确性的影响
链接：https://arxiv.org/abs/2508.03759

作者：hi P. Nagarhalli, Shruti S. Pawar, Soham A. Dahanukar, Uday Aswalekar, Ashwini M. Save, Sanket D. Patil
备注：None
摘要：从显微图像中准确分类白细胞对于在医学诊断中识别几种疾病和病症至关重要。许多深度学习技术被用来快速自动地对图像进行分类。然而，大多数时候，这些显微照片的分辨率很低，这可能很难对它们进行正确的分类。一些图片改进技术，如图像超分辨率，正在被用来提高照片的分辨率，以解决这个问题。建议的研究使用大的图像尺寸放大，以调查图片增强方法如何影响分类性能。该研究特别关注深度学习模型如何能够通过捕捉图像分辨率使用尖端技术提高时的微妙形态变化来理解更复杂的视觉信息。模型可以从标准和增强数据中学习，因为改进的图像被并入训练过程中。这种双重方法试图理解图像分辨率对模型性能的影响，并提高分类精度。一个著名的图片分类模型被用来进行广泛的测试，并彻底评估这种方法的有效性。这项研究旨在通过了解普通图像和增强图像之间的权衡，创建更有效的图像识别算法，以定制特定的白细胞数据集。
摘要：Accurately classifying white blood cells from microscopic images is essential to identify several illnesses and conditions in medical diagnostics. Many deep learning technologies are being employed to quickly and automatically classify images. However, most of the time, the resolution of these microscopic pictures is quite low, which might make it difficult to classify them correctly. Some picture improvement techniques, such as image super-resolution, are being utilized to improve the resolution of the photos to get around this issue. The suggested study uses large image dimension upscaling to investigate how picture-enhancing approaches affect classification performance. The study specifically looks at how deep learning models may be able to understand more complex visual information by capturing subtler morphological changes when image resolution is increased using cutting-edge techniques. The model may learn from standard and augmented data since the improved images are incorporated into the training process. This dual method seeks to comprehend the impact of image resolution on model performance and enhance classification accuracy. A well-known model for picture categorization is used to conduct extensive testing and thoroughly evaluate the effectiveness of this approach. This research intends to create more efficient image identification algorithms customized to a particular dataset of white blood cells by understanding the trade-offs between ordinary and enhanced images.

自动驾驶|车辆|车道检测等(2篇)

【1】Channel-Independent Federated Traffic Prediction
标题：独立于队列的联邦流量预测
链接：https://arxiv.org/abs/2508.04517

作者： Xiaoyu Li, Bin Xu, Meng Chen, Yongshun Gong
摘要：近年来，交通预测取得了显著的成就，并已成为智能交通系统的一个组成部分。然而，交通数据通常分布在多个数据所有者之间，并且隐私约束阻止直接利用这些隔离的数据集进行交通预测。大多数现有的联合流量预测方法集中在设计通信机制，允许模型利用来自其他客户端的信息，以提高预测精度。不幸的是，这样的方法通常会导致大量的通信开销，并且由此产生的传输延迟显著地减慢了训练过程。随着交通数据量的持续增长，这个问题变得越来越关键，使得当前方法的资源消耗变得不可持续。为了应对这一挑战，我们提出了一种新的变量关系建模范式的联合流量预测，被称为独立的范式（CIP）。与传统方法不同，CIP通过使每个节点仅使用本地信息执行有效和准确的预测，消除了客户端间通信的需要。在CIP的基础上，我们进一步开发了Fed-CI，一个高效的联邦学习框架，允许每个客户端独立处理自己的数据，同时有效地减轻了客户端之间缺乏直接数据共享所造成的信息丢失。Fed-CI显著降低了通信开销，加快了训练过程，并在遵守隐私法规的同时实现了最先进的性能。在多个真实数据集上进行的大量实验表明，Fed-CI在所有数据集和联邦设置上的性能始终优于现有方法。它在RMSE、MAE和MAPE方面分别实现了8%、14%和16%的改进，同时还大大降低了通信成本。
摘要：In recent years, traffic prediction has achieved remarkable success and has become an integral component of intelligent transportation systems. However, traffic data is typically distributed among multiple data owners, and privacy constraints prevent the direct utilization of these isolated datasets for traffic prediction. Most existing federated traffic prediction methods focus on designing communication mechanisms that allow models to leverage information from other clients in order to improve prediction accuracy. Unfortunately, such approaches often incur substantial communication overhead, and the resulting transmission delays significantly slow down the training process. As the volume of traffic data continues to grow, this issue becomes increasingly critical, making the resource consumption of current methods unsustainable. To address this challenge, we propose a novel variable relationship modeling paradigm for federated traffic prediction, termed the Channel-Independent Paradigm(CIP). Unlike traditional approaches, CIP eliminates the need for inter-client communication by enabling each node to perform efficient and accurate predictions using only local information. Based on the CIP, we further develop Fed-CI, an efficient federated learning framework, allowing each client to process its own data independently while effectively mitigating the information loss caused by the lack of direct data sharing among clients. Fed-CI significantly reduces communication overhead, accelerates the training process, and achieves state-of-the-art performance while complying with privacy regulations. Extensive experiments on multiple real-world datasets demonstrate that Fed-CI consistently outperforms existing methods across all datasets and federated settings. It achieves improvements of 8%, 14%, and 16% in RMSE, MAE, and MAPE, respectively, while also substantially reducing communication costs.

【2】Segment Any Vehicle: Semantic and Visual Context Driven SAM and A Benchmark
标题：细分任何车辆：语义和视觉上下文驱动的Sam和基准
链接：https://arxiv.org/abs/2508.04260

作者：, Ziwen Wang, Wentao Wu, Anjie Wang, Jiashu Wu, Yantao Pan, Chenglong Li
摘要：随着自动驾驶的快速发展，车辆感知，特别是检测和分割，对算法性能提出了越来越高的要求。预训练的大型分割模型，特别是Segment Anything Model（SAM），已经引起了人们的极大兴趣，并激发了人工智能的新研究方向。然而，SAM不能直接应用于车辆零件分割的细粒度任务，因为其文本提示分割功能不是公开访问的，并且其默认模式生成的掩码区域缺乏语义标签，限制了其在结构化的、特定于类别的分割任务中的实用性。为了解决这些限制，我们提出了SAV，这是一种新型框架，包括三个核心组件：基于SAM的编码器-解码器、车辆零件知识图和上下文样本检索编码模块。知识图通过结构化本体明确地对车辆部件之间的空间和几何关系进行建模，有效地对先验结构知识进行编码。同时，上下文检索模块通过从训练数据中识别和利用视觉上相似的车辆实例来增强分割，为改进的泛化提供丰富的上下文先验。此外，我们还介绍了一个新的用于车辆部件分割的大规模基准数据集，名为VehicleSeg 10 K，其中包含11，665个跨不同场景和视点的高质量像素级注释。我们对这个数据集和另外两个数据集进行了全面的实验，对多个代表性基线进行了基准测试，为未来的研究和比较奠定了坚实的基础。%本文的数据集和源代码将在接受后发布。本文的数据集和源代码将在https://github.com/Event-AHU/SAV上发布
摘要：With the rapid advancement of autonomous driving, vehicle perception, particularly detection and segmentation, has placed increasingly higher demands on algorithmic performance. Pre-trained large segmentation models, especially Segment Anything Model (SAM), have sparked significant interest and inspired new research directions in artificial intelligence. However, SAM cannot be directly applied to the fine-grained task of vehicle part segmentation, as its text-prompted segmentation functionality is not publicly accessible, and the mask regions generated by its default mode lack semantic labels, limiting its utility in structured, category-specific segmentation tasks. To address these limitations, we propose SAV, a novel framework comprising three core components: a SAM-based encoder-decoder, a vehicle part knowledge graph, and a context sample retrieval encoding module. The knowledge graph explicitly models the spatial and geometric relationships among vehicle parts through a structured ontology, effectively encoding prior structural knowledge. Meanwhile, the context retrieval module enhances segmentation by identifying and leveraging visually similar vehicle instances from training data, providing rich contextual priors for improved generalization. Furthermore, we introduce a new large-scale benchmark dataset for vehicle part segmentation, named VehicleSeg10K, which contains 11,665 high-quality pixel-level annotations across diverse scenes and viewpoints. We conduct comprehensive experiments on this dataset and two other datasets, benchmarking multiple representative baselines to establish a solid foundation for future research and comparison. % Both the dataset and source code of this paper will be released upon acceptance. Both the dataset and source code of this paper will be released on https://github.com/Event-AHU/SAV

联邦学习|隐私保护|加密(3篇)

【1】FedHiP: Heterogeneity-Invariant Personalized Federated Learning Through Closed-Form Solutions
标题：FedHiP：通过封闭式解决方案实现不变性个性化联邦学习
链接：https://arxiv.org/abs/2508.04470

作者：Tang, Zhirui Yang, Jingchao Wang, Kejia Fan, Jinfeng Xu, Huiping Zhuang, Anfeng Liu, Houbing Herbert Song, Leye Wang, Yunhuai Liu
备注：11 pages, 5 figures, 3 tables
摘要：最近，个性化联合学习（PFL）已经成为一种流行的模式，通过协作训练提供个性化的模型，同时适应每个客户端的本地应用程序。现有的PFL方法通常由于无处不在的数据异质性（即，非IID数据），这严重阻碍了收敛并降低了性能。我们发现，根本问题在于长期依赖基于梯度的更新，这是固有的敏感性非IID数据。为了从根本上解决这个问题并弥合研究差距，在本文中，我们提出了一个异构不变的个性化联邦学习方案，名为FedHiP，通过分析（即，封闭形式）解决方案以避免基于梯度的更新。具体来说，我们利用自我监督预训练的趋势，利用基础模型作为无梯度特征提取的冻结骨干。在特征提取器之后，我们进一步开发了用于无梯度训练的分析分类器。为了支持集体泛化和个人个性化，我们的FedHiP计划包括三个阶段：分析本地培训，分析全球聚合，分析本地个性化。我们的FedHiP方案的封闭式解决方案实现了其异构不变性的理想属性，这意味着每个个性化模型都保持相同，无论数据如何分布在所有其他客户端上。在基准数据集上进行的大量实验验证了我们的FedHiP方案的优越性，在准确率上至少超过最先进的基线5.79%-20.97%。
摘要：Lately, Personalized Federated Learning (PFL) has emerged as a prevalent paradigm to deliver personalized models by collaboratively training while simultaneously adapting to each client's local applications. Existing PFL methods typically face a significant challenge due to the ubiquitous data heterogeneity (i.e., non-IID data) across clients, which severely hinders convergence and degrades performance. We identify that the root issue lies in the long-standing reliance on gradient-based updates, which are inherently sensitive to non-IID data. To fundamentally address this issue and bridge the research gap, in this paper, we propose a Heterogeneity-invariant Personalized Federated learning scheme, named FedHiP, through analytical (i.e., closed-form) solutions to avoid gradient-based updates. Specifically, we exploit the trend of self-supervised pre-training, leveraging a foundation model as a frozen backbone for gradient-free feature extraction. Following the feature extractor, we further develop an analytic classifier for gradient-free training. To support both collective generalization and individual personalization, our FedHiP scheme incorporates three phases: analytic local training, analytic global aggregation, and analytic local personalization. The closed-form solutions of our FedHiP scheme enable its ideal property of heterogeneity invariance, meaning that each personalized model remains identical regardless of how non-IID the data are distributed across all other clients. Extensive experiments on benchmark datasets validate the superiority of our FedHiP scheme, outperforming the state-of-the-art baselines by at least 5.79%-20.97% in accuracy.

【2】FeDaL: Federated Dataset Learning for Time Series Foundation Models
标题：FeDaL：时间序列基础模型的联邦数据集学习
链接：https://arxiv.org/abs/2508.04045

作者： Chen, Guodong Long, Jing Jiang
备注：28 pages, scaling FL to time series foundation models
摘要：数据集的异质性引入了显著的领域偏差，从根本上降低了时间序列基础模型（TSFM）的泛化能力，但这一挑战仍然没有得到充分的探索。本文使用联邦学习的范式重新思考TSFMs的发展。我们提出了一种新的联邦数据集学习（FeDaL）的方法来处理异构的时间序列，通过学习的时间不可知的时间表示。联邦学习的分布式体系结构是将异构TS数据集分解为共享的广义知识和保存的个性化知识的自然解决方案。此外，基于TSFM架构，FeDaL通过添加两个互补机制来显式地减轻局部和全局偏差：域偏差消除（DBE）和全局偏差消除（GBE）。FeDaL的跨数据集泛化已经在现实世界的数据集中进行了广泛的评估，涵盖了8个任务，包括表示学习和下游时间序列分析，以及54个基线。我们进一步分析了联邦的扩展行为，展示了数据量、客户端数量和加入率如何影响去中心化下的模型性能。
摘要：Dataset-wise heterogeneity introduces significant domain biases that fundamentally degrade generalization on Time Series Foundation Models (TSFMs), yet this challenge remains underexplored. This paper rethink the development of TSFMs using the paradigm of federated learning. We propose a novel Federated Dataset Learning (FeDaL) approach to tackle heterogeneous time series by learning dataset-agnostic temporal representations. Specifically, the distributed architecture of federated learning is a nature solution to decompose heterogeneous TS datasets into shared generalized knowledge and preserved personalized knowledge. Moreover, based on the TSFM architecture, FeDaL explicitly mitigates both local and global biases by adding two complementary mechanisms: Domain Bias Elimination (DBE) and Global Bias Elimination (GBE). FeDaL`s cross-dataset generalization has been extensively evaluated in real-world datasets spanning eight tasks, including both representation learning and downstream time series analysis, against 54 baselines. We further analyze federated scaling behavior, showing how data volume, client count, and join rate affect model performance under decentralization.

【3】Decoupled Contrastive Learning for Federated Learning
标题：联邦学习的脱钩对比学习
链接：https://arxiv.org/abs/2508.04005

作者：Kim, Incheol Baek, Yon Dohn Chung
摘要：联邦学习是一种分布式机器学习范式，它允许多个参与者通过交换模型更新而不是原始数据来训练共享模型。然而，由于客户端之间的数据异构性，与集中式方法相比，其性能有所下降。虽然对比学习已经成为一种很有前途的方法来减轻这一点，我们的理论分析揭示了一个根本的冲突：它的渐近假设的无限数量的负样本违反了联邦学习的有限样本制度。为了解决这个问题，我们引入了解耦对比学习联邦学习（DCFL），一个新的框架，将现有的对比损失分为两个目标。将损失解耦成其对准和均匀性分量使得能够独立校准吸引力和排斥力，而不依赖于渐近假设。这种策略提供了一种对比学习方法，适用于每个客户端都有少量数据的联合学习环境。我们的实验结果表明，DCFL实现更强的正样本之间的对齐和更大的负样本之间的一致性相比，现有的对比学习方法。此外，在标准基准测试（包括CIFAR-10、CIFAR-100和Tiny-ImageNet）上的实验结果表明，DCFL始终优于最先进的联邦学习方法。
摘要：Federated learning is a distributed machine learning paradigm that allows multiple participants to train a shared model by exchanging model updates instead of their raw data. However, its performance is degraded compared to centralized approaches due to data heterogeneity across clients. While contrastive learning has emerged as a promising approach to mitigate this, our theoretical analysis reveals a fundamental conflict: its asymptotic assumptions of an infinite number of negative samples are violated in finite-sample regime of federated learning. To address this issue, we introduce Decoupled Contrastive Learning for Federated Learning (DCFL), a novel framework that decouples the existing contrastive loss into two objectives. Decoupling the loss into its alignment and uniformity components enables the independent calibration of the attraction and repulsion forces without relying on the asymptotic assumptions. This strategy provides a contrastive learning method suitable for federated learning environments where each client has a small amount of data. Our experimental results show that DCFL achieves stronger alignment between positive samples and greater uniformity between negative samples compared to existing contrastive learning methods. Furthermore, experimental results on standard benchmarks, including CIFAR-10, CIFAR-100, and Tiny-ImageNet, demonstrate that DCFL consistently outperforms state-of-the-art federated learning methods.

推理|分析|理解|解释(8篇)

【1】Privacy Risk Predictions Based on Fundamental Understanding of Personal Data and an Evolving Threat Landscape
标题：基于对个人数据和不断变化的威胁格局的基本理解的隐私风险预测
链接：https://arxiv.org/abs/2508.04542

作者：u, K. Suzanne Barber
备注：8 pages, 9 figures, 1 table
摘要：个人和组织如果不对相对隐私风险有基本的了解，就很难保护个人信息。通过分析5，000多个经验性身份盗窃和欺诈案例，本研究确定了哪些类型的个人数据被暴露、暴露发生的频率以及这些暴露的后果。我们构建了一个身份生态系统图--一个基础的、基于图的模型，其中节点表示个人可识别信息（PII）属性，边表示它们之间的经验披露关系（例如，一个PII属性由于另一个PII属性的暴露而暴露的概率）。利用这种图结构，我们开发了一个隐私风险预测框架，该框架使用图论和图神经网络来估计某些PII属性受损时进一步披露的可能性。结果表明，我们的方法有效地回答了核心问题：一个给定的身份属性的披露可能导致另一个属性的披露？
摘要：It is difficult for individuals and organizations to protect personal information without a fundamental understanding of relative privacy risks. By analyzing over 5,000 empirical identity theft and fraud cases, this research identifies which types of personal data are exposed, how frequently exposures occur, and what the consequences of those exposures are. We construct an Identity Ecosystem graph--a foundational, graph-based model in which nodes represent personally identifiable information (PII) attributes and edges represent empirical disclosure relationships between them (e.g., the probability that one PII attribute is exposed due to the exposure of another). Leveraging this graph structure, we develop a privacy risk prediction framework that uses graph theory and graph neural networks to estimate the likelihood of further disclosures when certain PII attributes are compromised. The results show that our approach effectively answers the core question: Can the disclosure of a given identity attribute possibly lead to the disclosure of another attribute?

【2】From Split to Share: Private Inference with Distributed Feature Sharing
标题：从拆分到共享：具有分布式功能共享的私人推理
链接：https://arxiv.org/abs/2508.04346

作者：, Jiayi Wen, Shouhong Tan, Zhirun Zheng, Cheng Huang
摘要：基于云的机器学习即服务（MLaaS）在处理敏感客户端数据时会引发严重的隐私问题。现有的私有推理（PI）方法面临着隐私和效率之间的基本权衡：加密方法提供了强大的保护，但会产生高计算开销，而有效的替代方案，如分裂推理暴露中间特征的反转攻击。我们提出了PrivDFS，一个新的范式私人推理，取代了一个单一的暴露表示与分布式功能共享。PrivDFS将客户端上的输入特征划分为多个平衡的份额，这些份额被分发到非合谋、非通信的服务器以进行独立的部分推理。客户端安全地聚合服务器的输出以重构最终预测，确保没有单个服务器观察到足以损害输入隐私的信息。为了进一步加强隐私，我们提出了两个关键扩展：PrivDFS-AT，它使用基于扩散的代理攻击者的对抗训练来执行抗反转特征分区，以及PrivDFS-KD，它利用用户特定的密钥来多样化分区策略并防止基于查询的反转泛化。在CIFAR-10和CelebA上的实验表明，PrivDFS实现了与深度分裂推理相当的隐私性，同时将客户端计算量减少了100倍，而没有准确性损失，并且扩展对基于扩散的分布内攻击和自适应攻击保持鲁棒性。
摘要：Cloud-based Machine Learning as a Service (MLaaS) raises serious privacy concerns when handling sensitive client data. Existing Private Inference (PI) methods face a fundamental trade-off between privacy and efficiency: cryptographic approaches offer strong protection but incur high computational overhead, while efficient alternatives such as split inference expose intermediate features to inversion attacks. We propose PrivDFS, a new paradigm for private inference that replaces a single exposed representation with distributed feature sharing. PrivDFS partitions input features on the client into multiple balanced shares, which are distributed to non-colluding, non-communicating servers for independent partial inference. The client securely aggregates the servers' outputs to reconstruct the final prediction, ensuring that no single server observes sufficient information to compromise input privacy. To further strengthen privacy, we propose two key extensions: PrivDFS-AT, which uses adversarial training with a diffusion-based proxy attacker to enforce inversion-resistant feature partitioning, and PrivDFS-KD, which leverages user-specific keys to diversify partitioning policies and prevent query-based inversion generalization. Experiments on CIFAR-10 and CelebA demonstrate that PrivDFS achieves privacy comparable to deep split inference while cutting client computation by up to 100 times with no accuracy loss, and that the extensions remain robust against both diffusion-based in-distribution and adaptive attacks.

【3】A Visual Tool for Interactive Model Explanation using Sensitivity Analysis
标题：使用敏感性分析进行交互式模型解释的视觉工具
链接：https://arxiv.org/abs/2508.04269

作者：chuler
备注：11 pages, 3 figures, This work is a preprint version of a paper currently in preparation with co-authors
摘要：我们提出了SAInT，一个基于Python的工具，用于通过集成的局部和全局敏感性分析来可视化地探索和理解机器学习（ML）模型的行为。我们的系统支持人在回路（HITL）工作流程，使用户（包括人工智能研究人员和领域专家）能够通过交互式图形界面配置、训练、评估和解释模型，而无需编程。该工具自动化模型训练和选择，使用基于方差的敏感性分析提供全局特征属性，并通过LIME和SHAP提供每个实例的解释。我们展示了一个分类任务预测泰坦尼克号数据集上的生存系统，并展示了敏感性信息如何指导特征选择和数据细化。
摘要：We present SAInT, a Python-based tool for visually exploring and understanding the behavior of Machine Learning (ML) models through integrated local and global sensitivity analysis. Our system supports Human-in-the-Loop (HITL) workflows by enabling users - both AI researchers and domain experts - to configure, train, evaluate, and explain models through an interactive graphical interface without programming. The tool automates model training and selection, provides global feature attribution using variance-based sensitivity analysis, and offers per-instance explanation via LIME and SHAP. We demonstrate the system on a classification task predicting survival on the Titanic dataset and show how sensitivity information can guide feature selection and data refinement.

【4】Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction
标题：因果奖励调整：通过后门修正减轻外部推理中的奖励黑客行为
链接：https://arxiv.org/abs/2508.04216

作者：g, Zeen Song, Huijie Guo, Wenwen Qiang
摘要：外部推理系统将语言模型与过程奖励模型（PRM）相结合，为数学问题解决等复杂任务选择高质量的推理路径。然而，这些系统容易发生奖励黑客攻击，其中高分但逻辑上不正确的路径被PRM分配高分，导致错误的答案。从因果推理的角度来看，我们将这种现象主要归因于令人混淆的语义特征的存在。为了解决这个问题，我们提出了因果奖励调整（CRA），一种通过估计推理路径的真实奖励来减轻奖励黑客攻击的方法。CRA在PRM的内部激活上训练稀疏自编码器以恢复可解释的特征，然后通过使用后门调整来纠正混淆。数学求解数据集上的实验表明，CRA减轻了奖励黑客攻击，提高了最终的准确性，而无需修改策略模型或重新训练PRM。
摘要：External reasoning systems combine language models with process reward models (PRMs) to select high-quality reasoning paths for complex tasks such as mathematical problem solving. However, these systems are prone to reward hacking, where high-scoring but logically incorrect paths are assigned high scores by the PRMs, leading to incorrect answers. From a causal inference perspective, we attribute this phenomenon primarily to the presence of confounding semantic features. To address it, we propose Causal Reward Adjustment (CRA), a method that mitigates reward hacking by estimating the true reward of a reasoning path. CRA trains sparse autoencoders on the PRM's internal activations to recover interpretable features, then corrects confounding by using backdoor adjustment. Experiments on math solving datasets demonstrate that CRA mitigates reward hacking and improves final accuracy, without modifying the policy model or retraining PRM.

【5】Fast and Accurate Explanations of Distance-Based Classifiers by Uncovering Latent Explanatory Structures
标题：通过揭示潜在解释结构快速准确地解释基于距离的分类器
链接：https://arxiv.org/abs/2508.03913

作者：ley, Jacob Kauffmann, Simon León Krug, Klaus-Robert Müller, Grégoire Montavon
摘要：基于距离的分类器，如k-最近邻和支持向量机，仍然是机器学习的主力，广泛用于科学和工业。在实践中，为了从这些模型中获得见解，确保它们的预测是可解释的也很重要。虽然可解释人工智能领域提供了原则上适用于任何模型的方法，但它也强调了潜在结构（例如神经网络中的层序列）在产生解释方面的有用性。在本文中，我们通过揭示基于距离的分类器中隐藏的神经网络结构（由线性检测单元与非线性池层组成）做出了贡献，在此基础上，可解释的人工智能技术（如分层相关传播（LRP））变得适用。通过定量评估，我们证明了我们的新的解释方法的优势，在几个基线。我们还通过两个实际用例展示了解释基于距离的模型的整体有用性。
摘要：Distance-based classifiers, such as k-nearest neighbors and support vector machines, continue to be a workhorse of machine learning, widely used in science and industry. In practice, to derive insights from these models, it is also important to ensure that their predictions are explainable. While the field of Explainable AI has supplied methods that are in principle applicable to any model, it has also emphasized the usefulness of latent structures (e.g. the sequence of layers in a neural network) to produce explanations. In this paper, we contribute by uncovering a hidden neural network structure in distance-based classifiers (consisting of linear detection units combined with nonlinear pooling layers) upon which Explainable AI techniques such as layer-wise relevance propagation (LRP) become applicable. Through quantitative evaluations, we demonstrate the advantage of our novel explanation approach over several baselines. We also show the overall usefulness of explaining distance-based models through two practical use cases.

【6】Revisiting Heat Flux Analysis of Tungsten Monoblock Divertor on EAST using Physics-Informed Neural Network
标题：使用物理信息神经网络重新审视EAST上的整体式转向器的热通量分析
链接：https://arxiv.org/abs/2508.03776

作者：, Zikang Yan, Hao Si, Zhendong Yang, Qingquan Yang, Dengdi Sun, Wanli Lyu, Jin Tang
摘要：估算核聚变装置EAST中的热通量是一项至关重要的任务。传统的科学计算方法通常使用有限元方法（FEM）来模拟这个过程。然而，有限元法依赖于基于网格的采样计算，这是计算效率低，难以进行实时模拟在实际的实验。受人工智能驱动的科学计算的启发，本文提出了一种新的物理信息神经网络（PINN）来应对这一挑战，在保持高精度的同时显著加快了热传导估计过程。具体来说，给定不同材料的输入，我们首先将空间坐标和时间戳输入神经网络，并根据热传导方程计算边界损失、初始条件损失和物理损失。此外，我们以数据驱动的方式对少量数据点进行采样，以更好地拟合特定的热传导场景，进一步增强模型的预测能力。我们在均匀和非均匀加热条件下进行实验。实验结果表明，所提出的基于热传导物理的神经网络实现了与有限元方法相当的准确性，同时实现了$\times $40倍的计算效率加速。数据集和源代码将在https://github.com/Event-AHU/OpenFusion上发布。
摘要：Estimating heat flux in the nuclear fusion device EAST is a critically important task. Traditional scientific computing methods typically model this process using the Finite Element Method (FEM). However, FEM relies on grid-based sampling for computation, which is computationally inefficient and hard to perform real-time simulations during actual experiments. Inspired by artificial intelligence-powered scientific computing, this paper proposes a novel Physics-Informed Neural Network (PINN) to address this challenge, significantly accelerating the heat conduction estimation process while maintaining high accuracy. Specifically, given inputs of different materials, we first feed spatial coordinates and time stamps into the neural network, and compute boundary loss, initial condition loss, and physical loss based on the heat conduction equation. Additionally, we sample a small number of data points in a data-driven manner to better fit the specific heat conduction scenario, further enhancing the model's predictive capability. We conduct experiments under both uniform and non-uniform heating conditions on the top surface. Experimental results show that the proposed thermal conduction physics-informed neural network achieves accuracy comparable to the finite element method, while achieving $\times$40 times acceleration in computational efficiency. The dataset and source code will be released on https://github.com/Event-AHU/OpenFusion.

【7】Data-Driven Discovery of Mobility Periodicity for Understanding Urban Transportation Systems
标题：数据驱动的移动周期性发现以了解城市交通系统
链接：https://arxiv.org/abs/2508.03747

作者：n, Qi Wang, Yunhan Zheng, Nina Cao, HanQin Cai, Jinhua Zhao
摘要：揭示人类流动的时间规律对于发现城市动态至关重要，并对各种决策过程和城市系统应用产生影响。本研究将复杂多维人类流动数据中的周期性量化问题表述为时间序列自回归中显性正自相关的稀疏识别，允许人们从数据驱动和可解释的机器学习角度发现和量化重要的周期性模式，例如每周周期性。我们将我们的框架应用于现实世界的人类移动数据，包括中国杭州的地铁客流以及纽约市和美国芝加哥的拼车旅行，揭示了过去几年不同空间位置的可解释的每周周期性。特别是，我们对2019年至2024年共乘数据的分析显示，COVID-19大流行对出行规律和随后的复苏趋势产生了破坏性影响，凸显了纽约市和芝加哥之间复苏模式百分比和速度的差异。我们发现，纽约市和芝加哥在2020年都经历了每周周期性的显著减少，纽约市的流动规律性恢复比芝加哥快。稀疏自回归的可解释性提供了对人类流动的潜在时间模式的见解，为理解城市系统提供了一个有价值的工具。我们的研究结果强调了可解释机器学习从现实世界的移动数据中解锁关键见解的潜力。
摘要：Uncovering the temporal regularity of human mobility is crucial for discovering urban dynamics and has implications for various decision-making processes and urban system applications. This study formulates the periodicity quantification problem in complex and multidimensional human mobility data as a sparse identification of dominant positive auto-correlations in time series autoregression, allowing one to discover and quantify significant periodic patterns such as weekly periodicity from a data-driven and interpretable machine learning perspective. We apply our framework to real-world human mobility data, including metro passenger flow in Hangzhou, China and ridesharing trips in New York City (NYC) and Chicago, USA, revealing the interpretable weekly periodicity across different spatial locations over past several years. In particular, our analysis of ridesharing data from 2019 to 2024 demonstrates the disruptive impact of the COVID-19 pandemic on mobility regularity and the subsequent recovery trends, highlighting differences in the recovery pattern percentages and speeds between NYC and Chicago. We explore that both NYC and Chicago experienced a remarkable reduction of weekly periodicity in 2020, and the recovery of mobility regularity in NYC is faster than Chicago. The interpretability of sparse autoregression provides insights into the underlying temporal patterns of human mobility, offering a valuable tool for understanding urban systems. Our findings highlight the potential of interpretable machine learning to unlock crucial insights from real-world mobility data.

【8】Understanding Human Daily Experience Through Continuous Sensing: ETRI Lifelog Dataset 2024
标题：通过连续感知了解人类日常体验：ETRI生命日志数据集2024
链接：https://arxiv.org/abs/2508.03698

作者：, Hyuntae Jeong, Seungeun Chung, Jeong Mook Lim, Kyoung Ju Noh, Sunkyung Lee, Gyuwon Jung
备注：This work is intended for submission to an IEEE conference. The content is also relevant to the cs.HC category
摘要：改善人类健康和福祉需要准确有效地了解个人在日常生活中的身心状态。为了支持这一目标，我们利用智能手机、智能手表和睡眠传感器每天24小时被动地连续收集数据，对参与者的日常行为干扰最小，使我们能够收集多天的日常行为和睡眠活动的定量数据。此外，我们还通过睡前和睡前立即进行的调查，收集了参与者对疲劳、压力和睡眠质量的主观自我报告。这个全面的生活日志数据集预计将为探索人类日常生活和生活方式模式的有意义的见解提供基础资源，并且部分数据已被匿名并公开以供进一步研究。在本文中，我们介绍了ETRI Lifelog Dataset 2024，详细介绍了它的结构并展示了潜在的应用，例如使用机器学习模型来预测睡眠质量和压力。
摘要：Improving human health and well-being requires an accurate and effective understanding of an individual's physical and mental state throughout daily life. To support this goal, we utilized smartphones, smartwatches, and sleep sensors to collect data passively and continuously for 24 hours a day, with minimal interference to participants' usual behavior, enabling us to gather quantitative data on daily behaviors and sleep activities across multiple days. Additionally, we gathered subjective self-reports of participants' fatigue, stress, and sleep quality through surveys conducted immediately before and after sleep. This comprehensive lifelog dataset is expected to provide a foundational resource for exploring meaningful insights into human daily life and lifestyle patterns, and a portion of the data has been anonymized and made publicly available for further research. In this paper, we introduce the ETRI Lifelog Dataset 2024, detailing its structure and presenting potential applications, such as using machine learning models to predict sleep quality and stress.

检测相关(4篇)

【1】CaPulse: Detecting Anomalies by Tuning in to the Causal Rhythms of Time Series
标题：CaPulse：通过调整时间序列的因果节奏来检测异常
链接：https://arxiv.org/abs/2508.04630

作者：a, Yingying Zhang, Yuxuan Liang, Lunting Fan, Qingsong Wen, Roger Zimmermann
摘要：时间序列异常检测在不同领域获得了相当大的关注。而现有的方法往往无法捕捉背后的机制异常产生的时间序列数据。此外，时间序列异常检测通常面临几个与数据相关的固有挑战，即，标签稀缺性、数据不平衡和复杂的多周期性。在本文中，我们利用因果工具，并引入了一个新的基于因果关系的框架，CaPulse，它调谐到时间序列数据的潜在因果脉冲，以有效地检测异常。具体地说，我们首先建立一个结构因果模型来破译异常背后的生成过程。为了应对数据带来的挑战，我们提出了具有新型掩码机制和精心设计的周期性学习器的周期性规范化流，创建了一种基于密度的异常检测方法。在七个真实世界数据集上的广泛实验表明，CaPulse始终优于现有方法，实现了3%至17%的AUROC改进，具有增强的可解释性。
摘要：Time series anomaly detection has garnered considerable attention across diverse domains. While existing methods often fail to capture the underlying mechanisms behind anomaly generation in time series data. In addition, time series anomaly detection often faces several data-related inherent challenges, i.e., label scarcity, data imbalance, and complex multi-periodicity. In this paper, we leverage causal tools and introduce a new causality-based framework, CaPulse, which tunes in to the underlying causal pulse of time series data to effectively detect anomalies. Concretely, we begin by building a structural causal model to decipher the generation processes behind anomalies. To tackle the challenges posed by the data, we propose Periodical Normalizing Flows with a novel mask mechanism and carefully designed periodical learners, creating a periodicity-aware, density-based anomaly detection approach. Extensive experiments on seven real-world datasets demonstrate that CaPulse consistently outperforms existing methods, achieving AUROC improvements of 3% to 17%, with enhanced interpretability.

【2】Argumentative Debates for Transparent Bias Detection [Technical Report]
标题：透明偏差检测的争论[技术报告]
链接：https://arxiv.org/abs/2508.04511

作者：obi, Nico Potyka, Anna Rapberger, Francesca Toni
摘要：随着人工智能系统在社会中的使用越来越多，解决数据中出现的或通过模型学习的潜在偏见对于防止特定群体的系统性劣势至关重要。在文献中已经提出了几个（不）公平的概念，以及相应的检测和减轻不公平的算法方法，但是，除了极少数例外，这些往往忽略了透明度。相反，可解释性和可解释性是算法公平性的核心要求，甚至比其他算法解决方案更重要，因为公平性具有以人为本的性质。在本文中，我们贡献了一种新的可解释的，可解释的偏见检测方法依赖于辩论的存在对个人的偏见，基于保护功能的个人和其他人在他们的邻居的价值。我们的方法建立在正式和计算论证的技术基础上，辩论是从争论邻里之间的偏见中产生的。我们对我们的方法进行了正式的、定量的和定性的评估，突出了它在基线性能方面的优势，以及它的可解释性和可解释性。
摘要：As the use of AI systems in society grows, addressing potential biases that emerge from data or are learned by models is essential to prevent systematic disadvantages against specific groups. Several notions of (un)fairness have been proposed in the literature, alongside corresponding algorithmic methods for detecting and mitigating unfairness, but, with very few exceptions, these tend to ignore transparency. Instead, interpretability and explainability are core requirements for algorithmic fairness, even more so than for other algorithmic solutions, given the human-oriented nature of fairness. In this paper, we contribute a novel interpretable, explainable method for bias detection relying on debates about the presence of bias against individuals, based on the values of protected features for the individuals and others in their neighbourhoods. Our method builds upon techniques from formal and computational argumentation, whereby debates result from arguing about biases within and across neighbourhoods. We provide formal, quantitative, and qualitative evaluations of our method, highlighting its strengths in performance against baselines, as well as its interpretability and explainability.

【3】Learning Using Privileged Information for Litter Detection
标题：使用特权信息进行垃圾检测学习
链接：https://arxiv.org/abs/2508.04124

作者：Bartolo, Konstantinos Makantasis, Dylan Seychell
备注：This paper was accepted at the 13th European Workshop on Visual Information Processing (EUVIP 2025)
摘要：随着垃圾污染在全球范围内持续上升，开发能够有效检测垃圾的自动化工具仍然是一项重大挑战。这项研究提出了一种新方法，首次将特权信息与深度学习对象检测相结合，以提高垃圾检测能力，同时保持模型效率。我们在五个广泛使用的对象检测模型中评估了我们的方法，解决了诸如检测小垃圾和被草或石头部分遮挡的对象等挑战。除此之外，我们工作的一个关键贡献也可以归因于制定一种将边界框信息编码为二进制掩码的方法，该方法可以馈送到检测模型以改进检测指导。通过对著名的SODA数据集的数据集内评估和对BDW和UAVVaste垃圾检测数据集的跨数据集评估的实验，我们证明了所有模型的一致性能改进。我们的方法不仅提高了训练集内的检测准确性，而且也很好地推广到其他垃圾检测环境。至关重要的是，这些改进是在不增加模型复杂性或添加额外层的情况下实现的，从而确保了计算效率和可扩展性。我们的研究结果表明，这种方法提供了一个实用的解决方案，在现实世界的应用中的垃圾检测，平衡的准确性和效率。
摘要：As litter pollution continues to rise globally, developing automated tools capable of detecting litter effectively remains a significant challenge. This study presents a novel approach that combines, for the first time, privileged information with deep learning object detection to improve litter detection while maintaining model efficiency. We evaluate our method across five widely used object detection models, addressing challenges such as detecting small litter and objects partially obscured by grass or stones. In addition to this, a key contribution of our work can also be attributed to formulating a means of encoding bounding box information as a binary mask, which can be fed to the detection model to refine detection guidance. Through experiments on both within-dataset evaluation on the renowned SODA dataset and cross-dataset evaluation on the BDW and UAVVaste litter detection datasets, we demonstrate consistent performance improvements across all models. Our approach not only bolsters detection accuracy within the training sets but also generalises well to other litter detection contexts. Crucially, these improvements are achieved without increasing model complexity or adding extra layers, ensuring computational efficiency and scalability. Our results suggest that this methodology offers a practical solution for litter detection, balancing accuracy and efficiency in real-world applications.

【4】Detection of Autonomic Dysreflexia in Individuals With Spinal Cord Injury Using Multimodal Wearable Sensors
标题：使用多模式可穿戴传感器检测脊髓损伤患者的自主反射障碍
链接：https://arxiv.org/abs/2508.03715

作者：uchs, Mehdi Ejtehadi, Ana Cisnal, Jürgen Pannek, Anke Scheel-Sailer, Robert Riener, Inge Eriks-Hoogland, Diego Paez-Granados
摘要：自主神经反射障碍（AD）是一种潜在的危及生命的疾病，其特征是脊髓损伤（SCI）患者突然出现严重的血压（BP）峰值。早期、准确的检测对于预防心血管并发症至关重要，但目前的监测方法要么是侵入性的，要么依赖于主观症状报告，限制了日常文件的适用性。这项研究提出了一种非侵入性、可解释的机器学习框架，用于使用多模式可穿戴传感器检测AD。在尿动力学研究期间从27名慢性SCI患者中收集数据，包括心电图（ECG）、光电容积描记（PPG）、生物阻抗（BioZ）、体温、呼吸率（RR）和心率（HR），使用三种商业器械。客观AD标签来自同步袖带血压测量。在信号预处理和特征提取之后，BorutaSHAP用于稳健的特征选择，SHAP值用于可解释性。我们训练了特定于模态和设备的弱学习器，并使用堆叠集成元模型将它们聚合起来。交叉验证按参与者分层，以确保普遍性。HR和ECG衍生特征被确定为信息量最大的特征，特别是那些捕获节律形态和变异性的特征。最近质心集合产生了最高的性能（Macro F1 = 0.77+/-0.03），显著优于基线模型。在各种模式中，HR的曲线下面积最高（AUC = 0.93），其次是ECG（0.88）和PPG（0.86）。RR和温度特征对总体准确性的贡献较小，与缺失数据和低特异性一致。该模型被证明对传感器脱落具有鲁棒性，并且与临床AD事件保持良好一致。这些结果代表了对SCI患者进行个性化实时监测的重要一步。
摘要：Autonomic Dysreflexia (AD) is a potentially life-threatening condition characterized by sudden, severe blood pressure (BP) spikes in individuals with spinal cord injury (SCI). Early, accurate detection is essential to prevent cardiovascular complications, yet current monitoring methods are either invasive or rely on subjective symptom reporting, limiting applicability in daily file. This study presents a non-invasive, explainable machine learning framework for detecting AD using multimodal wearable sensors. Data were collected from 27 individuals with chronic SCI during urodynamic studies, including electrocardiography (ECG), photoplethysmography (PPG), bioimpedance (BioZ), temperature, respiratory rate (RR), and heart rate (HR), across three commercial devices. Objective AD labels were derived from synchronized cuff-based BP measurements. Following signal preprocessing and feature extraction, BorutaSHAP was used for robust feature selection, and SHAP values for explainability. We trained modality- and device-specific weak learners and aggregated them using a stacked ensemble meta-model. Cross-validation was stratified by participants to ensure generalizability. HR- and ECG-derived features were identified as the most informative, particularly those capturing rhythm morphology and variability. The Nearest Centroid ensemble yielded the highest performance (Macro F1 = 0.77+/-0.03), significantly outperforming baseline models. Among modalities, HR achieved the highest area under the curve (AUC = 0.93), followed by ECG (0.88) and PPG (0.86). RR and temperature features contributed less to overall accuracy, consistent with missing data and low specificity. The model proved robust to sensor dropout and aligned well with clinical AD events. These results represent an important step toward personalized, real-time monitoring for individuals with SCI.

分类|识别(3篇)

【1】PRISM: Lightweight Multivariate Time-Series Classification through Symmetric Multi-Resolution Convolutional Layers
标题：PRISM：通过对称多分辨率卷积层的轻量级多变量时间序列分类
链接：https://arxiv.org/abs/2508.04503

作者：Zucchi, Thomas Lampert
摘要：多变量时间序列分类在从可穿戴传感到生物医学监测的领域中是关键的。尽管最近取得了进展，但基于Transformer和CNN的模型通常仍然计算量很大，提供有限的频率多样性，并且需要大量的参数预算。我们提出了PRISM（每通道分辨率通知对称模块），一个基于卷积的特征提取器，适用于对称有限脉冲响应（FIR）滤波器在多个时间尺度上，独立于每个通道。这种多分辨率、每通道设计产生了高频率选择性嵌入，而没有任何通道间卷积，大大降低了模型大小和复杂性。在人类活动、睡眠阶段和生物医学基准测试中，PRISM与轻量级分类头配对，匹配或优于领先的CNN和Transformer基线，同时使用的参数和FLOP大约少一个数量级。通过将经典信号处理见解与现代深度学习相结合，PRISM为多变量时间序列分类提供了准确、资源高效的解决方案。
摘要：Multivariate time-series classification is pivotal in domains ranging from wearable sensing to biomedical monitoring. Despite recent advances, Transformer- and CNN-based models often remain computationally heavy, offer limited frequency diversity, and require extensive parameter budgets. We propose PRISM (Per-channel Resolution-Informed Symmetric Module), a convolutional-based feature extractor that applies symmetric finite-impulse-response (FIR) filters at multiple temporal scales, independently per channel. This multi-resolution, per-channel design yields highly frequency-selective embeddings without any inter-channel convolutions, greatly reducing model size and complexity. Across human-activity, sleep-stage and biomedical benchmarks, PRISM, paired with lightweight classification heads, matches or outperforms leading CNN and Transformer baselines, while using roughly an order of magnitude fewer parameters and FLOPs. By uniting classical signal processing insights with modern deep learning, PRISM offers an accurate, resource-efficient solution for multivariate time-series classification.

【2】WSS-CL: Weight Saliency Soft-Guided Contrastive Learning for Efficient Machine Unlearning Image Classification
标题：WSS-CL：加权显着性软引导对比学习，用于高效的机器取消学习图像分类
链接：https://arxiv.org/abs/2508.04308

作者： Tran, Thai Hoang Le
备注：17th International Conference on Computational Collective Intelligence 2025
摘要：机器学习，即有效删除训练模型中特定数据的影响，仍然是一个具有挑战性的问题。当前的机器学习方法主要关注以数据为中心或基于权重的策略，在实现精确的学习、保持稳定性和确保跨不同领域的适用性方面经常遇到挑战。在这项工作中，我们引入了一种新的两阶段有效的机器学习方法用于图像分类，在权重显着性方面，利用权重显着性将学习过程集中在关键模型参数上。我们的方法被称为权重显着性软指导对比学习，用于高效的机器学习图像分类（WSS-CL），它显着缩小了与“精确”学习的性能差距。首先，遗忘阶段最大化输出logit和聚合伪标签之间的kullback-leibler发散，以在logit空间中进行有效遗忘。接下来，对抗性微调阶段以自我监督的方式引入对比学习。通过使用缩放的特征表示，它最大化特征空间中遗忘和保留数据样本之间的距离，遗忘和配对的增强样本作为正对，而保留样本在对比损失计算中作为负对。实验评估表明，与最先进的方法相比，我们提出的方法可以大大提高非学习效率，性能损失可以忽略不计，这表明它在监督和自我监督环境中的可用性。
摘要：Machine unlearning, the efficient deletion of the impact of specific data in a trained model, remains a challenging problem. Current machine unlearning approaches that focus primarily on data-centric or weight-based strategies frequently encounter challenges in achieving precise unlearning, maintaining stability, and ensuring applicability across diverse domains. In this work, we introduce a new two-phase efficient machine unlearning method for image classification, in terms of weight saliency, leveraging weight saliency to focus the unlearning process on critical model parameters. Our method is called weight saliency soft-guided contrastive learning for efficient machine unlearning image classification (WSS-CL), which significantly narrows the performance gap with "exact" unlearning. First, the forgetting stage maximizes kullback-leibler divergence between output logits and aggregated pseudo-labels for efficient forgetting in logit space. Next, the adversarial fine-tuning stage introduces contrastive learning in a self-supervised manner. By using scaled feature representations, it maximizes the distance between the forgotten and retained data samples in the feature space, with the forgotten and the paired augmented samples acting as positive pairs, while the retained samples act as negative pairs in the contrastive loss computation. Experimental evaluations reveal that our proposed method yields much-improved unlearning efficacy with negligible performance loss compared to state-of-the-art approaches, indicative of its usability in supervised and self-supervised settings.

【3】SoilNet: A Multimodal Multitask Model for Hierarchical Classification of Soil Horizons
标题：SoilNet：土壤层分层分类的多模式多任务模型
链接：https://arxiv.org/abs/2508.03785

作者：iaburu, Vipin Singh, Frank Haußer, Felix Bießmann
备注：24 pages, 7 figures, 6 tables
摘要：虽然基础模型的最新进展在许多领域改善了技术水平，但经验科学中的一些问题还不能从这一进展中受益。例如，土壤层分类仍然具有挑战性，因为它具有多模式和多任务的特点，以及复杂的分层结构标签分类。土壤层的准确分类对于监测土壤健康至关重要，而土壤健康直接影响农业生产力、粮食安全、生态系统稳定性和气候复原力。在这项工作中，我们提出了$\textit{SoilNet}$ -一个多模式的多任务模型来解决这个问题，通过一个结构化的模块化管道。我们的方法集成了图像数据和地时态元数据，首先预测深度标记，分割土壤剖面到地平线候选人。每个段的特征在于一组特定于水平的形态特征。最后，层位标签预测的基础上的多模态连接的特征向量，利用基于图形的标签表示来解释土壤层之间的复杂层次关系。我们的方法旨在解决复杂的分层分类，其中可能的标签的数量非常大，不平衡和非平凡的结构。我们证明了我们的方法在现实世界的土壤剖面数据集的有效性。所有代码和实验都可以在我们的存储库中找到：https://github.com/calgo-lab/BGR/
摘要：While recent advances in foundation models have improved the state of the art in many domains, some problems in empirical sciences could not benefit from this progress yet. Soil horizon classification, for instance, remains challenging because of its multimodal and multitask characteristics and a complex hierarchically structured label taxonomy. Accurate classification of soil horizons is crucial for monitoring soil health, which directly impacts agricultural productivity, food security, ecosystem stability and climate resilience. In this work, we propose $\textit{SoilNet}$ - a multimodal multitask model to tackle this problem through a structured modularized pipeline. Our approach integrates image data and geotemporal metadata to first predict depth markers, segmenting the soil profile into horizon candidates. Each segment is characterized by a set of horizon-specific morphological features. Finally, horizon labels are predicted based on the multimodal concatenated feature vector, leveraging a graph-based label representation to account for the complex hierarchical relationships among soil horizons. Our method is designed to address complex hierarchical classification, where the number of possible labels is very large, imbalanced and non-trivially structured. We demonstrate the effectiveness of our approach on a real-world soil profile dataset. All code and experiments can be found in our repository: https://github.com/calgo-lab/BGR/

表征(1篇)

【1】LRTuckerRep: Low-rank Tucker Representation Model for Multi-dimensional Data Completion
标题：LRTuckerRep：用于多维数据完成的低等级塔克表示模型
链接：https://arxiv.org/abs/2508.03755

作者：g, Lili Yang
摘要：多维数据补全是计算科学中的一个关键问题，特别是在计算机视觉、信号处理和科学计算等领域。现有的方法通常利用全局低秩近似或局部平滑正则化，但每种方法都有明显的局限性：低秩方法计算成本高，可能会破坏固有的数据结构，而基于平滑的方法通常需要大量的手动参数调整，泛化能力差。在本文中，我们提出了一种新的低秩塔克表示（LRTuckerRep）模型，统一了全球和当地的先验建模的塔克分解。具体来说，LRTuckerRep通过因子矩阵上的自适应加权核范数和稀疏Tucker核来编码低秩，同时通过因子空间上的无参数拉普拉斯正则化来捕获平滑性。为了有效地解决由此产生的非凸优化问题，我们开发了两个迭代算法，可证明收敛保证。多维图像修复和交通数据填补的大量实验表明，LRTuckerRep实现了卓越的完成精度和鲁棒性下的高丢失率相比，基线。
摘要：Multi-dimensional data completion is a critical problem in computational sciences, particularly in domains such as computer vision, signal processing, and scientific computing. Existing methods typically leverage either global low-rank approximations or local smoothness regularization, but each suffers from notable limitations: low-rank methods are computationally expensive and may disrupt intrinsic data structures, while smoothness-based approaches often require extensive manual parameter tuning and exhibit poor generalization. In this paper, we propose a novel Low-Rank Tucker Representation (LRTuckerRep) model that unifies global and local prior modeling within a Tucker decomposition. Specifically, LRTuckerRep encodes low rankness through a self-adaptive weighted nuclear norm on the factor matrices and a sparse Tucker core, while capturing smoothness via a parameter-free Laplacian-based regularization on the factor spaces. To efficiently solve the resulting nonconvex optimization problem, we develop two iterative algorithms with provable convergence guarantees. Extensive experiments on multi-dimensional image inpainting and traffic data imputation demonstrate that LRTuckerRep achieves superior completion accuracy and robustness under high missing rates compared to baselines.

编码器(1篇)

【1】Cloud Model Characteristic Function Auto-Encoder: Integrating Cloud Model Theory with MMD Regularization for Enhanced Generative Modeling
标题：云模型特征函数自动编码器：将云模型理论与MMD正规化集成以增强生成建模
链接：https://arxiv.org/abs/2508.04447

作者：Guoyin Wang
摘要：我们介绍云模型特征函数自动编码器（CMCFAE），一种新的生成模型，将云模型集成到Wasserstein自动编码器（WAE）框架。通过利用云模型的特征函数来正则化潜在空间，我们的方法可以更准确地建模复杂的数据分布。与依赖于标准高斯先验和传统发散度量的传统方法不同，我们的方法采用了云模型先验，提供了潜在空间的更灵活和真实的表示，从而减轻了重建样本中观察到的均匀化。我们推导出云模型的特征函数，并提出了相应的正则化WAE框架内。对MNIST，FashionMNIST，CIFAR-10和CelebA的广泛定量和定性评估表明，CMCFAE在重建质量，潜在空间结构和样本多样性方面优于现有模型。这项工作不仅建立了云模型理论与基于MMD的正则化的新集成，而且为增强基于自动编码器的生成模型提供了一个有前途的新视角。
摘要：We introduce Cloud Model Characteristic Function Auto-Encoder (CMCFAE), a novel generative model that integrates the cloud model into the Wasserstein Auto-Encoder (WAE) framework. By leveraging the characteristic functions of the cloud model to regularize the latent space, our approach enables more accurate modeling of complex data distributions. Unlike conventional methods that rely on a standard Gaussian prior and traditional divergence measures, our method employs a cloud model prior, providing a more flexible and realistic representation of the latent space, thus mitigating the homogenization observed in reconstructed samples. We derive the characteristic function of the cloud model and propose a corresponding regularizer within the WAE framework. Extensive quantitative and qualitative evaluations on MNIST, FashionMNIST, CIFAR-10, and CelebA demonstrate that CMCFAE outperforms existing models in terms of reconstruction quality, latent space structuring, and sample diversity. This work not only establishes a novel integration of cloud model theory with MMD-based regularization but also offers a promising new perspective for enhancing autoencoder-based generative models.

优化|敛散性(4篇)

【1】COPO: Consistency-Aware Policy Optimization
标题：COPO：一致性意识的政策优化
链接：https://arxiv.org/abs/2508.04138

作者：Han, Jiawei Chen, Hang Shao, Hao Ma, Mingcheng Li, Xintian Shen, Lihao Zheng, Wei Chen, Tao Wei, Lihua Zhang
摘要：强化学习显著增强了大型语言模型（LLM）在复杂问题解决任务中的推理能力。最近，DeepSeek R1的推出激发了人们对利用基于规则的奖励作为计算优势函数和指导策略优化的低成本替代方案的兴趣。然而，在许多复制和推广工作中观察到的一个共同挑战是，当单一提示下的多个抽样响应收敛到相同的结果时，无论正确与否，基于群体的优势退化为零。这导致梯度消失，并使相应的样本对学习无效，最终限制了训练效率和下游性能。为了解决这个问题，我们提出了一个一致性感知的策略优化框架，该框架引入了基于结果一致性的结构化全局奖励，基于它的全局损失确保即使模型输出显示出高的组内一致性，训练过程仍然可以接收有意义的学习信号，这鼓励从全局角度生成正确且自一致的推理路径。此外，我们采用了一种基于熵的软混合机制，该机制自适应地平衡了局部优势估计与全局优化，从而在整个训练过程中实现了探索和收敛之间的动态转换。我们的方法在奖励设计和优化策略方面都引入了几个关键的创新。我们通过多个数学推理基准测试的大量性能增益来验证其有效性，突出了所提出的框架的鲁棒性和普遍适用性。本作品的代码已在https://github.com/hijih/copo-code.git上发布。
摘要：Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks. Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization. However, a common challenge observed across many replication and extension efforts is that when multiple sampled responses under a single prompt converge to identical outcomes, whether correct or incorrect, the group-based advantage degenerates to zero. This leads to vanishing gradients and renders the corresponding samples ineffective for learning, ultimately limiting training efficiency and downstream performance. To address this issue, we propose a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency, the global loss based on it ensures that, even when model outputs show high intra-group consistency, the training process still receives meaningful learning signals, which encourages the generation of correct and self-consistent reasoning paths from a global perspective. Furthermore, we incorporate an entropy-based soft blending mechanism that adaptively balances local advantage estimation with global optimization, enabling dynamic transitions between exploration and convergence throughout training. Our method introduces several key innovations in both reward design and optimization strategy. We validate its effectiveness through substantial performance gains on multiple mathematical reasoning benchmarks, highlighting the proposed framework's robustness and general applicability. Code of this work has been released at https://github.com/hijih/copo-code.git.

【2】FairPOT: Balancing AUC Performance and Fairness with Proportional Optimal Transport
标题：FairPOT：通过按比例最佳运输平衡UC性能和公平性
链接：https://arxiv.org/abs/2508.03940

作者：u, Yi Shen, Matthew M. Engelhard, Benjamin A. Goldstein, Michael J. Pencina, Nicoleta J. Economou-Zavlanos, Michael M. Zavlanos
摘要：利用受试者操作者特征曲线（AUC）下的面积的公平性度量在医疗保健，金融和刑事司法等高风险领域获得了越来越多的关注。在这些领域中，公平性通常是通过风险评分而不是二元结果来评估的，一个常见的挑战是，强制执行严格的公平性可能会显着降低AUC性能。为了应对这一挑战，我们提出了公平比例最优运输（FairPOT），这是一种新颖的模型不可知的后处理框架，它使用最优运输策略性地调整不同群体的风险评分分布，但通过转换可控比例来选择性地这样做，即，在弱势群体中得分的最高分位数。通过改变λ，我们的方法允许在减少AUC差异和保持整体AUC性能之间进行可调的权衡。此外，我们将FairPOT扩展到部分AUC设置，使公平干预能够集中在风险最高的区域。在合成、公共和临床数据集上进行的大量实验表明，FairPOT在全局和部分AUC场景中的性能始终优于现有的后处理技术，通常在AUC略有下降的情况下实现更好的公平性，甚至在实用性上实现正增益。FairPOT的计算效率和实际适应性使其成为现实世界部署的有前途的解决方案。
摘要：Fairness metrics utilizing the area under the receiver operator characteristic curve (AUC) have gained increasing attention in high-stakes domains such as healthcare, finance, and criminal justice. In these domains, fairness is often evaluated over risk scores rather than binary outcomes, and a common challenge is that enforcing strict fairness can significantly degrade AUC performance. To address this challenge, we propose Fair Proportional Optimal Transport (FairPOT), a novel, model-agnostic post-processing framework that strategically aligns risk score distributions across different groups using optimal transport, but does so selectively by transforming a controllable proportion, i.e., the top-lambda quantile, of scores within the disadvantaged group. By varying lambda, our method allows for a tunable trade-off between reducing AUC disparities and maintaining overall AUC performance. Furthermore, we extend FairPOT to the partial AUC setting, enabling fairness interventions to concentrate on the highest-risk regions. Extensive experiments on synthetic, public, and clinical datasets show that FairPOT consistently outperforms existing post-processing techniques in both global and partial AUC scenarios, often achieving improved fairness with slight AUC degradation or even positive gains in utility. The computational efficiency and practical adaptability of FairPOT make it a promising solution for real-world deployment.

【3】Scalable Neural Network-based Blackbox Optimization
标题：基于可扩展神经网络的黑匣子优化
链接：https://arxiv.org/abs/2508.03827

作者：r Koratikere, Leifur Leifsson
备注：This preprint has been submitted to Structural and Multidisciplinary Optimization for peer review. An open-source implementation of SNBO is available at: this https URL
摘要：贝叶斯优化（BO）是一种广泛使用的黑盒优化方法，它利用高斯过程（GP）模型和采集函数来指导未来的采样。虽然BO在低维环境中有效，但由于GP模型的计算复杂性，BO在高维空间中面临可扩展性挑战，并且具有大量的函数评估。相比之下，神经网络（NN）提供了更好的可扩展性，可以对复杂函数进行建模，这导致了基于NN的BO方法的发展。然而，这些方法通常依赖于估计NN预测中的模型不确定性-这是一个通常计算密集且复杂的过程，特别是在高维中。为了解决这些限制，提出了一种新的方法，称为基于可扩展神经网络的黑盒优化（SNBO），不依赖于模型的不确定性估计。具体而言，SNBO使用单独的探索和利用标准添加新样本，同时自适应地控制采样区域以确保有效的优化。SNBO的评估范围从10到102个维度的优化问题，并与四个国家的最先进的基线算法进行比较。在大多数测试问题中，SNBO比性能最好的基线算法更好地获得函数值，同时需要减少40-60%的函数求值，并将运行时间减少至少一个数量级。
摘要：Bayesian Optimization (BO) is a widely used approach for blackbox optimization that leverages a Gaussian process (GP) model and an acquisition function to guide future sampling. While effective in low-dimensional settings, BO faces scalability challenges in high-dimensional spaces and with large number of function evaluations due to the computational complexity of GP models. In contrast, neural networks (NNs) offer better scalability and can model complex functions, which led to the development of NN-based BO approaches. However, these methods typically rely on estimating model uncertainty in NN prediction -- a process that is often computationally intensive and complex, particularly in high dimensions. To address these limitations, a novel method, called scalable neural network-based blackbox optimization (SNBO), is proposed that does not rely on model uncertainty estimation. Specifically, SNBO adds new samples using separate criteria for exploration and exploitation, while adaptively controlling the sampling region to ensure efficient optimization. SNBO is evaluated on a range of optimization problems spanning from 10 to 102 dimensions and compared against four state-of-the-art baseline algorithms. Across the majority of test problems, SNBO attains function values better than the best-performing baseline algorithm, while requiring 40-60% fewer function evaluations and reducing the runtime by at least an order of magnitude.

【4】PILOT-C: Physics-Informed Low-Distortion Optimal Trajectory Compression
标题：PILOT-C：基于物理信息的低失真最佳轨迹压缩
链接：https://arxiv.org/abs/2508.03730

作者： Baihua Zheng, Weiwei Sun
摘要：位置感知设备不断生成大量的轨迹数据，从而产生了对高效压缩的需求。线简化是一种常见的解决方案，但通常假设2D轨迹并忽略时间同步和运动连续性。我们提出了PILOT-C，一种新的轨迹压缩框架，集成了频域物理建模与误差有界优化。与现有的线简化方法不同，PILOT-C通过独立压缩每个空间轴来支持任意维度（包括3D）的轨迹。在四个真实世界的数据集上进行评估，PILOT-C在多个维度上实现了卓越的性能。在压缩比方面，PILOT-C比目前最先进的基于SED的线条简化算法CISED-W平均高出19.2%。在弹道保真度方面，PILOT-C与CISED-W相比，误差平均减少了32.6%。此外，PILOT-C无缝扩展到三维轨迹，同时保持相同的计算复杂度，与SQUISH-E相比，压缩比提高了49%，SQUISH-E是3D数据集上最有效的线简化算法。
摘要：Location-aware devices continuously generate massive volumes of trajectory data, creating demand for efficient compression. Line simplification is a common solution but typically assumes 2D trajectories and ignores time synchronization and motion continuity. We propose PILOT-C, a novel trajectory compression framework that integrates frequency-domain physics modeling with error-bounded optimization. Unlike existing line simplification methods, PILOT-C supports trajectories in arbitrary dimensions, including 3D, by compressing each spatial axis independently. Evaluated on four real-world datasets, PILOT-C achieves superior performance across multiple dimensions. In terms of compression ratio, PILOT-C outperforms CISED-W, the current state-of-the-art SED-based line simplification algorithm, by an average of 19.2%. For trajectory fidelity, PILOT-C achieves an average of 32.6% reduction in error compared to CISED-W. Additionally, PILOT-C seamlessly extends to three-dimensional trajectories while maintaining the same computational complexity, achieving a 49% improvement in compression ratios over SQUISH-E, the most efficient line simplification algorithm on 3D datasets.

预测|估计(8篇)

【1】Who cuts emissions, who turns up the heat? causal machine learning estimates of energy efficiency interventions
标题：谁来减少排放，谁来加热？能源效率干预措施的因果机器学习估计
链接：https://arxiv.org/abs/2508.04478

作者：o D'Amico, Francesco Pomponi, Jay H. Arehart, Lina Khaddour
摘要：减少国内能源需求是减缓气候变化和燃料贫穷战略的核心，但能效干预措施的影响是高度异质的。使用在英国住房存量的全国代表性数据上训练的因果机器学习模型，我们估计了墙壁绝缘对气体消耗的平均和有条件的治疗效果，重点关注能源负担亚组的分布效应。虽然干预措施平均减少了天然气需求（高达19%），但低能源负担群体实现了大量节省，而那些经历高能源负担的群体几乎没有减少。这种模式反映了一种行为驱动的机制：受高成本收入比（例如超过0.1）限制的家庭将储蓄重新分配给改善热舒适性，而不是降低消费。这种应对措施绝不是浪费，而是在先前贫困的背景下进行的合理调整，可能对健康和福祉产生共同效益。这些研究结果要求建立一个更广泛的评估框架，既考虑到气候影响，也考虑到国内能源政策的公平影响。
摘要：Reducing domestic energy demand is central to climate mitigation and fuel poverty strategies, yet the impact of energy efficiency interventions is highly heterogeneous. Using a causal machine learning model trained on nationally representative data of the English housing stock, we estimate average and conditional treatment effects of wall insulation on gas consumption, focusing on distributional effects across energy burden subgroups. While interventions reduce gas demand on average (by as much as 19 percent), low energy burden groups achieve substantial savings, whereas those experiencing high energy burdens see little to no reduction. This pattern reflects a behaviourally-driven mechanism: households constrained by high costs-to-income ratios (e.g. more than 0.1) reallocate savings toward improved thermal comfort rather than lowering consumption. Far from wasteful, such responses represent rational adjustments in contexts of prior deprivation, with potential co-benefits for health and well-being. These findings call for a broader evaluation framework that accounts for both climate impacts and the equity implications of domestic energy policy.

【2】Matrix-Free Two-to-Infinity and One-to-Two Norms Estimation
标题：无矩阵二到无限和一到二范估计
链接：https://arxiv.org/abs/2508.04444

作者：ganov, Evgeny Frolov, Sergey Samsonov, Maxim Rakhuba
摘要：在本文中，我们提出了新的随机算法估计的二至无穷大和一至二的范数在一个矩阵的自由设置，只使用矩阵向量乘法。我们的方法是基于适当的修改哈钦森的对角估计和它的Hutch++版本。我们为这两种修改提供了Oracle复杂性界限。我们进一步说明了我们的算法在图像分类任务的深度神经网络训练中用于基于雅可比的正则化的实际效用。我们还证明了我们的方法可以应用于减轻推荐系统领域中对抗性攻击的影响。
摘要：In this paper, we propose new randomized algorithms for estimating the two-to-infinity and one-to-two norms in a matrix-free setting, using only matrix-vector multiplications. Our methods are based on appropriate modifications of Hutchinson's diagonal estimator and its Hutch++ version. We provide oracle complexity bounds for both modifications. We further illustrate the practical utility of our algorithms for Jacobian-based regularization in deep neural network training on image classification tasks. We also demonstrate that our methodology can be applied to mitigate the effect of adversarial attacks in the domain of recommender systems.

【3】A virtual sensor fusion approach for state of charge estimation of lithium-ion cells
标题：用于锂离子电池荷电状态估计的虚拟传感器融合方法
链接：https://arxiv.org/abs/2508.04268

作者：evitali, Daniele Masti, Mirko Mazzoleni, Fabio Previdi
摘要：本文通过两种广泛使用的范例的组合来解决锂离子电池的荷电状态（SOC）的估计：配备等效电路模型（ECM）的卡尔曼滤波器（KF）和机器学习方法。特别地，考虑最近的虚拟传感器（VS）合成技术，其操作如下：（i）直接从数据学习所述细胞的仿射参数变化（APV）模型，（ii）从所述APV模型导出一组线性观测器，（iii）训练机器─从观测器提取的特征以及输入和输出数据的学习技术来预测SOC。将VS与电池端电压一起提供给扩展KF（EKF）作为输出测量，组合两种范例。提出了一种数据驱动的EKF噪声协方差矩阵的标定策略。实验结果表明，所设计的方法有利于提高系统的实时性. SOC估计的准确性和平滑性。
摘要：This paper addresses the estimation of the State Of Charge (SOC) of lithium-ion cells via the combination of two widely used paradigms: Kalman Filters (KFs) equipped with Equivalent Circuit Models (ECMs) and machine-learning approaches. In particular, a recent Virtual Sensor (VS) synthesis technique is considered, which operates as follows: (i) learn an Affine Parameter-Varying (APV) model of the cell directly from data, (ii) derive a bank of linear observers from the APV model, (iii) train a machine-learning technique from features extracted from the observers together with input and output data to predict the SOC. The SOC predictions returned by the VS are supplied to an Extended KF (EKF) as output measurements along with the cell terminal voltage, combining the two paradigms. A data-driven calibration strategy for the noise covariance matrices of the EKF is proposed. Experimental results show that the designed approach is beneficial w.r.t. SOC estimation accuracy and smoothness.

【4】Automated ultrasound doppler angle estimation using deep learning
标题：使用深度学习自动化超声多普勒角度估计
链接：https://arxiv.org/abs/2508.04243

作者：til, Ajay Anand
备注：None
摘要：角度估计是多普勒超声临床工作流程中测量血流速度的重要步骤。人们广泛认识到，不正确的角度估计是基于多普勒的血流速度测量中的误差的主要原因。在本文中，我们提出了一种基于深度学习的自动多普勒角度估计方法。该方法是使用2100人颈动脉超声图像，包括图像增强。使用五个预训练模型来提取图像特征，并将这些特征传递到自定义浅层网络进行多普勒角度估计。独立地，通过人类观察者查看图像以进行比较来获得测量值。对于评估的模型，自动和手动角度估计值之间的平均绝对误差（MAE）范围为3.9{\deg}至9.4{\deg}。此外，性能最佳模型的MAE小于可接受的临床多普勒角度误差阈值，从而避免将正常速度值误分类为狭窄。结果表明，基于深度学习的技术有可能用于自动超声多普勒角度估计。这种技术可能在商业超声扫描仪上的成像软件内实现。
摘要：Angle estimation is an important step in the Doppler ultrasound clinical workflow to measure blood velocity. It is widely recognized that incorrect angle estimation is a leading cause of error in Doppler-based blood velocity measurements. In this paper, we propose a deep learning-based approach for automated Doppler angle estimation. The approach was developed using 2100 human carotid ultrasound images including image augmentation. Five pre-trained models were used to extract images features, and these features were passed to a custom shallow network for Doppler angle estimation. Independently, measurements were obtained by a human observer reviewing the images for comparison. The mean absolute error (MAE) between the automated and manual angle estimates ranged from 3.9{\deg} to 9.4{\deg} for the models evaluated. Furthermore, the MAE for the best performing model was less than the acceptable clinical Doppler angle error threshold thus avoiding misclassification of normal velocity values as a stenosis. The results demonstrate potential for applying a deep-learning based technique for automated ultrasound Doppler angle estimation. Such a technique could potentially be implemented within the imaging software on commercial ultrasound scanners.

【5】Markov Chain Estimation with In-Context Learning
标题：基于上下文学习的马尔科夫链估计
链接：https://arxiv.org/abs/2508.03934

作者：age, Jeremie Mary, David Picard
备注：Accepted at Gretsi 2025
摘要：我们研究了Transformers学习算法的能力，这些算法涉及到它们的上下文，而仅仅使用下一个令牌预测进行训练。我们用随机转移矩阵建立马尔可夫链，并训练Transformers来预测下一个令牌。在训练和测试过程中使用的矩阵是不同的，我们表明，在Transformer大小和训练集大小上存在一个阈值，高于该阈值，模型能够学习从其上下文估计转移概率，而不是记忆训练模式。此外，我们还表明，对状态进行更复杂的编码，可以对结构不同于训练过程中看到的结构的马尔可夫链进行更鲁棒的预测。
摘要：We investigate the capacity of transformers to learn algorithms involving their context while solely being trained using next token prediction. We set up Markov chains with random transition matrices and we train transformers to predict the next token. Matrices used during training and test are different and we show that there is a threshold in transformer size and in training set size above which the model is able to learn to estimate the transition probabilities from its context instead of memorizing the training patterns. Additionally, we show that more involved encoding of the states enables more robust prediction for Markov chains with structures different than those seen during training.

【6】Calibrating Biophysical Models for Grape Phenology Prediction via Multi-Task Learning
标题：通过多任务学习校准葡萄物候预测的生物物理模型
链接：https://arxiv.org/abs/2508.03898

作者：olow, Sandhya Saisubramanian
摘要：葡萄物候的准确预测对于及时的葡萄园管理决策至关重要，例如调度灌溉和施肥，以最大限度地提高作物产量和质量。虽然根据历史数据校准的传统生物物理模型可以用于整个季节的预测，但它们缺乏精细葡萄园管理所需的精度。深度学习方法是一种引人注目的替代方法，但它们的性能受到稀疏物候数据集的阻碍，特别是在品种层面。我们提出了一种混合建模方法，结合多任务学习与递归神经网络参数化可微生物物理模型。通过使用多任务学习来预测生物物理模型的参数，我们的方法可以在保持生物结构的同时实现跨品种的共享学习，从而提高预测的鲁棒性和准确性。使用真实世界和合成数据集进行的实证评估表明，我们的方法在预测物候期以及其他作物状态变量（如抗寒性和小麦产量）方面明显优于传统的生物物理模型和基线深度学习方法。
摘要：Accurate prediction of grape phenology is essential for timely vineyard management decisions, such as scheduling irrigation and fertilization, to maximize crop yield and quality. While traditional biophysical models calibrated on historical field data can be used for season-long predictions, they lack the precision required for fine-grained vineyard management. Deep learning methods are a compelling alternative but their performance is hindered by sparse phenology datasets, particularly at the cultivar level. We propose a hybrid modeling approach that combines multi-task learning with a recurrent neural network to parameterize a differentiable biophysical model. By using multi-task learning to predict the parameters of the biophysical model, our approach enables shared learning across cultivars while preserving biological structure, thereby improving the robustness and accuracy of predictions. Empirical evaluation using real-world and synthetic datasets demonstrates that our method significantly outperforms both conventional biophysical models and baseline deep learning approaches in predicting phenological stages, as well as other crop state variables such as cold-hardiness and wheat yield.

【7】Prediction-Oriented Subsampling from Data Streams
标题：数据流的面向预测的二次采样
链接：https://arxiv.org/abs/2508.03868

作者： Lavinia Mussati, Freddie Bickford Smith, Tom Rainforth, Stephen Roberts
备注：Published at CoLLAs 2025
摘要：数据通常是以流的形式生成的，随着时间的推移，新的观察结果会到达。从数据流中学习模型的一个关键挑战是捕获相关信息，同时保持计算成本可控。我们探索智能数据子采样离线学习，并主张以减少下游预测的不确定性为中心的信息理论方法。从经验上讲，我们证明了这种面向预测的方法比以前提出的信息理论技术在两个广泛研究的问题上表现得更好。同时，我们强调，在实践中可靠地实现强大的性能需要仔细的模型设计。
摘要：Data is often generated in streams, with new observations arriving over time. A key challenge for learning models from data streams is capturing relevant information while keeping computational costs manageable. We explore intelligent data subsampling for offline learning, and argue for an information-theoretic method centred on reducing uncertainty in downstream predictions of interest. Empirically, we demonstrate that this prediction-oriented approach performs better than a previously proposed information-theoretic technique on two widely studied problems. At the same time, we highlight that reliably achieving strong performance in practice requires careful model design.

【8】Predicting fall risk in older adults: A machine learning comparison of accelerometric and non-accelerometric factors
标题：预测老年人跌倒风险：加速度和非加速度因素的机器学习比较
链接：https://arxiv.org/abs/2508.03756

作者：lez-Castro, José Alberto Benítez-Andrades, Rubén González-González, Camino Prada-García, Raquel Leirós-Rodríguez
备注：None
摘要：这项研究使用各种机器学习模型对146名参与者的加速度、非加速度和组合数据进行训练，研究老年人的跌倒风险预测。结合两种数据类型的模型实现了卓越的性能，贝叶斯岭回归显示出最高的准确性（MSE = 0.6746，R2 = 0.9941）。非加速度变量，如年龄和合并症，被证明是预测的关键。研究结果支持使用综合数据和贝叶斯方法来加强跌倒风险评估并为预防策略提供信息。
摘要：This study investigates fall risk prediction in older adults using various machine learning models trained on accelerometric, non-accelerometric, and combined data from 146 participants. Models combining both data types achieved superior performance, with Bayesian Ridge Regression showing the highest accuracy (MSE = 0.6746, R2 = 0.9941). Non-accelerometric variables, such as age and comorbidities, proved critical for prediction. Results support the use of integrated data and Bayesian approaches to enhance fall risk assessment and inform prevention strategies.

其他神经网络|深度学习|模型|建模(21篇)

【1】Robustly Learning Monotone Single-Index Models
标题：稳健学习单调单指数模型
链接：https://arxiv.org/abs/2508.04670

作者：ng, Nikos Zarifis, Ilias Diakonikolas, Jelena Diakonikolas
摘要：我们考虑了在对抗性标签噪声存在下，在高斯分布下关于平方损失的单指标模型学习的基本问题。我们的主要贡献是这个学习任务的第一个计算效率高的算法，实现了一个常数因子近似，成功地为{\em all}单调激活类的阶为2 + \zeta的有界矩，$对于$\zeta > 0。这类特别包括所有单调Lipschitz函数，甚至不连续函数，如（可能有偏）半空间。以前的工作未知激活的情况下，要么没有达到常数因子近似或成功的一个小得多的家庭激活。我们的方法的主要概念新颖性在于开发一个优化框架，该框架超出了通常梯度方法的边界，而是通过直接利用问题结构，高斯空间的属性和单调函数的规律性来识别有用的向量场来指导算法更新。
摘要：We consider the basic problem of learning Single-Index Models with respect to the square loss under the Gaussian distribution in the presence of adversarial label noise. Our main contribution is the first computationally efficient algorithm for this learning task, achieving a constant factor approximation, that succeeds for the class of {\em all} monotone activations with bounded moment of order $2 + \zeta,$ for $\zeta > 0.$ This class in particular includes all monotone Lipschitz functions and even discontinuous functions like (possibly biased) halfspaces. Prior work for the case of unknown activation either does not attain constant factor approximation or succeeds for a substantially smaller family of activations. The main conceptual novelty of our approach lies in developing an optimization framework that steps outside the boundaries of usual gradient methods and instead identifies a useful vector field to guide the algorithm updates by directly leveraging the problem structure, properties of Gaussian spaces, and regularity of monotone functions.

【2】Live Music Models
标题：现场音乐模特
链接：https://arxiv.org/abs/2508.04651

作者：m: Antoine Caillon, Brian McWilliams, Cassie Tarakajian, Ian Simon, Ilaria Manco, Jesse Engel, Noah Constant, Pen Li, Timo I. Denk, Alberto Lalama, Andrea Agostinelli, Anna Huang, Ethan Manilow, George Brower, Hakan Erdogan, Heidi Lei, Itai Rolnick, Ivan Grishchenko, Manu Orsini, Matej Kastelic, Mauricio Zuluaga, Mauro Verzetti, Michael Dooley, Ondrej Skopek, Rafael Ferrer, Zalán Borsos, Äaron van den Oord, Douglas Eck, Eli Collins, Jason Baldridge, Tom Hume, Chris Donahue, Kehang Han, Adam Roberts
摘要：我们介绍了一类新的生成模型的音乐称为现场音乐模型，产生一个连续的音乐流在实时同步用户控制。我们发布了Magenta RealTime，这是一个开放权重的现场音乐模型，可以使用文本或音频提示来控制声学风格。在音乐质量的自动度量方面，Magenta RealTime优于其他开放权重音乐生成模型，尽管使用的参数更少，并提供了首个实时生成功能。我们还发布了Lyria RealTime，这是一个基于API的模型，具有扩展控件，可以访问我们最强大的模型，并具有广泛的即时覆盖范围。这些模型展示了人工智能辅助音乐创作的新范式，强调现场音乐表演的人在回路中的互动。
摘要：We introduce a new class of generative models for music called live music models that produce a continuous stream of music in real-time with synchronized user control. We release Magenta RealTime, an open-weights live music model that can be steered using text or audio prompts to control acoustic style. On automatic metrics of music quality, Magenta RealTime outperforms other open-weights music generation models, despite using fewer parameters and offering first-of-its-kind live generation capabilities. We also release Lyria RealTime, an API-based model with extended controls, offering access to our most powerful model with wide prompt coverage. These models demonstrate a new paradigm for AI-assisted music creation that emphasizes human-in-the-loop interaction for live music performance.

【3】A Reproducible, Scalable Pipeline for Synthesizing Autoregressive Model Literature
标题：一个可复制、可扩展的自回归模型文献合成管道
链接：https://arxiv.org/abs/2508.04612

作者：ay, Bugra Kilictas, Hamdi Alakkad
备注：9 pages
摘要：自回归生成模型研究的加速已经产生了数千篇论文，使得人工文献调查和再现研究变得越来越不切实际。我们提出了一个完全开源的，可重复的管道，自动检索候选文件从公共存储库，过滤它们的相关性，提取元数据，超参数和报告的结果，集群主题，产生检索增强的摘要，并生成容器化的脚本重新运行选定的实验。对50篇人工标注的论文进行定量评价，相关性分类、超参数提取和引文识别的F1得分均在0.85以上。在多达1000篇论文的语料库上进行的实验表明，在8个CPU工作人员的情况下，可以实现接近线性的可扩展性。三个案例研究-WikiText-2上的AWD-LSTM，WikiText-103上的Transformer-XL和Lakh数据集上的自回归音乐模型-证实了提取的设置支持忠实再现，实现了原始报告的1- 3%的测试困惑。
摘要：The accelerating pace of research on autoregressive generative models has produced thousands of papers, making manual literature surveys and reproduction studies increasingly impractical. We present a fully open-source, reproducible pipeline that automatically retrieves candidate documents from public repositories, filters them for relevance, extracts metadata, hyper-parameters and reported results, clusters topics, produces retrieval-augmented summaries and generates containerised scripts for re-running selected experiments. Quantitative evaluation on 50 manually-annotated papers shows F1 scores above 0.85 for relevance classification, hyper-parameter extraction and citation identification. Experiments on corpora of up to 1000 papers demonstrate near-linear scalability with eight CPU workers. Three case studies -- AWD-LSTM on WikiText-2, Transformer-XL on WikiText-103 and an autoregressive music model on the Lakh MIDI dataset -- confirm that the extracted settings support faithful reproduction, achieving test perplexities within 1--3% of the original reports.

【4】Multitask Learning with Stochastic Interpolants
标题：使用随机插值的多任务学习
链接：https://arxiv.org/abs/2508.04605

作者：el, Florentin Coeurdoux, Michael S. Albergo, Eric Vanden-Eijnden
摘要：我们提出了一个框架，学习概率分布之间的映射，广泛地概括了流动和扩散模型的时间动态。为了实现这一点，我们通过将标量时间变量替换为向量，矩阵或线性算子来推广随机插值，使我们能够跨越多维空间桥接概率分布。这种方法能够构建通用的生成模型，能够在没有特定任务训练的情况下完成多个任务。我们的基于算子的插值不仅为现有的生成模型提供了统一的理论视角，而且还扩展了它们的功能。通过数值实验，我们证明了zero-shot的功效，我们的方法有条件的生成和修复，微调和后验采样，和多尺度建模，这表明它的潜力作为一个通用的任务不可知的替代专门的模型。
摘要：We propose a framework for learning maps between probability distributions that broadly generalizes the time dynamics of flow and diffusion models. To enable this, we generalize stochastic interpolants by replacing the scalar time variable with vectors, matrices, or linear operators, allowing us to bridge probability distributions across multiple dimensional spaces. This approach enables the construction of versatile generative models capable of fulfilling multiple tasks without task-specific training. Our operator-based interpolants not only provide a unifying theoretical perspective for existing generative models but also extend their capabilities. Through numerical experiments, we demonstrate the zero-shot efficacy of our method on conditional generation and inpainting, fine-tuning and posterior sampling, and multiscale modeling, suggesting its potential as a generic task-agnostic alternative to specialized models.

【5】Improved Training Strategies for Physics-Informed Neural Networks using Real Experimental Data in Aluminum Spot Welding
标题：使用铝点焊接中的真实实验数据改进物理信息神经网络的训练策略
链接：https://arxiv.org/abs/2508.04595

作者：k, Christian Weißenfels
摘要：电阻点焊是汽车行业白车身的主要连接工艺，其中焊核直径是关键的质量指标。其测量需要破坏性测试，限制了有效质量控制的潜力。研究了物理信息神经网络作为一种有前途的工具，从实验数据中重建内部过程状态，使基于模型的和非侵入性的质量评估在铝点焊。一个主要的挑战是由于竞争优化目标，将真实世界的数据集成到网络中。为了解决这个问题，我们引入了两种新的训练策略。首先，实验损失的动态位移和熔核直径逐渐包括使用一个淡入功能，以防止过度的优化冲突。我们还实现了一个自定义的学习率调度器和基于滚动窗口的提前停止，以抵消由于损失幅度增加而导致的过早减少。其次，我们通过查找表引入了温度相关材料参数的有条件更新，仅在达到损耗阈值后激活，以确保物理上有意义的温度。为了在保证计算效率的前提下准确地描述焊接过程，选择了一个轴对称的二维模型。为了减少计算负担，首先在一维中系统地评估训练策略和模型组件，从而实现损耗设计和接触模型的受控分析。二维网络预测动态位移和熔核增长的实验置信区间内，支持转移焊接阶段从钢到铝，并展示了强大的潜力，快速，基于模型的质量控制在工业应用中。
摘要：Resistance spot welding is the dominant joining process for the body-in-white in the automotive industry, where the weld nugget diameter is the key quality metric. Its measurement requires destructive testing, limiting the potential for efficient quality control. Physics-informed neural networks were investigated as a promising tool to reconstruct internal process states from experimental data, enabling model-based and non-invasive quality assessment in aluminum spot welding. A major challenge is the integration of real-world data into the network due to competing optimization objectives. To address this, we introduce two novel training strategies. First, experimental losses for dynamic displacement and nugget diameter are progressively included using a fading-in function to prevent excessive optimization conflicts. We also implement a custom learning rate scheduler and early stopping based on a rolling window to counteract premature reduction due to increased loss magnitudes. Second, we introduce a conditional update of temperature-dependent material parameters via a look-up table, activated only after a loss threshold is reached to ensure physically meaningful temperatures. An axially symmetric two-dimensional model was selected to represent the welding process accurately while maintaining computational efficiency. To reduce computational burden, the training strategies and model components were first systematically evaluated in one dimension, enabling controlled analysis of loss design and contact models. The two-dimensional network predicts dynamic displacement and nugget growth within the experimental confidence interval, supports transferring welding stages from steel to aluminum, and demonstrates strong potential for fast, model-based quality control in industrial applications.

【6】Algebraically Observable Physics-Informed Neural Network and its Application to Epidemiological Modelling
标题：代数可观察物理信息神经网络及其在流行病学建模中的应用
链接：https://arxiv.org/abs/2508.04590

作者：matsu
摘要：物理信息神经网络（PINN）是一种深度学习框架，它将数据基础的控制方程集成到损失函数中。在这项研究中，我们考虑的问题，估计状态变量和参数的流行病模型的常微分方程使用PINNs。在实践中，并非所有对应于模型所描述的群体的轨迹数据都可以被测量。学习PINN来估计不可测量的状态变量和使用部分测量的流行病学参数是具有挑战性的。相应地，我们引入了状态变量的代数能观性的概念。具体来说，我们建议增加的代数可观性分析的基础上的未测数据。在流行病学建模的背景下，通过三种情况下的数值实验证明了所提出的方法的有效性。具体而言，给定噪声和部分测量，未测状态和参数估计的精度所提出的方法被证明是高于传统的方法。所提出的方法也被证明是有效的，在实际情况下，如当对应于某些变量的数据不能从测量重建。
摘要：Physics-Informed Neural Network (PINN) is a deep learning framework that integrates the governing equations underlying data into a loss function. In this study, we consider the problem of estimating state variables and parameters in epidemiological models governed by ordinary differential equations using PINNs. In practice, not all trajectory data corresponding to the population described by models can be measured. Learning PINNs to estimate the unmeasured state variables and epidemiological parameters using partial measurements is challenging. Accordingly, we introduce the concept of algebraic observability of the state variables. Specifically, we propose augmenting the unmeasured data based on algebraic observability analysis. The validity of the proposed method is demonstrated through numerical experiments under three scenarios in the context of epidemiological modelling. Specifically, given noisy and partial measurements, the accuracy of unmeasured states and parameter estimation of the proposed method is shown to be higher than that of the conventional methods. The proposed method is also shown to be effective in practical scenarios, such as when the data corresponding to certain variables cannot be reconstructed from the measurements.

【7】Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model
标题：通过文本到图像模型中的渐进对齐实现零残留概念擦除
链接：https://arxiv.org/abs/2508.04472

作者：en, Zhen Wang, Taoran Mei, Lin Li, Bowei Zhu, Runshi Li, Long Chen
摘要：概念擦除，旨在防止预训练的文本到图像模型生成与语义有害概念相关联的内容（即，目标概念），正日益受到关注。最先进的方法将此任务制定为优化问题：它们将所有目标概念与语义无害的锚概念对齐，并应用封闭形式的解决方案来相应地更新模型。虽然这些封闭形式的方法是有效的，我们认为，现有的方法有两个被忽视的局限性：1）他们往往会导致不完全擦除由于“非零对齐残差”，特别是当文本提示是相对复杂的。2)它们可能会遭受发电质量下降，因为它们总是集中在几个深层的参数更新。为了解决这些问题，我们提出了一种新的封闭形式的方法ErasePro：它是专为更完整的概念擦除和更好地保持整体生成质量。具体而言，ErasePro首先将严格的零残留约束引入优化目标，确保目标和锚概念特征之间的完美对齐，并实现更完整的擦除。其次，它采用了一个渐进的，逐层更新策略，逐步转移目标概念的特征，从浅到深的层次的锚概念。随着深度的增加，所需的参数变化减少，从而减少敏感深层的偏差并保持生成质量。不同概念擦除任务（包括实例，艺术风格和裸体擦除）的实证结果证明了我们的ErasePro的有效性。
摘要：Concept Erasure, which aims to prevent pretrained text-to-image models from generating content associated with semantic-harmful concepts (i.e., target concepts), is getting increased attention. State-of-the-art methods formulate this task as an optimization problem: they align all target concepts with semantic-harmless anchor concepts, and apply closed-form solutions to update the model accordingly. While these closed-form methods are efficient, we argue that existing methods have two overlooked limitations: 1) They often result in incomplete erasure due to "non-zero alignment residual", especially when text prompts are relatively complex. 2) They may suffer from generation quality degradation as they always concentrate parameter updates in a few deep layers. To address these issues, we propose a novel closed-form method ErasePro: it is designed for more complete concept erasure and better preserving overall generative quality. Specifically, ErasePro first introduces a strict zero-residual constraint into the optimization objective, ensuring perfect alignment between target and anchor concept features and enabling more complete erasure. Secondly, it employs a progressive, layer-wise update strategy that gradually transfers target concept features to those of the anchor concept from shallow to deep layers. As the depth increases, the required parameter changes diminish, thereby reducing deviations in sensitive deep layers and preserving generative quality. Empirical results across different concept erasure tasks (including instance, art style, and nudity erasure) have demonstrated the effectiveness of our ErasePro.

【8】Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models
标题：解码多模式迷宫：对基于注意力的多模式模型中可解释性采用的系统回顾
链接：https://arxiv.org/abs/2508.04427

作者： Kibria, Sébastien Lafond, Janan Arslan
摘要：近年来，多模式学习取得了显着的进步，特别是与基于注意力的模型的集成，导致在各种任务中的显着性能增益。与此同时，对可解释人工智能（XAI）的需求也刺激了越来越多的研究，旨在解释这些模型的复杂决策过程。这篇系统性文献综述分析了2020年1月至2024年初发表的研究，重点是多模态模型的可解释性。在XAI更广泛的目标框架内，我们从多个维度研究文献，包括模型架构，所涉及的模式，解释算法和评估方法。我们的分析表明，大多数研究都集中在视觉语言和语言的模型，与注意力为基础的技术是最常用的解释。然而，这些方法通常无法捕捉模态之间的全部交互，跨域的架构异构性进一步加剧了这一挑战。重要的是，我们发现，在多模态设置XAI的评估方法在很大程度上是非系统的，缺乏一致性，鲁棒性，并考虑到特定于模态的认知和上下文因素。基于这些发现，我们提供了一套全面的建议，旨在促进多模式XAI研究中严格，透明和标准化的评估和报告实践。我们的目标是支持未来在更可解释、更负责任和更负责任的多模态AI系统中的研究，其核心是可解释性。
摘要：Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that the majority of studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities, a challenge further compounded by the architectural heterogeneity across domains. Importantly, we find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. Based on these findings, we provide a comprehensive set of recommendations aimed at promoting rigorous, transparent, and standardized evaluation and reporting practices in multimodal XAI research. Our goal is to support future research in more interpretable, accountable, and responsible mulitmodal AI systems, with explainability at their core.

【9】VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Visual Backbones
标题：VisionTS++：具有连续预训练视觉主干的跨模式时间序列基础模型
链接：https://arxiv.org/abs/2508.04379

作者：n, Mouxiang Chen, Xu Liu, Han Fu, Xiaoxue Ren, Jianling Sun, Zhuo Li, Chenghao Liu
备注：21 pages
摘要：最近的研究表明，在图像上预训练的视觉模型可以通过将预测重新定义为图像重建任务来在时间序列预测中表现良好，这表明它们具有作为通用时间序列基础模型的潜力。然而，由于三个关键差异，从视觉到时间序列的有效跨模态转换仍然具有挑战性：（1）结构化的有界图像数据和无界的异构时间序列之间的数据模态差距;（2）基于标准RGB三通道的视觉模型之间的多变量预测差距和对具有任意数量变量的时间序列建模的需求;以及（3）大多数视觉模型的确定性输出格式与对不确定性感知概率预测的需求之间的概率预测差距。为了弥补这些差距，我们提出了VisionTS++，这是一种基于视觉模型的TSFM，可以对大规模时间序列数据集进行持续的预训练，包括3项创新：（1）基于视觉模型的过滤机制，用于识别高质量的时间序列数据，从而减轻模态间隙并提高预训练稳定性，（2）彩色多元转换方法，其将多元时间序列转换为多子图RGB图像，从而捕获复杂的变量间依赖性;以及（3）多分位数预测方法，其使用并行重构头来生成不同分位数水平的预测，从而更灵活地近似任意输出分布，而无需限制性的先验分布假设。在分布内和分布外TSF基准测试中，\模型均达到SOTA结果，在MSE降低方面优于专门的TSFMs 6%-44%，并在12个概率预测设置中的9个中排名第一。我们的工作为跨模态知识转移建立了一个新的范式，推动了通用TSFMs的发展。
摘要：Recent studies have revealed that vision models pre-trained on images can perform well in time series forecasting by reformulating forecasting as an image reconstruction task, suggesting their potential as universal time series foundation models. However, effective cross-modal transfer from vision to time series remains challenging due to three key discrepancies: (1) data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) multivariate-forecasting gap between standard RGB three-channel-based vision models and the need to model time series with arbitrary numbers of variates; and (3) probabilistic-forecasting gap between the deterministic output formats of most vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisionTS++, a vision-model-based TSFM that performs continual pre-training on large-scale time series datasets, including 3 innovations: (1) a vision-model-based filtering mechanism to identify high-quality time series data, thereby mitigating modality gap and improving pre-training stability, (2) a colorized multivariate conversion method that transforms multivariate time series into multi-subfigure RGB images, capturing complex inter-variate dependencies; and (3) a multi-quantile forecasting approach using parallel reconstruction heads to generate forecasts of different quantile levels, thus more flexibly approximating arbitrary output distributions without restrictive prior distributional assumptions. Evaluated on both in-distribution and out-of-distribution TSF benchmarks, \model achieves SOTA results, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in 9 out of 12 probabilistic forecasting settings. Our work establishes a new paradigm for cross-modal knowledge transfer, advancing the development of universal TSFMs.

【10】Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting
标题：VLM的持续学习：超越忘记的调查和分类
链接：https://arxiv.org/abs/2508.04227

作者：u, Qiuhe Hong, Linlan Huang, Alexandra Gomez-Villa, Dipam Goswami, Xialei Liu, Joost van de Weijer, Yonghong Tian
摘要：视觉语言模型（VLM）通过利用大规模的预训练，在各种多模态任务中取得了令人印象深刻的性能。然而，使它们能够从非平稳数据中不断学习仍然是一个重大挑战，因为它们的跨模态对齐和泛化能力特别容易受到灾难性遗忘的影响。与传统的单峰持续学习（CL）不同，VLM面临着独特的挑战，如跨模态特征漂移，共享架构引起的参数干扰，以及zero-shot能力侵蚀。这项调查提供了第一个重点和系统的审查持续学习的VLM-CL。我们首先确定三个核心故障模式，降低VLM-CL的性能。基于这些，我们提出了一个挑战驱动的分类，将解决方案映射到其目标问题：（1）\textit{多模态重放策略}通过显式或隐式记忆机制解决跨模态漂移;（2）\textit{跨模态正则化}在更新过程中保持模态对齐;（3）\textit{参数有效适应}减轻模块化或低秩更新的参数干扰。我们进一步分析了当前的评估协议，数据集和指标，强调需要更好的基准，捕捉特定的VLM-forgetting和组合泛化。最后，我们概述了开放的问题和未来的方向，包括持续的预训练和成分zero-shot学习。这项调查旨在为开发终身视觉语言系统的研究人员提供全面的诊断参考。所有资源可在https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models上获得。
摘要：Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. This survey offers the first focused and systematic review of continual learning for VLMs (VLM-CL). We begin by identifying the three core failure modes that degrade performance in VLM-CL. Based on these, we propose a challenge-driven taxonomy that maps solutions to their target problems: (1) \textit{Multi-Modal Replay Strategies} address cross-modal drift through explicit or implicit memory mechanisms; (2) \textit{Cross-Modal Regularization} preserves modality alignment during updates; and (3) \textit{Parameter-Efficient Adaptation} mitigates parameter interference with modular or low-rank updates. We further analyze current evaluation protocols, datasets, and metrics, highlighting the need for better benchmarks that capture VLM-specific forgetting and compositional generalization. Finally, we outline open problems and future directions, including continual pre-training and compositional zero-shot learning. This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems. All resources are available at: https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models.

【11】NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations
标题：NVSpeech：一个集成且可扩展的基于副语言发声的类人语音建模管道
链接：https://arxiv.org/abs/2508.04195

作者：, Qinke Ni, Yuancheng Wang, Yiheng Lu, Haoyue Zhan, Pengyuan Xie, Qiang Zhang, Zhizheng Wu
摘要：副语言发声包括非语言的声音，如笑声和呼吸，以及词汇化的感叹词，如“嗯”和“哦”，是自然口语交流的组成部分。尽管它们在传达情感、意图和干扰线索方面很重要，但在传统的自动语音识别（ASR）和文本到语音（TTS）系统中，这些线索在很大程度上被忽视。我们提出了NVSpeech，一个集成的和可扩展的管道，桥梁的识别和合成的非语言发声，包括数据集的建设，ASR建模，和可控的TTS。(1)我们引入了一个手动注释的数据集，包含48，430个人类口语话语，其中包含18个单词级别的非语言学类别。(2)我们开发了语言感知的ASR模型，该模型将语言线索视为内联可解码标记（例如，“你真有趣（笑声）”），使联合词汇和非语言转录。然后，该模型被用来自动标注一个大型语料库，第一个大规模的中文数据集的174,179话语（573小时）与词级对齐和反语言线索。(3)我们微调zero-shot TTS模型上的人类和自动标记的数据，使显式的控制，语言发声，允许上下文感知插入在任意标记的位置，类人的语音合成。通过统一识别和生成非语言发音，NVSpeech提供了第一个开放的，大规模的，单词级注释的管道，用于普通话的表达性语音建模，以可扩展和可控的方式集成识别和合成。数据集和音频演示可在https://nvspeech170k.github.io/上获得。
摘要：Paralinguistic vocalizations-including non-verbal sounds like laughter and breathing, as well as lexicalized interjections such as "uhm" and "oh"-are integral to natural spoken communication. Despite their importance in conveying affect, intent, and interactional cues, such cues remain largely overlooked in conventional automatic speech recognition (ASR) and text-to-speech (TTS) systems. We present NVSpeech, an integrated and scalable pipeline that bridges the recognition and synthesis of paralinguistic vocalizations, encompassing dataset construction, ASR modeling, and controllable TTS. (1) We introduce a manually annotated dataset of 48,430 human-spoken utterances with 18 word-level paralinguistic categories. (2) We develop the paralinguistic-aware ASR model, which treats paralinguistic cues as inline decodable tokens (e.g., "You're so funny [Laughter]"), enabling joint lexical and non-verbal transcription. This model is then used to automatically annotate a large corpus, the first large-scale Chinese dataset of 174,179 utterances (573 hours) with word-level alignment and paralingustic cues. (3) We finetune zero-shot TTS models on both human- and auto-labeled data to enable explicit control over paralinguistic vocalizations, allowing context-aware insertion at arbitrary token positions for human-like speech synthesis. By unifying the recognition and generation of paralinguistic vocalizations, NVSpeech offers the first open, large-scale, word-level annotated pipeline for expressive speech modeling in Mandarin, integrating recognition and synthesis in a scalable and controllable manner. Dataset and audio demos are available at https://nvspeech170k.github.io/.

【12】Neural Network Training via Stochastic Alternating Minimization with Trainable Step Sizes
标题：通过随机交替最小化和可训练步骤大小进行神经网络训练
链接：https://arxiv.org/abs/2508.04193

作者：g Yan, Jiawei Xu, Zheng Peng, Qingsong Wang
摘要：深度神经网络的训练本质上是一个非凸优化问题，但随机梯度下降（SGD）等标准方法需要同时更新所有参数，通常会导致收敛不稳定和计算成本高。为了解决这些问题，我们提出了一种新的方法，可训练步长随机交替最小化（SAMT），它通过将每层的权重视为块来交替更新网络参数。通过将整体优化分解为对应于不同块的子问题，这种逐块交替策略减少了每步的计算开销，并增强了非凸设置中的训练稳定性。为了充分利用这些好处，受元学习的启发，我们提出了一种新的自适应步长策略，以纳入交替更新的子问题求解步骤。它支持不同类型的可训练步长，包括但不限于标量、元素、行和列，通过元学习为每个块定制自适应步长选择。我们进一步提供了一个理论上的收敛性保证所提出的算法，建立其优化的合理性。多个基准测试的大量实验表明，SAMT实现了更好的泛化性能与更少的参数更新相比，国家的最先进的方法，突出了其在神经网络优化的有效性和潜力。
摘要：The training of deep neural networks is inherently a nonconvex optimization problem, yet standard approaches such as stochastic gradient descent (SGD) require simultaneous updates to all parameters, often leading to unstable convergence and high computational cost. To address these issues, we propose a novel method, Stochastic Alternating Minimization with Trainable Step Sizes (SAMT), which updates network parameters in an alternating manner by treating the weights of each layer as a block. By decomposing the overall optimization into sub-problems corresponding to different blocks, this block-wise alternating strategy reduces per-step computational overhead and enhances training stability in nonconvex settings. To fully leverage these benefits, inspired by meta-learning, we proposed a novel adaptive step size strategy to incorporate into the sub-problem solving steps of alternating updates. It supports different types of trainable step sizes, including but not limited to scalar, element-wise, row-wise, and column-wise, enabling adaptive step size selection tailored to each block via meta-learning. We further provide a theoretical convergence guarantee for the proposed algorithm, establishing its optimization soundness. Extensive experiments for multiple benchmarks demonstrate that SAMT achieves better generalization performance with fewer parameter updates compared to state-of-the-art methods, highlighting its effectiveness and potential in neural network optimization.

【13】A Comparative Survey of PyTorch vs TensorFlow for Deep Learning: Usability, Performance, and Deployment Trade-offs
标题：PyTorch与TensorFlow用于深度学习的比较调查：可用性、性能和部署权衡
链接：https://arxiv.org/abs/2508.04035

作者：Ba Alawi
备注：14 pages, 15 figures, 43 references
摘要：本文对TensorFlow和PyTorch这两个领先的深度学习框架进行了全面的比较调查，重点关注它们的可用性，性能和部署权衡。我们回顾了每个框架的编程范式和开发人员的经验，对比了TensorFlow的基于图的（现在可选的渴望）方法与PyTorch的动态Pythonic风格。然后，我们利用最近的基准和研究，比较了多个任务和数据体系的模型训练速度和推理性能。深入研究部署灵活性-从TensorFlow成熟的生态系统（TensorFlow Lite for mobile/embedded、TensorFlow Serving和JavaScript支持）到PyTorch较新的生产工具（TorchScript compilation、ONNX export和TorchServe）。我们还调查了生态系统和社区支持，包括图书馆集成、行业采用和研究趋势（例如，PyTorch在最近的研究出版物中的主导地位与TensorFlow在企业中更广泛的工具相比）。讨论了在计算机视觉，自然语言处理和其他领域的应用，以说明每个框架在实践中的使用。最后，我们概述了深度学习框架设计的未来方向和开放挑战，例如统一渴望和图形执行，提高跨框架互操作性，以及集成编译器优化（XLA，JIT）以提高速度。我们的研究结果表明，虽然这两个框架都非常适合最先进的深度学习，但它们表现出明显的权衡：PyTorch提供了研究中青睐的简单性和灵活性，而TensorFlow提供了一个更完整的生产就绪生态系统-理解这些权衡是从业者选择适当工具的关键。我们包括图表，代码片段，以及20多篇学术论文和官方文档的参考文献来支持这种比较分析
摘要：This paper presents a comprehensive comparative survey of TensorFlow and PyTorch, the two leading deep learning frameworks, focusing on their usability, performance, and deployment trade-offs. We review each framework's programming paradigm and developer experience, contrasting TensorFlow's graph-based (now optionally eager) approach with PyTorch's dynamic, Pythonic style. We then compare model training speeds and inference performance across multiple tasks and data regimes, drawing on recent benchmarks and studies. Deployment flexibility is examined in depth - from TensorFlow's mature ecosystem (TensorFlow Lite for mobile/embedded, TensorFlow Serving, and JavaScript support) to PyTorch's newer production tools (TorchScript compilation, ONNX export, and TorchServe). We also survey ecosystem and community support, including library integrations, industry adoption, and research trends (e.g., PyTorch's dominance in recent research publications versus TensorFlow's broader tooling in enterprise). Applications in computer vision, natural language processing, and other domains are discussed to illustrate how each framework is used in practice. Finally, we outline future directions and open challenges in deep learning framework design, such as unifying eager and graph execution, improving cross-framework interoperability, and integrating compiler optimizations (XLA, JIT) for improved speed. Our findings indicate that while both frameworks are highly capable for state-of-the-art deep learning, they exhibit distinct trade-offs: PyTorch offers simplicity and flexibility favored in research, whereas TensorFlow provides a fuller production-ready ecosystem - understanding these trade-offs is key for practitioners selecting the appropriate tool. We include charts, code snippets, and more than 20 references to academic papers and official documentation to support this comparative analysis

【14】Intelligent Sampling of Extreme-Scale Turbulence Datasets for Accurate and Efficient Spatiotemporal Model Training
标题：极端尺度湍流数据集的智能采样，以实现准确有效的时空模型训练
链接：https://arxiv.org/abs/2508.03872

作者：ewer, Murali Meena Gopalakrishnan, Matthias Maiterth, Aditya Kashi, Jong Youl Choi, Pei Zhang, Stephen Nichols, Riccardo Balin, Miles Couchman, Stephen de Bruyn Kops, P.K. Yeung, Daniel Dotson, Rohini Uma-Vaideswaran, Sarp Oral, Feiyi Wang
备注：14 pages, 8 figures, 2 tables
摘要：随着摩尔定律和Dennard缩放的终结，有效的训练越来越需要重新考虑数据量。我们能否通过智能子采样，用更少的数据训练出更好的模型？为了探索这一点，我们开发了SICKLE，这是一个用于高效学习的稀疏智能策展框架，具有新颖的最大熵（MaxEnt）采样方法，可扩展的训练和能量基准测试。我们比较MaxEnt随机和相空间采样的大型直接数值模拟（DNS）湍流数据集。在Frontier上对SICKLE进行大规模评估，我们发现子采样作为预处理步骤可以提高模型精度并大幅降低能耗，在某些情况下可以观察到高达38倍的减少。
摘要：With the end of Moore's law and Dennard scaling, efficient training increasingly requires rethinking data volume. Can we train better models with significantly less data via intelligent subsampling? To explore this, we develop SICKLE, a sparse intelligent curation framework for efficient learning, featuring a novel maximum entropy (MaxEnt) sampling approach, scalable training, and energy benchmarking. We compare MaxEnt with random and phase-space sampling on large direct numerical simulation (DNS) datasets of turbulence. Evaluating SICKLE at scale on Frontier, we show that subsampling as a preprocessing step can improve model accuracy and substantially lower energy consumption, with reductions of up to 38x observed in certain cases.

【15】VAE-DNN: Energy-Efficient Trainable-by-Parts Surrogate Model For Parametric Partial Differential Equations
标题：VAE-DNN：参数偏微方程的节能逐部分训练替代模型
链接：https://arxiv.org/abs/2508.03839

作者：g, Alexandre M. Tartakovsky
摘要：我们提出了一个可训练的部分代理模型，用于求解正、逆参数化非线性偏微分方程。与其他几个代理和算子学习模型一样，所提出的方法采用编码器将高维输入$y（\bm{x}）$减少到低维潜在空间$\bm\mu_{\bm\phi_y}$。然后，使用全连接神经网络将$\bm\mu_{\bm\phi_y}$映射到PDE解$h（\bm{x}，t）$的潜在空间$\bm\mu_{\bm\phi_h}$。最后，利用解码器来重构$h（\bm{x}，t）$。我们的模型的创新之处在于它能够独立地训练它的三个组成部分。与FNO和DeepONet等领先的操作员学习模型相比，这种方法大大减少了训练所需的时间和能量。可分离的训练是通过训练编码器作为$y（\bm{x}）$的变分自动编码器（VAE）的一部分和解码器作为$h（\bm{x}，t）$ VAE的一部分来实现的。我们将此模型称为VAE-DNN模型。将VAE-DNN与FNO和DeepONet模型进行比较，以获得非线性扩散方程的正解和逆解，该方程控制无压含水层中的地下水流。我们的研究结果表明，与FNO和DeepONet模型相比，VAE-DNN不仅表现出更高的效率，而且在正向和反向解决方案中都具有更高的准确性。
摘要：We propose a trainable-by-parts surrogate model for solving forward and inverse parameterized nonlinear partial differential equations. Like several other surrogate and operator learning models, the proposed approach employs an encoder to reduce the high-dimensional input $y(\bm{x})$ to a lower-dimensional latent space, $\bm\mu_{\bm\phi_y}$. Then, a fully connected neural network is used to map $\bm\mu_{\bm\phi_y}$ to the latent space, $\bm\mu_{\bm\phi_h}$, of the PDE solution $h(\bm{x},t)$. Finally, a decoder is utilized to reconstruct $h(\bm{x},t)$. The innovative aspect of our model is its ability to train its three components independently. This approach leads to a substantial decrease in both the time and energy required for training when compared to leading operator learning models such as FNO and DeepONet. The separable training is achieved by training the encoder as part of the variational autoencoder (VAE) for $y(\bm{x})$ and the decoder as part of the $h(\bm{x},t)$ VAE. We refer to this model as the VAE-DNN model. VAE-DNN is compared to the FNO and DeepONet models for obtaining forward and inverse solutions to the nonlinear diffusion equation governing groundwater flow in an unconfined aquifer. Our findings indicate that VAE-DNN not only demonstrates greater efficiency but also delivers superior accuracy in both forward and inverse solutions compared to the FNO and DeepONet models.

【16】Development of management systems using artificial intelligence systems and machine learning methods for boards of directors (preprint, unofficial translation)
标题：为董事会开发使用人工智能系统和机器学习方法的管理系统（预印本，非正式翻译）
链接：https://arxiv.org/abs/2508.03769

作者：nova
摘要：该研究解决了企业管理中的范式转变，人工智能正在从决策支持工具转变为自主决策者，一些人工智能系统已经被任命为公司的领导角色。我们发现的一个核心问题是，人工智能技术的发展速度远远超过了制定适当的法律和道德准则的速度。该研究为企业管理中自主AI系统的开发和实施提出了一个“参考模型”。这一模式综合了若干关键要素，以确保决策的合法性和道德性。该模型引入了“计算法则”或“算法法则”的概念。这涉及到为人工智能系统创建一个单独的法律框架，将规则和法规翻译成机器可读的算法格式，以避免自然语言的模糊性。该论文强调了自主AI系统需要一个“专用操作环境”，类似于自动驾驶汽车的“操作设计领域”。这意味着创建一个特定的，明确定义的环境和一套规则，人工智能可以在其中安全有效地运行。该模型主张在受控的、综合生成的数据上训练人工智能系统，以确保从一开始就嵌入公平和道德考虑。博弈论也被提出作为一种计算人工智能在这些道德和法律约束下实现其目标的最佳策略的方法。所提供的分析强调了可解释人工智能（XAI）的重要性，以确保自治系统所做决策的透明度和问责制。这对于建立信任和遵守“解释权”至关重要。
摘要：The study addresses the paradigm shift in corporate management, where AI is moving from a decision support tool to an autonomous decision-maker, with some AI systems already appointed to leadership roles in companies. A central problem identified is that the development of AI technologies is far outpacing the creation of adequate legal and ethical guidelines. The research proposes a "reference model" for the development and implementation of autonomous AI systems in corporate management. This model is based on a synthesis of several key components to ensure legitimate and ethical decision-making. The model introduces the concept of "computational law" or "algorithmic law". This involves creating a separate legal framework for AI systems, with rules and regulations translated into a machine-readable, algorithmic format to avoid the ambiguity of natural language. The paper emphasises the need for a "dedicated operational context" for autonomous AI systems, analogous to the "operational design domain" for autonomous vehicles. This means creating a specific, clearly defined environment and set of rules within which the AI can operate safely and effectively. The model advocates for training AI systems on controlled, synthetically generated data to ensure fairness and ethical considerations are embedded from the start. Game theory is also proposed as a method for calculating the optimal strategy for the AI to achieve its goals within these ethical and legal constraints. The provided analysis highlights the importance of explainable AI (XAI) to ensure the transparency and accountability of decisions made by autonomous systems. This is crucial for building trust and for complying with the "right to explanation".

【17】Conditional Fetal Brain Atlas Learning for Automatic Tissue Segmentation
标题：自动组织分割的条件胎儿脑图谱学习
链接：https://arxiv.org/abs/2508.04522

作者：Tischer, Patric Kienast, Marlene Stümpflen, Gregor Kasprian, Georg Langs, Roxane Licandro
备注：12 pages, 4 figures, MICCAI Workshop on Perinatal Imaging, Placental and Preterm Image analysis
摘要：胎儿大脑的磁共振成像（MRI）已成为研究体内大脑发育的关键工具。然而，由于脑成熟、成像方案的可变性以及对胎龄（GA）的不确定估计，其评估仍然具有挑战性。为了克服这些问题，大脑图谱提供了一个标准化的参考框架，通过将图谱和受试者在一个共同的坐标系中对齐，促进了受试者之间的客观评估和比较。在这项工作中，我们介绍了一种新的深度学习框架，用于生成连续的、特定年龄的胎儿脑图谱，以进行实时胎儿脑组织分割。该框架结合了一个直接注册模型与条件约束。在219个神经典型胎儿MRI的策划数据集上进行培训，这些MRI涵盖妊娠21至37周。该方法实现了高配准精度，捕捉动态解剖结构的变化与尖锐的结构细节，和强大的分割性能与平均骰子相似系数（DSC）的86.3%，在六个脑组织。此外，生成的图谱的体积分析揭示了详细的神经典型的生长轨迹，提供了宝贵的见解胎儿大脑的成熟。这种方法能够以最小的预处理和实时性能进行个性化的发育评估，支持研究和临床应用。型号代码可在https://github.com/cirmuw/fetal-brain-atlas上获得
摘要：Magnetic Resonance Imaging (MRI) of the fetal brain has become a key tool for studying brain development in vivo. Yet, its assessment remains challenging due to variability in brain maturation, imaging protocols, and uncertain estimates of Gestational Age (GA). To overcome these, brain atlases provide a standardized reference framework that facilitates objective evaluation and comparison across subjects by aligning the atlas and subjects in a common coordinate system. In this work, we introduce a novel deep-learning framework for generating continuous, age-specific fetal brain atlases for real-time fetal brain tissue segmentation. The framework combines a direct registration model with a conditional discriminator. Trained on a curated dataset of 219 neurotypical fetal MRIs spanning from 21 to 37 weeks of gestation. The method achieves high registration accuracy, captures dynamic anatomical changes with sharp structural detail, and robust segmentation performance with an average Dice Similarity Coefficient (DSC) of 86.3% across six brain tissues. Furthermore, volumetric analysis of the generated atlases reveals detailed neurotypical growth trajectories, providing valuable insights into the maturation of the fetal brain. This approach enables individualized developmental assessment with minimal pre-processing and real-time performance, supporting both research and clinical applications. The model code is available at https://github.com/cirmuw/fetal-brain-atlas

【18】Metric Learning in an RKHS
标题：RKHS中的指标学习
链接：https://arxiv.org/abs/2508.04476

作者：tli, Yi Chen, Blake Mason, Robert Nowak, Ramya Korlakai Vinayak
备注：Appeared in the 41st Conference on Uncertainty in Artificial Intelligence (UAI 2025)
摘要：从一组三元组比较中学习度量，其形式为“你认为项目h与项目i或项目j更相似？表示项目之间的相似性和差异，在包括图像检索、推荐系统和认知心理学在内的各种应用中起着关键作用。目标是在RKHS中学习反映比较的指标。使用核方法和神经网络的非线性度量学习已经显示出巨大的经验前景。虽然以前的作品已经解决了这个问题的某些方面，很少或根本没有理论上的理解，这样的方法。例外是特殊（线性）情况，其中RKHS是标准欧几里得空间$\mathbb{R}^d$;在$\mathbb{R}^d$中有一个全面的度量学习理论。本文开发了一个通用的RKHS框架的度量学习，并提供了新的泛化保证和样本复杂度界限。我们通过对真实数据集的模拟和实验来验证我们的发现。我们的代码可在https://github.com/RamyaLab/metric-learning-RKHS上公开获取。
摘要：Metric learning from a set of triplet comparisons in the form of "Do you think item h is more similar to item i or item j?", indicating similarity and differences between items, plays a key role in various applications including image retrieval, recommendation systems, and cognitive psychology. The goal is to learn a metric in the RKHS that reflects the comparisons. Nonlinear metric learning using kernel methods and neural networks have shown great empirical promise. While previous works have addressed certain aspects of this problem, there is little or no theoretical understanding of such methods. The exception is the special (linear) case in which the RKHS is the standard Euclidean space $\mathbb{R}^d$; there is a comprehensive theory for metric learning in $\mathbb{R}^d$. This paper develops a general RKHS framework for metric learning and provides novel generalization guarantees and sample complexity bounds. We validate our findings through a set of simulations and experiments on real datasets. Our code is publicly available at https://github.com/RamyaLab/metric-learning-RKHS.

【19】The Relative Instability of Model Comparison with Cross-validation
标题：交叉验证模型比较的相对不稳定性
链接：https://arxiv.org/abs/2508.04409

作者： Bayle, Lucas Janson, Lester Mackey
备注：41 pages, 4 figures
摘要：现有的工作已经表明，交叉验证（CV）可以用于为稳定的机器学习算法的测试误差提供渐近置信区间，并且许多流行算法的现有稳定性结果可以应用于导出这样的置信区间有效的正实例。然而，在CV是用来比较两种算法的共同设置，它变得有必要考虑一个相对稳定的概念，不能很容易地从现有的稳定性结果，即使是简单的算法。为了更好地理解相对稳定性，当CV为两种算法的测试误差差提供有效的置信区间时，我们研究了Lasso的近亲软阈值最小二乘算法。我们证明，虽然稳定性保持在评估该算法的个人测试误差时，相对稳定性未能保持在比较两个这样的算法的测试误差时，即使在稀疏的低维线性模型设置。此外，我们根据经验证实，当使用软阈值或Lasso时，测试误差差异的CV置信区间无效。简而言之，在量化两种机器学习算法的性能差异的CV估计的不确定性时需要谨慎，即使两种算法都是单独稳定的。
摘要：Existing work has shown that cross-validation (CV) can be used to provide an asymptotic confidence interval for the test error of a stable machine learning algorithm, and existing stability results for many popular algorithms can be applied to derive positive instances where such confidence intervals will be valid. However, in the common setting where CV is used to compare two algorithms, it becomes necessary to consider a notion of relative stability which cannot easily be derived from existing stability results, even for simple algorithms. To better understand relative stability and when CV provides valid confidence intervals for the test error difference of two algorithms, we study the soft-thresholded least squares algorithm, a close cousin of the Lasso. We prove that while stability holds when assessing the individual test error of this algorithm, relative stability fails to hold when comparing the test error of two such algorithms, even in a sparse low-dimensional linear model setting. Additionally, we empirically confirm the invalidity of CV confidence intervals for the test error difference when either soft-thresholding or the Lasso is used. In short, caution is needed when quantifying the uncertainty of CV estimates of the performance difference of two machine learning algorithms, even when both algorithms are individually stable.

【20】Hybrid Quantum--Classical Machine Learning Potential with Variational Quantum Circuits
标题：混合量子--具有变分量子电路的经典机器学习潜力
链接：https://arxiv.org/abs/2508.04098

作者：Yoo Willow, D. ChangMo Yang, Chang Woo Myung
备注：26+6 pages, 6+4 figures
摘要：用于模拟大型复杂分子系统的量子算法仍处于起步阶段，超越最先进的经典技术仍然是一个不断后退的目标。与此同时，一个有希望的研究途径是通过混合量子经典算法寻求实际优势，该算法将传统神经网络与在当今嘈杂的中间尺度量子（NISQ）硬件上运行的变分量子电路（VQC）相结合。这种混合电路非常适合NISQ硬件。经典处理器执行大部分计算，而量子处理器执行有针对性的子任务，提供额外的非线性和表现力。在这里，我们将纯经典E（3）-等变消息传递机器学习势（MLP）与混合量子经典MLP进行基准测试，以预测液态硅的密度泛函理论（DFT）性质。在我们的混合架构中，消息传递层中的每个读出都被VQC取代。由HQC-MLP驱动的分子动力学模拟表明，VQC实现了高温结构和热力学性质的准确再现。这些发现展示了一个具体的场景，其中NISQ兼容的HQC算法可以提供比现有最佳经典替代方案更大的可测量的益处，这表明了材料建模中实现近期量子优势的可行途径。
摘要：Quantum algorithms for simulating large and complex molecular systems are still in their infancy, and surpassing state-of-the-art classical techniques remains an ever-receding goal post. A promising avenue of inquiry in the meanwhile is to seek practical advantages through hybrid quantum-classical algorithms, which combine conventional neural networks with variational quantum circuits (VQCs) running on today's noisy intermediate-scale quantum (NISQ) hardware. Such hybrids are well suited to NISQ hardware. The classical processor performs the bulk of the computation, while the quantum processor executes targeted sub-tasks that supply additional non-linearity and expressivity. Here, we benchmark a purely classical E(3)-equivariant message-passing machine learning potential (MLP) against a hybrid quantum-classical MLP for predicting density functional theory (DFT) properties of liquid silicon. In our hybrid architecture, every readout in the message-passing layers is replaced by a VQC. Molecular dynamics simulations driven by the HQC-MLP reveal that an accurate reproduction of high-temperature structural and thermodynamic properties is achieved with VQCs. These findings demonstrate a concrete scenario in which NISQ-compatible HQC algorithm could deliver a measurable benefit over the best available classical alternative, suggesting a viable pathway toward near-term quantum advantage in materials modeling.

【21】Constraining the outputs of ReLU neural networks
标题：约束ReLU神经网络的输出
链接：https://arxiv.org/abs/2508.03867

作者：xandr, Guido Montúfar
备注：32 pages, 4 figures
摘要：我们引入了一类与ReLU神经网络自然相关的代数簇，它们产生于输入空间中激活区域的输出的分段线性结构，以及参数空间中的分段多线性结构。通过分析每个激活区域内的网络输出的秩约束，我们推导出多项式方程，其特征在于由网络表示的功能。我们进一步研究了这些变体达到预期维度的条件，从而深入了解ReLU网络的表达和结构特性。
摘要：We introduce a class of algebraic varieties naturally associated with ReLU neural networks, arising from the piecewise linear structure of their outputs across activation regions in input space, and the piecewise multilinear structure in parameter space. By analyzing the rank constraints on the network outputs within each activation region, we derive polynomial equations that characterize the functions representable by the network. We further investigate conditions under which these varieties attain their expected dimension, providing insight into the expressive and structural properties of ReLU networks.

其他(23篇)

【1】Perch 2.0: The Bittern Lesson for Bioacoustics
标题：鲈鱼2.0：生物声学的苦涩课程
链接：https://arxiv.org/abs/2508.04665

作者：Merriënboer, Vincent Dumoulin, Jenny Hamer, Lauren Harrell, Andrea Burns, Tom Denton
摘要：Perch是一个高性能的生物声学预训练模型。它以监督的方式进行训练，为数千种发声物种提供现成的分类分数，并为迁移学习提供强大的嵌入。在这个新版本Perch 2.0中，我们从专门针对鸟类物种的训练扩展到大型多分类群数据集。该模型使用原型学习分类器以及新的源预测训练标准进行自蒸馏训练。Perch 2.0在BirdSet和BEANS基准测试中获得了最先进的性能。尽管几乎没有海洋训练数据，但它在海洋迁移学习任务上的表现也优于专门的海洋模型。我们提出的假设，为什么细粒度的物种分类是一个特别强大的预训练任务的生物声学。
摘要：Perch is a performant pre-trained model for bioacoustics. It was trained in supervised fashion, providing both off-the-shelf classification scores for thousands of vocalizing species as well as strong embeddings for transfer learning. In this new release, Perch 2.0, we expand from training exclusively on avian species to a large multi-taxa dataset. The model is trained with self-distillation using a prototype-learning classifier as well as a new source-prediction training criterion. Perch 2.0 obtains state-of-the-art performance on the BirdSet and BEANS benchmarks. It also outperforms specialized marine models on marine transfer learning tasks, despite having almost no marine training data. We present hypotheses as to why fine-grained species classification is a particularly robust pre-training task for bioacoustics.

【2】Augmentation-based Domain Generalization and Joint Training from Multiple Source Domains for Whole Heart Segmentation
标题：基于增强的领域概括和多个源领域的联合训练用于整个心脏分割
链接：https://arxiv.org/abs/2508.04552

作者：ler, Darko Stern, Gernot Plank, Martin Urschler
备注：Accepted for the MICCAI Challenge on Comprehensive Analysis and Computing of Real-World Medical Images 2024, 12 pages
摘要：心血管疾病是全球死亡的主要原因，它促使人们开发更复杂的方法来分析心脏及其子结构，如计算机断层扫描（CT）和磁共振（MR）等医学图像。代表整个心脏的重要心脏结构的语义分割对于评估患者特异性心脏形态和病理是有用的。此外，准确的语义分割可以用于生成心脏数字孪生模型，其允许例如电生理模拟和个性化治疗规划。尽管基于深度学习的医学图像分割方法在过去十年中取得了很大的进步，但在域转移下保持良好的性能-即当训练和测试数据从不同的数据分布中采样时-仍然具有挑战性。为了在训练时已知的域上表现良好，我们采用（1）平衡的联合训练方法，该方法利用来自不同源域的等量CT和MR数据。此外，为了减轻域向仅在测试时遇到的域的转移，我们依靠（2）强强度和空间增强技术来极大地多样化可用的训练数据。与仅在CT上训练的模型相比，我们提出的整个心脏分割方法（包含我们贡献的5倍集成）实现了MR数据的整体最佳性能，并且与CT数据的最佳性能相似。与93.33%DSC和0.8388毫米ASSD的CT和89.30%DSC和1.2411毫米ASSD的MR数据，我们的方法表现出巨大的潜力，有效地获得准确的语义分割，从患者特定的心脏双胞胎模型可以生成。
摘要：As the leading cause of death worldwide, cardiovascular diseases motivate the development of more sophisticated methods to analyze the heart and its substructures from medical images like Computed Tomography (CT) and Magnetic Resonance (MR). Semantic segmentations of important cardiac structures that represent the whole heart are useful to assess patient-specific cardiac morphology and pathology. Furthermore, accurate semantic segmentations can be used to generate cardiac digital twin models which allows e.g. electrophysiological simulation and personalized therapy planning. Even though deep learning-based methods for medical image segmentation achieved great advancements over the last decade, retaining good performance under domain shift -- i.e. when training and test data are sampled from different data distributions -- remains challenging. In order to perform well on domains known at training-time, we employ a (1) balanced joint training approach that utilizes CT and MR data in equal amounts from different source domains. Further, aiming to alleviate domain shift towards domains only encountered at test-time, we rely on (2) strong intensity and spatial augmentation techniques to greatly diversify the available training data. Our proposed whole heart segmentation method, a 5-fold ensemble with our contributions, achieves the best performance for MR data overall and a performance similar to the best performance for CT data when compared to a model trained solely on CT. With 93.33% DSC and 0.8388 mm ASSD for CT and 89.30% DSC and 1.2411 mm ASSD for MR data, our method demonstrates great potential to efficiently obtain accurate semantic segmentations from which patient-specific cardiac twin models can be generated.

【3】OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use
标题：操作系统代理：针对通用计算设备使用的基于MLLM的代理的调查
链接：https://arxiv.org/abs/2508.04482

作者： Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, Fei Wu
备注：ACL 2025 (Oral)
摘要：创造像《钢铁侠》中虚构的J.A.R.V.I.S一样有能力和多才多艺的AI助手的梦想长期以来一直吸引着人们的想象力。随着（多模态）大型语言模型（（M）LLM）的发展，这个梦想更接近现实，因为（M）基于LLM的Agent使用计算设备（例如，计算机和移动电话）通过在环境和接口（例如，操作系统（OS）提供的用于自动化任务的图形用户界面（GUI）已经取得了显着进步。本文提出了一个全面的调查，这些先进的代理，指定为OS代理。我们首先阐述了操作系统代理的基础知识，探索其关键组成部分，包括环境，观察空间和动作空间，并概述了基本功能，如理解，规划和接地。然后，我们研究的方法构建操作系统代理，专注于特定领域的基础模型和代理框架。评估协议和基准的详细审查强调了如何在不同的任务中评估OS代理。最后，我们讨论了当前的挑战，并确定了未来研究的有希望的方向，包括安全和隐私、个性化和自我进化。本次调查旨在巩固OS Agents研究的现状，为指导学术研究和产业发展提供见解。一个开源的GitHub存储库被维护为一个动态资源，以促进该领域的进一步创新。我们提出了一个9页的版本，我们的工作，接受ACL 2025，提供一个简明扼要的概述域。
摘要：The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey of these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, we discuss current challenges and identify promising directions for future research, including safety and privacy, personalization and self-evolution. This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field. We present a 9-page version of our work, accepted by ACL 2025, to provide a concise overview to the domain.

【4】GFocal: A Global-Focal Neural Operator for Solving PDEs on Arbitrary Geometries
标题：GFocal：一种用于求解任意几何形状上的偏置方程的全局焦点神经运算符
链接：https://arxiv.org/abs/2508.04463

作者：ei, Jiaxin Hu, Qiaofeng Li, Zhenyu Liu
摘要：基于transformer的神经操作符已经成为偏微分方程的有前途的代理求解器，通过利用Transformers的有效性来捕获长程依赖性和全局相关性，在语言建模中得到了深刻的证明。然而，现有的方法忽略了协调学习的相互依存关系之间的局部物理细节和全球功能，这是必不可少的解决多尺度问题，保持物理一致性和数值稳定性，在长期推出，并准确地捕捉过渡动态。在这项工作中，我们提出了GFocal，一种基于transformer的神经算子方法，可以同时执行全局和局部特征学习和融合。全局相关性和局部特征通过Nystr\"{o}m基于注意力的\textbf{g}块和基于切片的\textbf{focal}块来利用，以生成物理感知令牌，随后通过基于卷积的门控块进行调制和集成，从而实现多尺度信息的动态融合。GFocal可以在给定任意几何形状和初始条件的情况下对物理特征进行精确建模和预测。实验表明，GFocal在6个基准测试中有5个测试的平均相对增益为15.2%，达到了最先进的性能，在汽车和机翼的空气动力学仿真等工业规模的仿真中也表现出色。
摘要：Transformer-based neural operators have emerged as promising surrogate solvers for partial differential equations, by leveraging the effectiveness of Transformers for capturing long-range dependencies and global correlations, profoundly proven in language modeling. However, existing methodologies overlook the coordinated learning of interdependencies between local physical details and global features, which are essential for tackling multiscale problems, preserving physical consistency and numerical stability in long-term rollouts, and accurately capturing transitional dynamics. In this work, we propose GFocal, a Transformer-based neural operator method that enforces simultaneous global and local feature learning and fusion. Global correlations and local features are harnessed through Nystr\"{o}m attention-based \textbf{g}lobal blocks and slices-based \textbf{focal} blocks to generate physics-aware tokens, subsequently modulated and integrated via convolution-based gating blocks, enabling dynamic fusion of multiscale information. GFocal achieves accurate modeling and prediction of physical features given arbitrary geometries and initial conditions. Experiments show that GFocal achieves state-of-the-art performance with an average 15.2\% relative gain in five out of six benchmarks and also excels in industry-scale simulations such as aerodynamics simulation of automotives and airfoils.

【5】Multi-Marginal Stochastic Flow Matching for High-Dimensional Snapshot Data at Irregular Time Points
标题：不规则时间点多维快照数据的多边缘随机流匹配
链接：https://arxiv.org/abs/2508.04351

作者：e, Behnaz Moradijamei, Heman Shakeri
备注：23 pages, 10 figures
摘要：从有限的快照观察在不规则的时间点的高维系统的演化建模提出了一个重大的挑战，在定量生物学和相关领域。传统的方法往往依赖于降维技术，这可能会过度简化的动力学和无法捕捉非平衡系统中的关键瞬态行为。我们提出了多边缘随机流匹配（MMSFM），一种新的扩展模拟免费的分数和流匹配方法的多边缘设置，使对齐的高维数据测量在非等距的时间点，而不降低维数。测量值样条的使用增强了对不规则快照时间的鲁棒性，并且分数匹配防止了高维空间中的过拟合。我们在几个合成和基准数据集上验证了我们的框架，包括在不均匀的时间点收集的基因表达数据和图像进展任务，证明了该方法的多功能性。
摘要：Modeling the evolution of high-dimensional systems from limited snapshot observations at irregular time points poses a significant challenge in quantitative biology and related fields. Traditional approaches often rely on dimensionality reduction techniques, which can oversimplify the dynamics and fail to capture critical transient behaviors in non-equilibrium systems. We present Multi-Marginal Stochastic Flow Matching (MMSFM), a novel extension of simulation-free score and flow matching methods to the multi-marginal setting, enabling the alignment of high-dimensional data measured at non-equidistant time points without reducing dimensionality. The use of measure-valued splines enhances robustness to irregular snapshot timing, and score matching prevents overfitting in high-dimensional spaces. We validate our framework on several synthetic and benchmark datasets, including gene expression data collected at uneven time points and an image progression task, demonstrating the method's versatility.

【6】Symmetric Behavior Regularization via Taylor Expansion of Symmetry
标题：通过对称性的泰勒展开来规范对称行为
链接：https://arxiv.org/abs/2508.04225

作者：hu, Zheng Chen, Han Wang, Yukie Nagai
摘要：本文将对称发散引入行为正则化策略优化（BRPO），建立了一个新的离线强化学习框架。现有的方法集中在非对称分歧，如KL获得解析正则化的政策和一个实用的最小化目标。我们表明，对称分歧不允许采用正则化的分析策略，并且可能会产生损失等数值问题。我们通过f$-散度的泰勒级数来解决这些挑战。具体来说，我们证明了一个解析策略可以得到一个有限的系列。对于损失，我们观察到对称发散可以分解为一个不对称和一个条件对称项，泰勒展开后者简化了数值问题。总而言之，我们提出了对称$f$ Actor-Critic（S$f$-AC），这是第一个具有对称分歧的实用BRPO算法。分布近似和MuJoCo的实验结果验证了S$f$-AC的竞争力。
摘要：This paper introduces symmetric divergences to behavior regularization policy optimization (BRPO) to establish a novel offline RL framework. Existing methods focus on asymmetric divergences such as KL to obtain analytic regularized policies and a practical minimization objective. We show that symmetric divergences do not permit an analytic policy as regularization and can incur numerical issues as loss. We tackle these challenges by the Taylor series of $f$-divergence. Specifically, we prove that an analytic policy can be obtained with a finite series. For loss, we observe that symmetric divergences can be decomposed into an asymmetry and a conditional symmetry term, Taylor-expanding the latter alleviates numerical issues. Summing together, we propose Symmetric $f$ Actor-Critic (S$f$-AC), the first practical BRPO algorithm with symmetric divergences. Experimental results on distribution approximation and MuJoCo verify that S$f$-AC performs competitively.

【7】The State Of TTS: A Case Study with Human Fooling Rates
标题：TTC的状况：人类愚弄率的案例研究
链接：https://arxiv.org/abs/2508.04179

作者：rinivasa Varadhan, Sherry Thomas, Sai Teja M. S., Suvrat Bhooshan, Mitesh M. Khapra
备注：Accepted at InterSpeech 2025
摘要：虽然近年来的主观评估表明TTS的快速发展，目前的TTS系统真的可以通过人类的欺骗测试在图灵类评估？我们引入了人类愚弄率（HFR），这是一个直接衡量机器生成的语音被误认为人类的频率的指标。我们对开源和商业TTS模型的大规模评估揭示了重要的见解：（i）基于CMOS的人类平等的主张往往在欺骗测试中失败，（ii）TTS进展应以人类语音达到高HFR的数据集为基准，因为对单调或表达较少的参考样本进行评估设置了低标准，（iii）商业模型在zero-shot设置中接近人类欺骗，而开放源码系统仍在与自然的对话语音作斗争; ㈣对高质量数据进行微调可以提高真实性，但不能完全弥合差距。我们的研究结果强调，除了现有的主观测试外，还需要更现实、以人为本的评估。
摘要：While subjective evaluations in recent years indicate rapid progress in TTS, can current TTS systems truly pass a human deception test in a Turing-like evaluation? We introduce Human Fooling Rate (HFR), a metric that directly measures how often machine-generated speech is mistaken for human. Our large-scale evaluation of open-source and commercial TTS models reveals critical insights: (i) CMOS-based claims of human parity often fail under deception testing, (ii) TTS progress should be benchmarked on datasets where human speech achieves high HFRs, as evaluating against monotonous or less expressive reference samples sets a low bar, (iii) Commercial models approach human deception in zero-shot settings, while open-source systems still struggle with natural conversational speech; (iv) Fine-tuning on high-quality data improves realism but does not fully bridge the gap. Our findings underscore the need for more realistic, human-centric evaluations alongside existing subjective tests.

【8】Agentic-AI based Mathematical Framework for Commercialization of Energy Resilience in Electrical Distribution System Planning and Operation
标题：基于统计人工智能的配电系统规划和运营能源弹性商业化数学框架
链接：https://arxiv.org/abs/2508.04170

作者：hri, Divyanshi Dwivedi, Mayukha Pal
摘要：配电系统越来越容易受到极端天气事件和网络威胁的影响，因此有必要制定经济上可行的框架，以提高抗灾能力。虽然现有方法主要侧重于技术复原力指标和增强战略，但在建立市场驱动的机制方面仍然存在重大差距，这些机制可以有效地将复原力功能商业化，同时通过智能决策优化其部署。此外，传统的配电网络重构优化方法往往不能动态地适应正常和紧急情况。本文介绍了一种新的框架集成双代理近端策略优化（PPO）与基于市场的机制，实现了平均弹性得分为0.85 - 0.08超过10个测试集。建议的架构利用双代理PPO计划，其中战略代理选择最佳DER驱动的开关配置，而战术代理微调个别开关状态和电网的偏好下的预算和天气限制。这些代理在定制的动态模拟环境中进行交互，该环境模拟随机灾难事件，预算限制和安全性-成本权衡。设计了一个全面的奖励函数，平衡了弹性增强目标与市场盈利能力（最高可达200倍的奖励激励，导致灾难步骤期间85%的行动选择具有4个DER的配置），并结合了负载恢复速度，系统鲁棒性和客户满意度等因素。在10次测试中，该框架实现了0.12 - 0.01的效益成本比，证明了可持续的市场激励措施对复原力投资的促进作用。这一框架创造了可持续的市场激励机制
摘要：The increasing vulnerability of electrical distribution systems to extreme weather events and cyber threats necessitates the development of economically viable frameworks for resilience enhancement. While existing approaches focus primarily on technical resilience metrics and enhancement strategies, there remains a significant gap in establishing market-driven mechanisms that can effectively commercialize resilience features while optimizing their deployment through intelligent decision-making. Moreover, traditional optimization approaches for distribution network reconfiguration often fail to dynamically adapt to both normal and emergency conditions. This paper introduces a novel framework integrating dual-agent Proximal Policy Optimization (PPO) with market-based mechanisms, achieving an average resilience score of 0.85 0.08 over 10 test episodes. The proposed architecture leverages a dual-agent PPO scheme, where a strategic agent selects optimal DER-driven switching configurations, while a tactical agent fine-tunes individual switch states and grid preferences under budget and weather constraints. These agents interact within a custom-built dynamic simulation environment that models stochastic calamity events, budget limits, and resilience-cost trade-offs. A comprehensive reward function is designed that balances resilience enhancement objectives with market profitability (with up to 200x reward incentives, resulting in 85% of actions during calamity steps selecting configurations with 4 DERs), incorporating factors such as load recovery speed, system robustness, and customer satisfaction. Over 10 test episodes, the framework achieved a benefit-cost ratio of 0.12 0.01, demonstrating sustainable market incentives for resilience investment. This framework creates sustainable market incentives

【9】Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap
标题：通过DPO隐性奖励差距进行基于困难的偏好数据选择
链接：https://arxiv.org/abs/2508.04149

作者：Rongwu Xu, Zhijing Jin
备注：Our code and data are available at this https URL
摘要：将大型语言模型（LLM）与人类偏好对齐是人工智能研究中的一个关键挑战。虽然像人类反馈强化学习（RLHF）和直接偏好优化（DPO）这样的方法被广泛使用，但它们通常依赖于大型，昂贵的偏好数据集。目前的工作缺乏专门针对偏好数据的高质量数据选择方法。在这项工作中，我们引入了一种新的基于难度的数据选择策略的偏好数据集，接地DPO隐式奖励机制。通过选择具有较小DPO隐式奖励差距的偏好数据示例，这表明更具挑战性的情况下，我们提高了数据效率和模型对齐。我们的方法在多个数据集和对齐任务中始终优于五个强大的基线，仅用10%的原始数据就实现了卓越的性能。这种有原则的、有效的选择方法为在有限资源的情况下扩展LLM对准提供了有前途的解决方案。
摘要：Aligning large language models (LLMs) with human preferences is a critical challenge in AI research. While methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used, they often rely on large, costly preference datasets. The current work lacks methods for high-quality data selection specifically for preference data. In this work, we introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism. By selecting preference data examples with smaller DPO implicit reward gaps, which are indicative of more challenging cases, we improve data efficiency and model alignment. Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks, achieving superior performance with only 10\% of the original data. This principled, efficient selection method offers a promising solution for scaling LLM alignment with limited resources.

【10】Convolutional autoencoders for the reconstruction of three-dimensional interfacial multiphase flows
标题：用于三维界面多相流重建的卷积自动编码器
链接：https://arxiv.org/abs/2508.04084

作者：tforth, Shahab Mirjalili
摘要：在这项工作中，我们进行了全面的调查自动编码器降阶建模的三维多相流。专注于用标准卷积架构重建多相流体积/质量分数的准确性，我们研究了不同界面表示选择（扩散，尖锐，水平集）的优点和缺点。我们使用的合成数据与非平凡的接口拓扑结构和高分辨率的多相均匀各向同性湍流的模拟数据的组合进行训练和验证。这项研究阐明了通过自动编码器减少多相流维数的最佳实践。因此，这为解耦自动编码器的训练以进行精确重建和时间或输入/输出模型（例如神经运算符（例如，FNO，DeepONets）和自动编码器给出的低维潜在空间上的神经ODE。因此，这项研究的影响是显着的，多相流社区和超越的兴趣。
摘要：In this work, we perform a comprehensive investigation of autoencoders for reduced-order modeling of three-dimensional multiphase flows. Focusing on the accuracy of reconstructing multiphase flow volume/mass fractions with a standard convolutional architecture, we examine the advantages and disadvantages of different interface representation choices (diffuse, sharp, level set). We use a combination of synthetic data with non-trivial interface topologies and high-resolution simulation data of multiphase homogeneous isotropic turbulence for training and validation. This study clarifies the best practices for reducing the dimensionality of multiphase flows via autoencoders. Consequently, this paves the path for uncoupling the training of autoencoders for accurate reconstruction and the training of temporal or input/output models such as neural operators (e.g., FNOs, DeepONets) and neural ODEs on the lower-dimensional latent space given by the autoencoders. As such, the implications of this study are significant and of interest to the multiphase flow community and beyond.

【11】The Ubiquitous Sparse Matrix-Matrix Products
标题：泛在稀疏矩阵-矩阵产品
链接：https://arxiv.org/abs/2508.04077

作者：uç
摘要：稀疏矩阵与另一个（密集或稀疏）矩阵的相乘是捕获许多数据科学应用的计算模式的基本操作，包括但不限于图形算法、稀疏连接神经网络、图形神经网络、聚类和生物测序数据的多对多比较。在许多应用场景中，矩阵乘法发生在任意代数半环上，其中标量运算过载有具有某些属性的用户定义函数，或者更一般的异质代数，其中甚至输入矩阵的域也可以不同。在这里，我们提供了稀疏矩阵矩阵运算的统一处理及其丰富的应用空间，包括机器学习，计算生物学和化学，图形算法和科学计算。
摘要：Multiplication of a sparse matrix with another (dense or sparse) matrix is a fundamental operation that captures the computational patterns of many data science applications, including but not limited to graph algorithms, sparsely connected neural networks, graph neural networks, clustering, and many-to-many comparisons of biological sequencing data. In many application scenarios, the matrix multiplication takes places on an arbitrary algebraic semiring where the scalar operations are overloaded with user-defined functions with certain properties or a more general heterogenous algebra where even the domains of the input matrices can be different. Here, we provide a unifying treatment of the sparse matrix-matrix operation and its rich application space including machine learning, computational biology and chemistry, graph algorithms, and scientific computing.

【12】Tensorized Clustered LoRA Merging for Multi-Task Interference
标题：针对多任务干扰的张量分层LoRA合并
链接：https://arxiv.org/abs/2508.03999

作者：Fengran Mo, Guojun Liang, Jinghan Zhang, Bingbing Wen, Prayag Tiwari, Jian-Yun Nie
摘要：Despite the success of the monolithic dense paradigm of large language models (LLMs), the LoRA adapters offer an efficient solution by fine-tuning small task-specific modules and merging them with the base model. However, in multi-task settings, merging LoRA adapters trained on heterogeneous sources frequently causes \textit{task interference}, degrading downstream performance. To address this, we propose a tensorized clustered LoRA (TC-LoRA) library targeting to address the task interference at the \textit{text-level} and \textit{parameter-level}. At the \textit{text-level}, we cluster the training samples in the embedding space to capture input-format similarities, then train a specialized LoRA adapter for each cluster. At the \textit{parameter-level}, we introduce a joint Canonical Polyadic (CP) decomposition that disentangles task-specific and shared factors across LoRA adapters. This joint factorization preserves essential knowledge while reducing cross-task interference. Extensive experiments on out-of-domain zero-shot and skill-composition tasks-including reasoning, question answering, and coding. Compared to strong SVD-based baselines, TC-LoRA achieves +1.4\% accuracy on Phi-3 and +2.3\% on Mistral-7B (+2.3\%), demonstrating the effectiveness of TC-LoRA in LLM adaptation.

【13】BubbleONet: A Physics-Informed Neural Operator for High-Frequency Bubble Dynamics
标题：BubbleOet：高频气泡动力学的物理知识神经运算符
链接：https://arxiv.org/abs/2508.03965

作者：ang, Lin Cheng, Aswin Gnanaskandan, Ameya D. Jagtap
备注：35 pages, 25 figures
摘要：This paper introduces BubbleONet, an operator learning model designed to map pressure profiles from an input function space to corresponding bubble radius responses. BubbleONet is built upon the physics-informed deep operator network (PI-DeepONet) framework, leveraging DeepONet's powerful universal approximation capabilities for operator learning alongside the robust physical fidelity provided by the physics-informed neural networks. To mitigate the inherent spectral bias in deep learning, BubbleONet integrates the Rowdy adaptive activation function, enabling improved representation of high-frequency features. The model is evaluated across various scenarios, including: (1) Rayleigh-Plesset equation based bubble dynamics with a single initial radius, (2) Keller-Miksis equation based bubble dynamics with a single initial radius, and (3) Keller-Miksis equation based bubble dynamics with multiple initial radii. Moreover, the performance of single-step versus two-step training techniques for BubbleONet is investigated. The results demonstrate that BubbleONet serves as a promising surrogate model for simulating bubble dynamics, offering a computationally efficient alternative to traditional numerical solvers.

【14】ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants
标题：ASTRA：人工智能软件助理的自主时空红团队
链接：https://arxiv.org/abs/2508.03936

作者：Xu, Guangyu Shen, Zian Su, Siyuan Cheng, Hanxi Guo, Lu Yan, Xuan Chen, Jiasheng Jiang, Xiaolong Jin, Chengpeng Wang, Zhuo Zhang, Xiangyu Zhang
备注：The first two authors (Xiangzhe Xu and Guangyu Shen) contributed equally to this work
摘要：AI coding assistants like GitHub Copilot are rapidly transforming software development, but their safety remains deeply uncertain-especially in high-stakes domains like cybersecurity. Current red-teaming tools often rely on fixed benchmarks or unrealistic prompts, missing many real-world vulnerabilities. We present ASTRA, an automated agent system designed to systematically uncover safety flaws in AI-driven code generation and security guidance systems. ASTRA works in three stages: (1) it builds structured domain-specific knowledge graphs that model complex software tasks and known weaknesses; (2) it performs online vulnerability exploration of each target model by adaptively probing both its input space, i.e., the spatial exploration, and its reasoning processes, i.e., the temporal exploration, guided by the knowledge graphs; and (3) it generates high-quality violation-inducing cases to improve model alignment. Unlike prior methods, ASTRA focuses on realistic inputs-requests that developers might actually ask-and uses both offline abstraction guided domain modeling and online domain knowledge graph adaptation to surface corner-case vulnerabilities. Across two major evaluation domains, ASTRA finds 11-66% more issues than existing techniques and produces test cases that lead to 17% more effective alignment training, showing its practical value for building safer AI systems.

【15】DP-NCB: Privacy Preserving Fair Bandits
标题：DP-NCB：保护隐私的公平强盗
链接：https://arxiv.org/abs/2508.03836

作者：kar, Nishant Pandey, Sayak Ray Chowdhury
摘要：Multi-armed bandit algorithms are fundamental tools for sequential decision-making under uncertainty, with widespread applications across domains such as clinical trials and personalized decision-making. As bandit algorithms are increasingly deployed in these socially sensitive settings, it becomes critical to protect user data privacy and ensure fair treatment across decision rounds. While prior work has independently addressed privacy and fairness in bandit settings, the question of whether both objectives can be achieved simultaneously has remained largely open. Existing privacy-preserving bandit algorithms typically optimize average regret, a utilitarian measure, whereas fairness-aware approaches focus on minimizing Nash regret, which penalizes inequitable reward distributions, but often disregard privacy concerns. To bridge this gap, we introduce Differentially Private Nash Confidence Bound (DP-NCB)-a novel and unified algorithmic framework that simultaneously ensures $\epsilon$-differential privacy and achieves order-optimal Nash regret, matching known lower bounds up to logarithmic factors. The framework is sufficiently general to operate under both global and local differential privacy models, and is anytime, requiring no prior knowledge of the time horizon. We support our theoretical guarantees with simulations on synthetic bandit instances, showing that DP-NCB incurs substantially lower Nash regret than state-of-the-art baselines. Our results offer a principled foundation for designing bandit algorithms that are both privacy-preserving and fair, making them suitable for high-stakes, socially impactful applications.

【16】A Robust and Efficient Pipeline for Enterprise-Level Large-Scale Entity Resolution
标题：用于企业级大规模实体解析的稳健有效管道
链接：https://arxiv.org/abs/2508.03767

作者：Kannangara, Arman Abrahamyan, Daniel Elias, Thomas Kilby, Nadav Dar, Luiz Pizzato, Anna Leontjeva, Dan Jermyn
备注：10 pages, 5 figures
摘要：Entity resolution (ER) remains a significant challenge in data management, especially when dealing with large datasets. This paper introduces MERAI (Massive Entity Resolution using AI), a robust and efficient pipeline designed to address record deduplication and linkage issues in high-volume datasets at an enterprise level. The pipeline's resilience and accuracy have been validated through various large-scale record deduplication and linkage projects. To evaluate MERAI's performance, we compared it with two well-known entity resolution libraries, Dedupe and Splink. While Dedupe failed to scale beyond 2 million records due to memory constraints, MERAI successfully processed datasets of up to 15.7 million records and produced accurate results across all experiments. Experimental data demonstrates that MERAI outperforms both baseline systems in terms of matching accuracy, with consistently higher F1 scores in both deduplication and record linkage tasks. MERAI offers a scalable and reliable solution for enterprise-level large-scale entity resolution, ensuring data integrity and consistency in real-world applications.

【17】GlaBoost: A multimodal Structured Framework for Glaucoma Risk Stratification
标题：GlaBoost：青光眼风险分层的多模式结构化框架
链接：https://arxiv.org/abs/2508.03750

作者：ng, Weizheng Xie, Karanjit Kooner, Tsengdar Lee, Jui-Kai Wang, Jia Zhang
摘要：Early and accurate detection of glaucoma is critical to prevent irreversible vision loss. However, existing methods often rely on unimodal data and lack interpretability, limiting their clinical utility. In this paper, we present GlaBoost, a multimodal gradient boosting framework that integrates structured clinical features, fundus image embeddings, and expert-curated textual descriptions for glaucoma risk prediction. GlaBoost extracts high-level visual representations from retinal fundus photographs using a pretrained convolutional encoder and encodes free-text neuroretinal rim assessments using a transformer-based language model. These heterogeneous signals, combined with manually assessed risk scores and quantitative ophthalmic indicators, are fused into a unified feature space for classification via an enhanced XGBoost model. Experiments conducted on a real-world annotated dataset demonstrate that GlaBoost significantly outperforms baseline models, achieving a validation accuracy of 98.71%. Feature importance analysis reveals clinically consistent patterns, with cup-to-disc ratio, rim pallor, and specific textual embeddings contributing most to model decisions. GlaBoost offers a transparent and scalable solution for interpretable glaucoma diagnosis and can be extended to other ophthalmic disorders.

【18】Privileged Contrastive Pretraining for Multimodal Affect Modelling
标题：多模式情感建模的强化对比预训练
链接：https://arxiv.org/abs/2508.03729

作者：nitas, Konstantinos Makantasis, Georgios N. Yannakakis
摘要：Affective Computing (AC) has made significant progress with the advent of deep learning, yet a persistent challenge remains: the reliable transfer of affective models from controlled laboratory settings (in-vitro) to uncontrolled real-world environments (in-vivo). To address this challenge we introduce the Privileged Contrastive Pretraining (PriCon) framework according to which models are first pretrained via supervised contrastive learning (SCL) and then act as teacher models within a Learning Using Privileged Information (LUPI) framework. PriCon both leverages privileged information during training and enhances the robustness of derived affect models via SCL. Experiments conducted on two benchmark affective corpora, RECOLA and AGAIN, demonstrate that models trained using PriCon consistently outperform LUPI and end to end models. Remarkably, in many cases, PriCon models achieve performance comparable to models trained with access to all modalities during both training and testing. The findings underscore the potential of PriCon as a paradigm towards further bridging the gap between in-vitro and in-vivo affective modelling, offering a scalable and practical solution for real-world applications.

【19】A Social Data-Driven System for Identifying Estate-related Events and Topics
标题：用于识别房地产相关事件和话题的社交数据驱动系统
链接：https://arxiv.org/abs/2508.03711

作者：Mu, Menglin Li, Kwan Hui Lim
备注：Accepted at ASONAM 2025
摘要：Social media platforms such as Twitter and Facebook have become deeply embedded in our everyday life, offering a dynamic stream of localized news and personal experiences. The ubiquity of these platforms position them as valuable resources for identifying estate-related issues, especially in the context of growing urban populations. In this work, we present a language model-based system for the detection and classification of estate-related events from social media content. Our system employs a hierarchical classification framework to first filter relevant posts and then categorize them into actionable estate-related topics. Additionally, for posts lacking explicit geotags, we apply a transformer-based geolocation module to infer posting locations at the point-of-interest level. This integrated approach supports timely, data-driven insights for urban management, operational response and situational awareness.

【20】Accept-Reject Lasso
标题：接受-预设套索
链接：https://arxiv.org/abs/2508.04646

作者：u, Yunqi Zhang
摘要：The Lasso method is known to exhibit instability in the presence of highly correlated features, often leading to an arbitrary selection of predictors. This issue manifests itself in two primary error types: the erroneous omission of features that lack a true substitutable relationship (falsely redundant features) and the inclusion of features with a true substitutable relationship (truly redundant features). Although most existing methods address only one of these challenges, we introduce the Accept-Reject Lasso (ARL), a novel approach that resolves this dilemma. ARL operationalizes an Accept-Reject framework through a fine-grained analysis of feature selection across data subsets. This framework is designed to partition the output of an ensemble method into beneficial and detrimental components through fine-grained analysis. The fundamental challenge for Lasso is that inter-variable correlation obscures the true sources of information. ARL tackles this by first using clustering to identify distinct subset structures within the data. It then analyzes Lasso's behavior across these subsets to differentiate between true and spurious correlations. For truly correlated features, which induce multicollinearity, ARL tends to select a single representative feature and reject the rest to ensure model stability. Conversely, for features linked by spurious correlations, which may vanish in certain subsets, ARL accepts those that Lasso might have incorrectly omitted. The distinct patterns arising from true versus spurious correlations create a divisible separation. By setting an appropriate threshold, our framework can effectively distinguish between these two phenomena, thereby maximizing the inclusion of informative variables while minimizing the introduction of detrimental ones. We illustrate the efficacy of the proposed method through extensive simulation and real-data experiments.

【21】LA-CaRe-CNN: Cascading Refinement CNN for Left Atrial Scar Segmentation
标题：LA-CaRe-CNN：用于左心房疤痕分割的级联细化CNN
链接：https://arxiv.org/abs/2508.04553

作者：ler, Darko Stern, Gernot Plank, Martin Urschler
备注：Accepted for the MICCAI Challenge on Comprehensive Analysis and Computing of Real-World Medical Images 2024, 12 pages
摘要：Atrial fibrillation (AF) represents the most prevalent type of cardiac arrhythmia for which treatment may require patients to undergo ablation therapy. In this surgery cardiac tissues are locally scarred on purpose to prevent electrical signals from causing arrhythmia. Patient-specific cardiac digital twin models show great potential for personalized ablation therapy, however, they demand accurate semantic segmentation of healthy and scarred tissue typically obtained from late gadolinium enhanced (LGE) magnetic resonance (MR) scans. In this work we propose the Left Atrial Cascading Refinement CNN (LA-CaRe-CNN), which aims to accurately segment the left atrium as well as left atrial scar tissue from LGE MR scans. LA-CaRe-CNN is a 2-stage CNN cascade that is trained end-to-end in 3D, where Stage 1 generates a prediction for the left atrium, which is then refined in Stage 2 in conjunction with the original image information to obtain a prediction for the left atrial scar tissue. To account for domain shift towards domains unknown during training, we employ strong intensity and spatial augmentation to increase the diversity of the training dataset. Our proposed method based on a 5-fold ensemble achieves great segmentation results, namely, 89.21% DSC and 1.6969 mm ASSD for the left atrium, as well as 64.59% DSC and 91.80% G-DSC for the more challenging left atrial scar tissue. Thus, segmentations obtained through LA-CaRe-CNN show great potential for the generation of patient-specific cardiac digital twin models and downstream tasks like personalized targeted ablation therapy to treat AF.

【22】Viability of perturbative expansion for quantum field theories on neurons
标题：神经元量子场论微扰展开的可行性
链接：https://arxiv.org/abs/2508.03810

作者：Sen, Varun Vaidya
备注：24 pages, 4 figures
摘要：Neural Network (NN) architectures that break statistical independence of parameters have been proposed as a new approach for simulating local quantum field theories (QFTs). In the infinite neuron number limit, single-layer NNs can exactly reproduce QFT results. This paper examines the viability of this architecture for perturbative calculations of local QFTs for finite neuron number $N$ using scalar $\phi^4$ theory in $d$ Euclidean dimensions as an example. We find that the renormalized $O(1/N)$ corrections to two- and four-point correlators yield perturbative series which are sensitive to the ultraviolet cut-off and therefore have a weak convergence. We propose a modification to the architecture to improve this convergence and discuss constraints on the parameters of the theory and the scaling of N which allow us to extract accurate field theory results.

【23】A semi-automatic approach to study population dynamics based on population pyramids
标题：基于人口金字塔研究人口动态的半自动方法
链接：https://arxiv.org/abs/2508.03788

作者：Klimroth, João Pedro Meireles, Laurie Bingaman Lackey, Nick van Eeuwijk Mads F. Bertelsen, Paul W. Dierkes, Marcus Clauss
摘要：The depiction of populations - of humans or animals - as "population pyramids" is a useful tool for the assessment of various characteristics of populations at a glance. Although these visualisations are well-known objects in various communities, formalised and algorithmic approaches to gain information from these data are less present. Here, we present an algorithm-based classification of population data into "pyramids" of different shapes ([normal and inverted] pyramid / plunger / bell, [lower / middle / upper] diamond, column, hourglass) that are linked to specific characteristics of the population. To develop the algorithmic approach, we used data describing global zoo populations of mammals from 1970-2024. This algorithm-based approach delivers plausible classifications, in particular with respect to changes in population size linked to specific series of, and transitions between, different "pyramid" shapes. We believe this approach might become a useful tool for analysing and communicating historical population developments in multiple contexts and is of broad interest. Moreover, it might be useful for animal population management strategies.

机器翻译由腾讯交互翻译提供，仅供参考

点击“阅读原文”获取带摘要的学术速递