点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
cs.LG 方向,今日共计129篇
大模型相关(17篇)
【1】VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
标题:VisionThink:基于强化学习的智能高效视觉语言模型
链接:https://arxiv.org/abs/2507.13348
作者:ang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia
备注:Code and models are available at this https URL
摘要:视觉语言模型(VLM)的最新进展通过增加视觉标记的数量来提高性能,视觉标记通常比文本标记长得多。然而,我们观察到,大多数现实世界的场景并不需要这么多的视觉令牌。虽然在一小部分OCR相关任务中性能显著下降,但模型在大多数其他一般VQA任务中仍然可以准确执行,分辨率仅为1/4。因此,我们提出了动态处理不同分辨率的不同样本,并提出了一种新的范式,即视觉令牌压缩,VisionThink。它从下采样图像开始,并智能地决定它是否足以解决问题。否则,模型可以输出一个特殊的令牌来请求更高分辨率的图像。与使用固定修剪比率或阈值压缩令牌的现有高效VLM方法相比,VisionThink自主决定是否根据情况压缩令牌。因此,它在OCR相关任务上展示了强大的细粒度视觉理解能力,同时在简单任务上节省了大量的视觉令牌。我们采用强化学习,并提出了LLM-as-Judge策略,成功地将RL应用于一般的VQA任务。此外,我们精心设计了一个奖励函数和惩罚机制,以实现一个稳定和合理的图像调整调用比例。大量的实验表明,我们的方法的优越性,效率和有效性。我们的代码可在https://github.com/dvlab-research/VisionThink上获得。
摘要:Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.
【2】GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM
标题:GeoReg:使用LLM进行社会经济估计的权重约束Few-Shot回归
链接:https://arxiv.org/abs/2507.13323
作者: Ahn, Sungwon Han, Seungeon Lee, Donghyun Ahn, Hyoshin Kim, Jungwon Kim, Jihee Kim, Sangyoon Park, Meeyoung Cha
备注:15 pages, 13 figures, 7 tables
摘要:区域GDP、人口和教育水平等社会经济指标对于制定政策决定和促进可持续发展至关重要。这项研究引入了GeoReg回归模型,该模型集成了各种数据源,包括卫星图像和基于网络的地理空间信息,以估计这些指标,即使是数据稀缺的地区,如发展中国家。我们的方法利用大语言模型(LLM)的先验知识来解决标记数据的稀缺性,LLM作为数据工程师,通过提取信息特征来实现Few-Shot设置中的有效估计。具体来说,我们的模型获得数据特征和目标指标之间的上下文关系,将它们的相关性分类为正、负、混合或不相关。然后将这些特征馈送到线性估计器中,并为每个类别定制权重约束。为了捕获非线性模式,该模型还识别有意义的特征交互,并将它们与非线性变换一起集成。在三个处于不同发展阶段的国家进行的实验表明,我们的模型在估计社会经济指标方面优于基线,即使对于数据有限的低收入国家也是如此。
摘要:Socio-economic indicators like regional GDP, population, and education levels, are crucial to shaping policy decisions and fostering sustainable development. This research introduces GeoReg a regression model that integrates diverse data sources, including satellite imagery and web-based geospatial information, to estimate these indicators even for data-scarce regions such as developing countries. Our approach leverages the prior knowledge of large language model (LLM) to address the scarcity of labeled data, with the LLM functioning as a data engineer by extracting informative features to enable effective estimation in few-shot settings. Specifically, our model obtains contextual relationships between data features and the target indicator, categorizing their correlations as positive, negative, mixed, or irrelevant. These features are then fed into the linear estimator with tailored weight constraints for each category. To capture nonlinear patterns, the model also identifies meaningful feature interactions and integrates them, along with nonlinear transformations. Experiments across three countries at different stages of development demonstrate that our model outperforms baselines in estimating socio-economic indicators, even for low-income countries with limited data availability.
【3】Automating Steering for Safe Multimodal Large Language Models
标题:安全多模式大型语言模型的自动转向
链接:https://arxiv.org/abs/2507.13255
作者:Wu, Mengru Wang, Ziwen Xu, Tri Cao, Nay Oo, Bryan Hooi, Shumin Deng
备注:Working in progress. 22 pages (8+ for main); 25 figures; 1 table
摘要:多模态大型语言模型(MLLM)的最新进展已经解锁了强大的跨模态推理能力,但也提出了新的安全问题,特别是在面对对抗性多模态输入时。为了提高MLLM在推理过程中的安全性,我们引入了一种模块化和自适应的推理时间干预技术AutoSteer,而不需要对底层模型进行任何微调。AutoSteer包含三个核心组件:(1)一个新的安全意识评分(SAS),自动识别模型内部层之间最安全相关的区别;(2)一个自适应安全探测器,经过训练,可以估计中间表示的有毒输出的可能性;以及(3)一个轻量级的拒绝头,当检测到安全风险时,可以选择性地进行干预,以调节生成。在LLaVA-OV和Chameleon上进行的各种安全关键基准测试表明,AutoSteer显著降低了文本、视觉和跨模态威胁的攻击成功率(ASR),同时保持了一般能力。这些发现将AutoSteer定位为一个实用,可解释和有效的框架,用于更安全地部署多模式AI系统。
摘要
:Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.
【4】Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities
标题:反向强化学习满足大型语言模型训练后:基础、进步和机会
链接:https://arxiv.org/abs/2507.13158
作者:Mihaela van der Schaar
摘要:在大型语言模型(LLM)时代,对齐已经成为追求更可靠,可控和有能力的机器智能的基本但具有挑战性的问题。最近推理模型和对话式人工智能系统的成功强调了强化学习(RL)在增强这些系统方面的关键作用,推动了RL和LLM对齐交叉点的研究兴趣。本文通过反向强化学习(IRL)的镜头,全面回顾了LLM对齐的最新进展,强调了LLM对齐中采用的RL技术与传统RL任务中采用的RL技术之间的区别。特别是,我们强调了从人类数据构建神经奖励模型的必要性,并讨论了这种范式转变的正式和实际意义。我们首先介绍RL中的基本概念,为不熟悉该领域的读者提供基础。然后,我们研究了这一研究议程的最新进展,讨论了进行IRL的LLM对齐的主要挑战和机遇。除了方法上的考虑,我们还探索了实际方面,包括数据集、基准、评估指标、基础设施以及计算效率高的训练和推理技术。最后,我们从稀疏奖励强化学习的文献中汲取见解,以确定开放性问题和潜在的研究方向。通过综合来自不同研究的结果,我们的目标是提供该领域的结构化和关键性概述,突出未解决的挑战,并概述通过RL和IRL技术改善LLM对齐的未来发展方向。
摘要:In the era of Large Language Models (LLMs), alignment has emerged as a fundamental yet challenging problem in the pursuit of more reliable, controllable, and capable machine intelligence. The recent success of reasoning models and conversational AI systems has underscored the critical role of reinforcement learning (RL) in enhancing these systems, driving increased research interest at the intersection of RL and LLM alignment. This paper provides a comprehensive review of recent advances in LLM alignment through the lens of inverse reinforcement learning (IRL), emphasizing the distinctions between RL techniques employed in LLM alignment and those in conventional RL tasks. In particular, we highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift. We begin by introducing fundamental concepts in RL to provide a foundation for readers unfamiliar with the field. We then examine recent advances in this research agenda, discussing key challenges and opportunities in conducting IRL for LLM alignment. Beyond methodological considerations, we explore practical aspects, including datasets, benchmarks, evaluation metrics, infrastructure, and computationally efficient training and inference techniques. Finally, we draw insights from the literature on sparse-reward RL to identify open questions and potential research directions. By synthesizing findings from diverse studies, we aim to provide a structured and critical overview of the field, highlight unresolved challenges, and outline promising future directions for improving LLM alignment through RL and IRL techniques.
【5】SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts
标题:SemCSE:使用LLM生成的科学摘要的语义对比句嵌入
链接:https://arxiv.org/abs/2507.13105
作者:ner, Sina Zarriess
摘要:我们介绍SemCSE,一种用于学习科学文本语义嵌入的无监督方法。基于文本嵌入对比学习的最新进展,我们的方法利用LLM生成的科学摘要摘要来训练一个模型,该模型将语义相关的摘要在嵌入空间中更紧密地定位在一起。这一最终目标确保了该模型捕捉到文本的真实语义内容,与传统的基于引用的方法相比,这些方法不一定反映语义相似性。为了验证这一点,我们提出了一个新的基准,旨在评估模型的理解和编码的语义内容的科学文本的能力,表明我们的方法在嵌入空间内执行更强的语义分离。此外,我们还根据科学文本嵌入的全面SciRepEval基准对SemCSE进行了评估,它在其大小的模型中实现了最先进的性能,从而凸显了以语义为中心的训练方法的好处。
摘要:We introduce SemCSE, an unsupervised method for learning semantic embeddings of scientific texts. Building on recent advances in contrastive learning for text embeddings, our approach leverages LLM-generated summaries of scientific abstracts to train a model that positions semantically related summaries closer together in the embedding space. This resulting objective ensures that the model captures the true semantic content of a text, in contrast to traditional citation-based approaches that do not necessarily reflect semantic similarity. To validate this, we propose a novel benchmark designed to assess a model's ability to understand and encode the semantic content of scientific texts, demonstrating that our method enforces a stronger semantic separation within the embedding space. Additionally, we evaluate SemCSE on the comprehensive SciRepEval benchmark for scientific text embeddings, where it achieves state-of-the-art performance among models of its size, thus highlighting the benefits of a semantically focused training approach.
【6】Insights into a radiology-specialised multimodal large language model with sparse autoencoders
标题:透视放射学专用多模态大型语言模型与稀疏自动编码器
链接:https://arxiv.org/abs/2507.12950
作者:zid, Shruthi Bannur, Daniel Coelho de Castro, Anton Schwaighofer, Javier Alvarez-Valle, Stephanie L. Hyland
备注:Actionable Interpretability Workshop at ICML 2025. 24 pages, 7 figures, 5 tables
摘要:可解释性可以提高人工智能模型的安全性、透明度和可信度,这在医疗保健应用中尤为重要,因为决策往往会带来重大后果。机械可解释性,特别是通过使用稀疏自动编码器(SAE),提供了一种很有前途的方法,用于在大型基于transformer的模型中发现人类可解释的特征。在这项研究中,我们将Matryoshka-SAE应用于放射学专业的多模态大型语言模型MAIRA-2,以解释其内部表示。使用SAE特征的大规模自动化可解释性,我们确定了一系列临床相关概念-包括医疗器械(例如,线和管放置、起搏器存在)、胸腔积液和心脏肥大等病理、纵向变化和文本特征。我们进一步研究了这些功能对模型行为的影响,通过转向,展示了方向控制几代人的混合成功。我们的研究结果揭示了实践和方法上的挑战,但它们提供了对MAIRA-2学习的内部概念的初步见解-标志着朝着更深入的机械理解和放射学适应的多模态大型语言模型的可解释性迈出了一步,并为提高模型透明度铺平了道路。我们发布经过培训的SAE和解释:https://huggingface.co/microsoft/maira-2-sae。
摘要:Interpretability can improve the safety, transparency and trust of AI models, which is especially important in healthcare applications where decisions often carry significant consequences. Mechanistic interpretability, particularly through the use of sparse autoencoders (SAEs), offers a promising approach for uncovering human-interpretable features within large transformer-based models. In this study, we apply Matryoshka-SAE to the radiology-specialised multimodal large language model, MAIRA-2, to interpret its internal representations. Using large-scale automated interpretability of the SAE features, we identify a range of clinically relevant concepts - including medical devices (e.g., line and tube placements, pacemaker presence), pathologies such as pleural effusion and cardiomegaly, longitudinal changes and textual features. We further examine the influence of these features on model behaviour through steering, demonstrating directional control over generations with mixed success. Our results reveal practical and methodological challenges, yet they offer initial insights into the internal concepts learned by MAIRA-2 - marking a step toward deeper mechanistic understanding and interpretability of a radiology-adapted multimodal large language model, and paving the way for improved model transparency. We release the trained SAEs and interpretations: https://huggingface.co/microsoft/maira-2-sae.
【7】Probabilistic Soundness Guarantees in LLM Reasoning Chains
标题:LLM推理链中的概率可靠性保证
链接:https://arxiv.org/abs/2507.12948
作者:u, Anton Xue, Shreya Havaldar, Delip Rao, Helen Jin, Chris Callison-Burch, Eric Wong
摘要:在由大型语言模型(LLM)生成的推理链中,初始错误通常会传播并破坏最终结论的可靠性。当前基于LLM的错误检测方法通常无法检测传播的错误,因为它们没有正确地考虑早期错误如何可能破坏下游推理的判断。为了更好地检测这种传播的错误,我们引入了自回归推理蕴涵稳定性(ARES),这是一种新的概率框架,它通过仅基于先前评估的合理前提来判断每个索赔来防止错误传播。这种归纳方法为每一步产生一个微妙的分数,并提供其可靠性的认证统计保证,而不是脆弱的二进制标签。ARES在四个基准测试中实现了最先进的性能(72.1% Macro-F1,+8.2分),并在非常长的合成推理链上表现出卓越的鲁棒性,在检测传播错误方面表现出色(90.3% F1,+27.6分)。
摘要:In reasoning chains generated by large language models (LLMs), initial errors often propagate and undermine the reliability of the final conclusion. Current LLM-based error detection methods often fail to detect propagated errors because they do not properly account for how earlier errors might corrupt judgments of downstream reasoning. To better detect such propagated errors, we introduce Autoregressive Reasoning Entailment Stability (ARES), a novel probabilistic framework that prevents error propagation by judging each claim based only on previously-assessed sound premises. This inductive method yields a nuanced score for each step and provides certified statistical guarantees of its soundness, rather than a brittle binary label. ARES achieves state-of-the-art performance across four benchmarks (72.1% Macro-F1, +8.2 points) and demonstrates superior robustness on very long synthetic reasoning chains, where it excels at detecting propagated errors (90.3% F1, +27.6 points).
【8】Trace Reconstruction with Language Models
标题:利用语言模型重建痕迹
链接:https://arxiv.org/abs/2507.12927
作者: Weindel, Michael Girsch, Reinhard Heckel
摘要:一般的迹重建问题寻求从其被删除、插入和替换独立损坏的噪声拷贝中恢复原始序列。这个问题出现在诸如DNA数据存储的应用中,DNA数据存储由于其高信息密度和寿命而成为有前途的存储介质。然而,在DNA合成、存储和测序过程中引入的错误需要通过算法和代码进行校正,其中迹线重建通常用作数据检索过程的一部分。在这项工作中,我们提出了TReconLM,它利用在下一个标记预测上训练的语言模型进行跟踪重建。我们在合成数据上预训练语言模型,并在真实数据上进行微调,以适应特定于技术的错误模式。TReconLM优于最先进的轨迹重建算法,包括先前的深度学习方法,可以在没有错误的情况下恢复更高比例的序列。
摘要:The general trace reconstruction problem seeks to recover an original sequence from its noisy copies independently corrupted by deletions, insertions, and substitutions. This problem arises in applications such as DNA data storage, a promising storage medium due to its high information density and longevity. However, errors introduced during DNA synthesis, storage, and sequencing require correction through algorithms and codes, with trace reconstruction often used as part of the data retrieval process. In this work, we propose TReconLM, which leverages language models trained on next-token prediction for trace reconstruction. We pretrain language models on synthetic data and fine-tune on real-world data to adapt to technology-specific error patterns. TReconLM outperforms state-of-the-art trace reconstruction algorithms, including prior deep learning approaches, recovering a substantially higher fraction of sequences without error.
【9】VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks
标题:VAR-MATH:通过符号多实例基准探索大型语言模型中的真实数学推理
链接:https://arxiv.org/abs/2507.12885
作者: Ran Cheng, Kay Chen Tan
摘要:强化学习(RL)的最新进展已经导致大型语言模型(LLM)的数学推理能力得到了实质性的改善,正如标准基准测试所衡量的那样。然而,即使模型是用有缺陷的信号(如随机或反向奖励)训练的,这些收益也往往会持续存在,这就提出了一个基本问题:这些改进是否反映了真正的推理,或者它们仅仅是对特定于基准的模式过度拟合的产物?为了解决这个问题,我们采取以评估为中心的角度来看,并确定现有协议中的两个关键缺陷。首先,测试问题的公开可用性会导致基准污染,从而增加数据泄露的风险。其次,评估脆弱性源于对单实例评估的依赖,单实例评估对随机输出高度敏感,并且无法捕获推理一致性。为了克服这些限制,我们引入了{VAR-MATH},一个符号评估框架,旨在探测真正的推理能力。通过将固定的数值问题转换为符号模板,并要求模型解决每个模板的多个实例,VAR-MATH在结构上等效的变量之间执行一致的推理,从而减轻污染并提高评估的鲁棒性。我们应用VAR-MATH将两个流行的基准测试AMC 23和AIME 24转换为它们的符号对应物VAR-AMC 23和VAR-AIME 24。实验结果显示,RL训练的模型在可变版本上的性能大幅下降,特别是对于较小的模型,AMC 23和AIME 24的平均下降率分别为48.0%和58.3%。这些发现表明,许多现有的强化学习方法依赖于表面的数学,并未能推广到特定的数字形式之外。总的来说,VAR-MATH为数学推理提供了一个有原则的、抗污染的评估范式。
摘要:Recent advances in reinforcement learning (RL) have led to substantial improvements in the mathematical reasoning abilities of large language models (LLMs), as measured by standard benchmarks. However, these gains often persist even when models are trained with flawed signals, such as random or inverted rewards, raising a fundamental question: do such improvements reflect true reasoning, or are they merely artifacts of overfitting to benchmark-specific patterns? To address this question, we take an evaluation-centric perspective and identify two critical shortcomings in existing protocols. First, \emph{benchmark contamination} arises from the public availability of test problems, increasing the risk of data leakage. Second, \emph{evaluation fragility} stems from the reliance on single-instance assessments, which are highly sensitive to stochastic outputs and fail to capture reasoning consistency. To overcome these limitations, we introduce {VAR-MATH}, a symbolic evaluation framework designed to probe genuine reasoning ability. By converting fixed numerical problems into symbolic templates and requiring models to solve multiple instantiations of each, VAR-MATH enforces consistent reasoning across structurally equivalent variants, thereby mitigating contamination and improving evaluation robustness. We apply VAR-MATH to transform two popular benchmarks, AMC23 and AIME24, into their symbolic counterparts, VAR-AMC23 and VAR-AIME24. Experimental results reveal substantial performance drops for RL-trained models on the variabilized versions, especially for smaller models, with average declines of 48.0\% on AMC23 and 58.3\% on AIME24. These findings suggest that many existing RL methods rely on superficial heuristics and fail to generalize beyond specific numerical forms. Overall, VAR-MATH offers a principled, contamination-resistant evaluation paradigm for mathematical reasoning.
【10】Large Language Models' Internal Perception of Symbolic Music
标题:大型语言模型对象征性音乐的内在感知
链接:https://arxiv.org/abs/2507.12808
作者:in, Kunitake Kaneko
摘要:大型语言模型(LLM)擅长建模自然语言中字符串之间的关系,并且在扩展到其他符号领域(如编码或数学)方面表现出了希望。然而,他们在多大程度上隐含的象征性音乐模型仍然未被探索。本文探讨了LLM如何通过从描述流派和风格组合的文本提示生成符号音乐数据来表示音乐概念,并通过识别和生成任务来评估其效用。我们生成一个LLM生成的MP3文件的数据集,而不依赖于明确的音乐训练。然后,我们完全在这个LLM生成的数据集上训练神经网络,并执行流派和风格分类以及旋律完成,将其性能与已建立的模型进行基准测试。我们的研究结果表明,LLM可以从文本中推断出基本的音乐结构和时间关系,突出了它们对音乐模式进行隐式编码的潜力,以及由于缺乏明确的音乐背景而导致的局限性,从而揭示了它们对符号音乐的生成能力。
摘要:Large language models (LLMs) excel at modeling relationships between strings in natural language and have shown promise in extending to other symbolic domains like coding or mathematics. However, the extent to which they implicitly model symbolic music remains underexplored. This paper investigates how LLMs represent musical concepts by generating symbolic music data from textual prompts describing combinations of genres and styles, and evaluating their utility through recognition and generation tasks. We produce a dataset of LLM-generated MIDI files without relying on explicit musical training. We then train neural networks entirely on this LLM-generated MIDI dataset and perform genre and style classification as well as melody completion, benchmarking their performance against established models. Our results demonstrate that LLMs can infer rudimentary musical structures and temporal relationships from text, highlighting both their potential to implicitly encode musical patterns and their limitations due to a lack of explicit musical context, shedding light on their generative capabilities for symbolic music.
【11】A Comprehensive Survey of Electronic Health Record Modeling: From Deep Learning Approaches to Large Language Models
标题:电子健康记录建模的全面调查:从深度学习方法到大型语言模型
链接:https://arxiv.org/abs/2507.12774
作者:g Ren, Jingxi Zhu, Zehao Liu, Tianxiang Zhao, Vasant Honavar
摘要:人工智能(AI)通过对电子健康记录(EHR)的分析和建模,在改变医疗保健方面表现出了巨大的潜力。然而,EHR数据固有的异质性,时间不规则性和特定领域的性质提出了独特的挑战,这些挑战与视觉和自然语言任务中的挑战有着根本的不同。本调查全面概述了深度学习、大型语言模型(LLM)和EHR建模交叉领域的最新进展。我们引入了一个统一的分类法,涵盖了五个关键的设计维度:以数据为中心的方法,神经架构设计,以学习为中心的策略,多模态学习和基于LLM的建模系统。在每个维度中,我们回顾了解决数据质量增强,结构和时间表示,自我监督学习以及与临床知识整合的代表性方法。我们进一步强调了新兴的趋势,如基础模型,LLM驱动的临床代理,以及下游推理的EHR到文本翻译。最后,我们讨论了开放的挑战,在基准,可解释性,临床对齐,并在不同的临床环境中推广。该调查旨在为推进AI驱动的EHR建模和临床决策支持提供结构化路线图。有关EHR相关方法的完整列表,请参阅https://survey-on-tabular-data.github.io/。
摘要:Artificial intelligence (AI) has demonstrated significant potential in transforming healthcare through the analysis and modeling of electronic health records (EHRs). However, the inherent heterogeneity, temporal irregularity, and domain-specific nature of EHR data present unique challenges that differ fundamentally from those in vision and natural language tasks. This survey offers a comprehensive overview of recent advancements at the intersection of deep learning, large language models (LLMs), and EHR modeling. We introduce a unified taxonomy that spans five key design dimensions: data-centric approaches, neural architecture design, learning-focused strategies, multimodal learning, and LLM-based modeling systems. Within each dimension, we review representative methods addressing data quality enhancement, structural and temporal representation, self-supervised learning, and integration with clinical knowledge. We further highlight emerging trends such as foundation models, LLM-driven clinical agents, and EHR-to-text translation for downstream reasoning. Finally, we discuss open challenges in benchmarking, explainability, clinical alignment, and generalization across diverse clinical settings. This survey aims to provide a structured roadmap for advancing AI-driven EHR modeling and clinical decision support. For a comprehensive list of EHR-related methods, kindly refer to https://survey-on-tabular-data.github.io/.
【12】VLMgineer: Vision Language Models as Robotic Toolsmiths
标题:VLMGineer:作为机器人工具史密斯的视觉语言模型
链接:https://arxiv.org/abs/2507.12644
作者:ayuan Gao, Tianyu Li, Junyao Shi, Yihan Li, Zizhe Zhang, Nadia Figueroa, Dinesh Jayaraman
备注:Project Website: this https URL
摘要:工具的设计和使用反映了通过创造力,计划和远见来理解和操纵物理世界的能力。因此,这些能力通常被视为生物物种智力的可衡量指标。虽然今天对机器人智能的大部分研究都集中在生成更好的控制器上,但发明更智能的工具提供了一种补充形式的物理智能:将解决问题的责任转移到工具的设计上。鉴于巨大的和令人印象深刻的常识,推理和创造性的能力,今天的基础模型,我们调查这些模型是否可以提供有用的先验自动设计和有效地运用这些工具?我们提出了VLMgineer,一个框架,利用视觉语言模型(VLM)的代码生成能力,连同进化搜索迭代协同设计物理工具和操作它们执行任务的行动计划。我们评估VLMgineer的日常操作场景,需要创造性的工具设计和使用多样化的新基准。在这个套件中,VLMgineer不断发现能够更有效和创新地解决任务的工具和策略,将具有挑战性的机器人问题转化为简单的执行。它还优于VLM根据人类规范和现有的人工工具生成的设计,用于日常任务。为了方便将来对自动化工具发明的研究,我们将发布我们的基准测试和代码。
摘要:Tool design and use reflect the ability to understand and manipulate the physical world through creativity, planning, and foresight. As such, these capabilities are often regarded as measurable indicators of intelligence across biological species. While much of today's research on robotic intelligence focuses on generating better controllers, inventing smarter tools offers a complementary form of physical intelligence: shifting the onus of problem-solving onto the tool's design. Given the vast and impressive common-sense, reasoning, and creative capabilities of today's foundation models, we investigate whether these models can provide useful priors to automatically design and effectively wield such tools? We present VLMgineer, a framework that harnesses the code generation abilities of vision language models (VLMs) together with evolutionary search to iteratively co-design physical tools and the action plans that operate them to perform a task. We evaluate VLMgineer on a diverse new benchmark of everyday manipulation scenarios that demand creative tool design and use. Across this suite, VLMgineer consistently discovers tools and policies that solve tasks more effectively and innovatively, transforming challenging robotics problems into straightforward executions. It also outperforms VLM-generated designs from human specifications and existing human-crafted tools for everyday tasks. To facilitate future research on automated tool invention, we will release our benchmark and code.
【13】BootSeer: Analyzing and Mitigating Initialization Bottlenecks in Large-Scale LLM Training
标题:BootSeer:分析和缓解大规模法学硕士训练中的学习瓶颈
链接:https://arxiv.org/abs/2507.12619
作者:iaoyun Zhi, Jinxin Chi, Menghan Yu, Lixin Huang, Jia Zhu, Weilun Zhang, Xing Ma, Wenjia Liu, Zhicheng Zhu, Daowen Luo, Zuquan Song, Xin Yin, Chao Xiang, Shuguang Wang, Wencong Xiao, Gene Cooperman
备注:18 pages, 14 figures
摘要:大型语言模型(LLM)已成为现代人工智能的基石,推动了自然语言处理的突破,并扩展到涉及图像、音频和视频的多模态工作。与大多数计算软件一样,区分普通运行时性能和启动开销是很重要的。之前的研究主要集中在运行时性能上:提高训练效率和稳定性。这项工作的重点是培训中日益关键的启动开销问题:培训作业开始执行之前的延迟。在大型工业规模的LLM中,启动开销尤其重要,因为故障发生得更频繁,多个团队在迭代的更新-调试周期中运行。在我们的一个训练集群中,仅启动开销就浪费了超过3.5%的GPU时间。 在这项工作中,我们提出了基于真实生产数据的LLM训练启动开销的第一个深入表征。我们分析启动成本的组成部分,量化其直接影响,并研究它如何与工作规模的规模。这些见解激发了Bootseer的设计,Bootseer是一个系统级优化框架,解决了三个主要的启动瓶颈:(a)容器映像加载,(b)运行时依赖安装和(c)模型检查点恢复。为了缓解这些瓶颈,Bootseer引入了三种技术:(a)热块记录和预取,(b)依赖性快照和(c)条带化HDFS-FUSE。Bootseer已部署在生产环境中,并在真实的LLM培训工作负载上进行了评估,证明启动开销减少了50%。
摘要:Large Language Models (LLMs) have become a cornerstone of modern AI, driving breakthroughs in natural language processing and expanding into multimodal jobs involving images, audio, and video. As with most computational software, it is important to distinguish between ordinary runtime performance and startup overhead. Prior research has focused on runtime performance: improving training efficiency and stability. This work focuses instead on the increasingly critical issue of startup overhead in training: the delay before training jobs begin execution. Startup overhead is particularly important in large, industrial-scale LLMs, where failures occur more frequently and multiple teams operate in iterative update-debug cycles. In one of our training clusters, more than 3.5% of GPU time is wasted due to startup overhead alone. In this work, we present the first in-depth characterization of LLM training startup overhead based on real production data. We analyze the components of startup cost, quantify its direct impact, and examine how it scales with job size. These insights motivate the design of Bootseer, a system-level optimization framework that addresses three primary startup bottlenecks: (a) container image loading, (b) runtime dependency installation, and (c) model checkpoint resumption. To mitigate these bottlenecks, Bootseer introduces three techniques: (a) hot block record-and-prefetch, (b) dependency snapshotting, and (c) striped HDFS-FUSE. Bootseer has been deployed in a production environment and evaluated on real LLM training workloads, demonstrating a 50% reduction in startup overhead.
【14】Assay2Mol: large language model-based drug design using BioAssay context
标题:Assay2Mol:使用BioAssay上下文的基于大语言模型的药物设计
链接:https://arxiv.org/abs/2507.12574
作者:g, Spencer S. Ericksen, Anthony Gitter
备注:23 pages, 10 figures
摘要:科学数据库汇集了大量的定量数据以及描述性文本。在生物化学中,分子筛选测定评估候选分子对疾病靶标的功能反应。非结构化文本描述了这些靶标的生物学机制、实验筛选方案和测定的其他属性,为新药发现活动提供了丰富的信息,但由于非结构化格式而尚未开发。我们提出了Assay2Mol,这是一个基于大型语言模型的工作流程,可以利用现有的大量生化筛选试验进行早期药物发现。Assay2Mol检索涉及与新靶点相似的靶点的现有检测记录,并使用检索到的检测筛选数据进行上下文学习来生成候选分子。Assay2Mol优于最近的机器学习方法,这些方法为靶蛋白结构生成候选配体分子,同时还促进了更可合成的分子生成。
摘要:Scientific databases aggregate vast amounts of quantitative data alongside descriptive text. In biochemistry, molecule screening assays evaluate the functional responses of candidate molecules against disease targets. Unstructured text that describes the biological mechanisms through which these targets operate, experimental screening protocols, and other attributes of assays offer rich information for new drug discovery campaigns but has been untapped because of that unstructured format. We present Assay2Mol, a large language model-based workflow that can capitalize on the vast existing biochemical screening assays for early-stage drug discovery. Assay2Mol retrieves existing assay records involving targets similar to the new target and generates candidate molecules using in-context learning with the retrieved assay screening data. Assay2Mol outperforms recent machine learning approaches that generate candidate ligand molecules for target protein structures, while also promoting more synthesizable molecule generation.
【15】Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training
标题:扩展RL:通过长期训练解锁LLM中的多样化推理
链接:https://arxiv.org/abs/2507.12507
作者:iu, Shizhe Diao, Jian Hu, Ximing Lu, Xin Dong, Hao Zhang, Alexander Bukharin, Shaokun Zhang, Jiaqi Zeng, Makesh Narsimhan Sreedhar, Gerald Shen, David Mosallanezhad, Di Zhang, Jonas Yang, June Yang, Oleksii Kuchaiev, Guilin Liu, Zhiding Yu, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong
备注:14 pages, 7 figures
摘要:以推理为中心的语言模型(如OpenAI的O 1和DeepSeek-R1)的最新进展表明,通过思维链推理和迭代探索,扩展测试时计算可以在数学和代码生成等复杂任务上产生实质性的改进。这些突破是由大规模强化学习(RL)推动的,特别是当与可验证的奖励信号相结合时,这些奖励信号提供了客观和接地的监督。在这份报告中,我们研究了长时间强化学习对不同推理领域的小型语言模型的影响。我们的工作确定了有效训练的几个关键因素,包括使用可验证的奖励任务,增强组相对策略优化(GRPO),以及提高训练稳定性和泛化能力的实用技术。我们引入受控KL正则化,裁剪比,和定期参考政策重置解锁长期性能增益的关键组成部分。我们的模型在强大的基线上取得了显着的改进,包括数学+14.7%,编码+13.9%,逻辑谜题任务+54.8%。为了便于继续研究,我们公开发布我们的模型。
摘要:Recent advancements in reasoning-focused language models such as OpenAI's O1 and DeepSeek-R1 have shown that scaling test-time computation-through chain-of-thought reasoning and iterative exploration-can yield substantial improvements on complex tasks like mathematics and code generation. These breakthroughs have been driven by large-scale reinforcement learning (RL), particularly when combined with verifiable reward signals that provide objective and grounded supervision. In this report, we investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains. Our work identifies several key ingredients for effective training, including the use of verifiable reward tasks, enhancements to Group Relative Policy Optimization (GRPO), and practical techniques to improve training stability and generalization. We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains. Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks. To facilitate continued research, we release our model publicly.
【16】Spatially Grounded Explanations in Vision Language Models for Document Visual Question Answering
标题:文档视觉问题回答的视觉语言模型中的空间接地解释
链接:https://arxiv.org/abs/2507.12490
作者:no Hormazábal Lagos, Héctor Cerezo-Costas, Dimosthenis Karatzas
备注:This work has been accepted for presentation at the 16th Conference and Labs of the Evaluation Forum (CLEF 2025) and will be published in the proceedings by Springer in the Lecture Notes in Computer Science (LNCS) series. Please cite the published version when available
摘要:我们引入了EaGERS,一个完全免训练和模型不可知的管道,它(1)通过视觉语言模型生成自然语言原理,(2)通过在具有多数投票的可配置网格上计算多模态嵌入相似性来将这些原理接地到空间子区域,以及(3)限制仅从掩蔽图像中选择的相关区域生成响应。在DocVQA数据集上的实验表明,我们的最佳配置不仅在精确匹配精度和平均归一化Levenshtein相似性指标上优于基础模型,而且还提高了DocVQA的透明度和可重复性,而无需额外的模型微调。
摘要:We introduce EaGERS, a fully training-free and model-agnostic pipeline that (1) generates natural language rationales via a vision language model, (2) grounds these rationales to spatial sub-regions by computing multimodal embedding similarities over a configurable grid with majority voting, and (3) restricts the generation of responses only from the relevant regions selected in the masked image. Experiments on the DocVQA dataset demonstrate that our best configuration not only outperforms the base model on exact match accuracy and Average Normalized Levenshtein Similarity metrics but also enhances transparency and reproducibility in DocVQA without additional model fine-tuning.
【17】Kodezi Chronos: A Debugging-First Language Model for Repository-Scale, Memory-Driven Code Understanding
标题:Kodezi Chronos:用于存储库规模、内存驱动代码理解的初始语言模型
链接:https://arxiv.org/abs/2507.12482
作者:an, Assad Chowdary, Sharoz Haseeb, Urvish Patel
备注:10 pages, 10 figures, 7 tables, IEEE Conference format, Q4 2025 model release, Q1 2026 Kodezi OS deployment
摘要:大型语言模型(LLM)具有先进的代码生成和软件自动化,但从根本上受到有限的推理时间上下文和缺乏显式代码结构推理的限制。我们介绍了Kodezi Chronos,这是一种用于自主代码理解、调试和维护的下一代架构,旨在跨超长上下文运行,包括整个代码库、历史和文档,所有这些都没有固定的窗口限制。Kodezi Chronos利用多级嵌入式内存引擎,将基于矢量和图形的索引与持续的代码感知检索相结合。这可以对数百万行代码进行高效和准确的推理,支持存储库规模的理解,多文件重构和实时自我修复操作。我们的评估介绍了一种新的多随机检索基准,专门针对软件工程领域。与经典的检索基准测试不同,这种方法要求模型解决任意距离和混淆的代码工件之间的关联,模拟现实的任务,如变量跟踪,依赖迁移和语义错误定位。Chronos优于之前的LLM和代码模型,与传统的基于序列的方法相比,在现实世界的错误检测方面提高了23%,并将调试周期缩短了40%。通过与IDE和CI/CD工作流的原生接口,Chronos实现了无缝的自主软件维护,提高了代码可靠性和生产力,同时减少了手动工作。这些结果标志着迈向自我维持、持续优化的软件生态系统的关键进展。
摘要
:Large Language Models (LLMs) have advanced code generation and software automation, but are fundamentally constrained by limited inference-time context and lack of explicit code structure reasoning. We introduce Kodezi Chronos, a next-generation architecture for autonomous code understanding, debugging, and maintenance, designed to operate across ultra-long contexts comprising entire codebases, histories, and documentation, all without fixed window limits. Kodezi Chronos leverages a multi-level embedding memory engine, combining vector and graph-based indexing with continuous code-aware retrieval. This enables efficient and accurate reasoning over millions of lines of code, supporting repository-scale comprehension, multi-file refactoring, and real-time self-healing actions. Our evaluation introduces a novel Multi Random Retrieval benchmark, specifically tailored to the software engineering domain. Unlike classical retrieval benchmarks, this method requires the model to resolve arbitrarily distant and obfuscated associations across code artifacts, simulating realistic tasks such as variable tracing, dependency migration, and semantic bug localization. Chronos outperforms prior LLMs and code models, demonstrating a 23% improvement in real-world bug detection and reducing debugging cycles by up to 40% compared to traditional sequence-based approaches. By natively interfacing with IDEs and CI/CD workflows, Chronos enables seamless, autonomous software maintenance, elevating code reliability and productivity while reducing manual effort. These results mark a critical advance toward self-sustaining, continuously optimized software ecosystems.
Graph相关(图学习|图神经网络|图优化等)(6篇)
【1】NGTM: Substructure-based Neural Graph Topic Model for Interpretable Graph Generation
标题:NGTM:用于可解释图生成的基于子结构的神经图主题模型
链接:https://arxiv.org/abs/2507.13133
作者:huang, Dazhong Shen, Ying Sun
摘要:图生成在许多领域中起着关键作用,包括分子设计和知识图构建。虽然现有的方法在生成逼真的图形方面取得了相当大的成功,但它们的可解释性仍然有限,往往模糊了结构决策背后的基本原理。为了应对这一挑战,我们提出了神经图主题模型(NGTM),这是一种受自然语言处理中主题建模启发的新型生成框架。NGTM将图表示为潜在主题的混合物,每个主题定义了语义上有意义的子结构的分布,这有助于在局部和全局尺度上的显式可解释性。生成过程透明地将这些主题分布与全局结构变量集成在一起,从而可以对每个生成的图进行清晰的语义跟踪。实验表明,NGTM实现了有竞争力的生成质量,同时独特地实现了细粒度控制和可解释性,允许用户通过主题级调整来调整结构特征或诱导生物特性。
摘要:Graph generation plays a pivotal role across numerous domains, including molecular design and knowledge graph construction. Although existing methods achieve considerable success in generating realistic graphs, their interpretability remains limited, often obscuring the rationale behind structural decisions. To address this challenge, we propose the Neural Graph Topic Model (NGTM), a novel generative framework inspired by topic modeling in natural language processing. NGTM represents graphs as mixtures of latent topics, each defining a distribution over semantically meaningful substructures, which facilitates explicit interpretability at both local and global scales. The generation process transparently integrates these topic distributions with a global structural variable, enabling clear semantic tracing of each generated graph. Experiments demonstrate that NGTM achieves competitive generation quality while uniquely enabling fine-grained control and interpretability, allowing users to tune structural features or induce biological properties through topic-level adjustments.
【2】On statistical learning of graphs
标题:关于图形的统计学习
链接:https://arxiv.org/abs/2507.13054
作者:Cipriani, Valentino Delle Rose, Luca San Mauro, Giovanni Solda
摘要:我们研究PAC和在线学习的假设类形成的副本的可数无限图G,其中每个副本是由置换G的顶点诱导。这相当于学习一个图的标签,知道它的结构和标签集。我们考虑的类,置换移动只有200多个顶点。我们的主要结果表明,PAC学习所有这样的有限支持副本意味着在线学习的全同构类型的G,并等价于条件的自守平凡。我们还刻画了图的副本诱导交换两个顶点是不可学习的,使用放松的无限随机图的扩展属性。最后,我们证明了,对于所有的G和k>2,k-顶点置换的可学习性是等价的2-顶点置换,产生一个四类划分的无限图,其复杂性,我们也确定使用工具来自描述集理论和可计算性理论。
摘要:We study PAC and online learnability of hypothesis classes formed by copies of a countably infinite graph G, where each copy is induced by permuting G's vertices. This corresponds to learning a graph's labeling, knowing its structure and label set. We consider classes where permutations move only finitely many vertices. Our main result shows that PAC learnability of all such finite-support copies implies online learnability of the full isomorphism type of G, and is equivalent to the condition of automorphic triviality. We also characterize graphs where copies induced by swapping two vertices are not learnable, using a relaxation of the extension property of the infinite random graph. Finally, we show that, for all G and k>2, learnability for k-vertex permutations is equivalent to that for 2-vertex permutations, yielding a four-class partition of infinite graphs, whose complexity we also determine using tools coming from both descriptive set theory and computability theory.
【3】SMART: Relation-Aware Learning of Geometric Representations for Knowledge Graphs
标题:Smart:知识图几何表示的感知学习
链接:https://arxiv.org/abs/2507.13001
作者:uzouvi, Bowen Song, Andrea Coletta, Luigi Bellomarini, Jens Lehmann, Sahar Vahdati
摘要:知识图表示学习方法提供了知识图(KG)中三元组形式的符号知识与其特征向量之间的映射。知识图嵌入(KGE)模型通常将KG中的关系表示为几何变换。大多数最先进的(SOTA)KGE模型都是从基本几何变换(EGT)中导出的,例如平移、缩放、旋转和反射或其组合。这些几何变换使模型能够有效地保留KG的特定结构和关系模式。然而,如果不考虑特定于关系的转换,KGE目前对EGT的使用仍然不够。虽然最近的模型试图通过以不同的方式集成SOTA基线模型来解决这个问题,但这些基线只使用单个或复合版本的几何变换来表示所有关系。在本文中,我们提出了一个框架,评估如何以及每个关系适合不同的几何变换。基于此排名,模型可以:(1)为每个关系分配最匹配的转换,或者(2)使用多数投票来选择一种转换类型以应用于所有关系。也就是说,该模型通过注意机制在低维向量空间中学习单个关系特定EGT。此外,我们使用关系和EGT之间的相关性,这是在低维学习,在高维向量空间中的关系嵌入。我们的模型的有效性是通过对三个基准幼儿园以及一个现实世界的金融幼儿园的综合评估来证明的,其表现与领先的模型相当
摘要:Knowledge graph representation learning approaches provide a mapping between symbolic knowledge in the form of triples in a knowledge graph (KG) and their feature vectors. Knowledge graph embedding (KGE) models often represent relations in a KG as geometric transformations. Most state-of-the-art (SOTA) KGE models are derived from elementary geometric transformations (EGTs), such as translation, scaling, rotation, and reflection, or their combinations. These geometric transformations enable the models to effectively preserve specific structural and relational patterns of the KG. However, the current use of EGTs by KGEs remains insufficient without considering relation-specific transformations. Although recent models attempted to address this problem by ensembling SOTA baseline models in different ways, only a single or composite version of geometric transformations are used by such baselines to represent all the relations. In this paper, we propose a framework that evaluates how well each relation fits with different geometric transformations. Based on this ranking, the model can: (1) assign the best-matching transformation to each relation, or (2) use majority voting to choose one transformation type to apply across all relations. That is, the model learns a single relation-specific EGT in low dimensional vector space through an attention mechanism. Furthermore, we use the correlation between relations and EGTs, which are learned in a low dimension, for relation embeddings in a high dimensional vector space. The effectiveness of our models is demonstrated through comprehensive evaluations on three benchmark KGs as well as a real-world financial KG, witnessing a performance comparable to leading models
【4】A Spectral Interpretation of Redundancy in a Graph Reservoir
标题:图形储层冗余的谱解释
链接
:https://arxiv.org/abs/2507.12963
作者:n, Alessandro Sperduti
备注:This paper has been accepted for presentation at the 3rd International Workshop on Reservoir Computing (RC 2025) at ICANN 2025
摘要:水库计算已成功地应用于图作为预处理方法,以提高图神经网络(GNNs)的训练效率。然而,在图上重复应用层算子时出现的一个常见问题是过度平滑,这包括图信号向图拉普拉斯算子的低频分量的收敛。这项工作重新访问的定义的水库的多分辨率水库图神经网络(MRGNN),光谱水库模型,并提出了一个变种的基础上,最初介绍了在计算机图形学的表面设计领域的光顺算法。该算法提供了一个通带光谱滤波器,允许平滑而不收缩,它可以通过拉普拉斯算子适应图形设置。鉴于其频谱公式,该方法自然地连接到GNN架构,用于在适当控制时平滑可能是有益的任务,例如图分类。本文的核心贡献在于从随机游走的角度对算法进行了理论分析。特别是,它显示了如何调整的频谱系数可以被解释为调制冗余随机游动的贡献。基于MRGNN架构的探索性实验说明了这种方法的潜力,并为未来的研究提出了有前途的方向。
摘要:Reservoir computing has been successfully applied to graphs as a preprocessing method to improve the training efficiency of Graph Neural Networks (GNNs). However, a common issue that arises when repeatedly applying layer operators on graphs is over-smoothing, which consists in the convergence of graph signals toward low-frequency components of the graph Laplacian. This work revisits the definition of the reservoir in the Multiresolution Reservoir Graph Neural Network (MRGNN), a spectral reservoir model, and proposes a variant based on a Fairing algorithm originally introduced in the field of surface design in computer graphics. This algorithm provides a pass-band spectral filter that allows smoothing without shrinkage, and it can be adapted to the graph setting through the Laplacian operator. Given its spectral formulation, this method naturally connects to GNN architectures for tasks where smoothing, when properly controlled, can be beneficial,such as graph classification. The core contribution of the paper lies in the theoretical analysis of the algorithm from a random walks perspective. In particular, it shows how tuning the spectral coefficients can be interpreted as modulating the contribution of redundant random walks. Exploratory experiments based on the MRGNN architecture illustrate the potential of this approach and suggest promising directions for future research.
【5】Multi-Channel Graph Neural Network for Financial Risk Prediction of NEEQ Enterprises
标题:用于NEEQ企业财务风险预测的多通道图神经网络
链接:https://arxiv.org/abs/2507.12787
作者:u
备注:10 pages, 4 figures. Submitted for conference review
摘要:随着中国多层次资本市场的不断发展,全国中小企业股份转让系统(简称“新三板”)已成为中小企业重要的融资平台。然而,由于其规模和财务弹性有限,许多NEEQ上市公司面临财务困境的风险增加。为了解决这个问题,我们提出了一个多通道深度学习框架,该框架集成了结构化财务指标、文本披露和企业关系数据,以进行全面的财务风险预测。具体来说,我们设计了一个三通道图同构网络(GIN),分别处理数字,文本和基于图形的输入。这些模态特定的表示融合使用基于注意力的机制,然后通过门控单元,以提高鲁棒性和预测精度。对7,731家新三板公司数据的实验结果表明,我们的模型在AUC、精度、召回率和F1得分方面明显优于传统机器学习方法和单一模态基线。这项工作为中小企业的风险建模提供了理论和实践见解,并提供了一个数据驱动的工具,以支持金融监管机构和投资者。
摘要:With the continuous evolution of China's multi-level capital market, the National Equities Exchange and Quotations (NEEQ), also known as the "New Third Board," has become a critical financing platform for small and medium-sized enterprises (SMEs). However, due to their limited scale and financial resilience, many NEEQ-listed companies face elevated risks of financial distress. To address this issue, we propose a multi-channel deep learning framework that integrates structured financial indicators, textual disclosures, and enterprise relationship data for comprehensive financial risk prediction. Specifically, we design a Triple-Channel Graph Isomorphism Network (GIN) that processes numeric, textual, and graph-based inputs separately. These modality-specific representations are fused using an attention-based mechanism followed by a gating unit to enhance robustness and prediction accuracy. Experimental results on data from 7,731 real-world NEEQ companies demonstrate that our model significantly outperforms traditional machine learning methods and single-modality baselines in terms of AUC, Precision, Recall, and F1 Score. This work provides theoretical and practical insights into risk modeling for SMEs and offers a data-driven tool to support financial regulators and investors.
【6】Complex non-backtracking matrix for directed graphs
标题:有向图的复非回溯矩阵
链接:https://arxiv.org/abs/2507.12503
作者:ndo, Hideitsu Hino
备注:None
摘要:图表示矩阵是图数据分析的重要工具。最近,埃尔米特邻接矩阵已被提出来研究有向图结构。以往的研究表明,这些矩阵可以提取有价值的信息进行聚类。在本文中,我们提出了复杂的非回溯矩阵,它集成了厄米特邻接矩阵和非回溯矩阵的性质。该矩阵与无向图的无回溯矩阵具有相似的性质。我们揭示了复杂的非回溯矩阵和Hermite邻接矩阵之间的关系。此外,我们提供了有趣的见解,这种矩阵表示持有集群信息,特别是稀疏有向图。
摘要:Graph representation matrices are essential tools in graph data analysis. Recently, Hermitian adjacency matrices have been proposed to investigate directed graph structures. Previous studies have demonstrated that these matrices can extract valuable information for clustering. In this paper, we propose the complex non-backtracking matrix that integrates the properties of the Hermitian adjacency matrix and the non-backtracking matrix. The proposed matrix has similar properties with the non-backtracking matrix of undirected graphs. We reveal relationships between the complex non-backtracking matrix and the Hermitian adjacency matrix. Also, we provide intriguing insights that this matrix representation holds cluster information, particularly for sparse directed graphs.
Transformer(6篇)
【1】Training Transformers with Enforced Lipschitz Constants
标题:用强制Lipschitz常数训练Transformer
链接:https://arxiv.org/abs/2507.13338
作者:house, R. Preston Hess, Franz Cesista, Andrii Zahorodnii, Jeremy Bernstein, Phillip Isola
摘要:神经网络通常对输入和权重扰动高度敏感。这种敏感性与病理学有关,例如对对抗性例子的脆弱性,发散训练和过度拟合。为了解决这些问题,过去的研究着眼于完全从Lipschitz组件构建神经网络。然而,这些技术还没有成熟到研究人员已经训练了一个现代体系结构的地步,例如在初始化之后强制执行Lipschitz证书的Transformer。为了探索这一差距,我们首先开发和基准测试新的,计算效率高的工具,用于维护正常约束的权重矩阵。应用这些工具,我们能够训练Transformer模型,并在整个训练过程中强制执行Lipschitz边界。我们发现优化器动态很重要:从AdamW切换到Muon改进了标准方法-权重衰减和光谱归一化-允许模型在较低的Lipschitz边界下达到相同的性能。受μ子更新具有固定谱范数的启发,我们共同设计了一种权重约束方法,该方法改善了MLP和2 M参数Transformers的Lipschitz与性能权衡。我们的2-Lipschitz Transformer在莎士比亚文本上的验证准确率达到60%。扩展到145 M参数,我们的10-Lipschitz Transformer在互联网文本上达到21%的准确度。然而,为了匹配39.4%的NanoGPT基线验证准确度,我们的Lipschitz上限增加到10^264。尽管如此,我们的Lipschitz Transformers训练没有稳定性措施,如层规范,QK规范和logit tanh软帽。
摘要
:Neural networks are often highly sensitive to input and weight perturbations. This sensitivity has been linked to pathologies such as vulnerability to adversarial examples, divergent training, and overfitting. To combat these problems, past research has looked at building neural networks entirely from Lipschitz components. However, these techniques have not matured to the point where researchers have trained a modern architecture such as a transformer with a Lipschitz certificate enforced beyond initialization. To explore this gap, we begin by developing and benchmarking novel, computationally-efficient tools for maintaining norm-constrained weight matrices. Applying these tools, we are able to train transformer models with Lipschitz bounds enforced throughout training. We find that optimizer dynamics matter: switching from AdamW to Muon improves standard methods -- weight decay and spectral normalization -- allowing models to reach equal performance with a lower Lipschitz bound. Inspired by Muon's update having a fixed spectral norm, we co-design a weight constraint method that improves the Lipschitz vs. performance tradeoff on MLPs and 2M parameter transformers. Our 2-Lipschitz transformer on Shakespeare text reaches validation accuracy 60%. Scaling to 145M parameters, our 10-Lipschitz transformer reaches 21% accuracy on internet text. However, to match the NanoGPT baseline validation accuracy of 39.4%, our Lipschitz upper bound increases to 10^264. Nonetheless, our Lipschitz transformers train without stability measures such as layer norm, QK norm, and logit tanh softcapping.
【2】DASViT: Differentiable Architecture Search for Vision Transformer
标题:DASViT:视觉Transformer的差异化架构搜索
链接:https://arxiv.org/abs/2507.13079
作者:u, Ferrante Neri, Zhenhua Feng
备注:Accepted to the International Joint Conference on Neural Networks (IJCNN) 2025
摘要:设计有效的神经网络是深度学习的基石,神经架构搜索(NAS)已成为自动化这一过程的强大工具。在现有的NAS方法中,差异化架构搜索(DARTS)因其效率和易用性而获得了突出地位,激发了许多进步。自从Vision Transformers(ViT)兴起以来,研究人员已经应用NAS来探索ViT架构,通常专注于宏观级搜索空间,并依赖于进化算法等离散方法。虽然这些方法确保了可靠性,但它们在发现创新的架构设计方面面临挑战,需要大量的计算资源,并且时间密集。为了解决这些局限性,我们引入了Vision Transformer(DASViT)的可区分架构搜索,它弥补了ViT可区分搜索的差距,并揭示了新颖的设计。实验表明,DASViT提供的架构打破了传统的Transformer编码器设计,在多个数据集上的性能优于ViT-B/16,并以更少的参数和FLOP实现了卓越的效率。
摘要:Designing effective neural networks is a cornerstone of deep learning, and Neural Architecture Search (NAS) has emerged as a powerful tool for automating this process. Among the existing NAS approaches, Differentiable Architecture Search (DARTS) has gained prominence for its efficiency and ease of use, inspiring numerous advancements. Since the rise of Vision Transformers (ViT), researchers have applied NAS to explore ViT architectures, often focusing on macro-level search spaces and relying on discrete methods like evolutionary algorithms. While these methods ensure reliability, they face challenges in discovering innovative architectural designs, demand extensive computational resources, and are time-intensive. To address these limitations, we introduce Differentiable Architecture Search for Vision Transformer (DASViT), which bridges the gap in differentiable search for ViTs and uncovers novel designs. Experiments show that DASViT delivers architectures that break traditional Transformer encoder designs, outperform ViT-B/16 on multiple datasets, and achieve superior efficiency with fewer parameters and FLOPs.
【3】The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting
标题:架构的力量:深入研究长期时间序列预测的Transformer架构
链接:https://arxiv.org/abs/2507.13043
作者:n, Mouxiang Chen, Han Fu, Xiaoxue Ren, Xiaoyun Joy Wang, Jianling Sun, Zhuo Li, Chenghao Liu
备注:15 pages, 6 figures
摘要:基于transformer的模型最近在长期时间序列预测(LTSF)中占据主导地位,但其架构的变化,如仅编码器,编码器-解码器和仅解码器设计,提出了一个关键问题:什么Transformer架构最适合LTSF任务?然而,现有的模型通常与各种特定于时间序列的设计紧密耦合,这使得很难隔离架构本身的影响。为了解决这个问题,我们提出了一个新的分类,解开这些设计,使更清晰和更统一的比较Transformer架构。我们的分类法考虑的关键方面,如注意力机制,预测聚合,预测范式和规范化层。通过大量的实验,我们发现了几个关键的见解:双向注意力与联合注意力是最有效的;更完整的预测聚合提高了性能;直接映射范式优于自回归方法。此外,我们的组合模型,利用最佳的架构选择,始终优于几个现有的模型,加强了我们的结论的有效性。我们希望这些研究结果提供有价值的指导未来的研究Transformer架构设计在LTSF。我们的代码可在https://github.com/HALF111/TSF_architecture上获得。
摘要:Transformer-based models have recently become dominant in Long-term Time Series Forecasting (LTSF), yet the variations in their architecture, such as encoder-only, encoder-decoder, and decoder-only designs, raise a crucial question: What Transformer architecture works best for LTSF tasks? However, existing models are often tightly coupled with various time-series-specific designs, making it difficult to isolate the impact of the architecture itself. To address this, we propose a novel taxonomy that disentangles these designs, enabling clearer and more unified comparisons of Transformer architectures. Our taxonomy considers key aspects such as attention mechanisms, forecasting aggregations, forecasting paradigms, and normalization layers. Through extensive experiments, we uncover several key insights: bi-directional attention with joint-attention is most effective; more complete forecasting aggregation improves performance; and the direct-mapping paradigm outperforms autoregressive approaches. Furthermore, our combined model, utilizing optimal architectural choices, consistently outperforms several existing models, reinforcing the validity of our conclusions. We hope these findings offer valuable guidance for future research on Transformer architectural designs in LTSF. Our code is available at https://github.com/HALF111/TSF_architecture.
【4】Fremer: Lightweight and Effective Frequency Transformer for Workload Forecasting in Cloud Services
标题:Fremer:轻量级和有效的频率Transformer,用于云服务中的预测
链接:https://arxiv.org/abs/2507.12908
作者:hen, Hengyu Ye, Fuxin Jiang, Xiao He, Tieying Zhang, Jianjun Chen, Xiaofeng Gao
备注:12 pages, 11 figures
摘要:云计算预测在云服务应用中至关重要,例如自动扩展和调度,对运营效率有着深远的影响。尽管基于Transformer的预测模型在一般任务中取得了显著的成功,但其计算效率往往无法满足大规模云环境的严格要求。考虑到大多数工作负载系列表现出复杂的周期性模式,在频域中解决这些挑战提供了巨大的优势。为此,我们提出了Fremer,一个高效的深度预测模型。Fremer满足三个关键要求:它表现出卓越的效率,优于大多数基于Transformer的预测模型;它实现了卓越的准确性,超过了所有最先进的(SOTA)工作负载预测模型;它表现出多周期系列的稳健性能。此外,我们收集并开源了四个源自字节跳动云服务的高质量开源工作负载数据集,其中包括来自数千个计算实例的工作负载数据。在我们的专有数据集和公共基准上进行的大量实验表明,Fremer始终优于基线模型,在MSE中平均提高了5.5%,在MAE中平均提高了4.7%,在SMAPE中平均提高了8.6%,同时降低了参数规模和计算成本。此外,在基于Kubernetes的主动自动扩展测试中,Fremer将平均延迟提高了18.78%,并将资源消耗降低了2.35%,突出了其在现实世界应用中的实际功效。
摘要
:Workload forecasting is pivotal in cloud service applications, such as auto-scaling and scheduling, with profound implications for operational efficiency. Although Transformer-based forecasting models have demonstrated remarkable success in general tasks, their computational efficiency often falls short of the stringent requirements in large-scale cloud environments. Given that most workload series exhibit complicated periodic patterns, addressing these challenges in the frequency domain offers substantial advantages. To this end, we propose Fremer, an efficient and effective deep forecasting model. Fremer fulfills three critical requirements: it demonstrates superior efficiency, outperforming most Transformer-based forecasting models; it achieves exceptional accuracy, surpassing all state-of-the-art (SOTA) models in workload forecasting; and it exhibits robust performance for multi-period series. Furthermore, we collect and open-source four high-quality, open-source workload datasets derived from ByteDance's cloud services, encompassing workload data from thousands of computing instances. Extensive experiments on both our proprietary datasets and public benchmarks demonstrate that Fremer consistently outperforms baseline models, achieving average improvements of 5.5% in MSE, 4.7% in MAE, and 8.6% in SMAPE over SOTA models, while simultaneously reducing parameter scale and computational costs. Additionally, in a proactive auto-scaling test based on Kubernetes, Fremer improves average latency by 18.78% and reduces resource consumption by 2.35%, underscoring its practical efficacy in real-world applications.
【5】Transformer-Based Person Identification via Wi-Fi CSI Amplitude and Phase Perturbations
标题:通过Wi-Fi SI幅度和相扰动进行基于Transformer的人员识别
链接:https://arxiv.org/abs/2507.12854
作者:ola, Andrea Bernardini, Francesco Danese, Mario Lezoche, Maurizio Mancini, Daniele Pannone, Amedeo Ranaldi
摘要:Wi-Fi传感作为一种非侵入性和隐私保护的替代方案,正在获得越来越多的发展势头,以取代基于视觉的人类身份识别系统。然而,通过无线信号的个人识别,特别是在没有用户运动的情况下,仍然在很大程度上未被探索。大多数现有的基于无线的方法依赖于运动模式,如步行步态,以提取生物特征线索。相比之下,我们提出了一种基于变换器的方法,该方法从记录的信道状态信息(CSI)中识别个体,而受试者保持静止。CSI捕获由人体与无线电信号之间的独特相互作用引起的细粒度幅度和相位失真。为了支持评估,我们介绍了在受控室内环境中使用ESP 32设备获取的数据集,其中包括在多个方向上观察到的六名参与者。量身定制的预处理管道,包括离群值去除、平滑和相位校准,可提高信号质量。我们的双分支Transformer架构分别处理幅度和相位模态,并实现了99.82%的分类精度,优于卷积和多层感知器基线。这些结果证明了CSI扰动的辨别潜力,突出了它们以一致的方式编码生物特征的能力。他们进一步证实了在现实世界中使用低成本商品Wi-Fi硬件进行被动、无设备人员识别的可行性。
摘要:Wi-Fi sensing is gaining momentum as a non-intrusive and privacy-preserving alternative to vision-based systems for human identification. However, person identification through wireless signals, particularly without user motion, remains largely unexplored. Most prior wireless-based approaches rely on movement patterns, such as walking gait, to extract biometric cues. In contrast, we propose a transformer-based method that identifies individuals from Channel State Information (CSI) recorded while the subject remains stationary. CSI captures fine-grained amplitude and phase distortions induced by the unique interaction between the human body and the radio signal. To support evaluation, we introduce a dataset acquired with ESP32 devices in a controlled indoor environment, featuring six participants observed across multiple orientations. A tailored preprocessing pipeline, including outlier removal, smoothing, and phase calibration, enhances signal quality. Our dual-branch transformer architecture processes amplitude and phase modalities separately and achieves 99.82\% classification accuracy, outperforming convolutional and multilayer perceptron baselines. These results demonstrate the discriminative potential of CSI perturbations, highlighting their capacity to encode biometric traits in a consistent manner. They further confirm the viability of passive, device-free person identification using low-cost commodity Wi-Fi hardware in real-world settings.
【6】Compact Vision Transformer by Reduction of Kernel Complexity
标题:通过降低核心复杂性实现紧凑的视觉Transformer
链接:https://arxiv.org/abs/2507.12780
作者:Wang, Yingzhen Yang
摘要:自注意和Transformer架构已经成为现代深度学习的基础组件。最近的努力已经将Transformer块集成到计算机视觉的紧凑神经架构中,从而产生了各种有效的Vision Transformers。在这项工作中,我们介绍了Transformer与内核复杂性降低,或KCR-Transformer,一个紧凑的Transformer块配备可微通道选择,指导下一个新的和尖锐的理论推广界。KCR-Transformer在Transformer块的MLP层中执行输入/输出通道选择,以降低计算成本。此外,我们提供了一个严格的理论分析,建立一个紧密的泛化范围配备KCR变压器块的网络。利用这些强有力的理论结果,KCR-Transformer的通道修剪以泛化感知的方式进行,确保所得网络保留可证明的小泛化误差。我们的KCR-Transformer与许多流行的紧凑型Transformer网络兼容,如ViT和Swin,它降低了Vision Transformers的FLOP,同时保持甚至提高了预测精度。在实验中,我们用KCR-Transformer块替换了Vision Transformers中的所有Transformer块,从而得到了具有不同骨干的KCR-Transformer网络。由此产生的TCR变压器在各种计算机视觉任务上实现了卓越的性能,甚至比原始模型更好的性能,甚至更少的FLOP和参数。
摘要:Self-attention and transformer architectures have become foundational components in modern deep learning. Recent efforts have integrated transformer blocks into compact neural architectures for computer vision, giving rise to various efficient vision transformers. In this work, we introduce Transformer with Kernel Complexity Reduction, or KCR-Transformer, a compact transformer block equipped with differentiable channel selection, guided by a novel and sharp theoretical generalization bound. KCR-Transformer performs input/output channel selection in the MLP layers of transformer blocks to reduce the computational cost. Furthermore, we provide a rigorous theoretical analysis establishing a tight generalization bound for networks equipped with KCR-Transformer blocks. Leveraging such strong theoretical results, the channel pruning by KCR-Transformer is conducted in a generalization-aware manner, ensuring that the resulting network retains a provably small generalization error. Our KCR-Transformer is compatible with many popular and compact transformer networks, such as ViT and Swin, and it reduces the FLOPs of the vision transformers while maintaining or even improving the prediction accuracy. In the experiments, we replace all the transformer blocks in the vision transformers with KCR-Transformer blocks, leading to KCR-Transformer networks with different backbones. The resulting TCR-Transformers achieve superior performance on various computer vision tasks, achieving even better performance than the original models with even less FLOPs and parameters.
GAN|对抗|攻击|生成相关(3篇)
【1】SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks
标题:SHIELD:一种安全且高度增强的集成学习,用于针对对抗性攻击的稳健Deepfake检测
链接:https://arxiv.org/abs/2507.13170
作者:in, Awais Khan, Muhammad Umar Farooq, Khalid Malik
摘要:音频在说话人验证、支持语音的智能设备和音频会议等应用中起着至关重要的作用。然而,音频操纵,如deepfakes,通过传播错误信息带来了重大风险。我们的实证分析表明,现有的检测deepfake音频的方法往往容易受到反取证(AF)攻击,特别是那些使用生成对抗网络的攻击。在本文中,我们提出了一种新的协作学习方法SHIELD来防御生成式AF攻击。为了暴露AF签名,我们集成了一个辅助生成模型,称为防御(DF)生成模型,它通过结合输入和输出来促进协作学习。此外,我们设计了一个三元组模型来捕获真实和AF攻击音频与真实生成和攻击生成音频的相关性,使用辅助生成模型。所提出的SHIELD增强了对生成AF攻击的防御,并在各种生成模型中实现了鲁棒性能。对于三种不同的生成模型,所提出的AF将ASVspoof 2019的平均检测准确率从95.49%降至59.77%,将In-the-Wild的平均检测准确率从99.44%降至38.45%,将HalfTruth的平均检测准确率从98.41%降至51.18%。所提出的SHIELD机制对AF攻击具有鲁棒性,并且在ASVspoof 2019,In-the-Wild和HalfTruth数据集的匹配设置中分别实现了98.13%,98.58%和99.57%的平均准确率,以及98.78%,98.62%和98.85%的不匹配设置。
摘要
:Audio plays a crucial role in applications like speaker verification, voice-enabled smart devices, and audio conferencing. However, audio manipulations, such as deepfakes, pose significant risks by enabling the spread of misinformation. Our empirical analysis reveals that existing methods for detecting deepfake audio are often vulnerable to anti-forensic (AF) attacks, particularly those attacked using generative adversarial networks. In this article, we propose a novel collaborative learning method called SHIELD to defend against generative AF attacks. To expose AF signatures, we integrate an auxiliary generative model, called the defense (DF) generative model, which facilitates collaborative learning by combining input and output. Furthermore, we design a triplet model to capture correlations for real and AF attacked audios with real-generated and attacked-generated audios using auxiliary generative models. The proposed SHIELD strengthens the defense against generative AF attacks and achieves robust performance across various generative models. The proposed AF significantly reduces the average detection accuracy from 95.49% to 59.77% for ASVspoof2019, from 99.44% to 38.45% for In-the-Wild, and from 98.41% to 51.18% for HalfTruth for three different generative models. The proposed SHIELD mechanism is robust against AF attacks and achieves an average accuracy of 98.13%, 98.58%, and 99.57% in match, and 98.78%, 98.62%, and 98.85% in mismatch settings for the ASVspoof2019, In-the-Wild, and HalfTruth datasets, respectively.
【2】Bridging the Gap: Leveraging Retrieval-Augmented Generation to Better Understand Public Concerns about Vaccines
标题:弥合差距:利用检索增强一代更好地了解公众对疫苗的担忧
链接:https://arxiv.org/abs/2507.12840
作者:Javed, Sedigh Khademi Habibabadi, Christopher Palmer, Hazel Clothier, Jim Buttery, Gerardo Luis Dimaguila
摘要:疫苗犹豫威胁着公共健康,导致疫苗被推迟或拒绝。社交媒体是了解公众关注的重要来源,而话题建模等传统方法往往难以捕捉细微差别的意见。虽然经过了查询回答的训练,但大型语言模型(LLM)经常会错过当前事件和社区关注的问题。此外,LLM中的幻觉可能会损害公共卫生沟通。为了解决这些限制,我们开发了一个工具(VaxPulse查询角)使用检索增强生成技术。它解决了各种在线平台上有关公众疫苗问题的复杂查询,帮助公共卫生管理人员和利益相关者了解公众关注的问题,并实施有针对性的干预措施,以提高疫苗信心。通过分析35,103个Shingrix社交媒体帖子,它实现了回答的忠诚度(0.96)和相关性(0.94)。
摘要:Vaccine hesitancy threatens public health, leading to delayed or rejected vaccines. Social media is a vital source for understanding public concerns, and traditional methods like topic modelling often struggle to capture nuanced opinions. Though trained for query answering, large Language Models (LLMs) often miss current events and community concerns. Additionally, hallucinations in LLMs can compromise public health communication. To address these limitations, we developed a tool (VaxPulse Query Corner) using the Retrieval Augmented Generation technique. It addresses complex queries about public vaccine concerns on various online platforms, aiding public health administrators and stakeholders in understanding public concerns and implementing targeted interventions to boost vaccine confidence. Analysing 35,103 Shingrix social media posts, it achieved answer faithfulness (0.96) and relevance (0.94).
【3】World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving
标题:基于世界模型的端到端场景生成,用于自动驾驶中的事故预测
链接:https://arxiv.org/abs/2507.12762
作者:uan, Haicheng Liao, Chengyue Wang, Xingcheng Liu, Jiaxun Zhang, Zhenning Li
摘要:可靠地预测交通事故对于推进自动驾驶系统至关重要。然而,这一目标受到两个基本挑战的限制:缺乏多样化的高质量训练数据,以及由于环境干扰或传感器缺陷而经常缺乏关键的对象级线索。为了解决这些问题,我们提出了一个综合的框架结合生成场景增强自适应时间推理。具体来说,我们开发了一个视频生成管道,该管道利用由域信息提示引导的世界模型来创建高分辨率,统计一致的驾驶场景,特别是丰富了边缘情况和复杂交互的覆盖范围。与此同时,我们构建了一个动态预测模型,通过加强图卷积和扩张时间算子编码时空关系,有效地解决了数据不完整性和瞬态视觉噪声。此外,我们还发布了一个新的基准数据集,旨在更好地捕捉各种现实驾驶风险。在公开和新发布的数据集上进行的大量实验证实,我们的框架提高了事故预测的准确性和提前期,为安全关键型自动驾驶应用中的当前数据和建模限制提供了一个强大的解决方案。
摘要:Reliable anticipation of traffic accidents is essential for advancing autonomous driving systems. However, this objective is limited by two fundamental challenges: the scarcity of diverse, high-quality training data and the frequent absence of crucial object-level cues due to environmental disruptions or sensor deficiencies. To tackle these issues, we propose a comprehensive framework combining generative scene augmentation with adaptive temporal reasoning. Specifically, we develop a video generation pipeline that utilizes a world model guided by domain-informed prompts to create high-resolution, statistically consistent driving scenarios, particularly enriching the coverage of edge cases and complex interactions. In parallel, we construct a dynamic prediction model that encodes spatio-temporal relationships through strengthened graph convolutions and dilated temporal operators, effectively addressing data incompleteness and transient visual noise. Furthermore, we release a new benchmark dataset designed to better capture diverse real-world driving risks. Extensive experiments on public and newly released datasets confirm that our framework enhances both the accuracy and lead time of accident anticipation, offering a robust solution to current data and modeling limitations in safety-critical autonomous driving applications.
半/弱/无/有监督|不确定性|主动学习(6篇)
【1】Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces
标题:多模式脑机接口的不确定性感知跨模式知识提炼和原型学习
链接:https://arxiv.org/abs/2507.13092
作者: Jang, Hye-Bin Shin, Seong-Whan Lee
摘要:脑电图(EEG)是脑机接口(BCI)中用于认知状态监测的基本模式。然而,它对固有信号误差和人为标记误差非常敏感,这会导致标记噪声并最终降低模型性能。为了增强EEG学习,已经探索了多模态知识蒸馏(KD)以将知识从具有丰富表示的视觉模型转移到基于EEG的模型。然而,KD面临两个关键挑战:模态差距和软标签错位。前者源于EEG和视觉特征空间的异质性,而后者源于标签不一致性,这些标签不一致性在地面真值标签和蒸馏目标之间产生差异。本文讨论了语义不确定性所造成的模糊的功能和弱定义的标签。我们提出了一个新的跨模态知识蒸馏框架,减轻模态和标签的不一致性。它通过一个基于原型的相似性模块对齐特征语义,并引入了一个特定于任务的蒸馏头来解决标签引起的监督不一致。实验结果表明,我们的方法提高了基于EEG的情感回归和分类性能,优于公共多模态数据集上的单峰和多模态基线。这些发现突出了我们的BCI应用框架的潜力。
摘要:Electroencephalography (EEG) is a fundamental modality for cognitive state monitoring in brain-computer interfaces (BCIs). However, it is highly susceptible to intrinsic signal errors and human-induced labeling errors, which lead to label noise and ultimately degrade model performance. To enhance EEG learning, multimodal knowledge distillation (KD) has been explored to transfer knowledge from visual models with rich representations to EEG-based models. Nevertheless, KD faces two key challenges: modality gap and soft label misalignment. The former arises from the heterogeneous nature of EEG and visual feature spaces, while the latter stems from label inconsistencies that create discrepancies between ground truth labels and distillation targets. This paper addresses semantic uncertainty caused by ambiguous features and weakly defined labels. We propose a novel cross-modal knowledge distillation framework that mitigates both modality and label inconsistencies. It aligns feature semantics through a prototype-based similarity module and introduces a task-specific distillation head to resolve label-induced inconsistency in supervision. Experimental results demonstrate that our approach improves EEG-based emotion regression and classification performance, outperforming both unimodal and multimodal baselines on a public multimodal dataset. These findings highlight the potential of our framework for BCI applications.
【2】Confidence-Filtered Relevance (CFR): An Interpretable and Uncertainty-Aware Machine Learning Framework for Naturalness Assessment in Satellite Imagery
标题:置信度过滤相关性(CFR):用于卫星图像自然度评估的可解释和不确定性感知机器学习框架
链接:https://arxiv.org/abs/2507.13034
作者:m, Ribana Roscher
摘要
:自然保护区在生态平衡和生态系统服务方面发挥着至关重要的作用。使用卫星图像和机器学习对这些地区进行大规模监测是有希望的,但目前的方法往往缺乏可解释性和不确定性意识,并且没有解决不确定性如何影响自然性评估。相比之下,我们提出了置信度过滤相关性(CFR),这是一个以数据为中心的框架,它将LRP注意力推出与深度确定性不确定性(DDU)估计相结合,以分析模型不确定性如何影响相关性热图的可解释性。CFR基于不确定性阈值将数据集划分为子集,从而能够系统地分析不确定性如何塑造卫星图像中自然性的解释。应用于AnthroProtect数据集,CFR将更高的相关性分配给灌木丛,森林和湿地,与其他关于自然性评估的研究保持一致。此外,我们的分析表明,随着不确定性的增加,这些相关性热图的可解释性下降,它们的熵增加,表明选择性更低,属性更模糊。CFR提供了一种以数据为中心的方法,根据其相关的确定性评估卫星图像中模式与自然度的相关性。
摘要:Protected natural areas play a vital role in ecological balance and ecosystem services. Monitoring these regions at scale using satellite imagery and machine learning is promising, but current methods often lack interpretability and uncertainty-awareness, and do not address how uncertainty affects naturalness assessment. In contrast, we propose Confidence-Filtered Relevance (CFR), a data-centric framework that combines LRP Attention Rollout with Deep Deterministic Uncertainty (DDU) estimation to analyze how model uncertainty influences the interpretability of relevance heatmaps. CFR partitions the dataset into subsets based on uncertainty thresholds, enabling systematic analysis of how uncertainty shapes the explanations of naturalness in satellite imagery. Applied to the AnthroProtect dataset, CFR assigned higher relevance to shrublands, forests, and wetlands, aligning with other research on naturalness assessment. Moreover, our analysis shows that as uncertainty increases, the interpretability of these relevance heatmaps declines and their entropy grows, indicating less selective and more ambiguous attributions. CFR provides a data-centric approach to assess the relevance of patterns to naturalness in satellite imagery based on their associated certainty.
【3】Robust Explanations Through Uncertainty Decomposition: A Path to Trustworthier AI
标题:通过不确定性分解进行稳健的解释:通往更值得信赖的人工智能的道路
链接:https://arxiv.org/abs/2507.12913
作者:hu, Louenas Bounia, Vu Linh Nguyen, Sébastien Destercke, Arthur Hoarau
摘要:机器学习的最新进展强调了模型预测的透明度,特别是当使用越来越复杂的架构时,可解释性会降低。在本文中,我们提出利用预测的不确定性作为一种补充方法,以经典的可解释性方法。具体来说,我们区分任意(数据相关)和认知(模型相关)的不确定性,以指导选择适当的解释。认识不确定性作为不可靠解释的拒绝标准,本身就提供了对训练不足(一种新的解释形式)的洞察。偶然的不确定性告知特征重要性解释和反事实解释之间的选择。这利用了由不确定性量化和解纠缠驱动的可解释性方法框架。我们的实验证明了这种不确定性感知方法对传统机器学习和深度学习场景中解释的鲁棒性和可达性的影响。
摘要:Recent advancements in machine learning have emphasized the need for transparency in model predictions, particularly as interpretability diminishes when using increasingly complex architectures. In this paper, we propose leveraging prediction uncertainty as a complementary approach to classical explainability methods. Specifically, we distinguish between aleatoric (data-related) and epistemic (model-related) uncertainty to guide the selection of appropriate explanations. Epistemic uncertainty serves as a rejection criterion for unreliable explanations and, in itself, provides insight into insufficient training (a new form of explanation). Aleatoric uncertainty informs the choice between feature-importance explanations and counterfactual explanations. This leverages a framework of explainability methods driven by uncertainty quantification and disentanglement. Our experiments demonstrate the impact of this uncertainty-aware approach on the robustness and attainability of explanations in both traditional machine learning and deep learning scenarios.
【4】Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)
标题:对精选数据进行监督微调是强化学习(并且可以改进)
链接:https://arxiv.org/abs/2507.12856
作者:in, Jost Tobias Springenberg
备注:See project website for details and code at: this https URL
摘要:基于策展(或过滤)数据的行为克隆(BC)是大型语言模型的监督微调(SFT)的主要范例;以及控制策略的模仿学习。在这里,我们借鉴了这一成功策略与通过强化学习(RL)寻找最优策略的理论和实践之间的联系。在现有文献的基础上,我们澄清了SFT可以理解为在稀疏奖励设置中最大化RL目标的下限。支持其经常观察到的良好表现。从这个角度来看,我们意识到,对SFT的一个小修改会导致一个重要性加权的变体,它的行为更接近于使用RL进行训练,因为它:i)优化了RL目标的更严格约束,ii)与SFT相比,可以提高对策划数据的性能。我们将这种变体称为重要性加权监督微调(iw-SFT)。我们证明了它很容易实现,并且可以进一步推广到使用高质量评分数据的训练。由此产生的SFT变体与用于大型语言模型和连续控制任务中的训练策略的更先进的RL算法相比具有竞争力。例如,在AIME 2024数据集上达到66.7%。
摘要:Behavior Cloning (BC) on curated (or filtered) data is the predominant paradigm for supervised fine-tuning (SFT) of large language models; as well as for imitation learning of control policies. Here, we draw on a connection between this successful strategy and the theory and practice of finding optimal policies via Reinforcement Learning (RL). Building on existing literature, we clarify that SFT can be understood as maximizing a lower bound on the RL objective in a sparse reward setting. Giving support to its often observed good performance. From this viewpoint, we realize that a small modification to SFT leads to an importance weighted variant that behaves closer to training with RL as it: i) optimizes a tighter bound to the RL objective and, ii) can improve performance compared to SFT on curated data. We refer to this variant as importance weighted supervised fine-tuning (iw-SFT). We show that it is easy to implement and can be further generalized to training with quality scored data. The resulting SFT variants are competitive with more advanced RL algorithms for large language models and for training policies in continuous control tasks. For example achieving 66.7% on the AIME 2024 dataset.
【5】From Novelty to Imitation: Self-Distilled Rewards for Offline Reinforcement Learning
标题:从新颖到模仿:离线强化学习的自我提炼奖励
链接:https://arxiv.org/abs/2507.12815
作者:audhary, Laxmidhar Behera
摘要:离线强化学习(RL)旨在从静态数据集中学习有效的策略,而不需要进一步的代理-环境交互。然而,它的实际采用往往受到需要显式奖励注释的阻碍,这可能是昂贵的工程或难以追溯获得。为了解决这个问题,我们提出了ReLOAD(Reinforcement Learning with Offline Reward Annotation via Distillation),这是一种用于离线RL的新型奖励注释框架。与依赖于复杂对齐过程的现有方法不同,我们的方法采用随机网络蒸馏(RND),使用简单而有效的嵌入差异度量从专家演示中生成内在奖励。首先,我们训练一个预测器网络,以模仿一个固定的目标网络的嵌入基于专家的状态转换。之后,这些网络之间的预测误差将作为静态数据集中每次转换的奖励信号。该机制提供了结构化的奖励信号,而不需要手工制作的奖励注释。我们提供了一个正式的理论结构,提供了深入了解如何RND预测错误有效地作为内在的奖励区分专家一样的过渡。在D4RL基准测试上的实验表明,ReLOAD实现了强大的离线策略学习,并实现了与传统奖励注释方法相竞争的性能。
摘要:Offline Reinforcement Learning (RL) aims to learn effective policies from a static dataset without requiring further agent-environment interactions. However, its practical adoption is often hindered by the need for explicit reward annotations, which can be costly to engineer or difficult to obtain retrospectively. To address this, we propose ReLOAD (Reinforcement Learning with Offline Reward Annotation via Distillation), a novel reward annotation framework for offline RL. Unlike existing methods that depend on complex alignment procedures, our approach adapts Random Network Distillation (RND) to generate intrinsic rewards from expert demonstrations using a simple yet effective embedding discrepancy measure. First, we train a predictor network to mimic a fixed target network's embeddings based on expert state transitions. Later, the prediction error between these networks serves as a reward signal for each transition in the static dataset. This mechanism provides a structured reward signal without requiring handcrafted reward annotations. We provide a formal theoretical construct that offers insights into how RND prediction errors effectively serve as intrinsic rewards by distinguishing expert-like transitions. Experiments on the D4RL benchmark demonstrate that ReLOAD enables robust offline policy learning and achieves performance competitive with traditional reward-annotated methods.
【6】Unsupervised Ground Metric Learning
标题:无监督地面指标学习
链接:https://arxiv.org/abs/2507.13094
作者:fenberg, Jonas Bresch, Oleh Melnyk, Gabriele Steidl
备注:10 figures, 1 table
摘要:在没有标记样本的情况下进行数据分类仍然是一个具有挑战性的问题。它通常取决于适当选择的特征之间的距离,这是度量学习中的一个主题。最近,Huizing,Cantini和Peyr 'e提出了同时学习样本和数据集特征之间的最优传输(OT)成本矩阵。这导致了寻找将成本矩阵映射到OT距离的特定非线性函数的正特征向量的任务。考虑到这个基本思想,我们考虑了无监督度量学习的算法和建模部分。首先,我们研究适当的算法和它们的收敛性。特别是,我们建议使用随机随机函数迭代算法,并证明它收敛线性为我们的设置,虽然我们的运营商是不paracractive的,因为它是收敛到目前为止所需的。第二,我们问一个自然的问题,OT距离是否可以被其他距离取代。我们展示了马氏距离如何适合我们的考虑。此外,我们研究了一种方法,通过图拉普拉斯算子。与前面的设置相反,我们只需要处理需要矩阵中的线性函数,因此可以应用线性代数的简单算法。
摘要:Data classification without access to labeled samples remains a challenging problem. It usually depends on an appropriately chosen distance between features, a topic addressed in metric learning. Recently, Huizing, Cantini and Peyr\'e proposed to simultaneously learn optimal transport (OT) cost matrices between samples and features of the dataset. This leads to the task of finding positive eigenvectors of a certain nonlinear function that maps cost matrices to OT distances. Having this basic idea in mind, we consider both the algorithmic and the modeling part of unsupervised metric learning. First, we examine appropriate algorithms and their convergence. In particular, we propose to use the stochastic random function iteration algorithm and prove that it converges linearly for our setting, although our operators are not paracontractive as it was required for convergence so far. Second, we ask the natural question if the OT distance can be replaced by other distances. We show how Mahalanobis-like distances fit into our considerations. Further, we examine an approach via graph Laplacians. In contrast to the previous settings, we have just to deal with linear functions in the wanted matrices here, so that simple algorithms from linear algebra can be applied.
迁移|Zero/Few/One-Shot|自适应(5篇)
【1】Assessing adaptive world models in machines with novel games
标题:使用新颖游戏在机器中评估自适应世界模型
链接:https://arxiv.org/abs/2507.12821
作者:g, Katherine M. Collins, Prafull Sharma, Cedric Colas, Kaiya Ivy Zhao, Adrian Weller, Zenna Tavares, Phillip Isola, Samuel J. Gershman, Jacob D. Andreas, Thomas L. Griffiths, Francois Chollet, Kelsey R. Allen, Joshua B. Tenenbaum
备注:17 pages, 4 figures
摘要:人类的智力在新的和不熟悉的环境中表现出快速适应和有效解决问题的能力。我们认为,这种深刻的适应性是从根本上联系到有效的建设和完善的内部表示的环境,通常被称为世界模型,我们把这种适应机制作为世界模型的感应。然而,目前对人工智能(AI)世界模型的理解和评估仍然很狭隘,通常集中在从大量数据库的训练中学习到的静态表示,而不是模型通过在新环境中的交互和探索来学习这些表示的效率和功效。在这个视角中,我们提供了一个世界模型归纳的观点,借鉴了几十年来认知科学关于人类如何有效学习和适应的研究;然后我们呼吁建立一个新的评估框架来评估人工智能中的适应性世界模型。具体地说,我们提出了一个新的基准测试模式的基础上套件的精心设计的游戏与真正的,深刻的和不断刷新的新颖性,在底层的游戏结构-我们把这种游戏作为小说游戏。我们详细说明了关键的必要条件,构建这些游戏,并提出适当的指标,明确挑战和评估代理的能力,快速世界模型的感应。我们希望这个新的评估框架将激发未来对人工智能世界模型的评估工作,并为开发能够像人类一样快速适应和强大泛化的人工智能系统迈出关键一步-这是人工智能的关键组成部分。
摘要:Human intelligence exhibits a remarkable capacity for rapid adaptation and effective problem-solving in novel and unfamiliar contexts. We argue that this profound adaptability is fundamentally linked to the efficient construction and refinement of internal representations of the environment, commonly referred to as world models, and we refer to this adaptation mechanism as world model induction. However, current understanding and evaluation of world models in artificial intelligence (AI) remains narrow, often focusing on static representations learned from training on a massive corpora of data, instead of the efficiency and efficacy of models in learning these representations through interaction and exploration within a novel environment. In this Perspective, we provide a view of world model induction drawing on decades of research in cognitive science on how humans learn and adapt so efficiently; we then call for a new evaluation framework for assessing adaptive world models in AI. Concretely, we propose a new benchmarking paradigm based on suites of carefully designed games with genuine, deep and continually refreshing novelty in the underlying game structures -- we refer to this kind of games as novel games. We detail key desiderata for constructing these games and propose appropriate metrics to explicitly challenge and evaluate the agent's ability for rapid world model induction. We hope that this new evaluation framework will inspire future evaluation efforts on world models in AI and provide a crucial step towards developing AI systems capable of the human-like rapid adaptation and robust generalization -- a critical component of artificial general intelligence.
【2】Improving physics-informed neural network extrapolation via transfer learning and adaptive activation functions
标题:通过迁移学习和自适应激活函数改进基于物理的神经网络外推
链接:https://arxiv.org/abs/2507.12659
作者:s Papastathopoulos-Katsaros, Alexandra Stavrianidi, Zhandong Liu
备注:18 pages, 16 figures, 7 tables Accepted to ICANN 2025
摘要:物理信息神经网络(PINN)是一种深度学习模型,它将系统的物理定律纳入学习过程,使其非常适合解决复杂的科学和工程问题。最近,PINN作为将物理原理与数据驱动建模相结合以提高预测精度的强大框架而受到广泛关注。尽管他们的成功,但是,PINN往往表现出差的外推性能的训练域之外,是高度敏感的激活函数(AF)的选择。在本文中,我们介绍了一种迁移学习(TL)方法,以提高PINN的外推能力。我们的方法在扩展的训练域中应用迁移学习(TL),只使用少量精心选择的搭配点。此外,我们提出了一种自适应AF,它采用标准AF的线性组合的形式,这提高了模型的鲁棒性和准确性。通过一系列的实验,我们证明,我们的方法实现了平均减少40%的相对L2误差和平均减少50%的平均绝对误差的外推域,所有没有显着增加的计算成本。该代码可在https://github.com/LiuzLab/PINN-extrapolation上获得。
摘要:Physics-Informed Neural Networks (PINNs) are deep learning models that incorporate the governing physical laws of a system into the learning process, making them well-suited for solving complex scientific and engineering problems. Recently, PINNs have gained widespread attention as a powerful framework for combining physical principles with data-driven modeling to improve prediction accuracy. Despite their successes, however, PINNs often exhibit poor extrapolation performance outside the training domain and are highly sensitive to the choice of activation functions (AFs). In this paper, we introduce a transfer learning (TL) method to improve the extrapolation capability of PINNs. Our approach applies transfer learning (TL) within an extended training domain, using only a small number of carefully selected collocation points. Additionally, we propose an adaptive AF that takes the form of a linear combination of standard AFs, which improves both the robustness and accuracy of the model. Through a series of experiments, we demonstrate that our method achieves an average of 40% reduction in relative L2 error and an average of 50% reduction in mean absolute error in the extrapolation domain, all without a significant increase in computational cost. The code is available at https://github.com/LiuzLab/PINN-extrapolation .
【3】Best Practices for Large-Scale, Pixel-Wise Crop Mapping and Transfer Learning Workflows
标题:大规模、逐像素作物制图和迁移学习工作流的最佳实践
链接:https://arxiv.org/abs/2507.12590
作者:, Tao Liu, Sean Alexander Woznicki, Miljana Marković, Oskar Marko, Molly Sears
备注:A review article. 41 pages, 22 figures. Preprint
摘要:作物制图涉及利用主要来自遥感图像的空间数据对作物类型进行识别和分类。这项研究首次全面回顾了大规模的像素作物映射工作流程,包括传统的监督方法和新兴的迁移学习方法。为了确定最佳的监督作物映射工作流程,我们进行了系统的实验,比较了六种广泛采用的基于卫星图像的预处理方法,以及十一种监督像素分类模型。此外,我们评估了不同训练样本量和变量组合的协同影响。此外,我们确定了不同幅度的域转移的最佳迁移学习技术。在五个不同的农业地点进行了最佳方法的评估。大地卫星8号是主要的卫星数据来源。标签来自CDL可信像素和实地调查。 我们的研究结果揭示了三个关键见解。首先,细尺度区间预处理与Transformer模型相结合,始终为监督和可转移工作流提供最佳性能。RF在传统的监督学习和直接转移到类似领域中提供了快速的培训和竞争力。其次,迁移学习技术增强了工作流程的适应性,UDA对同质作物类有效,而微调在不同的场景中仍然很强大。最后,工作流程的选择在很大程度上取决于标记样本的可用性。有了足够的样本量,监督训练通常会提供更准确和更普遍的结果。在一定阈值以下,匹配域转移水平的迁移学习是实现作物映射的可行替代方案。存储库:大规模像素明智裁剪映射和传输学习工作流的最佳实践
摘要:Crop mapping involves identifying and classifying crop types using spatial data, primarily derived from remote sensing imagery. This study presents the first comprehensive review of large-scale, pixel-wise crop mapping workflows, encompassing both conventional supervised methods and emerging transfer learning approaches. To identify the optimal supervised crop mapping workflows, we conducted systematic experiments, comparing six widely adopted satellite image-based preprocessing methods, alongside eleven supervised pixel-wise classification models. Additionally, we assessed the synergistic impact of varied training sample sizes and variable combinations. Moreover, we identified optimal transfer learning techniques for different magnitudes of domain shift. The evaluation of best methods was conducted across five diverse agricultural sites. Landsat 8 served as the primary satellite data source. Labels come from CDL trusted pixels and field surveys. Our findings reveal three key insights. First, fine-scale interval preprocessing paired with Transformer models consistently delivered optimal performance for both supervised and transferable workflows. RF offered rapid training and competitive performance in conventional supervised learning and direct transfer to similar domains. Second, transfer learning techniques enhanced workflow adaptability, with UDA being effective for homogeneous crop classes while fine-tuning remains robust across diverse scenarios. Finally, workflow choice depends heavily on the availability of labeled samples. With a sufficient sample size, supervised training typically delivers more accurate and generalizable results. Below a certain threshold, transfer learning that matches the level of domain shift is a viable alternative to achieve crop mapping. Repository: Best-Practices-for-Large-Scale-Pixel-Wise-Crop-Mapping-and-Transfer-Learning-Workflows
【4】IncA-DES: An incremental and adaptive dynamic ensemble selection approach using online K-d tree neighborhood search for data streams with concept drift
标题:IncA-DES:使用在线K-d树邻居搜索来搜索具有概念漂移的数据流的增量和自适应动态集成选择方法
链接:https://arxiv.org/abs/2507.12573
作者:. L. Barboza, Paulo R. Lisboa de Almeida, Alceu de Souza Britto Jr., Robert Sabourin, Rafael M. O. Cruz
备注:Preprint of article published to Information Fusion
摘要:数据流带来了在基于批处理的ML中通常不会遇到的挑战。其中之一是概念漂移,其特征是数据分布随时间的变化。在众多的分类器融合方法中,分类器的融合方法已经取得了很好的效果,并受到越来越多的关注。DS方法,由于系综是基于实例的,似乎是一个有效的选择漂移的情况下。然而,必须注意使这些方法适应概念漂移。必须进行培训才能培养本地专家,并且随着数据的不断到达,常用的社区搜索DS可能会变得令人望而却步。在这项工作中,我们提出了IncA-DES,它采用了一种训练策略,促进了本地专家的生成,假设特征空间的不同区域随着时间的推移变得可用。此外,概念漂移检测器的融合支持信息的维护和对新概念的适应。还采用了一个基于分类的过滤器,以避免使用DS方法时,有一个共识,在附近,我们认为每个DS方法应该采用的策略,因为它被证明使它们更适用,更快。此外,为了减少kNN的处理时间,我们提出了一种在线K-d树算法,它可以快速删除实例,而不会变得不一致,并处理可能发生在数据流中的不平衡问题。实验结果表明,该框架得到了最好的平均准确度相比,考虑到不同级别的标签可用性的七个国家的最先进的方法,并提出了最准确的方法之间的处理时间较小。此外,与在线K-d树的融合改善了处理时间,精度损失可以忽略不计。我们已经在一个在线存储库中提供了我们的框架。
摘要:Data streams pose challenges not usually encountered in batch-based ML. One of them is concept drift, which is characterized by the change in data distribution over time. Among many approaches explored in literature, the fusion of classifiers has been showing good results and is getting growing attention. DS methods, due to the ensemble being instance-based, seem to be an efficient choice under drifting scenarios. However, some attention must be paid to adapting such methods for concept drift. The training must be done in order to create local experts, and the commonly used neighborhood-search DS may become prohibitive with the continuous arrival of data. In this work, we propose IncA-DES, which employs a training strategy that promotes the generation of local experts with the assumption that different regions of the feature space become available with time. Additionally, the fusion of a concept drift detector supports the maintenance of information and adaptation to a new concept. An overlap-based classification filter is also employed in order to avoid using the DS method when there is a consensus in the neighborhood, a strategy that we argue every DS method should employ, as it was shown to make them more applicable and quicker. Moreover, aiming to reduce the processing time of the kNN, we propose an Online K-d tree algorithm, which can quickly remove instances without becoming inconsistent and deals with unbalancing concerns that may occur in data streams. Experimental results showed that the proposed framework got the best average accuracy compared to seven state-of-the-art methods considering different levels of label availability and presented the smaller processing time between the most accurate methods. Additionally, the fusion with the Online K-d tree has improved processing time with a negligible loss in accuracy. We have made our framework available in an online repository.
【5】Quantum Transfer Learning to Boost Dementia Detection
标题:量子转移学习促进痴呆症检测
链接:https://arxiv.org/abs/2507.12485
作者:owmik, Talita Perciano, Himanshu Thapliyal
摘要:痴呆症是一种毁灭性的疾病,对个人、家庭和医疗保健系统都有深远的影响。早期和准确地检测痴呆症对于及时干预和改善患者预后至关重要。虽然经典的机器学习和深度学习方法已经被广泛地用于痴呆症预测,但这些解决方案往往难以处理高维生物医学数据和大规模数据集,很快就会达到计算和性能限制。为了应对这一挑战,量子机器学习(QML)已成为一种有前途的范式,提供更快的训练和先进的模式识别功能。这项工作旨在证明量子转移学习(QTL)的潜力,以提高应用于痴呆症检测的二进制分类任务的弱经典深度学习模型的性能。此外,我们展示了噪声对基于QTL的方法的影响,研究了该方法的可靠性和鲁棒性。使用OASIS 2数据集,我们展示了量子技术如何将次优的经典模型转化为更有效的生物医学图像分类解决方案,突出了它们对推进医疗保健技术的潜在影响。
摘要:Dementia is a devastating condition with profound implications for individuals, families, and healthcare systems. Early and accurate detection of dementia is critical for timely intervention and improved patient outcomes. While classical machine learning and deep learning approaches have been explored extensively for dementia prediction, these solutions often struggle with high-dimensional biomedical data and large-scale datasets, quickly reaching computational and performance limitations. To address this challenge, quantum machine learning (QML) has emerged as a promising paradigm, offering faster training and advanced pattern recognition capabilities. This work aims to demonstrate the potential of quantum transfer learning (QTL) to enhance the performance of a weak classical deep learning model applied to a binary classification task for dementia detection. Besides, we show the effect of noise on the QTL-based approach, investigating the reliability and robustness of this method. Using the OASIS 2 dataset, we show how quantum techniques can transform a suboptimal classical model into a more effective solution for biomedical image classification, highlighting their potential impact on advancing healthcare technology.
强化学习(5篇)
【1】Evaluating Reinforcement Learning Algorithms for Navigation in Simulated Robotic Quadrupeds: A Comparative Study Inspired by Guide Dog Behaviour
标题:评估模拟机器人四足动物导航的强化学习算法:受导盲犬行为启发的比较研究
链接:https://arxiv.org/abs/2507.13277
作者:. Harrison
摘要:机器人越来越多地融入各个行业,特别是在医疗保健领域。然而,四足机器人的许多有价值的应用仍然被忽视。本研究探讨三种强化学习演算法在训练模拟四足机器人自主导航与避障的有效性。我们的目标是开发一个机器人导盲犬模拟能够路径跟踪和避障,具有长期的现实世界援助导盲犬和视障人士的潜力。它还寻求扩大对医疗“宠物”的研究,包括机器人向导和警报犬。 对13篇相关研究论文的比较分析形成了关键的评估标准,包括碰撞检测、寻路算法、传感器使用、机器人类型和仿真平台。该研究的重点是传感器输入,碰撞频率,奖励信号和学习进度,以确定哪种算法最适合在复杂环境中支持机器人导航。 定制的环境用于确保在受控条件下对所有三种算法进行公平评价,从而允许一致的数据收集。结果表明,最近策略优化(PPO)在所有指标上都优于深度Q网络(DQN)和Q学习,特别是在每集达到目标的平均和中值步骤方面。 通过分析这些结果,这项研究有助于机器人导航,人工智能和医疗机器人技术,为人工智能驱动的四足动物移动的可行性及其在辅助机器人技术中的作用提供了见解。
摘要:Robots are increasingly integrated across industries, particularly in healthcare. However, many valuable applications for quadrupedal robots remain overlooked. This research explores the effectiveness of three reinforcement learning algorithms in training a simulated quadruped robot for autonomous navigation and obstacle avoidance. The goal is to develop a robotic guide dog simulation capable of path following and obstacle avoidance, with long-term potential for real-world assistance to guide dogs and visually impaired individuals. It also seeks to expand research into medical 'pets', including robotic guide and alert dogs. A comparative analysis of thirteen related research papers shaped key evaluation criteria, including collision detection, pathfinding algorithms, sensor usage, robot type, and simulation platforms. The study focuses on sensor inputs, collision frequency, reward signals, and learning progression to determine which algorithm best supports robotic navigation in complex environments. Custom-made environments were used to ensure fair evaluation of all three algorithms under controlled conditions, allowing consistent data collection. Results show that Proximal Policy Optimization (PPO) outperformed Deep Q-Network (DQN) and Q-learning across all metrics, particularly in average and median steps to goal per episode. By analysing these results, this study contributes to robotic navigation, AI and medical robotics, offering insights into the feasibility of AI-driven quadruped mobility and its role in assistive robotics.
【2】Autonomous Resource Management in Microservice Systems via Reinforcement Learning
标题:通过强化学习在微服务系统中进行自主资源管理
链接:https://arxiv.org/abs/2507.12879
作者:, Nia Qi, Yingnan Deng, Zhihao Xue, Ming Gong, Wuyang Zhang
摘要:针对传统微服务架构中资源分配不均、延迟高、吞吐量不足等问题,提出了一种基于强化学习的微服务资源调度和优化方法。在微服务系统中,随着服务数量和负载的增加,有效地调度和分配计算能力、内存和存储等资源成为一个关键的研究挑战。为了解决这个问题,本文采用了一种基于强化学习的智能调度算法。通过智能体与环境的交互,不断优化资源分配策略。在实验中,本文考虑了不同的资源条件和负载情况,从多个维度评估所提出的方法,包括响应时间,吞吐量,资源利用率和成本效率。实验结果表明,基于强化学习的调度方法在低负载、高并发条件下显著提高了系统响应速度和吞吐量,同时还优化了资源利用率,降低了能耗。在多维资源条件下,该方法可以考虑多个目标,实现资源的优化调度。与传统的静态资源分配方法相比,强化学习模型具有更强的适应性和优化能力。它可以实时调整资源分配策略,从而在动态变化的负载和资源环境中保持良好的系统性能。
摘要:This paper proposes a reinforcement learning-based method for microservice resource scheduling and optimization, aiming to address issues such as uneven resource allocation, high latency, and insufficient throughput in traditional microservice architectures. In microservice systems, as the number of services and the load increase, efficiently scheduling and allocating resources such as computing power, memory, and storage becomes a critical research challenge. To address this, the paper employs an intelligent scheduling algorithm based on reinforcement learning. Through the interaction between the agent and the environment, the resource allocation strategy is continuously optimized. In the experiments, the paper considers different resource conditions and load scenarios, evaluating the proposed method across multiple dimensions, including response time, throughput, resource utilization, and cost efficiency. The experimental results show that the reinforcement learning-based scheduling method significantly improves system response speed and throughput under low load and high concurrency conditions, while also optimizing resource utilization and reducing energy consumption. Under multi-dimensional resource conditions, the proposed method can consider multiple objectives and achieve optimized resource scheduling. Compared to traditional static resource allocation methods, the reinforcement learning model demonstrates stronger adaptability and optimization capability. It can adjust resource allocation strategies in real time, thereby maintaining good system performance in dynamically changing load and resource environments.
【3】Fly, Fail, Fix: Iterative Game Repair with Reinforcement Learning and Large Multimodal Models
标题:飞行、失败、修复:利用强化学习和大型多峰模型进行迭代游戏修复
链接:https://arxiv.org/abs/2507.12666
作者:, Josef Spjut, Jonathan Tremblay
备注:Published at Reinforcement Learning and Video Games workshop this https URL
摘要:游戏设计取决于理解静态规则和内容如何转化为动态玩家行为-这是只检查游戏代码或资产的现代生成系统难以捕获的。我们提出了一个自动化设计迭代框架,通过配对强化学习(RL)代理(它对游戏进行测试)与大型多模态模型(LMM)(它根据代理的行为修改游戏)来缩小这一差距。在每个循环中,RL播放器完成几集,产生(i)数字播放度量和/或(ii)概括最近视频帧的紧凑图像带。LMM设计者接收游戏目标和当前游戏配置,分析游戏轨迹,并编辑配置以引导未来行为朝向目标。我们证明了Lebron可以通过RL代理提供的行为轨迹进行推理,以迭代地改进游戏机制,指向AI辅助游戏设计的实用,可扩展的工具。
摘要:Game design hinges on understanding how static rules and content translate into dynamic player behavior - something modern generative systems that inspect only a game's code or assets struggle to capture. We present an automated design iteration framework that closes this gap by pairing a reinforcement learning (RL) agent, which playtests the game, with a large multimodal model (LMM), which revises the game based on what the agent does. In each loop the RL player completes several episodes, producing (i) numerical play metrics and/or (ii) a compact image strip summarising recent video frames. The LMM designer receives a gameplay goal and the current game configuration, analyses the play traces, and edits the configuration to steer future behaviour toward the goal. We demonstrate results that LMMs can reason over behavioral traces supplied by RL agents to iteratively refine game mechanics, pointing toward practical, scalable tools for AI-assisted game design.
【4】A Survey of Explainable Reinforcement Learning: Targets, Methods and Needs
标题:可解释强化学习综述:目标、方法与需求
链接:https://arxiv.org/abs/2507.12599
作者:ères
备注:69 pages, 19 figures
摘要
:最近人工智能(AI)模型的成功伴随着其内部机制的不透明性,特别是由于使用了深度神经网络。为了理解这些内部机制并解释这些AI模型的输出,已经提出了一组方法,这些方法被归类为可解释AI(XAI)。本文重点关注XAI的一个子域,称为可解释强化学习(XRL),其目的是解释通过强化学习学习的代理的行为。我们提出了一个直观的分类基于两个问题“什么”和“如何”。第一个问题集中在方法解释的目标上,而第二个问题涉及提供解释的方式。我们使用这种分类法来提供超过250篇论文的最新评论。此外,我们还介绍了一组接近XRL的域,我们认为这些域应该得到社区的关注。最后,我们确定了XRL领域的一些需求。
摘要:The success of recent Artificial Intelligence (AI) models has been accompanied by the opacity of their internal mechanisms, due notably to the use of deep neural networks. In order to understand these internal mechanisms and explain the output of these AI models, a set of methods have been proposed, grouped under the domain of eXplainable AI (XAI). This paper focuses on a sub-domain of XAI, called eXplainable Reinforcement Learning (XRL), which aims to explain the actions of an agent that has learned by reinforcement learning. We propose an intuitive taxonomy based on two questions "What" and "How". The first question focuses on the target that the method explains, while the second relates to the way the explanation is provided. We use this taxonomy to provide a state-of-the-art review of over 250 papers. In addition, we present a set of domains close to XRL, which we believe should get attention from the community. Finally, we identify some needs for the field of XRL.
【5】Distributional Reinforcement Learning on Path-dependent Options
标题:路径依赖期权的分布式强化学习
链接:https://arxiv.org/abs/2507.12657
作者:r Özsoy
摘要:我们重新解释并提出了一个框架,通过使用分布强化学习(DistRL)估计收益的全分布来为路径依赖的金融衍生品定价。与传统的方法,专注于预期的期权价值,我们的方法模型的整个条件分布的回报,允许风险意识的定价,尾部风险估计,并增强不确定性量化。我们证明了这种方法对亚式期权的有效性,使用基于分位数的价值函数逼近。
摘要:We reinterpret and propose a framework for pricing path-dependent financial derivatives by estimating the full distribution of payoffs using Distributional Reinforcement Learning (DistRL). Unlike traditional methods that focus on expected option value, our approach models the entire conditional distribution of payoffs, allowing for risk-aware pricing, tail-risk estimation, and enhanced uncertainty quantification. We demonstrate the efficacy of this method on Asian options, using quantile-based value function approximators.
符号|符号学习(1篇)
【1】(Exhaustive) Symbolic Regression and model selection by minimum description length
标题:(详尽)通过最小描述长度进行符号回归和模型选择
链接:https://arxiv.org/abs/2507.13033
作者:mond
备注:15 pages, 4 figures; Invited review for the Royal Society Philosophical Transactions A special issue "Symbolic regression in the physical sciences"
摘要:符号回归是从数据中学习函数的机器学习方法。在简要概述了符号回归领域之后,我将描述传统算法面临的两个主要挑战:它们有一个未知的(可能是很大的)概率无法找到任何给定的好函数,并且它们在函数选择过程中遭受模糊性和不合理的假设。为了解决这些问题,我提出了一个详尽的搜索和模型选择的最小描述长度的原则,它允许的准确性和复杂性直接权衡每个信息单位的测量。我展示了由此产生的公开可用的穷举符号回归算法在天体物理学中的三个开放的问题:宇宙的膨胀历史,星系中的引力的有效行为和潜在的暴胀场。在每一种情况下,该算法确定了许多功能优于文献标准。这种通用方法应该在科学和其他领域找到广泛的实用性。
摘要:Symbolic regression is the machine learning method for learning functions from data. After a brief overview of the symbolic regression landscape, I will describe the two main challenges that traditional algorithms face: they have an unknown (and likely significant) probability of failing to find any given good function, and they suffer from ambiguity and poorly-justified assumptions in their function-selection procedure. To address these I propose an exhaustive search and model selection by the minimum description length principle, which allows accuracy and complexity to be directly traded off by measuring each in units of information. I showcase the resulting publicly available Exhaustive Symbolic Regression algorithm on three open problems in astrophysics: the expansion history of the universe, the effective behaviour of gravity in galaxies and the potential of the inflaton field. In each case the algorithm identifies many functions superior to the literature standards. This general purpose methodology should find widespread utility in science and beyond.
医学相关(4篇)
【1】Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction
标题:基于深度学习的扩散加权MRI图像胎儿肺分割和胎儿生长限制的肺成熟度评估
链接:https://arxiv.org/abs/2507.13106
作者:iao, Katharine Brudkiewicz, Zhen Yuan, Rosalind Aughwane, Magdalena Sokolska, Joanna Chappell, Trevor Gaunt, Anna L. David, Andrew P. King, Andrew Melbourne
摘要:胎肺成熟度是预测新生儿结局和产后干预需求的关键指标,特别是对于受胎儿生长受限影响的妊娠。体素内非相干运动分析在胎儿肺部发育的非侵入性评估方面显示出了有希望的结果,但其对手动分割的依赖非常耗时,从而限制了其临床适用性。在这项工作中,我们提出了一种用于扩散加权磁共振图像的自动肺成熟度评估管道,该管道由基于深度学习的胎儿肺分割模型和模型拟合肺成熟度评估组成。在从4D扩散加权MRI扫描的基线帧中选择的手动分割图像上训练3D nnU-Net模型。分割模型表现出稳健的性能,平均骰子系数为82.14%。接下来,基于nnU-Net预测和手动肺分割进行逐体素模型拟合,以量化反映组织微观结构和灌注的IVIM参数。结果表明两者之间没有差异。我们的工作表明,一个完全自动化的管道是可能的,以支持胎儿肺成熟度评估和临床决策。
摘要:Fetal lung maturity is a critical indicator for predicting neonatal outcomes and the need for post-natal intervention, especially for pregnancies affected by fetal growth restriction. Intra-voxel incoherent motion analysis has shown promising results for non-invasive assessment of fetal lung development, but its reliance on manual segmentation is time-consuming, thus limiting its clinical applicability. In this work, we present an automated lung maturity evaluation pipeline for diffusion-weighted magnetic resonance images that consists of a deep learning-based fetal lung segmentation model and a model-fitting lung maturity assessment. A 3D nnU-Net model was trained on manually segmented images selected from the baseline frames of 4D diffusion-weighted MRI scans. The segmentation model demonstrated robust performance, yielding a mean Dice coefficient of 82.14%. Next, voxel-wise model fitting was performed based on both the nnU-Net-predicted and manual lung segmentations to quantify IVIM parameters reflecting tissue microstructure and perfusion. The results suggested no differences between the two. Our work shows that a fully automated pipeline is possible for supporting fetal lung maturity assessment and clinical decision-making.
【2】Demographic-aware fine-grained classification of pediatric wrist fractures
标题:儿童手腕骨折的人口统计学细粒度分类
链接:https://arxiv.org/abs/2507.12964
作者:ed, Ali Shariq Imran, Zenun Kastrati, Sher Muhammad Daudpota
摘要
:腕部病变是经常观察到的,特别是在构成大多数骨折病例的儿童中。然而,诊断这些疾病是耗时的,需要专业知识。计算机视觉提供了一个有前途的途径,取决于广泛的数据集的可用性,这是医学成像中的一个显著挑战。因此,仅仅依靠一种形式,如图像,是不够的,特别是在一个多样化和丰富的数据类型的时代。在这项研究中,我们采用了多方面的方法来解决使用极其有限的数据集识别手腕病理的挑战。最初,我们将这个问题作为一个细粒度的识别任务来处理,旨在识别传统CNN忽略的细微X射线病理。其次,我们通过将患者元数据与X射线图像融合来增强网络性能。第三,我们不是在像ImageNet这样的粗粒度数据集上进行预训练,而是利用在细粒度数据集上训练的权重。虽然元数据集成已用于其他医学领域,但这是腕部病理学的新应用。我们的研究结果表明,细粒度的策略和元数据集成提高了诊断准确性2%,有限的数据集和超过10%,与一个更大的数据集中的数据集。
摘要:Wrist pathologies are frequently observed, particularly among children who constitute the majority of fracture cases. However, diagnosing these conditions is time-consuming and requires specialized expertise. Computer vision presents a promising avenue, contingent upon the availability of extensive datasets, a notable challenge in medical imaging. Therefore, reliance solely on one modality, such as images, proves inadequate, especially in an era of diverse and plentiful data types. In this study, we employ a multifaceted approach to address the challenge of recognizing wrist pathologies using an extremely limited dataset. Initially, we approach the problem as a fine-grained recognition task, aiming to identify subtle X-ray pathologies that conventional CNNs overlook. Secondly, we enhance network performance by fusing patient metadata with X-ray images. Thirdly, rather than pre-training on a coarse-grained dataset like ImageNet, we utilize weights trained on a fine-grained dataset. While metadata integration has been used in other medical domains, this is a novel application for wrist pathologies. Our results show that a fine-grained strategy and metadata integration improve diagnostic accuracy by 2% with a limited dataset and by over 10% with a larger fracture-focused dataset.
【3】Investigating Forecasting Models for Pandemic Infections Using Heterogeneous Data Sources: A 2-year Study with COVID-19
标题:使用异类数据源调查大流行感染预测模型:为期2年的COVID-19研究
链接:https://arxiv.org/abs/2507.12966
作者: Komodromos, Kleanthis Malialis, Panayiotis Kolios
备注:Keywords: epidemiology, pandemic forecasting, COVID-19, infections, machine learning Accepted: IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) 2025
摘要:2019冠状病毒病疫情于2019年12月爆发,造成广泛的健康、经济及社会混乱。快速的全球传播使卫生保健系统不堪重负,导致高感染率、住院和死亡。为了尽量减少传播,各国政府实施了几项非药物干预措施,如封锁和旅行限制。这些措施虽然有效控制了传播,但也带来了重大的经济和社会挑战。尽管世卫组织于二零二三年五月宣布COVID-19不再为全球卫生紧急事件,但其影响持续存在,影响公共卫生策略。大流行期间收集的大量数据为了解疾病动态、传播和干预有效性提供了宝贵的见解。利用这些见解可以改进预测模型,加强对未来疫情的准备和应对,同时减轻其社会和经济影响。本文介绍了塞浦路斯COVID-19预测的大规模案例研究,使用了一个为期两年的数据集,该数据集整合了流行病学数据,疫苗接种记录,政策措施和天气条件。我们分析感染趋势,评估预测性能,并检查外部因素对疾病动态的影响。所获得的见解有助于改进大流行病防备和应对战略。
摘要:Emerging in December 2019, the COVID-19 pandemic caused widespread health, economic, and social disruptions. Rapid global transmission overwhelmed healthcare systems, resulting in high infection rates, hospitalisations, and fatalities. To minimise the spread, governments implemented several non-pharmaceutical interventions like lockdowns and travel restrictions. While effective in controlling transmission, these measures also posed significant economic and societal challenges. Although the WHO declared COVID-19 no longer a global health emergency in May 2023, its impact persists, shaping public health strategies. The vast amount of data collected during the pandemic offers valuable insights into disease dynamics, transmission, and intervention effectiveness. Leveraging these insights can improve forecasting models, enhancing preparedness and response to future outbreaks while mitigating their social and economic impact. This paper presents a large-scale case study on COVID-19 forecasting in Cyprus, utilising a two-year dataset that integrates epidemiological data, vaccination records, policy measures, and weather conditions. We analyse infection trends, assess forecasting performance, and examine the influence of external factors on disease dynamics. The insights gained contribute to improved pandemic preparedness and response strategies.
【4】A Novel Data Augmentation Strategy for Robust Deep Learning Classification of Biomedical Time-Series Data: Application to ECG and EEG Analysis
标题:一种用于生物医学时间序列数据稳健深度学习分类的新型数据增强策略:应用于心电图和脑电分析
链接:https://arxiv.org/abs/2507.12645
作者:Guhdar, Ramadhan J. Mstafa, Abdulhakeem O. Mohammed
摘要:对ECG和EEG等各种生物信号的准确和统一分析的需求日益增加,这对于全面的患者评估至关重要,特别是在同步监测中。尽管在多传感器融合方面取得了进展,但在开发统一的架构,有效地处理和提取从根本上不同的生理信号的特征方面仍然存在关键差距。另一个挑战是许多生物医学数据集中固有的类不平衡,通常会导致传统方法的性能偏差。这项研究通过提出一种新颖统一的深度学习框架来解决这些问题,该框架可以在不同的信号类型中实现最先进的性能。我们的方法将基于ResNet的CNN与注意力机制集成在一起,并通过一种新的数据增强策略进行增强:每个信号的多个增强变量的时域级联以生成更丰富的表示。与之前的工作不同,我们科学地增加了信号的复杂性,以实现未来的预测能力,从而实现了与最先进技术相比的最佳预测。预处理步骤包括小波去噪、基线去除和标准化。通过结合使用这种先进的数据增强和焦点损失功能,有效地管理了班级不平衡。在训练期间应用正则化技术以确保泛化。我们在三个基准数据集上严格评估了所提出的架构:UCI癫痫发作EEG,MIT-BIH心律失常和PTB诊断ECG。它分别实现了99.96%、99.78%和100%的准确度,证明了在不同信号类型和临床背景下的鲁棒性。最后,该架构需要约130 MB的内存,处理每个样本的时间约为10 ms,这表明它适合在低端或可穿戴设备上部署。
摘要:The increasing need for accurate and unified analysis of diverse biological signals, such as ECG and EEG, is paramount for comprehensive patient assessment, especially in synchronous monitoring. Despite advances in multi-sensor fusion, a critical gap remains in developing unified architectures that effectively process and extract features from fundamentally different physiological signals. Another challenge is the inherent class imbalance in many biomedical datasets, often causing biased performance in traditional methods. This study addresses these issues by proposing a novel and unified deep learning framework that achieves state-of-the-art performance across different signal types. Our method integrates a ResNet-based CNN with an attention mechanism, enhanced by a novel data augmentation strategy: time-domain concatenation of multiple augmented variants of each signal to generate richer representations. Unlike prior work, we scientifically increase signal complexity to achieve future-reaching capabilities, which resulted in the best predictions compared to the state of the art. Preprocessing steps included wavelet denoising, baseline removal, and standardization. Class imbalance was effectively managed through the combined use of this advanced data augmentation and the Focal Loss function. Regularization techniques were applied during training to ensure generalization. We rigorously evaluated the proposed architecture on three benchmark datasets: UCI Seizure EEG, MIT-BIH Arrhythmia, and PTB Diagnostic ECG. It achieved accuracies of 99.96%, 99.78%, and 100%, respectively, demonstrating robustness across diverse signal types and clinical contexts. Finally, the architecture requires ~130 MB of memory and processes each sample in ~10 ms, suggesting suitability for deployment on low-end or wearable devices.
聚类(1篇)
【1】Ranking Vectors Clustering: Theory and Applications
标题:排名载体集群:理论与应用
链接:https://arxiv.org/abs/2507.12583
作者:hi, Ali Eshragh, Babak Aslani, Meysam Rabiee
摘要:我们研究了聚类排序向量的问题,其中每个向量表示作为不同整数的有序列表的偏好。具体来说,我们专注于k-质心排序向量聚类问题(KRC),其目的是划分一组排序向量到k个集群,并确定每个集群的质心。与经典的k-means聚类(KMC)不同,KRC将观测值和质心都约束为排序向量。我们建立了KRC的NP-硬度,并刻画了它的可行集。对于单集群的情况下,我们推导出一个封闭形式的最佳质心,它可以在线性时间计算的解析解。为了解决KRC的计算挑战,我们开发了一个有效的近似算法,KRCA,迭代细化KMC的初始解,称为基线解。此外,我们引入了一个分支定界(BnB)算法,在KRCA有效的集群重建,利用决策树框架,以减少计算时间,同时结合控制参数,以平衡解决方案的质量和效率。我们建立了KRCA和BnB的理论误差界。通过对合成数据集和真实数据集进行广泛的数值实验,我们证明了KRCA始终优于基线解决方案,在快速计算时间的情况下显著提高了解决方案的质量。这项工作突出了KRC对于个性化和大规模决策的实际意义,提供了可以在未来研究中建立的方法论进步和见解。
摘要
:We study the problem of clustering ranking vectors, where each vector represents preferences as an ordered list of distinct integers. Specifically, we focus on the k-centroids ranking vectors clustering problem (KRC), which aims to partition a set of ranking vectors into k clusters and identify the centroid of each cluster. Unlike classical k-means clustering (KMC), KRC constrains both the observations and centroids to be ranking vectors. We establish the NP-hardness of KRC and characterize its feasible set. For the single-cluster case, we derive a closed-form analytical solution for the optimal centroid, which can be computed in linear time. To address the computational challenges of KRC, we develop an efficient approximation algorithm, KRCA, which iteratively refines initial solutions from KMC, referred to as the baseline solution. Additionally, we introduce a branch-and-bound (BnB) algorithm for efficient cluster reconstruction within KRCA, leveraging a decision tree framework to reduce computational time while incorporating a controlling parameter to balance solution quality and efficiency. We establish theoretical error bounds for KRCA and BnB. Through extensive numerical experiments on synthetic and real-world datasets, we demonstrate that KRCA consistently outperforms baseline solutions, delivering significant improvements in solution quality with fast computational times. This work highlights the practical significance of KRC for personalization and large-scale decision making, offering methodological advancements and insights that can be built upon in future studies.
自动驾驶|车辆|车道检测等(4篇)
【1】Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models
标题:Orbis:克服驱动世界模型的长期预测挑战
链接:https://arxiv.org/abs/2507.13162
作者:sakhan, Sudhanshu Mittal, Silvio Galesso, Karim Farid, Thomas Brox
备注:Project page: this https URL
摘要:现有的自动驾驶世界模型在长期生成和推广到具有挑战性的场景方面存在困难。在这项工作中,我们使用简单的设计选择开发了一个模型,并且没有额外的监督或传感器,例如地图,深度或多个摄像头。我们证明了我们的模型具有最先进的性能,尽管只有469M参数,并且在280小时的视频数据上进行了训练。它在转弯机动和城市交通等困难场景中尤为突出。我们测试离散令牌模型是否可能具有基于流匹配的连续模型的优势。为此,我们建立了一个混合标记器,它与这两种方法兼容,并允许并排比较。我们的研究得出了支持连续自回归模型的结论,它在个体设计选择上不那么脆弱,并且比基于离散令牌的模型更强大。代码、模型和定性结果可在https://lmb-freiburg.github.io/orbis.github.io/上公开获取。
摘要:Existing world models for autonomous driving struggle with long-horizon generation and generalization to challenging scenarios. In this work, we develop a model using simple design choices, and without additional supervision or sensors, such as maps, depth, or multiple cameras. We show that our model yields state-of-the-art performance, despite having only 469M parameters and being trained on 280h of video data. It particularly stands out in difficult scenarios like turning maneuvers and urban traffic. We test whether discrete token models possibly have advantages over continuous models based on flow matching. To this end, we set up a hybrid tokenizer that is compatible with both approaches and allows for a side-by-side comparison. Our study concludes in favor of the continuous autoregressive model, which is less brittle on individual design choices and more powerful than the model built on discrete tokens. Code, models and qualitative results are publicly available at https://lmb-freiburg.github.io/orbis.github.io/.
【2】LaViPlan : Language-Guided Visual Path Planning with RLVR
标题:LaViPlan:使用WLVR的图像引导视觉路径规划
链接:https://arxiv.org/abs/2507.12911
作者:
备注:11 pages, 6 figures
摘要:自动驾驶中的分布外(OOD)场景是指偏离训练域的情况,通常会导致事先缺乏此类情况的规划者的意外和潜在危险行为。近年来,视觉语言模型(VLM)因其在面向对象设计环境下的泛化能力而被引入自动驾驶研究。早期的研究表明,VLM可以识别OOD场景,并生成用户级的决策,如“直行”或“右转”。然而,由于VLM的高级决策或以语言表达的视觉推理与解释为行动的低级预测轨迹之间的不一致,出现了一个新的挑战。在本文中,我们提出了LaViPlan,这是一个利用强化学习与可验证奖励(RLVR)的框架,可以使用面向规划的指标来优化VLM。这种方法解决了通过监督学习进行微调的现有VLM中观察到的视觉-语言-动作不一致问题,该方法可以识别驾驶场景,但通常会产生不知道上下文的决策。实验结果表明,我们的方法提高了OOD条件下的态势感知和决策,突出了它的潜力,以减轻错位问题。这项工作介绍了一个很有前途的培训后的范例VLM代理的自动驾驶的背景下。
摘要:Out-of-distribution (OOD) scenarios in autonomous driving refer to situations that deviate from the training domain, often leading to unexpected and potentially hazardous behavior from planners that lack prior exposure to such cases. Recently, Vision-Language Models (VLMs) have been introduced into autonomous driving research for their promising generalization capabilities in OOD settings. Early studies demonstrated that VLMs could recognize OOD scenarios and generate user-level decisions such as "go straight" or "turn right." However, a new challenge has emerged due to the misalignment between the VLM's high-level decisions or visual reasoning expressed in language, and the low-level predicted trajectories interpreted as actions. In this paper, we propose LaViPlan, a framework that leverages Reinforcement Learning with Verifiable Rewards (RLVR) to optimize VLMs using planning-oriented metrics. This approach addresses the vision-language-action misalignment observed in existing VLMs fine-tuned via supervised learning, which can recognize driving scenarios but often produce context-unaware decisions. Experimental results demonstrate that our method improves situational awareness and decision-making under OOD conditions, highlighting its potential to mitigate the misalignment issue. This work introduces a promising post-training paradigm for VLM agents in the context of autonomous driving.
【3】Deep Bilinear Koopman Model for Real-Time Vehicle Control in Frenet Frame
标题:Frenet框架下实时车辆控制的深双线性Koopman模型
链接:https://arxiv.org/abs/2507.12578
作者:Abtahi, Farhang Motallebi Araghi, Navid Mojahed, Shima Nazari
备注:14 pages, 8 figures. This manuscript is under review with IEEE Transactions on Intelligent Vehicles
摘要:由于车辆动力学的非线性和耦合性质,自动驾驶车辆的精确建模和控制仍然是一个根本性的挑战。虽然Koopman算子理论为部署强大的线性控制技术提供了一个框架,但学习用于高保真建模的有限维不变子空间仍然是一个悬而未决的问题。本文提出了一种深入的Koopman方法,用于曲线Frenet框架内的车辆动力学建模和控制。所提出的框架使用深度神经网络架构,从数据中同时学习Koopman算子及其相关的不变子空间。该算法在保持凸性的同时捕获了输入-状态双线性相互作用,这使得它适合于实时模型预测控制(MPC)应用。在训练过程中使用多步预测损失,以确保长期预测能力。为了进一步提高实时轨迹跟踪性能,该模型集成了累积误差调节器(CER)模块,该模块通过减轻累积预测误差来补偿模型失配。闭环性能通过硬件在环(HIL)实验进行评估,实验使用CarSim RT模型作为目标对象,并在dSPACE SCALEXIO系统上进行实时验证。相对于基线控制器,拟议的控制器实现了跟踪误差的显着降低,证实了其适合在嵌入式自动驾驶汽车系统中实时实施。
摘要
:Accurate modeling and control of autonomous vehicles remain a fundamental challenge due to the nonlinear and coupled nature of vehicle dynamics. While Koopman operator theory offers a framework for deploying powerful linear control techniques, learning a finite-dimensional invariant subspace for high-fidelity modeling continues to be an open problem. This paper presents a deep Koopman approach for modeling and control of vehicle dynamics within the curvilinear Frenet frame. The proposed framework uses a deep neural network architecture to simultaneously learn the Koopman operator and its associated invariant subspace from the data. Input-state bilinear interactions are captured by the algorithm while preserving convexity, which makes it suitable for real-time model predictive control (MPC) application. A multi-step prediction loss is utilized during training to ensure long-horizon prediction capability. To further enhance real-time trajectory tracking performance, the model is integrated with a cumulative error regulator (CER) module, which compensates for model mismatch by mitigating accumulated prediction errors. Closed-loop performance is evaluated through hardware-in-the-loop (HIL) experiments using a CarSim RT model as the target plant, with real-time validation conducted on a dSPACE SCALEXIO system. The proposed controller achieved significant reductions in tracking error relative to baseline controllers, confirming its suitability for real-time implementation in embedded autonomous vehicle systems.
【4】ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving
标题:ReAL-AD:在端到端自动驾驶中实现类人推理
链接:https://arxiv.org/abs/2507.12499
作者:, Jiadong Tu, Yuexin Ma, Xinge Zhu
备注:Accepted by ICCV2025
摘要:端到端自动驾驶已经成为一种有前途的方法,可以在单一框架内统一感知,预测和规划,减少信息丢失并提高适应性。然而,现有的方法通常依赖于固定和稀疏的轨迹监督,限制了它们捕获人类驾驶员自然采用的分层推理过程的能力。为了弥合这一差距,我们提出了ReAL-AD,这是一个推理增强学习框架,它基于三层人类认知模型构建自动驾驶的决策:驾驶策略,驾驶决策和驾驶操作,其中视觉语言模型(VLM)被纳入以增强这些级别的情景感知和结构化推理。具体来说,我们介绍:(1)战略推理注入器,它通过从VLM生成的见解中解释复杂的交通环境来制定高级驾驶策略;(2)战术推理集成器,它将战略意图细化为可解释的战术选择,例如车道变换,超车和速度调整;以及(3)分层轨迹解码器,其逐步将战术决策转化为精确的控制动作,以实现平滑和人性化的轨迹执行。广泛的评估表明,整合我们的框架可以将规划的准确性和安全性提高30%以上,使端到端自动驾驶更具可解释性,并与人类的分层推理保持一致。该项目页面可在以下位置找到:\href{https://4dvlab.github.io/project_page/realad}{\texttt{4dvlab.github.io/project\_page/realad}}
摘要:End-to-end autonomous driving has emerged as a promising approach to unify perception, prediction, and planning within a single framework, reducing information loss and improving adaptability. However, existing methods often rely on fixed and sparse trajectory supervision, limiting their ability to capture the hierarchical reasoning process that human drivers naturally employ. To bridge this gap, we propose ReAL-AD, a Reasoning-Augmented Learning framework that structures decision-making in autonomous driving based on the three-tier human cognitive model: Driving Strategy, Driving Decision, and Driving Operation, where Vision-Language Models (VLMs) are incorporated to enhance situational awareness and structured reasoning across these levels. Specifically, we introduce: (1) the Strategic Reasoning Injector, which formulates high-level driving strategies by interpreting complex traffic contexts from VLM-generated insights; (2) the Tactical Reasoning Integrator, which refines strategic intent into interpretable tactical choices such as lane changes, overtaking, and speed adjustments; and (3) the Hierarchical Trajectory Decoder, which progressively translates tactical decisions into precise control actions for smooth and human-like trajectory execution. Extensive evaluations show that integrating our framework improves planning accuracy and safety by over 30%, making end-to-end autonomous driving more interpretable and aligned with human-like hierarchical reasoning. The project page can be found at: \href{https://4dvlab.github.io/project_page/realad}{\texttt{4dvlab.github.io/project\_page/realad}}
联邦学习|隐私保护|加密(3篇)
【1】FedGA: A Fair Federated Learning Framework Based on the Gini Coefficient
标题:FedGA:基于基尼系数的公平联邦学习框架
链接:https://arxiv.org/abs/2507.12983
作者:iu
摘要:公平性已经成为联邦学习的关键挑战之一。在水平联邦设置中,数据异构性通常会导致客户端之间的显著性能差异,从而引起对公平模型行为的关注。为了解决这个问题,我们提出了FedGA,公平意识的联邦学习算法。我们首先采用基尼系数来衡量客户之间的绩效差异。在此基础上,建立了基尼系数G与全局模型U_s更新尺度之间的关系,并利用该关系自适应地确定公平性干预的时机。随后,我们根据系统的实时公平性状态动态调整聚合权重,使全局模型能够更好地结合来自性能相对较差的客户端的信息。我们在CIFAR-10,CIFAR-10和Synthetic数据集上进行了广泛的实验。结果表明,FedGA有效地提高了公平性指标,如方差和基尼系数,同时保持强劲的整体性能,证明了我们的方法的有效性。
摘要:Fairness has emerged as one of the key challenges in federated learning. In horizontal federated settings, data heterogeneity often leads to substantial performance disparities across clients, raising concerns about equitable model behavior. To address this issue, we propose FedGA, a fairness-aware federated learning algorithm. We first employ the Gini coefficient to measure the performance disparity among clients. Based on this, we establish a relationship between the Gini coefficient $G$ and the update scale of the global model ${U_s}$, and use this relationship to adaptively determine the timing of fairness intervention. Subsequently, we dynamically adjust the aggregation weights according to the system's real-time fairness status, enabling the global model to better incorporate information from clients with relatively poor performance.We conduct extensive experiments on the Office-Caltech-10, CIFAR-10, and Synthetic datasets. The results show that FedGA effectively improves fairness metrics such as variance and the Gini coefficient, while maintaining strong overall performance, demonstrating the effectiveness of our approach.
【2】Federated Learning in Open- and Closed-Loop EMG Decoding: A Privacy and Performance Perspective
标题:开放和闭环EMG解码中的联邦学习:隐私和性能的角度
链接:https://arxiv.org/abs/2507.12652
作者:lm, César Uribe, Momona Yamagami
备注:23 pages, 7 figures
摘要:侵入式和非侵入式神经接口有望成为下一代技术的高带宽输入设备。然而,神经信号固有地编码关于个人身份和健康的敏感信息,使得解码器训练的数据共享成为一个关键的隐私挑战。联邦学习(FL),一个分布式的,隐私保护的学习框架,提出了一个很有前途的解决方案,但它仍然在闭环自适应神经接口的探索。在这里,我们介绍了基于FL的神经解码,并在开环和闭环情况下使用高维肌电信号系统地评估其性能和隐私。在开环模拟中,FL的表现明显优于本地学习基线,证明了它在高性能、隐私意识神经解码方面的潜力。相比之下,闭环用户研究需要适应FL方法,以适应单用户,实时交互,标准FL不支持的场景。这种修改导致本地学习解码器在闭环性能方面超过了适应FL方法,但本地学习仍然具有更高的隐私风险。我们的研究结果强调了一个关键的性能隐私权衡实时自适应应用程序,并表示需要FL方法专门设计的自适应,单用户应用程序。
摘要:Invasive and non-invasive neural interfaces hold promise as high-bandwidth input devices for next-generation technologies. However, neural signals inherently encode sensitive information about an individual's identity and health, making data sharing for decoder training a critical privacy challenge. Federated learning (FL), a distributed, privacy-preserving learning framework, presents a promising solution, but it remains unexplored in closed-loop adaptive neural interfaces. Here, we introduce FL-based neural decoding and systematically evaluate its performance and privacy using high-dimensional electromyography signals in both open- and closed-loop scenarios. In open-loop simulations, FL significantly outperformed local learning baselines, demonstrating its potential for high-performance, privacy-conscious neural decoding. In contrast, closed-loop user studies required adapting FL methods to accommodate single-user, real-time interactions, a scenario not supported by standard FL. This modification resulted in local learning decoders surpassing the adapted FL approach in closed-loop performance, yet local learning still carried higher privacy risks. Our findings highlight a critical performance-privacy tradeoff in real-time adaptive applications and indicate the need for FL methods specifically designed for co-adaptive, single-user applications.
【3】Sporadic Federated Learning Approach in Quantum Environment to Tackle Quantum Noise
标题:量子环境中解决量子噪音的零星联邦学习方法
链接:https://arxiv.org/abs/2507.12492
作者:man, Atit Pokharel, Dinh C. Nguyen
摘要:量子联合学习(QFL)是一种新兴的范式,它结合了量子计算和联合学习(FL),以实现分散的模型训练,同时保持量子网络上的数据隐私。然而,量子噪声仍然是QFL中的一个重要障碍,因为现代量子设备由于硬件质量的变化和对量子退相干的敏感性而经历异质噪声水平,导致训练性能不足。为了解决这个问题,我们提出了SpoQFL,一种新的QFL框架,利用零星学习来减轻分布式量子系统中的量子噪声异质性。SpoQFL根据噪声波动动态调整训练策略,增强模型鲁棒性、收敛稳定性和整体学习效率。在真实世界数据集上的大量实验表明,SpoQFL显著优于传统的QFL方法,实现了卓越的训练性能和更稳定的收敛。
摘要:Quantum Federated Learning (QFL) is an emerging paradigm that combines quantum computing and federated learning (FL) to enable decentralized model training while maintaining data privacy over quantum networks. However, quantum noise remains a significant barrier in QFL, since modern quantum devices experience heterogeneous noise levels due to variances in hardware quality and sensitivity to quantum decoherence, resulting in inadequate training performance. To address this issue, we propose SpoQFL, a novel QFL framework that leverages sporadic learning to mitigate quantum noise heterogeneity in distributed quantum systems. SpoQFL dynamically adjusts training strategies based on noise fluctuations, enhancing model robustness, convergence stability, and overall learning efficiency. Extensive experiments on real-world datasets demonstrate that SpoQFL significantly outperforms conventional QFL approaches, achieving superior training performance and more stable convergence.
推理|分析|理解|解释(5篇)
【1】MUPAX: Multidimensional Problem Agnostic eXplainable AI
标题:MUPAX:多维问题不可知的eXplanable AI
链接:https://arxiv.org/abs/2507.13090
作者:Dentamaro, Felice Franchini, Giuseppe Pirlo, Irina Voiculescu
摘要:理想情况下,健壮的XAI技术应该同时是确定性的,模型不可知的,并保证收敛。我们提出多维问题不可知解释AI(MUPAX),一个确定性的,模型不可知的解释技术,保证收敛。MUPAX测量理论公式通过结构化的扰动分析,发现固有的输入模式,消除虚假的关系,给出了原则性的特征重要性属性。我们评估了MUPAX在广泛的数据模式和任务:音频分类(1D),图像分类(2D),体积医学图像分析(3D),解剖标志检测,证明尺寸不可知的有效性。严格的收敛保证扩展到任何损失函数和任意维度,使MUPAX适用于几乎任何AI问题上下文。与其他在掩蔽时通常会降低性能的XAI方法相比,MUPAX不仅保留了模型的准确性,而且通过仅捕获原始数据中最重要的模式来提高模型的准确性。对XAI最先进技术的广泛基准测试表明,MUPAX能够生成精确、一致和可理解的解释,这是迈向可解释和可信赖的AI系统的关键一步。源代码将在发布时发布。
摘要:Robust XAI techniques should ideally be simultaneously deterministic, model agnostic, and guaranteed to converge. We propose MULTIDIMENSIONAL PROBLEM AGNOSTIC EXPLAINABLE AI (MUPAX), a deterministic, model agnostic explainability technique, with guaranteed convergency. MUPAX measure theoretic formulation gives principled feature importance attribution through structured perturbation analysis that discovers inherent input patterns and eliminates spurious relationships. We evaluate MUPAX on an extensive range of data modalities and tasks: audio classification (1D), image classification (2D), volumetric medical image analysis (3D), and anatomical landmark detection, demonstrating dimension agnostic effectiveness. The rigorous convergence guarantees extend to any loss function and arbitrary dimensions, making MUPAX applicable to virtually any problem context for AI. By contrast with other XAI methods that typically decrease performance when masking, MUPAX not only preserves but actually enhances model accuracy by capturing only the most important patterns of the original data. Extensive benchmarking against the state of the XAI art demonstrates MUPAX ability to generate precise, consistent and understandable explanations, a crucial step towards explainable and trustworthy AI systems. The source code will be released upon publication.
【2】Learning to Reject Low-Quality Explanations via User Feedback
标题:学习通过用户反馈来处理低质量的投诉
链接:https://arxiv.org/abs/2507.12900
作者:diotti, Dario Pesenti, Stefano Teso, Jesse Davis
摘要:机器学习预测器越来越多地用于高风险应用,如信用评分。解释帮助用户解开他们预测背后的原因,但并不总是“高质量”。也就是说,最终用户可能难以解释或相信它们,这可能会使信任评估和下游决策复杂化。我们认为,分类器应该有选择,拒绝处理输入的预测不能得到正确的解释,并引入一个框架,学习拒绝低质量的解释(LTX),其中预测器配备了一个拒绝器,评估解释的质量。在这个问题设置中,关键的挑战是如何正确地定义和评估解释质量,以及如何设计一个合适的拒绝器。针对流行的归因技术,我们引入了ULER(以用户为中心的低质量解释拒绝器),它从人类评级和每个特征的相关性判断中学习一个简单的拒绝器,以反映人类对解释质量的判断。我们的实验表明,ULER在八个分类和回归基准测试以及一个新的人类注释数据集上的LtX拒绝策略方面优于最先进的和简化感知学习,我们将公开发布该数据集以支持未来的研究。
摘要:Machine Learning predictors are increasingly being employed in high-stakes applications such as credit scoring. Explanations help users unpack the reasons behind their predictions, but are not always "high quality''. That is, end-users may have difficulty interpreting or believing them, which can complicate trust assessment and downstream decision-making. We argue that classifiers should have the option to refuse handling inputs whose predictions cannot be explained properly and introduce a framework for learning to reject low-quality explanations (LtX) in which predictors are equipped with a rejector that evaluates the quality of explanations. In this problem setting, the key challenges are how to properly define and assess explanation quality and how to design a suitable rejector. Focusing on popular attribution techniques, we introduce ULER (User-centric Low-quality Explanation Rejector), which learns a simple rejector from human ratings and per-feature relevance judgments to mirror human judgments of explanation quality. Our experiments show that ULER outperforms both state-of-the-art and explanation-aware learning to reject strategies at LtX on eight classification and regression benchmarks and on a new human-annotated dataset, which we will publicly release to support future research.
【3】Understanding the Evolution of the Neural Tangent Kernel at the Edge of Stability
标题:理解神经切线核在稳定边缘的演化
链接:https://arxiv.org/abs/2507.12837
作者:ng, Jeremy Cohen, Yuanzhi Li
摘要:近年来,深度学习中的神经切向核(NTK)研究越来越受到关注。NTK通常在训练期间主动改变,并且与特征学习相关。与此同时,最近关于梯度下降(GD)的研究发现了一种称为稳定边缘(EoS)的现象,其中NTK的最大特征值围绕与步长成反比的值振荡。然而,尽管后续工作已经深入探索了这种特征值行为的潜在机制,但对NTK特征向量在EoS期间的行为的理解仍然缺失。本文研究了NTK特征向量的动态EoS期间的详细信息。在不同的架构中,我们观察到较大的学习速率会导致最终NTK的前导特征向量以及完整的NTK矩阵与训练目标更一致。然后,我们研究了这种现象的基本机制,并提供了一个两层线性网络的理论分析。我们的研究增强了对深度学习中GD训练动态的理解。
摘要
:The study of Neural Tangent Kernels (NTKs) in deep learning has drawn increasing attention in recent years. NTKs typically actively change during training and are related to feature learning. In parallel, recent work on Gradient Descent (GD) has found a phenomenon called Edge of Stability (EoS), in which the largest eigenvalue of the NTK oscillates around a value inversely proportional to the step size. However, although follow-up works have explored the underlying mechanism of such eigenvalue behavior in depth, the understanding of the behavior of the NTK eigenvectors during EoS is still missing. This paper examines the dynamics of NTK eigenvectors during EoS in detail. Across different architectures, we observe that larger learning rates cause the leading eigenvectors of the final NTK, as well as the full NTK matrix, to have greater alignment with the training target. We then study the underlying mechanism of this phenomenon and provide a theoretical analysis for a two-layer linear network. Our study enhances the understanding of GD training dynamics in deep learning.
【4】Reasoning-Finetuning Repurposes Latent Representations in Base Models
标题:推理微调重新利用基本模型中的潜在表示
链接:https://arxiv.org/abs/2507.12638
作者:, Chuqiao Lin, Constantin Venhoff, Neel Nanda
备注:6 pages, 6 figures. ICML 2025 Workshop on Actionable Interpretability
摘要:回溯是推理微调引起的一种涌现行为,是增强推理模型能力的关键机制。先前的工作已经成功地通过导向向量操纵了这种行为,但对潜在的机制仍然知之甚少。在这项工作中,我们证明了DeepSeek-R1-Distill-Llama-8B中回溯的出现部分是由基础模型激活中已经存在的重新定位的方向驱动的。具体来说,我们确定了一个方向,在基地Llama-3.1-8B的残留流,系统地诱导回溯时,用于引导蒸馏推理模型,并发现该方向的转向效果不能平凡地解释令牌级属性。我们进一步发现,这个方向不会导致回溯的基础模型,这表明推理微调过程重新利用预先存在的表示,以形成新的行为电路。此外,我们假设这个方向是几个可能一起调解回溯的方向之一。我们的研究结果提供了一个令人信服的画面,推理微调模型重新使用预先存在的基础模型表示,而不是从头开始学习新的功能。
摘要:Backtracking, an emergent behavior elicited by reasoning fine-tuning, has been shown to be a key mechanism in reasoning models' enhanced capabilities. Prior work has succeeded in manipulating this behavior via steering vectors, but the underlying mechanism remains poorly understood. In this work, we show that the emergence of backtracking in DeepSeek-R1-Distill-Llama-8B is in part driven by a repurposed direction already present in base model activations. Specifically, we identify a direction in base Llama-3.1-8B's residual stream which systematically induces backtracking when used to steer the distilled reasoning model, and find that the effects of steering with this direction cannot be trivially explained by token-level attributes. We further find that this direction does not induce backtracking in the base model, suggesting that the reasoning finetuning process repurposes pre-existing representations to form new behavioral circuits. Additionally, we hypothesize that this direction is one of several which may work together to mediate backtracking. Our findings offer a compelling picture that reasoning-finetuned models repurpose pre-existing base model representations, rather than learn new capabilities from scratch.
【5】Implementation and Analysis of GPU Algorithms for Vecchia Approximation
标题:Vecchia逼近的图形处理算法的实现与分析
链接:https://arxiv.org/abs/2407.02740
作者:ames, Joseph Guinness
摘要:高斯过程已经成为空间统计学家工具箱中不可或缺的一部分,但不适合分析大型数据集,因为精确拟合相关模型需要大量的时间和内存。Vecchia近似被广泛用于降低计算复杂度,并且可以用并行算法来计算。虽然已经为Vecchia Approximation开发了多核软件,例如GpGp R包,但缺乏设计用于在图形处理单元(GPU)上运行的软件,尽管GPU在统计和机器学习方面取得了巨大成功。我们比较了在GPU上实现Vecchia近似的三种不同方法:其中两种与用于其他高斯过程近似的方法相似,另一种是新的。研究了存储器类型对性能的影响,并对最终方法进行了相应的优化。我们表明,我们的新方法优于其他两个,然后将其在GpGpU R包。我们通过在各种数据集上拟合高斯过程模型,将GpGpU与现有的多核和GPU加速软件进行比较,包括从地球观测卫星收集的$n>10^6$点的大型时空数据集。我们的研究结果表明,GpGpU实现了更快的运行时间和更好的预测精度。
摘要:Gaussian Processes have become an indispensable part of the spatial statistician's toolbox but are unsuitable for analyzing large dataset because of the significant time and memory needed to fit the associated model exactly. Vecchia Approximation is widely used to reduce the computational complexity and can be calculated with embarrassingly parallel algorithms. While multi-core software has been developed for Vecchia Approximation, such as the GpGp R package, software designed to run on graphics processing units (GPU) is lacking, despite the tremendous success GPUs have had in statistics and machine learning. We compare three different ways to implement Vecchia Approximation on a GPU: two of which are similar to methods used for other Gaussian Process approximations and one that is new. The impact of memory type on performance is investigated and the final method is optimized accordingly. We show that our new method outperforms the other two and then present it in the GpGpU R package. We compare GpGpU to existing multi-core and GPU-accelerated software by fitting Gaussian Process models on various datasets, including a large spatial-temporal dataset of $n>10^6$ points collected from an earth-observing satellite. Our results show that GpGpU achieves faster runtimes and better predictive accuracy.
检测相关(2篇)
【1】RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images
标题:RS-TinyNet:用于检测遥感图像中微小物体的分阶段特征融合网络
链接:https://arxiv.org/abs/2507.13120
作者: Jiang, Wei Zhang, Xuerui Mao
摘要:遥感图像中微小目标的检测一直是一个长期存在的挑战,因为它们的空间信息极其有限,特征表示能力弱,并且在复杂背景中分布密集。尽管投入了大量的努力,主流检测器在这种情况下仍然表现不佳。为了弥合这一差距,我们引入了RS-TinyNet,这是一种多阶段特征融合和增强模型,专门针对各种RS场景中的RS微小目标检测。RS-TinyNet有两个新颖的设计:微小对象显著性建模和特征完整性重建。在这些原则的指导下,我们设计了三个逐步的功能增强模块。其中,多维协作注意(MDCA)模块采用多维注意来增强微小物体的显著性。此外,辅助可逆分支(ARB)和渐进融合检测头(PFDH)模块的引入,以保持信息流和融合多层次的功能,以弥合语义差距,并保留结构细节。在公共RS数据集AI-TOD上的综合实验表明,我们的RS-TinyNet超过现有的最先进的(SOTA)检测器4.0%AP和6.5%AP75。在DIOR基准数据集上的测试进一步验证了该算法在不同RS场景下的检测性能。这些结果表明,提出的多阶段特征融合策略为复杂遥感环境下的微小目标检测提供了一种有效和实用的解决方案。
摘要:Detecting tiny objects in remote sensing (RS) imagery has been a long-standing challenge due to their extremely limited spatial information, weak feature representations, and dense distributions across complex backgrounds. Despite numerous efforts devoted, mainstream detectors still underperform in such scenarios. To bridge this gap, we introduce RS-TinyNet, a multi-stage feature fusion and enhancement model explicitly tailored for RS tiny object detection in various RS scenarios. RS-TinyNet comes with two novel designs: tiny object saliency modeling and feature integrity reconstruction. Guided by these principles, we design three step-wise feature enhancement modules. Among them, the multi-dimensional collaborative attention (MDCA) module employs multi-dimensional attention to enhance the saliency of tiny objects. Additionally, the auxiliary reversible branch (ARB) and a progressive fusion detection head (PFDH) module are introduced to preserve information flow and fuse multi-level features to bridge semantic gaps and retain structural detail. Comprehensive experiments on public RS dataset AI-TOD show that our RS-TinyNet surpasses existing state-of-the-art (SOTA) detectors by 4.0% AP and 6.5% AP75. Evaluations on DIOR benchmark dataset further validate its superior detection performance in diverse RS scenarios. These results demonstrate that the proposed multi-stage feature fusion strategy offers an effective and practical solution for tiny object detection in complex RS environments.
【2】Fault detection and diagnosis for the engine electrical system of a space launcher based on a temporal convolutional autoencoder and calibrated classifiers
标题:基于时间卷积自动编码器和校准分类器的空间发射器发动机电气系统故障检测和诊断
链接:https://arxiv.org/abs/2507.13022
作者:ra, Louison Bocquet-Nouaille, Elinirina Robinson, Serge Le Gonidec
备注:53 pages, 16 figures
摘要:在下一代可重复使用的航天发射器的健康监测的背景下,我们概述了第一步发展机载故障检测和诊断能力的电气系统,控制发动机阀门。与文献中现有的方法不同,我们的解决方案旨在满足更广泛的关键要求。这包括估计预测的置信水平,检测分布外(OOD)情况,以及控制假警报。所提出的解决方案是基于时间卷积自动编码器,从原始传感器数据中自动提取低维特征。故障检测和诊断分别进行使用二进制和多类分类器训练的autoencoder的潜在和残留空间。分类器是基于直方图的梯度提升模型,其被校准为输出概率,该概率可以被解释为置信水平。一个相对简单的技术,基于归纳共形异常检测,用于识别OOD数据。我们利用其他简单而有效的技术,如累积和控制图(CSCUM),以限制误报,和阈值移动,以解决类不平衡的故障检测。所提出的框架是高度可配置的,并已进行了评估模拟数据,包括标称和异常的操作场景。结果表明,我们的解决方案是一个有前途的第一步,虽然测试与真实的数据将是必要的,以确保它达到所需的成熟度水平的操作使用。
摘要:In the context of the health monitoring for the next generation of reusable space launchers, we outline a first step toward developing an onboard fault detection and diagnostic capability for the electrical system that controls the engine valves. Unlike existing approaches in the literature, our solution is designed to meet a broader range of key requirements. This includes estimating confidence levels for predictions, detecting out-of-distribution (OOD) cases, and controlling false alarms. The proposed solution is based on a temporal convolutional autoencoder to automatically extract low-dimensional features from raw sensor data. Fault detection and diagnosis are respectively carried out using a binary and a multiclass classifier trained on the autoencoder latent and residual spaces. The classifiers are histogram-based gradient boosting models calibrated to output probabilities that can be interpreted as confidence levels. A relatively simple technique, based on inductive conformal anomaly detection, is used to identify OOD data. We leverage other simple yet effective techniques, such as cumulative sum control chart (CUSUM) to limit the false alarms, and threshold moving to address class imbalance in fault detection. The proposed framework is highly configurable and has been evaluated on simulated data, covering both nominal and anomalous operational scenarios. The results indicate that our solution is a promising first step, though testing with real data will be necessary to ensure that it achieves the required maturity level for operational use.
分类|识别(1篇)
【1】WhoFi: Deep Person Re-Identification via Wi-Fi Channel Signal Encoding
标题:WhoFi:通过Wi-Fi信道信号编码进行深度人员重新识别
链接:https://arxiv.org/abs/2507.12869
作者:ola, Daniele Pannone, Dario Montagnini, Emad Emam
摘要:在视频监控中,人员身份识别是一个关键而又具有挑战性的任务。虽然传统方法依赖于视觉数据,但照明不良、遮挡和次优角度等问题通常会阻碍性能。为了应对这些挑战,我们引入了WhoFi,这是一种利用Wi-Fi信号进行人员重新识别的新型管道。从信道状态信息(CSI)中提取生物特征,并通过模块化深度神经网络(DNN)进行处理,该网络具有基于变换器的编码器。该网络使用批量负损失函数进行训练,以学习鲁棒和可推广的生物特征签名。在NTU-Fi数据集上的实验表明,与最先进的方法相比,我们的方法取得了有竞争力的结果,证实了其通过Wi-Fi信号识别个人的有效性。
摘要:Person Re-Identification is a key and challenging task in video surveillance. While traditional methods rely on visual data, issues like poor lighting, occlusion, and suboptimal angles often hinder performance. To address these challenges, we introduce WhoFi, a novel pipeline that utilizes Wi-Fi signals for person re-identification. Biometric features are extracted from Channel State Information (CSI) and processed through a modular Deep Neural Network (DNN) featuring a Transformer-based encoder. The network is trained using an in-batch negative loss function to learn robust and generalizable biometric signatures. Experiments on the NTU-Fi dataset show that our approach achieves competitive results compared to state-of-the-art methods, confirming its effectiveness in identifying individuals via Wi-Fi signals.
表征(3篇)
【1】Boosting Team Modeling through Tempo-Relational Representation Learning
标题:通过时态关系表示学习促进团队建模
链接:https://arxiv.org/abs/2507.13305
作者:Marco De Luca, Giovanna Varni, Andrea Passerini
摘要:团队建模仍然是人工智能和社会科学交叉的一个基本挑战。社会科学研究强调需要联合建模动态和关系,而实际应用需要能够同时推断多个团队结构的统一模型,提供可解释的见解和可操作的建议,以提高团队绩效。然而,现有的工程并不能满足这些实际需求。为了弥合这一差距,我们提出了TRENN,一种新型的时间关系架构,集成了:(i)自动时间图提取器,(ii)时间关系编码器,(iii)团队结构预测解码器,以及(iv)两个互补的可解释性模块。TRENN共同捕捉关系和时间团队动态,为MT-TRENN提供了坚实的基础,MT-TRENN通过将解码器替换为多任务头来扩展TReNN,使模型能够学习共享的社会嵌入并同时预测多个团队结构,包括紧急领导力,领导风格和团队合作组件。实验结果表明,我们的方法显着优于完全依赖于时间或关系信息的方法。此外,实验评估表明,MT-TRENN中集成的可解释性模块产生可解释的见解和可操作的建议,以支持团队改进。这些功能使我们的方法特别适合以人为中心的人工智能应用,例如高风险协作环境中的智能决策支持系统。
摘要:Team modeling remains a fundamental challenge at the intersection of Artificial Intelligence and the Social Sciences. Social Science research emphasizes the need to jointly model dynamics and relations, while practical applications demand unified models capable of inferring multiple team constructs simultaneously, providing interpretable insights and actionable recommendations to enhance team performance. However, existing works do not meet these practical demands. To bridge this gap, we present TRENN, a novel tempo-relational architecture that integrates: (i) an automatic temporal graph extractor, (ii) a tempo-relational encoder, (iii) a decoder for team construct prediction, and (iv) two complementary explainability modules. TRENN jointly captures relational and temporal team dynamics, providing a solid foundation for MT-TRENN, which extends TReNN by replacing the decoder with a multi-task head, enabling the model to learn shared Social Embeddings and simultaneously predict multiple team constructs, including Emergent Leadership, Leadership Style, and Teamwork components. Experimental results demonstrate that our approach significantly outperforms approaches that rely exclusively on temporal or relational information. Additionally, experimental evaluation has shown that the explainability modules integrated in MT-TRENN yield interpretable insights and actionable suggestions to support team improvement. These capabilities make our approach particularly well-suited for Human-Centered AI applications, such as intelligent decision-support systems in high-stakes collaborative environments.
【2】Spectral Bellman Method: Unifying Representation and Exploration in RL
标题:谱Bellman方法:RL中统一的表示和探索
链接:https://arxiv.org/abs/2507.13181
作者:ti, Bo Dai, Shie Mannor, Guy Tennenholtz
摘要:表征的效果已经在强化学习中得到了证明,无论是理论上还是经验上的成功。然而,现有的表示学习主要是从模型学习方面诱导的,与我们的RL任务不一致。这项工作介绍了频谱贝尔曼表示,这是一种源自固有贝尔曼误差(IBE)条件的新框架,它与贝尔曼更新的基本结构在可能的值函数空间中保持一致,因此,直接面向基于值的RL。我们的关键见解是发现一个基本的谱关系:在零IBE条件下,由贝尔曼算子的值函数分布的变换本质上与特征协方差结构有关。这种光谱连接产生了一个新的,理论上接地学习状态动作功能,固有的贝尔曼对齐协方差的目标。我们的方法需要一个简单的修改现有的算法。我们证明,我们学习的表示,使结构化的探索,通过调整特征协方差与贝尔曼动态,并提高整体性能,特别是在具有挑战性的硬探索和长期信用分配任务。我们的框架自然扩展到强大的多步Bellman算子,进一步扩大了其影响。光谱贝尔曼表示为基于值的强化学习提供了一条原则性和有效的路径,可以学习更强大和结构合理的表示。
摘要
:The effect of representation has been demonstrated in reinforcement learning, from both theoretical and empirical successes. However, the existing representation learning mainly induced from model learning aspects, misaligning with our RL tasks. This work introduces Spectral Bellman Representation, a novel framework derived from the Inherent Bellman Error (IBE) condition, which aligns with the fundamental structure of Bellman updates across a space of possible value functions, therefore, directly towards value-based RL. Our key insight is the discovery of a fundamental spectral relationship: under the zero-IBE condition, the transformation of a distribution of value functions by the Bellman operator is intrinsically linked to the feature covariance structure. This spectral connection yields a new, theoretically-grounded objective for learning state-action features that inherently capture this Bellman-aligned covariance. Our method requires a simple modification to existing algorithms. We demonstrate that our learned representations enable structured exploration, by aligning feature covariance with Bellman dynamics, and improve overall performance, particularly in challenging hard-exploration and long-horizon credit assignment tasks. Our framework naturally extends to powerful multi-step Bellman operators, further broadening its impact. Spectral Bellman Representation offers a principled and effective path toward learning more powerful and structurally sound representations for value-based reinforcement learning.
【3】cIDIR: Conditioned Implicit Neural Representation for Regularized Deformable Image Registration
标题:cIDIR:用于正规化可变形图像配准的条件隐式神经表示
链接:https://arxiv.org/abs/2507.12953
作者: Hadramy, Oumeymah Cherkaoui, Philippe C. Cattin
摘要:正则化在可变形图像配准(DVF)中是必不可少的,以确保估计的变形向量场(DVF)保持平滑,物理上合理,解剖学上一致。然而,在基于学习的神经网络框架中微调正则化参数在计算上是昂贵的,通常需要多次训练迭代。为了解决这个问题,我们提出了cIDI,这是一种基于隐式神经表示(INR)的新框架,它将配准过程限制在正则化超参数上。与需要对每个正则化超参数设置进行再训练的传统方法不同,cIDIR在这些超参数的先验分布上进行训练,然后通过使用分割掩码作为观察来优化正则化超参数。此外,cIDIR对连续和可微分的DVF进行建模,通过自动微分实现高级正则化技术的无缝集成。在DIR-LAB数据集上进行评估,$\operatorname{cIDIR}$在整个数据集上实现了高准确性和鲁棒性。
摘要:Regularization is essential in deformable image registration (DIR) to ensure that the estimated Deformation Vector Field (DVF) remains smooth, physically plausible, and anatomically consistent. However, fine-tuning regularization parameters in learning-based DIR frameworks is computationally expensive, often requiring multiple training iterations. To address this, we propose cIDI, a novel DIR framework based on Implicit Neural Representations (INRs) that conditions the registration process on regularization hyperparameters. Unlike conventional methods that require retraining for each regularization hyperparameter setting, cIDIR is trained over a prior distribution of these hyperparameters, then optimized over the regularization hyperparameters by using the segmentations masks as an observation. Additionally, cIDIR models a continuous and differentiable DVF, enabling seamless integration of advanced regularization techniques via automatic differentiation. Evaluated on the DIR-LAB dataset, $\operatorname{cIDIR}$ achieves high accuracy and robustness across the dataset.
3D|3D重建等相关(1篇)
【1】Predicting 3D Rigid Body Dynamics with Deep Residual Network
标题:利用深度剩余网络预测3D刚性物体动力学
链接:https://arxiv.org/abs/2407.18798
作者:inbarrs Oketunji
摘要:本研究探讨应用深度残差网络预测相互作用的三维刚体的动力学。我们提出了一个框架,将用C++实现的3D物理模拟器与使用PyTorch构建的深度学习模型相结合。模拟器生成的训练数据包括线性和角运动、弹性碰撞、流体摩擦、重力效应和阻尼。我们的深度残差网络由一个输入层、多个残差块和一个输出层组成,旨在处理3D动态的复杂性。我们使用10,000个模拟场景来评估网络的性能,每个场景涉及3-5个相互作用的刚体。该模型实现了0.015的位置预测和0.022的方向预测的均方误差,比基线方法提高了25%。我们的研究结果证明了网络捕捉复杂物理相互作用的能力,特别是在预测弹性碰撞和旋转动力学方面取得了成功。这项工作通过展示深度残差网络在复杂3D物理系统建模中的巨大潜力,为物理信息机器学习做出了重大贡献。我们讨论了我们的方法的局限性,并提出了未来的方向,以提高更多样化的对象形状和材料的泛化。
摘要:This study investigates the application of deep residual networks for predicting the dynamics of interacting three-dimensional rigid bodies. We present a framework combining a 3D physics simulator implemented in C++ with a deep learning model constructed using PyTorch. The simulator generates training data encompassing linear and angular motion, elastic collisions, fluid friction, gravitational effects, and damping. Our deep residual network, consisting of an input layer, multiple residual blocks, and an output layer, is designed to handle the complexities of 3D dynamics. We evaluate the network's performance using a datasetof 10,000 simulated scenarios, each involving 3-5 interacting rigid bodies. The model achieves a mean squared error of 0.015 for position predictions and 0.022 for orientation predictions, representing a 25% improvement over baseline methods. Our results demonstrate the network's ability to capture intricate physical interactions, with particular success in predicting elastic collisions and rotational dynamics. This work significantly contributes to physics-informed machine learning by showcasing the immense potential of deep residual networks in modeling complex 3D physical systems. We discuss our approach's limitations and propose future directions for improving generalization to more diverse object shapes and materials.
优化|敛散性(8篇)
【1】Merge Kernel for Bayesian Optimization on Permutation Space
标题:排列空间上Bayesian优化的合并核
链接:https://arxiv.org/abs/2507.13263
作者:, Linjiang Chen
备注:8 pages, submitted to AAAI-26
摘要:贝叶斯优化(BO)算法是解决黑箱优化问题的标准工具。目前最先进的置换空间BO方法依赖于Mallow核-一种显式枚举每个成对比较的$\Omega(n^2)$表示。受Mallows核与两两比较之间的密切关系的启发,我们提出了一种基于排序算法的置换空间上的核函数生成框架。在这个框架中,Mallow内核可以被看作是从冒泡排序派生出来的一个特殊实例。此外,我们引入了由归并排序构造的\textbf{Merge Kernel},它用$\Theta(n\log n)$取代了二次复杂度,以实现尽可能低的复杂度。得到的特征向量明显更短,可以在线性时间内计算,但仍然有效地捕获有意义的排列距离。为了在不牺牲紧凑性的情况下提高鲁棒性和右不变性,我们进一步引入了三个轻量级的任务不可知描述符:(1)移位直方图,它聚集绝对元素位移并提供全局错位信号;(2)分裂对线,它通过在整个排列的两半中对齐元素来编码选定的长距离比较;以及(3)滑动窗口图案,其总结了影响邻近目标的局部顺序模式。我们的实证评估表明,所提出的内核始终优于国家的最先进的马洛内核在各种置换优化基准。结果表明,合并核提供了一个更紧凑,但更有效的解决方案,贝叶斯优化排列空间。
摘要
:Bayesian Optimization (BO) algorithm is a standard tool for black-box optimization problems. The current state-of-the-art BO approach for permutation spaces relies on the Mallows kernel-an $\Omega(n^2)$ representation that explicitly enumerates every pairwise comparison. Inspired by the close relationship between the Mallows kernel and pairwise comparison, we propose a novel framework for generating kernel functions on permutation space based on sorting algorithms. Within this framework, the Mallows kernel can be viewed as a special instance derived from bubble sort. Further, we introduce the \textbf{Merge Kernel} constructed from merge sort, which replaces the quadratic complexity with $\Theta(n\log n)$ to achieve the lowest possible complexity. The resulting feature vector is significantly shorter, can be computed in linearithmic time, yet still efficiently captures meaningful permutation distances. To boost robustness and right-invariance without sacrificing compactness, we further incorporate three lightweight, task-agnostic descriptors: (1) a shift histogram, which aggregates absolute element displacements and supplies a global misplacement signal; (2) a split-pair line, which encodes selected long-range comparisons by aligning elements across the two halves of the whole permutation; and (3) sliding-window motifs, which summarize local order patterns that influence near-neighbor objectives. Our empirical evaluation demonstrates that the proposed kernel consistently outperforms the state-of-the-art Mallows kernel across various permutation optimization benchmarks. Results confirm that the Merge Kernel provides a more compact yet more effective solution for Bayesian optimization in permutation space.
【2】GradNetOT: Learning Optimal Transport Maps with GradNets
标题:GradNetOT:使用GradNets学习最优运输地图
链接:https://arxiv.org/abs/2507.13191
作者:haudhari, Srinivasa Pranav, José M. F. Moura
摘要:单调梯度函数在解决最优运输问题的Monge公式中起着核心作用,该问题出现在从流体动力学到机器人群控制的现代应用中。当运输成本为平方欧几里德距离时,Brenier定理保证了唯一的最优映射是凸函数的梯度,即单调梯度映射,并且满足Monge-Schneider方程.在[arXiv:2301.10862] [arXiv:2404.07361]中,我们提出了单调梯度网络(mGradNets),这是一种直接参数化单调梯度映射空间的神经网络。在这项工作中,我们利用mGradNets通过最小化使用Monge-Gillere方程定义的训练损失函数来直接学习最佳传输映射。我们的经验表明,mGradNets的结构偏差有利于学习最优的运输地图,并采用我们的方法的机器人群控制问题。
摘要:Monotone gradient functions play a central role in solving the Monge formulation of the optimal transport problem, which arises in modern applications ranging from fluid dynamics to robot swarm control. When the transport cost is the squared Euclidean distance, Brenier's theorem guarantees that the unique optimal map is the gradient of a convex function, namely a monotone gradient map, and it satisfies a Monge-Amp\`ere equation. In [arXiv:2301.10862] [arXiv:2404.07361], we proposed Monotone Gradient Networks (mGradNets), neural networks that directly parameterize the space of monotone gradient maps. In this work, we leverage mGradNets to directly learn the optimal transport mapping by minimizing a training loss function defined using the Monge-Amp\`ere equation. We empirically show that the structural bias of mGradNets facilitates the learning of optimal transport maps and employ our method for a robot swarm control problem.
【3】From a Mixed-Policy Perspective: Improving Differentiable Automatic Post-editing Optimization
标题:从混合政策的角度:改进差异化自动后期编辑优化
链接:https://arxiv.org/abs/2507.12931
作者:n
摘要:本文介绍了两个新的修改微分自动后期编辑优化(DAPO)算法,从混合策略的角度来看。标准的策略梯度方法可能会受到不稳定和样本效率低下的影响,特别是在稀疏奖励设置中。为了解决这个问题,我们首先提出了一种方法,它结合了一个预先训练的,稳定的指导政策($\pipi $),以提供政策外的经验,从而定期培训的目标政策($\pion$)。该方法通过自适应调整学习步长,提高了训练的稳定性和收敛速度。其次,我们扩展了这个想法,以重新利用零奖励样本,这往往是丢弃的动态采样策略,如DAPO的。通过将这些样本作为一个不同的批次,由专家策略指导,我们进一步提高了采样效率。我们为这两种方法提供了理论分析,证明了它们的目标函数在强化学习的理论框架内收敛到最优解。提出的混合策略框架有效地平衡了探索和利用,保证了更稳定和有效的策略优化。
摘要:This paper introduces two novel modifications to the Differentiable Automatic Post-editing Optimization (DAPO) algorithm, approached from a mixed-policy perspective. Standard policy gradient methods can suffer from instability and sample inefficiency, particularly in sparse reward settings. To address this, we first propose a method that incorporates a pre-trained, stable guiding policy ($\piphi$) to provide off-policy experience, thereby regularizing the training of the target policy ($\pion$). This approach improves training stability and convergence speed by adaptively adjusting the learning step size. Secondly, we extend this idea to re-utilize zero-reward samples, which are often discarded by dynamic sampling strategies like DAPO's. By treating these samples as a distinct batch guided by the expert policy, we further enhance sample efficiency. We provide a theoretical analysis for both methods, demonstrating that their objective functions converge to the optimal solution within the established theoretical framework of reinforcement learning. The proposed mixed-policy framework effectively balances exploration and exploitation, promising more stable and efficient policy optimization.
【4】Sample-Constrained Black Box Optimization for Audio Personalization
标题:用于音频个性化的样本约束黑匣子优化
链接:https://arxiv.org/abs/2507.12773
作者: Rajagopalan, Yu-Lin Wei, Romit Roy Choudhury
备注:Published in AAAI 2024
摘要:我们考虑个性化音频以最大化用户体验的问题。简而言之,我们的目标是找到一个过滤器$h^*$,适用于任何音乐或语音,将最大限度地提高用户的满意度。这是一个黑盒优化问题,因为用户的满意度函数是未知的。在这个主题上已经做了大量的工作,其中关键思想是向用户播放音频样本,每个音频样本由不同的滤波器$h_i$整形,并向用户查询他们的满意度分数$f(h_i)$。一个家庭的"代理”功能,然后设计,以适应这些分数和优化方法逐渐完善这些功能,以达到过滤器$\hat{h}^*$,最大限度地提高满意度。在某些应用中,我们观察到第二种类型的查询是可能的,其中用户可以告诉我们最佳过滤器$h^*$的各个元素$h ^*[j]$。考虑一个与烹饪的类比,目标是烹饪一个最大化用户满意度的食谱。可以要求用户对各种烹饪食谱(例如,豆腐炒饭)或评分的个别成分(说,盐,糖,大米,鸡肉等)。给定$B$查询的预算,其中查询可以是任何一种类型,我们的目标是找到最大化用户满意度的配方。我们的建议建立在稀疏高斯过程回归(GPR)的基础上,并展示了混合方法如何优于任何一种类型的查询。我们的结果通过模拟和现实世界的实验,志愿者给音乐/语音音频的反馈,并能够实现高满意度进行验证。我们相信这种混合查询的想法在黑盒优化中打开了新的问题,解决方案可以使音频个性化之外的其他应用受益。
摘要:We consider the problem of personalizing audio to maximize user experience. Briefly, we aim to find a filter $h^*$, which applied to any music or speech, will maximize the user's satisfaction. This is a black-box optimization problem since the user's satisfaction function is unknown. Substantive work has been done on this topic where the key idea is to play audio samples to the user, each shaped by a different filter $h_i$, and query the user for their satisfaction scores $f(h_i)$. A family of ``surrogate" functions is then designed to fit these scores and the optimization method gradually refines these functions to arrive at the filter $\hat{h}^*$ that maximizes satisfaction. In certain applications, we observe that a second type of querying is possible where users can tell us the individual elements $h^*[j]$ of the optimal filter $h^*$. Consider an analogy from cooking where the goal is to cook a recipe that maximizes user satisfaction. A user can be asked to score various cooked recipes (e.g., tofu fried rice) or to score individual ingredients (say, salt, sugar, rice, chicken, etc.). Given a budget of $B$ queries, where a query can be of either type, our goal is to find the recipe that will maximize this user's satisfaction. Our proposal builds on Sparse Gaussian Process Regression (GPR) and shows how a hybrid approach can outperform any one type of querying. Our results are validated through simulations and real world experiments, where volunteers gave feedback on music/speech audio and were able to achieve high satisfaction levels. We believe this idea of hybrid querying opens new problems in black-box optimization and solutions can benefit other applications beyond audio personalization.
【5】Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?
标题:编码器是否能够学习里程碑以实现超参数优化的热启动?
链接:https://arxiv.org/abs/2507.12604
作者:jko, Katarzyna Woźnica
摘要:有效地表示异构的表格数据集的元学习的目的仍然是一个悬而未决的问题。以前的方法依赖于旨在通用的表示。本文提出了两种针对特定元任务的表格表示学习新方法--热启动贝叶斯超参数优化。两者都遵循我们制定的特定要求,即强制表示以捕捉地标的属性。第一种方法涉及深度度量学习,而第二种方法基于地标重建。我们以两种方式评估所提出的编码器。除了目标元任务的增益之外,我们还使用所提出的需求的满足程度作为评估度量。实验表明,虽然所提出的编码器可以有效地学习与地标对齐的表示,但它们可能不会直接转化为HPO热启动元任务中的显着性能增益。
摘要:Effectively representing heterogeneous tabular datasets for meta-learning purposes is still an open problem. Previous approaches rely on representations that are intended to be universal. This paper proposes two novel methods for tabular representation learning tailored to a specific meta-task - warm-starting Bayesian Hyperparameter Optimization. Both follow the specific requirement formulated by ourselves that enforces representations to capture the properties of landmarkers. The first approach involves deep metric learning, while the second one is based on landmarkers reconstruction. We evaluate the proposed encoders in two ways. Next to the gain in the target meta-task, we also use the degree of fulfillment of the proposed requirement as the evaluation metric. Experiments demonstrate that while the proposed encoders can effectively learn representations aligned with landmarkers, they may not directly translate to significant performance gains in the meta-task of HPO warm-starting.
【6】Optimal Empirical Risk Minimization under Temporal Distribution Shifts
标题:时间分布变化下的最优经验风险最小化
链接:https://arxiv.org/abs/2507.13287
作者:ng, Ramesh Johari, Dominik Rothenhäusler, Emily Fox
摘要:时间分布变化对在动态变化的环境中训练和部署的机器学习模型构成了关键挑战。本文介绍了RIDER(RISk minimization under Dynamically Evolving Regimes),它导出了时间分布变化下的最优加权经验风险最小化方法。我们的方法在理论上是基于随机分布移位模型,其中随机移位是数据生成过程中众多不可预测的变化的叠加。我们表明,常见的加权方案,如汇集所有数据,指数加权数据,并只使用最近的数据,自然出现在我们的框架中的特殊情况。我们证明了RIDER在Wild-Time中的一系列基准方法中作为年鉴数据集的微调步骤时,始终提高了样本外预测性能。此外,我们表明,RIDER优于标准的加权策略在其他两个现实世界的任务:预测股市波动和预测乘坐纽约市出租车数据的持续时间。
摘要:Temporal distribution shifts pose a key challenge for machine learning models trained and deployed in dynamically evolving environments. This paper introduces RIDER (RIsk minimization under Dynamically Evolving Regimes) which derives optimally-weighted empirical risk minimization procedures under temporal distribution shifts. Our approach is theoretically grounded in the random distribution shift model, where random shifts arise as a superposition of numerous unpredictable changes in the data-generating process. We show that common weighting schemes, such as pooling all data, exponentially weighting data, and using only the most recent data, emerge naturally as special cases in our framework. We demonstrate that RIDER consistently improves out-of-sample predictive performance when applied as a fine-tuning step on the Yearbook dataset, across a range of benchmark methods in Wild-Time. Moreover, we show that RIDER outperforms standard weighting strategies in two other real-world tasks: predicting stock market volatility and forecasting ride durations in NYC taxi data.
【7】Stochastic Weakly Convex Optimization Under Heavy-Tailed Noises
标题:重尾噪声下的随机弱凸优化
链接:https://arxiv.org/abs/2507.13283
作者:u, Yi Xu, Xiangyang Ji
摘要:越来越多的研究集中在重尾梯度噪声下的随机一阶方法(SFOMs),这在实际深度学习模型的训练中已经观察到。本文主要研究了两类梯度噪声:一类是亚威布尔噪声,另一类是假设有界p次中心矩(p-∞)且p\in(1,2]$的噪声.后者更具挑战性,因为当$p\in(1,2)$时会出现无穷大方差。在这两个梯度噪声假设下,SFOMs的期望收敛性和高概率收敛性在凸优化和标准光滑优化的背景下得到了广泛的研究。然而,对于弱凸目标,一类,包括所有Lipschitz连续凸目标和光滑的目标,我们的理解的预期和高概率收敛的SFOMs下这两种类型的噪声仍然是不完整的。研究了亚威布尔噪声下香草随机次梯度下降(SsGD)方法的高概率收敛性,以及p$-∞噪声下截尾SsGD方法的高概率收敛性和期望收敛性.这两个分析进行的背景下,弱凸优化。对于可能非凸且非光滑的弱凸目标,我们的结果表明,与光滑目标的情况相比,在亚威布尔噪声下,vanilla SsGD对故障概率和迭代次数的理论依赖性不会降低。在p-∞噪声下,我们的研究结果表明,弱凸目标的非光滑性和非凸性并不影响剪切SGD对相对于光滑情况的失效概率的理论依赖性;然而,我们推导出的样本复杂性比一个众所周知的光滑优化下界更差。
摘要:An increasing number of studies have focused on stochastic first-order methods (SFOMs) under heavy-tailed gradient noises, which have been observed in the training of practical deep learning models. In this paper, we focus on two types of gradient noises: one is sub-Weibull noise, and the other is noise under the assumption that it has a bounded $p$-th central moment ($p$-BCM) with $p\in (1, 2]$. The latter is more challenging due to the occurrence of infinite variance when $p\in (1, 2)$. Under these two gradient noise assumptions, the in-expectation and high-probability convergence of SFOMs have been extensively studied in the contexts of convex optimization and standard smooth optimization. However, for weakly convex objectives-a class that includes all Lipschitz-continuous convex objectives and smooth objectives-our understanding of the in-expectation and high-probability convergence of SFOMs under these two types of noises remains incomplete. We investigate the high-probability convergence of the vanilla stochastic subgradient descent (SsGD) method under sub-Weibull noises, as well as the high-probability and in-expectation convergence of clipped SsGD under the $p$-BCM noises. Both analyses are conducted in the context of weakly convex optimization. For weakly convex objectives that may be non-convex and non-smooth, our results demonstrate that the theoretical dependence of vanilla SsGD on the failure probability and number of iterations under sub-Weibull noises does not degrade compared to the case of smooth objectives. Under $p$-BCM noises, our findings indicate that the non-smoothness and non-convexity of weakly convex objectives do not impact the theoretical dependence of clipped SGD on the failure probability relative to the smooth case; however, the sample complexity we derived is worse than a well-known lower bound for smooth optimization.
【8】Refining Coarse-Grained Molecular Topologies: A Bayesian Optimization Approach
标题:细化粗粒度分子布局:一种Bayesian优化方法
链接:https://arxiv.org/abs/2501.02707
作者:y, Adam P. Generale, Nikhith Vankireddy, Yuichiro Asoma, Masataka Nakauchi, Haein Lee, Katsuhisa Yoshida, Yoshishige Okuno, Surya R. Kalidindi
摘要:分子动力学(MD)模拟对于准确预测各种压力和温度系综下大分子系统的物理和化学性质至关重要。然而,与全原子(AA)MD模拟相关的高计算成本导致了粗粒分子动力学(CGMD)的发展,提供了AA结构到代表性CG珠的低维压缩,以降低预测精度为代价降低了计算费用。现有的CGMD方法,如CG-Martini(根据实验数据校准),旨在生成一个嵌入的拓扑结构,充分概括了一系列的结构。不利的是,在试图指定跨分子类别的适用性的参数化时,它不能专门用于特定领域的应用,其中足够的准确性和计算速度是至关重要的。这项工作提出了一种新的方法,通过细化通用Martini 3拓扑结构(特别是给定粗粒度映射内的键合交互参数)来优化CGMD模拟的衍生结果,以使用贝叶斯优化方法针对特定于领域的应用程序。我们已经开发并验证了适用于任何聚合度的CG潜力,代表了该领域的重大进步。我们优化的CG潜力,基于Martini 3框架,旨在实现与AAMD相当的精度,同时保持CGMD的计算效率。这种方法弥合了多尺度分子模拟中效率和准确性之间的差距,可能使各种科学和技术领域的分子发现更加快速和具有成本效益。
摘要
:Molecular Dynamics (MD) simulations are essential for accurately predicting the physical and chemical properties of large molecular systems across various pressure and temperature ensembles. However, the high computational costs associated with All-Atom (AA) MD simulations have led to the development of Coarse-Grained Molecular Dynamics (CGMD), providing a lower-dimensional compression of the AA structure into representative CG beads, offering reduced computational expense at the cost of predictive accuracy. Existing CGMD methods, such as CG-Martini (calibrated against experimental data), aim to generate an embedding of a topology that sufficiently generalizes across a range of structures. Detrimentally, in attempting to specify parameterization with applicability across molecular classes, it is unable to specialize to domain-specific applications, where sufficient accuracy and computational speed are critical. This work presents a novel approach to optimize derived results from CGMD simulations by refining the general-purpose Martini3 topologies specifically the bonded interaction parameters within a given coarse-grained mapping - for domain-specific applications using Bayesian Optimization methodologies. We have developed and validated a CG potential applicable to any degree of polymerization, representing a significant advancement in the field. Our optimized CG potential, based on the Martini3 framework, aims to achieve accuracy comparable to AAMD while maintaining the computational efficiency of CGMD. This approach bridges the gap between efficiency and accuracy in multiscale molecular simulations, potentially enabling more rapid and cost-effective molecular discovery across various scientific and technological domains.
预测|估计(4篇)
【1】Leveraging Asynchronous Cross-border Market Data for Improved Day-Ahead Electricity Price Forecasting in European Markets
标题:利用同步跨境市场数据改善欧洲市场的前一天电价预测
链接:https://arxiv.org/abs/2507.13250
作者:garida Mascarenhas, Jilles De Blauwe, Mikael Amelin, Hussain Kazmi
备注:Both Maria Margarida Mascarenhas and Jilles De Blauwe contributed equally to the paper
摘要:准确的短期电价预测对于在日前市场中战略性地安排需求和发电报价至关重要。虽然近年来数据驱动技术在实现高预测准确性方面表现出相当大的实力,但它们在很大程度上依赖于输入协变量的质量。在本文中,我们调查是否异步公布的价格,由于不同的关门时间(GCT)在一些投标区可以提高预测精度,在其他市场与后来的GCT。使用最先进的模型集成,我们在比利时(BE)和瑞典投标区(SE 3)的预测准确性分别提高了22%和9%,包括来自早期GCT(德国-卢森堡,奥地利和瑞士)的互联市场的价格数据。这种改善适用于一般和极端市场条件。我们的分析还产生了进一步的重要见解:频繁的模型重新校准对于最大的准确性是必要的,但会带来大量的额外计算成本,并且使用来自更多市场的数据并不总是会带来更好的性能-这是我们在预测模型的可解释性分析中深入研究的事实。总体而言,这些研究结果为市场参与者和决策者提供了宝贵的指导,旨在优化日益相互关联和波动的欧洲能源市场的投标策略。
摘要:Accurate short-term electricity price forecasting is crucial for strategically scheduling demand and generation bids in day-ahead markets. While data-driven techniques have shown considerable prowess in achieving high forecast accuracy in recent years, they rely heavily on the quality of input covariates. In this paper, we investigate whether asynchronously published prices as a result of differing gate closure times (GCTs) in some bidding zones can improve forecasting accuracy in other markets with later GCTs. Using a state-of-the-art ensemble of models, we show significant improvements of 22% and 9% in forecast accuracy in the Belgian (BE) and Swedish bidding zones (SE3) respectively, when including price data from interconnected markets with earlier GCT (Germany-Luxembourg, Austria, and Switzerland). This improvement holds for both general as well as extreme market conditions. Our analysis also yields further important insights: frequent model recalibration is necessary for maximum accuracy but comes at substantial additional computational costs, and using data from more markets does not always lead to better performance - a fact we delve deeper into with interpretability analysis of the forecast models. Overall, these findings provide valuable guidance for market participants and decision-makers aiming to optimize bidding strategies within increasingly interconnected and volatile European energy markets.
【2】FLDmamba: Integrating Fourier and Laplace Transform Decomposition with Mamba for Enhanced Time Series Prediction
标题:FLDmamba:将傅立叶和拉普拉斯变换分解与Mamba集成,用于增强时间序列预测
链接:https://arxiv.org/abs/2507.12803
作者:ang, Chenglei Yu, Haixin Wang, Yudong Yan, Yuansheng Cao, Siu-Ming Yiu, Tailin Wu, Hongzhi Yin
备注:12 pages
摘要:时间序列预测是各个领域的关键任务,由于时间序列数据的固有复杂性,包括非平稳性,多尺度周期性和瞬态动态,特别是在处理长期预测时,面临着重大挑战。虽然基于transformer的架构已经显示出了希望,但其与序列长度的二次复杂性阻碍了其长期预测的效率。状态空间模型的最新进展,如Mamba,为长期建模提供了更有效的替代方案,但它们不能有效地捕获多尺度周期性和瞬态动态。同时,它们容易受到时间序列中的数据噪声问题的影响。本文提出了一种新的框架,FLDmamba(傅立叶和拉普拉斯变换分解Mamba),解决这些限制。FLDmamba利用傅立叶和拉普拉斯变换的优势,有效地捕获时间序列数据中的多尺度周期性和瞬态动态,并提高模型对数据噪声问题的鲁棒性。我们广泛的实验表明,FLDmamba在时间序列预测基准测试中实现了卓越的性能,优于基于transformer和其他基于Mamba的架构。为了提高我们方法的可重复性,我们通过以下URL访问代码和数据:{\href{https://github.com/AI4Science-WestlakeU/FLDmamba}{https://github.com/AI4Science-WestlakeU/\model}。
摘要:Time series prediction, a crucial task across various domains, faces significant challenges due to the inherent complexities of time series data, including non-stationarity, multi-scale periodicity, and transient dynamics, particularly when tackling long-term predictions. While Transformer-based architectures have shown promise, their quadratic complexity with sequence length hinders their efficiency for long-term predictions. Recent advancements in State-Space Models, such as Mamba, offer a more efficient alternative for long-term modeling, but they cannot capture multi-scale periodicity and transient dynamics effectively. Meanwhile, they are susceptible to data noise issues in time series. This paper proposes a novel framework, FLDmamba (Fourier and Laplace Transform Decomposition Mamba), addressing these limitations. FLDmamba leverages the strengths of both Fourier and Laplace transforms to effectively capture both multi-scale periodicity, transient dynamics within time series data, and improve the robustness of the model to the data noise issue. Our extensive experiments demonstrate that FLDmamba achieves superior performance on time series prediction benchmarks, outperforming both Transformer-based and other Mamba-based architectures. To promote the reproducibility of our method, we have made both the code and data accessible via the following URL:{\href{https://github.com/AI4Science-WestlakeU/FLDmamba}{https://github.com/AI4Science-WestlakeU/\model}.
【3】Bayesian Modeling and Estimation of Linear Time-Variant Systems using Neural Networks and Gaussian Processes
标题:使用神经网络和高斯过程的线性时变系统的Bayesian建模和估计
链接:https://arxiv.org/abs/2507.12878
作者:lman
摘要:从输入输出数据辨识线性时变系统是一个基本而又具有挑战性的不适定逆问题。这项工作介绍了一个统一的贝叶斯框架,模型系统的脉冲响应,$h(t,\tau)$,作为一个随机过程。我们将响应分解为后验均值和随机波动项,该公式提供了一种用于量化不确定性的原则方法,并自然定义了一种新的有用的系统类,我们称之为线性时不变期望(LTIE)。为了进行推理,我们利用现代机器学习技术,包括贝叶斯神经网络和高斯过程,使用可扩展的变分推理。我们通过一系列的实验证明,我们的框架可以鲁棒地推断从一个单一的噪声观测的LTI系统的属性,显示优越的数据效率相比,在模拟的环境噪声断层扫描问题的经典方法,并成功地跟踪一个连续变化的LTV脉冲响应通过使用结构化高斯过程之前。这项工作提供了一个灵活的和强大的方法,在动态环境中的不确定性感知系统识别。
摘要:The identification of Linear Time-Variant (LTV) systems from input-output data is a fundamental yet challenging ill-posed inverse problem. This work introduces a unified Bayesian framework that models the system's impulse response, $h(t, \tau)$, as a stochastic process. We decompose the response into a posterior mean and a random fluctuation term, a formulation that provides a principled approach for quantifying uncertainty and naturally defines a new, useful system class we term Linear Time-Invariant in Expectation (LTIE). To perform inference, we leverage modern machine learning techniques, including Bayesian neural networks and Gaussian Processes, using scalable variational inference. We demonstrate through a series of experiments that our framework can robustly infer the properties of an LTI system from a single noisy observation, show superior data efficiency compared to classical methods in a simulated ambient noise tomography problem, and successfully track a continuously varying LTV impulse response by using a structured Gaussian Process prior. This work provides a flexible and robust methodology for uncertainty-aware system identification in dynamic environments.
【4】Differentially Private Conformal Prediction via Quantile Binary Search
标题:通过分位数二进制搜索的差异私密保形预测
链接:https://arxiv.org/abs/2507.12497
作者:M. Romanus, Roberto Molinari
摘要:大多数差分隐私(DP)方法专注于基于训练数据限制来自学习者的隐私泄漏,当程序涉及校准数据集时,考虑泄漏的方法较少,这在不确定性量化方法中很常见,如共形预测(CP)。由于在这个方向上的方法是有限的,在这项工作中,我们提供了一个通用的DP CP方法,我们称之为通过分位数搜索(P-COQS)的私有一致性。所提出的方法适应现有的随机二进制搜索算法计算DP分位数在CP的校准阶段,从而保证隐私的后果预测集。然而,当使用有限样本校准集时,这是以相对于期望的$(1 - \alpha)$水平略微覆盖不足为代价的(尽管广泛的经验结果表明,在所考虑的情况下,P-COQS通常以所需水平为目标)。适应算法的adaptation属性和量化的近似覆盖保证随之而来的CP,我们进行了广泛的实验,研究隐私噪声,样本大小和显着性水平的影响,我们的方法相比,现有的替代品的性能。此外,我们在几个基准数据集上对我们的方法进行了经验评估,包括CIFAR-10,ImageNet和CoronaHack。我们的研究结果表明,所提出的方法是强大的隐私噪声,并表现出良好的经验覆盖率,效率和信息量方面与目前的DP替代。具体而言,结果表明,P-COQS产生较小的共形预测集,同时在所有这些实验设置中针对所需的覆盖和隐私保证。
摘要:Most Differentially Private (DP) approaches focus on limiting privacy leakage from learners based on the data that they are trained on, there are fewer approaches that consider leakage when procedures involve a calibration dataset which is common in uncertainty quantification methods such as Conformal Prediction (CP). Since there is a limited amount of approaches in this direction, in this work we deliver a general DP approach for CP that we call Private Conformity via Quantile Search (P-COQS). The proposed approach adapts an existing randomized binary search algorithm for computing DP quantiles in the calibration phase of CP thereby guaranteeing privacy of the consequent prediction sets. This however comes at a price of slightly under-covering with respect to the desired $(1 - \alpha)$-level when using finite-sample calibration sets (although broad empirical results show that the P-COQS generally targets the required level in the considered cases). Confirming properties of the adapted algorithm and quantifying the approximate coverage guarantees of the consequent CP, we conduct extensive experiments to examine the effects of privacy noise, sample size and significance level on the performance of our approach compared to existing alternatives. In addition, we empirically evaluate our approach on several benchmark datasets, including CIFAR-10, ImageNet and CoronaHack. Our results suggest that the proposed method is robust to privacy noise and performs favorably with respect to the current DP alternative in terms of empirical coverage, efficiency, and informativeness. Specifically, the results indicate that P-COQS produces smaller conformal prediction sets while simultaneously targeting the desired coverage and privacy guarantees in all these experimental settings.
其他神经网络|深度学习|模型|建模(25篇)
【1】Latent Policy Steering with Embodiment-Agnostic Pretrained World Models
标题:用不可知的预先训练的世界模型进行潜在政策引导
链接:https://arxiv.org/abs/2507.13340
作者:, Mrinal Verghese, Jeff Schneider
摘要:通过模仿学习视觉策略已被证明在广泛的机器人领域是有效的。然而,这些策略的性能在很大程度上取决于培训演示的数量,这需要在现实世界中收集昂贵的数据。在这项工作中,我们的目标是通过利用来自广泛实施例的现有或具有成本效益的数据,例如公共机器人数据集和人类玩物体的数据集(来自玩耍的人类数据),来减少学习可视化机器人策略时的数据收集工作。我们的方法利用了两个关键见解。首先,我们使用光流作为一个不可知的动作表示训练世界模型(WM)跨多个实施例的数据集,并微调它的少量机器人数据从目标实施例。其次,我们开发了一种方法,潜在的政策导向(LPS),以提高输出的行为克隆政策,通过搜索潜在空间的WM更好的动作序列。在现实世界的实验中,我们观察到用少量数据训练的策略的性能有了显着改善(30次演示的相对改善超过50%,50次演示的相对改善超过20%),方法是将策略与WM相结合,WM在从不同机器人的现有Open X-embodiment数据集或具有成本效益的人类数据集中采样的2000集上进行预训练。
摘要:Learning visuomotor policies via imitation has proven effective across a wide range of robotic domains. However, the performance of these policies is heavily dependent on the number of training demonstrations, which requires expensive data collection in the real world. In this work, we aim to reduce data collection efforts when learning visuomotor robot policies by leveraging existing or cost-effective data from a wide range of embodiments, such as public robot datasets and the datasets of humans playing with objects (human data from play). Our approach leverages two key insights. First, we use optic flow as an embodiment-agnostic action representation to train a World Model (WM) across multi-embodiment datasets, and finetune it on a small amount of robot data from the target embodiment. Second, we develop a method, Latent Policy Steering (LPS), to improve the output of a behavior-cloned policy by searching in the latent space of the WM for better action sequences. In real world experiments, we observe significant improvements in the performance of policies trained with a small amount of data (over 50% relative improvement with 30 demonstrations and over 20% relative improvement with 50 demonstrations) by combining the policy with a WM pretrained on two thousand episodes sampled from the existing Open X-embodiment dataset across different robots or a cost-effective human dataset from play.
【2】MoTM: Towards a Foundation Model for Time Series Imputation based on Continuous Modeling
标题:MoTM:基于连续建模的时间序列插补基础模型
链接:https://arxiv.org/abs/2507.13207
作者:e Naour, Tahar Nabil, Ghislain Agoua
备注:10th Workshop on Advanced Analytics and Learning on Temporal Data (AALTD), ECML 2025
摘要:近年来,人们对时间序列基础模型越来越感兴趣,特别强调预测任务。然而,关键任务的域外填补缺失值仍然在很大程度上未得到充分探索。我们提出通过利用隐式神经表示(INR)来填补这一空白的第一步。INR将时间序列建模为连续函数,并自然处理各种缺失数据场景和采样率。虽然它们在特定的分布中表现出强劲的表现,但它们在分布变化下挣扎。为了解决这个问题,我们引入了MoTM(混合时间流模型),这是迈向时间序列插补基础模型的一步。MoTM基于新的时间序列是以前看到的模式的混合物的想法,将INR的基础(每个INR都是在不同的时间序列家族上独立训练的)与在推理时适应观察到的上下文的岭回归器相结合。我们证明了在不同的插补场景(例如,块和逐点缺失,可变采样率),为适应性基础插补模型铺平了道路。
摘要:Recent years have witnessed a growing interest for time series foundation models, with a strong emphasis on the forecasting task. Yet, the crucial task of out-of-domain imputation of missing values remains largely underexplored. We propose a first step to fill this gap by leveraging implicit neural representations (INRs). INRs model time series as continuous functions and naturally handle various missing data scenarios and sampling rates. While they have shown strong performance within specific distributions, they struggle under distribution shifts. To address this, we introduce MoTM (Mixture of Timeflow Models), a step toward a foundation model for time series imputation. Building on the idea that a new time series is a mixture of previously seen patterns, MoTM combines a basis of INRs, each trained independently on a distinct family of time series, with a ridge regressor that adapts to the observed context at inference. We demonstrate robust in-domain and out-of-domain generalization across diverse imputation scenarios (e.g., block and pointwise missingness, variable sampling rates), paving the way for adaptable foundation imputation models.
【3】Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning
标题:差异知情的样本选择加速多模式对比学习
链接:https://arxiv.org/abs/2507.12998
作者
:o, Feng Hong, Mengxi Chen, Pengyi Chen, Benyuan Liu, Jiangchao Yao, Ya Zhang, Yanfeng Wang
摘要:基于对比学习的多模态模型的显著成功在很大程度上是由在越来越大的数据集上进行训练所驱动的,这些数据集具有昂贵的计算消耗。样本选择作为一种替代的有效范式,对加快训练过程具有重要意义。然而,最近的进展,样本选择要么主要依赖于一个预言模型离线选择一个高质量的coreset,这是有限的冷启动的情况下,或专注于在线选择的基础上,实时模型预测,没有充分或有效地考虑噪声对应。为了解决这一难题,我们提出了一种新的差分信息样本选择(DISSect)方法,该方法可以准确有效地区分噪声对应以加速训练。具体来说,我们重新思考了噪声对应对对比学习的影响,并提出当前模型的预测相关性与历史模型的预测相关性之间的差异对于表征样本质量更具信息性。在此基础上,我们构建了一个鲁棒的基于差分的样本选择,并分析其理论见解。在三个基准数据集和各种下游任务上进行的大量实验表明,DISSect与当前最先进的方法相比具有一致的优越性。源代码可从以下网址获得:https://github.com/MediaBrain-SJTU/DISSect。
摘要:The remarkable success of contrastive-learning-based multimodal models has been greatly driven by training on ever-larger datasets with expensive compute consumption. Sample selection as an alternative efficient paradigm plays an important direction to accelerate the training process. However, recent advances on sample selection either mostly rely on an oracle model to offline select a high-quality coreset, which is limited in the cold-start scenarios, or focus on online selection based on real-time model predictions, which has not sufficiently or efficiently considered the noisy correspondence. To address this dilemma, we propose a novel Differential-Informed Sample Selection (DISSect) method, which accurately and efficiently discriminates the noisy correspondence for training acceleration. Specifically, we rethink the impact of noisy correspondence on contrastive learning and propose that the differential between the predicted correlation of the current model and that of a historical model is more informative to characterize sample quality. Based on this, we construct a robust differential-based sample selection and analyze its theoretical insights. Extensive experiments on three benchmark datasets and various downstream tasks demonstrate the consistent superiority of DISSect over current state-of-the-art methods. Source code is available at: https://github.com/MediaBrain-SJTU/DISSect.
【4】Variance-Based Pruning for Accelerating and Compressing Trained Networks
标题:基于方差的修剪用于加速和压缩训练网络
链接:https://arxiv.org/abs/2507.12988
作者:risha, Jens Mehnert, Alexandru Paul Condurache
备注:Accepted at IEEE/CVF International Conference on Computer Vision (ICCV) 2025
摘要:越来越昂贵的训练越来越大的模型,如视觉Transfomers激励重用已经训练的最先进的网络的庞大库。然而,它们的延迟、高计算成本和内存需求给部署带来了重大挑战,特别是在资源受限的硬件上。虽然结构化修剪方法可以减少这些因素,但它们通常需要昂贵的再训练,有时多达数百个时期,甚至从头开始训练,以恢复因结构修改而丢失的准确性。在结构化修剪之后保持训练模型的所提供的性能,从而避免大量的再训练仍然是一个挑战。为了解决这个问题,我们引入了基于方差的修剪,这是一种简单而结构化的一次性修剪技术,可以有效地压缩网络,并具有最小的微调。我们的方法首先收集激活统计数据,用于选择神经元进行修剪。同时,平均激活被集成回模型中,以保持高度的性能。在ImageNet-1 k识别任务中,我们证明了在修剪DeiT-Base之后,它保留了超过70%的原始性能,并且只需要10次微调就可以恢复99%的原始准确率,同时将MAC减少了35%,模型大小减少了36%,从而将模型速度提高了1.44倍。
摘要:Increasingly expensive training of ever larger models such as Vision Transfomers motivate reusing the vast library of already trained state-of-the-art networks. However, their latency, high computational costs and memory demands pose significant challenges for deployment, especially on resource-constrained hardware. While structured pruning methods can reduce these factors, they often require costly retraining, sometimes for up to hundreds of epochs, or even training from scratch to recover the lost accuracy resulting from the structural modifications. Maintaining the provided performance of trained models after structured pruning and thereby avoiding extensive retraining remains a challenge. To solve this, we introduce Variance-Based Pruning, a simple and structured one-shot pruning technique for efficiently compressing networks, with minimal finetuning. Our approach first gathers activation statistics, which are used to select neurons for pruning. Simultaneously the mean activations are integrated back into the model to preserve a high degree of performance. On ImageNet-1k recognition tasks, we demonstrate that directly after pruning DeiT-Base retains over 70% of its original performance and requires only 10 epochs of fine-tuning to regain 99% of the original accuracy while simultaneously reducing MACs by 35% and model size by 36%, thus speeding up the model by 1.44x.
【5】WaveletInception Networks for Drive-by Vibration-Based Infrastructure Health Monitoring
标题:WaveletIncept网络用于基于行车振动的基础设施健康监控
链接:https://arxiv.org/abs/2507.12969
作者:i Samani, Alfredo Nunez, Bart De Schutter
摘要:本文提出了一种新的基于深度学习的框架,用于使用驱动振动响应信号进行基础设施健康监测。认识到光谱和时间信息的重要性,我们引入了WaveletInception-BiLSTM网络。WaveletInception特征提取器利用可学习小波包变换(LWPT)作为提取振动信号特征的主干,在早期网络层中结合频谱信息。其次是1D Inception网络,它可以在更深的层次上提取多尺度的高级特征。然后,通过长短期记忆(LSTM)层将提取的振动信号特征与操作条件集成。由此产生的特征提取网络有效地分析了各种测量速度下的驱动振动信号,而无需预处理,并使用LSTM捕获不同信息模式之间相互关联的时间依赖性,并创建用于健康状况估计的特征向量。估计器头设计有使用双向LSTM(BiLSTM)网络的顺序建模架构,从驾驶测量中捕获双向时间关系。该架构允许对基础设施健康状况进行高分辨率的波束级评估。一个案例研究的重点是铁路轨道刚度估计与模拟驾驶振动信号表明,该模型显着优于国家的最先进的方法,估计铁路道碴和轨垫刚度参数。结果强调了这种方法在准确、本地化和全自动驾驶基础设施健康监测方面的潜力。
摘要:This paper presents a novel deep learning-based framework for infrastructure health monitoring using drive-by vibration response signals. Recognizing the importance of spectral and temporal information, we introduce the WaveletInception-BiLSTM network. The WaveletInception feature extractor utilizes a Learnable Wavelet Packet Transform (LWPT) as the stem for extracting vibration signal features, incorporating spectral information in the early network layers. This is followed by 1D Inception networks that extract multi-scale, high-level features at deeper layers. The extracted vibration signal features are then integrated with operational conditions via a Long Short-term Memory (LSTM) layer. The resulting feature extraction network effectively analyzes drive-by vibration signals across various measurement speeds without preprocessing and uses LSTM to capture interrelated temporal dependencies among different modes of information and to create feature vectors for health condition estimation. The estimator head is designed with a sequential modeling architecture using bidirectional LSTM (BiLSTM) networks, capturing bi-directional temporal relationships from drive-by measurements. This architecture allows for a high-resolution, beam-level assessment of infrastructure health conditions. A case study focusing on railway track stiffness estimation with simulated drive-by vibration signals shows that the model significantly outperforms state-of-the-art methods in estimating railway ballast and railpad stiffness parameters. Results underscore the potential of this approach for accurate, localized, and fully automated drive-by infrastructure health monitoring.
【6】DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization
标题:DMQ:剖析扩散模型的离群值以进行训练后量化
链接:https://arxiv.org/abs/2507.12933
作者:Lee, Jiwan Hur, Hyounguk Shon, Jae Young Lee, Junmo Kim
备注:Accepted by ICCV 2025
摘要
:扩散模型在图像生成方面取得了显着的成功,但具有显着的计算成本,在资源受限的环境中部署带来了挑战。最近的训练后量化(PTQ)方法试图通过关注扩散模型的迭代性质来缓解这个问题。然而,这些方法往往忽略了离群值,导致低位宽的性能下降。在本文中,我们提出了一个DMQ,它结合了学习等效缩放(LES)和通道的2的幂缩放(PTS),以有效地解决这些挑战。学习等效缩放优化通道缩放因子,以在权重和激活之间重新分配量化难度,从而降低总体量化误差。认识到早期的去噪步骤,尽管有很小的量化误差,但由于误差积累,对最终输出产生了至关重要的影响,我们采用了自适应时间步加权方案,以在学习过程中优先考虑这些关键步骤。此外,识别层,如跳过连接表现出高通道间的差异,我们引入通道的2的幂缩放激活。为了确保即使在小的校准集的PTS因素的鲁棒选择,我们引入了一个投票算法,提高了可靠性。大量的实验表明,我们的方法显着优于现有的作品,特别是在低位宽,如W4A6(4位权重,6位激活)和W4A8,保持高图像生成质量和模型稳定性。该代码可在https://github.com/LeeDongYeun/dmq上获得。
摘要:Diffusion models have achieved remarkable success in image generation but come with significant computational costs, posing challenges for deployment in resource-constrained environments. Recent post-training quantization (PTQ) methods have attempted to mitigate this issue by focusing on the iterative nature of diffusion models. However, these approaches often overlook outliers, leading to degraded performance at low bit-widths. In this paper, we propose a DMQ which combines Learned Equivalent Scaling (LES) and channel-wise Power-of-Two Scaling (PTS) to effectively address these challenges. Learned Equivalent Scaling optimizes channel-wise scaling factors to redistribute quantization difficulty between weights and activations, reducing overall quantization error. Recognizing that early denoising steps, despite having small quantization errors, crucially impact the final output due to error accumulation, we incorporate an adaptive timestep weighting scheme to prioritize these critical steps during learning. Furthermore, identifying that layers such as skip connections exhibit high inter-channel variance, we introduce channel-wise Power-of-Two Scaling for activations. To ensure robust selection of PTS factors even with small calibration set, we introduce a voting algorithm that enhances reliability. Extensive experiments demonstrate that our method significantly outperforms existing works, especially at low bit-widths such as W4A6 (4-bit weight, 6-bit activation) and W4A8, maintaining high image generation quality and model stability. The code is available at https://github.com/LeeDongYeun/dmq.
【7】Generalist Bimanual Manipulation via Foundation Video Diffusion Models
标题:通过基金会视频扩散模型进行全能双手操纵
链接:https://arxiv.org/abs/2507.12898
作者: Hengkai Tan, Xinyi Mao, Guodong Liu, Shuhe Huang, Chendong Xiang, Hang Su, Jun Zhu
摘要:双手机器人操作涉及两个机器人手臂的协调控制,是解决具有挑战性的任务的基础。尽管最近在通用操作方面取得了进展,但数据稀缺和体现异质性仍然是进一步扩大双手设置的严重障碍。在本文中,我们介绍了VIDeo Diffusion for Action Reasoning(VIDAR),这是一个两阶段框架,利用大规模的基于扩散的视频预训练和一种新的用于动作预测的掩蔽逆动力学模型。我们在来自三个真实世界双手机器人平台的750K多视图视频上预训练视频扩散模型,利用统一的观察空间对机器人,相机,任务和场景上下文进行编码。我们的掩码逆动力学模型学习掩码,从生成的轨迹中提取动作相关信息,而不需要像素级标签,并且掩码可以有效地推广到看不见的背景。我们的实验表明,在看不见的机器人平台上只有20分钟的人类演示(只有典型数据需求的1%),VIDAR可以通过强大的语义理解推广到看不见的任务和背景,超越了最先进的方法。我们的研究结果突出了视频基础模型的潜力,再加上掩蔽的动作预测,使可扩展和可推广的机器人操作在不同的现实世界的设置。
摘要:Bimanual robotic manipulation, which involves the coordinated control of two robotic arms, is foundational for solving challenging tasks. Despite recent progress in general-purpose manipulation, data scarcity and embodiment heterogeneity remain serious obstacles to further scaling up in bimanual settings. In this paper, we introduce VIdeo Diffusion for Action Reasoning (VIDAR), a two-stage framework that leverages large-scale, diffusion-based video pre-training and a novel masked inverse dynamics model for action prediction. We pre-train the video diffusion model on 750K multi-view videos from three real-world bimanual robot platforms, utilizing a unified observation space that encodes robot, camera, task, and scene contexts. Our masked inverse dynamics model learns masks to extract action-relevant information from generated trajectories without requiring pixel-level labels, and the masks can effectively generalize to unseen backgrounds. Our experiments demonstrate that with only 20 minutes of human demonstrations on an unseen robot platform (only 1% of typical data requirements), VIDAR generalizes to unseen tasks and backgrounds with strong semantic understanding, surpassing state-of-the-art methods. Our findings highlight the potential of video foundation models, coupled with masked action prediction, to enable scalable and generalizable robotic manipulation in diverse real-world settings.
【8】Topology-Aware Activation Functions in Neural Networks
标题:神经网络中的布局感知激活函数
链接:https://arxiv.org/abs/2507.12874
作者:pov, Oleg R. Musin
备注:Accepted to ESANN 2025. Published in the ESANN 2025 proceedings
摘要:这项研究探索了新的激活函数,增强了神经网络在训练过程中操纵数据拓扑的能力。基于传统激活函数(如$\mathrm{ReLU}$)的局限性,我们提出了$\mathrm{SmoothSplit}$和$\mathrm{ParametricSplit}$,它们引入了拓扑“切割”功能。这些功能使网络能够有效地转换复杂的数据流形,提高低维层场景的性能。通过对合成和真实世界数据集的实验,我们证明$\mathrm{ParametricSplit}$在低维设置中的性能优于传统激活,同时在高维设置中保持竞争性性能。我们的研究结果突出了拓扑感知激活功能在推进神经网络架构中的潜力。该代码可通过https://github.com/Snopoff/Topology-Aware-Activations获得。
摘要:This study explores novel activation functions that enhance the ability of neural networks to manipulate data topology during training. Building on the limitations of traditional activation functions like $\mathrm{ReLU}$, we propose $\mathrm{SmoothSplit}$ and $\mathrm{ParametricSplit}$, which introduce topology "cutting" capabilities. These functions enable networks to transform complex data manifolds effectively, improving performance in scenarios with low-dimensional layers. Through experiments on synthetic and real-world datasets, we demonstrate that $\mathrm{ParametricSplit}$ outperforms traditional activations in low-dimensional settings while maintaining competitive performance in higher-dimensional ones. Our findings highlight the potential of topology-aware activation functions in advancing neural network architectures. The code is available via https://github.com/Snopoff/Topology-Aware-Activations.
【9】RONOM: Reduced-Order Neural Operator Modeling
标题:RONOM:降阶神经运算符建模
链接:https://arxiv.org/abs/2507.12814
作者:er, Dongwei Ye, Christoph Brune
摘要:时变偏微分方程在基于物理的建模中无处不在,但在许多查询场景中,如实时预测,最优控制和不确定性量化,它们仍然是计算密集型的。降阶建模(ROM)通过构建低维代理模型来解决这些挑战,但依赖于固定的离散化,这限制了评估期间跨不同网格的灵活性。算子学习方法,如神经算子,通过参数化无限维函数空间之间的映射提供了一种替代方案,使其能够适应不同分辨率的数据。ROM提供了严格的数值误差估计,而神经算子学习主要集中在离散化收敛和不变性上,而没有量化无限维和离散化算子之间的误差。这项工作介绍了降阶神经算子建模(RONOM)框架,它连接了ROM和算子学习的概念。我们建立了一个类似于ROM中的离散误差界,并深入了解RONOM的离散收敛性和离散鲁棒性。此外,两个数值例子,比较RONOM现有的神经运营商解决偏微分方程。结果表明,使用标准向量到向量神经网络的RONOM在输入泛化方面具有相当的性能,在空间超分辨率和离散化鲁棒性方面具有优异的性能,同时还提供了对时间超分辨率场景的新见解。
摘要
:Time-dependent partial differential equations are ubiquitous in physics-based modeling, but they remain computationally intensive in many-query scenarios, such as real-time forecasting, optimal control, and uncertainty quantification. Reduced-order modeling (ROM) addresses these challenges by constructing a low-dimensional surrogate model but relies on a fixed discretization, which limits flexibility across varying meshes during evaluation. Operator learning approaches, such as neural operators, offer an alternative by parameterizing mappings between infinite-dimensional function spaces, enabling adaptation to data across different resolutions. Whereas ROM provides rigorous numerical error estimates, neural operator learning largely focuses on discretization convergence and invariance without quantifying the error between the infinite-dimensional and the discretized operators. This work introduces the reduced-order neural operator modeling (RONOM) framework, which bridges concepts from ROM and operator learning. We establish a discretization error bound analogous to those in ROM, and get insights into RONOM's discretization convergence and discretization robustness. Moreover, two numerical examples are presented that compare RONOM to existing neural operators for solving partial differential equations. The results demonstrate that RONOM using standard vector-to-vector neural networks achieves comparable performance in input generalization and superior performance in both spatial super-resolution and discretization robustness, while also offering novel insights into temporal super-resolution scenarios.
【10】PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database
标题:PMKLC:基于并行多知识学习的大规模基因组数据库无损压缩
链接:https://arxiv.org/abs/2507.12805
作者:Yanfeng Ding, Liping Yi, Huidong Ma, Gang Wang, Xiaoguang Liu, Cheng Zhong, Wentong Cai
备注:Accepted via KDD-25
摘要:基于学习的无损压缩器在大规模基因组数据库的备份、存储、传输和管理中起着至关重要的作用。然而,它们的1)不足的压缩比,2)低压缩\解压缩吞吐量,以及3)差的压缩鲁棒性限制了它们在工业和学术界的广泛采用和应用。为了解决这些问题,本文提出了一种新的基于多知识学习的\underline{P}压缩器\underline{M}压缩器(PMKLC),该压缩器主要有四个关键设计:1)提出了一种基于多知识学习的自动化压缩框架作为压缩器的主干,以提高压缩率和鲁棒性; 2)设计了一个GPU加速的($s$,$k$)-mer编码器来优化压缩吞吐量和计算资源的使用:3)引入数据块划分和逐步模型传递(SMP)机制来实现并行加速; 4)针对复杂的应用场景,设计了PMKLC-S和PMKLC-M两种压缩模式,前者在资源受限的单GPU上运行,后者在多GPU上加速。我们在15个具有不同物种和数据大小的真实世界数据集上对PMKLC-S/M和14个基线(7个传统基线和7个基于学习的基线)进行基准测试。在测试数据集上,PMKLC-S/M与基线相比,平均压缩比分别提高了73.609%和73.480%,平均吞吐量分别提高了3.036times $和10.710times $.此外,PMKLC-S/M还实现了最佳的鲁棒性和有竞争力的内存成本,表明其对具有不同概率分布扰动的数据集具有更高的稳定性,并且其在内存受限设备上运行的能力很强。
摘要:Learning-based lossless compressors play a crucial role in large-scale genomic database backup, storage, transmission, and management. However, their 1) inadequate compression ratio, 2) low compression \& decompression throughput, and 3) poor compression robustness limit their widespread adoption and application in both industry and academia. To solve those challenges, we propose a novel \underline{P}arallel \underline{M}ulti-\underline{K}nowledge \underline{L}earning-based \underline{C}ompressor (PMKLC) with four crucial designs: 1) We propose an automated multi-knowledge learning-based compression framework as compressors' backbone to enhance compression ratio and robustness; 2) we design a GPU-accelerated ($s$,$k$)-mer encoder to optimize compression throughput and computing resource usage; 3) we introduce data block partitioning and Step-wise Model Passing (SMP) mechanisms for parallel acceleration; 4) We design two compression modes PMKLC-S and PMKLC-M to meet the complex application scenarios, where the former runs on a resource-constrained single GPU and the latter is multi-GPU accelerated. We benchmark PMKLC-S/M and 14 baselines (7 traditional and 7 leaning-based) on 15 real-world datasets with different species and data sizes. Compared to baselines on the testing datasets, PMKLC-S/M achieve the average compression ratio improvement up to 73.609\% and 73.480\%, the average throughput improvement up to 3.036$\times$ and 10.710$\times$, respectively. Besides, PMKLC-S/M also achieve the best robustness and competitive memory cost, indicating its greater stability against datasets with different probability distribution perturbations, and its strong ability to run on memory-constrained devices.
【11】Layer Separation Deep Learning Model with Auxiliary Variables for Partial Differential Equations
标题:具有偏微方程辅助变量的分层深度学习模型
链接:https://arxiv.org/abs/2507.12766
作者: Yiqi Gu
摘要:在本文中,我们提出了一种新的优化框架,即层分离(LySep)模型,以改进基于深度学习的偏微分方程求解方法。由于深度学习中损失函数的高度非凸性,现有的优化算法往往会收敛到次优的局部极小值,或者遭受梯度爆炸或消失,导致性能不佳。为了解决这些问题,我们引入辅助变量来分离深度神经网络的层。具体来说,每一层的输出及其衍生物由辅助变量表示,有效地将深层架构分解为一系列浅层架构。建立了新的带有辅助变量的损失函数,其中只有相邻两层的变量是耦合的。相应的算法交替方向的基础上开发,其中许多变量可以更新最佳的封闭形式。此外,我们提供了理论分析,证明了LySep模型和原始深度模型之间的一致性。高维数值结果验证了我们的理论,并展示了LySep在最小化损失和减少解的误差的优势。
摘要:In this paper, we propose a new optimization framework, the layer separation (LySep) model, to improve the deep learning-based methods in solving partial differential equations. Due to the highly non-convex nature of the loss function in deep learning, existing optimization algorithms often converge to suboptimal local minima or suffer from gradient explosion or vanishing, resulting in poor performance. To address these issues, we introduce auxiliary variables to separate the layers of deep neural networks. Specifically, the output and its derivatives of each layer are represented by auxiliary variables, effectively decomposing the deep architecture into a series of shallow architectures. New loss functions with auxiliary variables are established, in which only variables from two neighboring layers are coupled. Corresponding algorithms based on alternating directions are developed, where many variables can be updated optimally in closed forms. Moreover, we provide theoretical analyses demonstrating the consistency between the LySep model and the original deep model. High-dimensional numerical results validate our theory and demonstrate the advantages of LySep in minimizing loss and reducing solution error.
【12】Domain-Enhanced Dual-Branch Model for Efficient and Interpretable Accident Anticipation
标题:用于高效且可解释的事故预测的域增强双分支模型
链接:https://arxiv.org/abs/2507.12755
作者:uan, Haicheng Liao, Chengyue Wang, Bonan Wang, Jiaxun Zhang, Jia Hu, Zhenning Li
摘要:开发精确且计算效率高的交通事故预测系统对于现代自动驾驶技术至关重要,可以及时干预和预防损失。在本文中,我们提出了一个事故预测框架,采用双分支架构,有效地集成了视觉信息从dashcam视频与结构化的文本数据来自事故报告。此外,我们引入了一种特征聚合方法,通过大型模型(GPT-4 o,Long-CLIP)促进多模态输入的无缝集成,并辅以有针对性的快速工程策略,以产生可操作的反馈和标准化的事故档案。对基准数据集(DAD,CCD和A3 D)进行的综合评估验证了我们的方法具有卓越的预测准确性,增强的响应能力,降低的计算开销和提高的可解释性,从而为交通事故预测的最先进性能建立了一个新的基准。
摘要:Developing precise and computationally efficient traffic accident anticipation system is crucial for contemporary autonomous driving technologies, enabling timely intervention and loss prevention. In this paper, we propose an accident anticipation framework employing a dual-branch architecture that effectively integrates visual information from dashcam videos with structured textual data derived from accident reports. Furthermore, we introduce a feature aggregation method that facilitates seamless integration of multimodal inputs through large models (GPT-4o, Long-CLIP), complemented by targeted prompt engineering strategies to produce actionable feedback and standardized accident archives. Comprehensive evaluations conducted on benchmark datasets (DAD, CCD, and A3D) validate the superior predictive accuracy, enhanced responsiveness, reduced computational overhead, and improved interpretability of our approach, thus establishing a new benchmark for state-of-the-art performance in traffic accident anticipation.
【13】Multimodal-Guided Dynamic Dataset Pruning for Robust and Efficient Data-Centric Learning
标题:多模态引导的动态数据集修剪,用于鲁棒和高效的数据中心学习
链接:https://arxiv.org/abs/2507.12750
作者
:ang, Peijia Li, Yujie Liu, Zhiming Xu, Peng Ye, Wanli Ouyang, Furao Shen, Dongzhan Zhou
摘要:现代深度模型是在大型真实世界数据集上训练的,其中数据质量各不相同,冗余是常见的。以数据为中心的方法,如数据集修剪,在提高训练效率和模型性能方面表现出了希望。然而,大多数现有的方法依赖于静态统计或特定于任务的指标,限制了它们的鲁棒性和跨域的通用性。在这项工作中,我们引入了一个动态的数据集修剪框架,自适应地选择训练样本的基础上,任务驱动的难度和跨模态语义一致性。通过结合来自预训练的多模态基础模型的监督,我们的方法捕获了训练动态,同时有效地过滤掉了无信息的样本。我们的工作突出了集成跨模态对齐以实现强大样本选择的潜力,推动以数据为中心的学习在应用领域中实现更高效和更强大的实践。
摘要:Modern deep models are trained on large real-world datasets, where data quality varies and redundancy is common. Data-centric approaches such as dataset pruning have shown promise in improving training efficiency and model performance. However, most existing methods rely on static heuristics or task-specific metrics, limiting their robustness and generalizability across domains. In this work, we introduce a dynamic dataset pruning framework that adaptively selects training samples based on both task-driven difficulty and cross-modality semantic consistency. By incorporating supervision from pretrained multimodal foundation models, our approach captures training dynamics while effectively filtering out uninformative samples. Our work highlights the potential of integrating cross-modality alignment for robust sample selection, advancing data-centric learning toward more efficient and robust practices across application domains.
【14】From SGD to Spectra: A Theory of Neural Network Weight Dynamics
标题:从SGD到光谱:神经网络权重动力学理论
链接:https://arxiv.org/abs/2507.12709
作者:hard Olsen, Sam Fatehmanesh, Frank Xiao, Adarsh Kumarappan, Anirudh Gajula
摘要:深度神经网络已经彻底改变了机器学习,但它们的训练动力学在理论上仍然不清楚-我们开发了一个连续时间,矩阵值随机微分方程(SGD)框架,该框架将SGD的微观动力学与权重矩阵中奇异值谱的宏观演化严格联系起来。我们推导出精确的SDES,表明平方奇异值遵循具有特征值排斥的戴森布朗运动,并将静态分布描述为具有幂律尾部的伽马型密度,为训练网络中经验观察到的“批量+尾部”光谱结构提供了第一个理论解释。通过对Transformer和MLP架构的受控实验,我们验证了我们的理论预测,并证明了基于PDE的预测与观察到的光谱演变之间的定量一致性,为理解深度学习的工作原理提供了严格的基础。
摘要:Deep neural networks have revolutionized machine learning, yet their training dynamics remain theoretically unclear-we develop a continuous-time, matrix-valued stochastic differential equation (SDE) framework that rigorously connects the microscopic dynamics of SGD to the macroscopic evolution of singular-value spectra in weight matrices. We derive exact SDEs showing that squared singular values follow Dyson Brownian motion with eigenvalue repulsion, and characterize stationary distributions as gamma-type densities with power-law tails, providing the first theoretical explanation for the empirically observed 'bulk+tail' spectral structure in trained networks. Through controlled experiments on transformer and MLP architectures, we validate our theoretical predictions and demonstrate quantitative agreement between SDE-based forecasts and observed spectral evolution, providing a rigorous foundation for understanding why deep learning works.
【15】PinFM: Foundation Model for User Activity Sequences at a Billion-scale Visual Discovery Platform
标题:PinFM:亿级视觉发现平台用户活动序列的基础模型
链接:https://arxiv.org/abs/2507.12704
作者:hen, Kousik Rajesh, Matthew Lawhon, Zelun Wang, Hanyu Li, Haomiao Li, Saurabh Vishwas Joshi, Pong Eksombatchai, Jaewon Yang, Yi-Ping Hsu, Jiajing Xu, Charles Rosenberg
备注:RecSys 2025
摘要:用户活动序列已成为推荐系统中最重要的信号之一。我们提出了一个基础模型,PinFM,用于在十亿级的视觉发现平台上了解多个应用程序的用户活动序列。我们使用大量用户活动数据预训练具有20 B+参数的Transformer模型,然后针对特定应用对其进行微调,从而有效地将其与现有模型耦合。虽然这种预训练和微调方法在其他领域(如视觉和NLP)中很受欢迎,但其在工业推荐系统中的应用面临着许多挑战。基础模型必须具有足够的可扩展性,以每秒对数百万个项目进行评分,同时满足这些系统所施加的严格成本和延迟限制。此外,它应该捕获用户活动和其他功能之间的交互,并处理在预训练阶段不存在的新项目。 我们开发了创新技术来应对这些挑战。我们的基础设施和算法优化,例如去重交叉注意力Transformer(DCAT),将Pinterest内部数据的吞吐量提高了600%。我们证明了PinFM可以通过改变输入序列来学习用户序列和候选项之间的交互,从而使新项的参与增加20%。PinFM现已部署,以帮助改善超过5亿用户在各种应用程序中的体验。
摘要:User activity sequences have emerged as one of the most important signals in recommender systems. We present a foundational model, PinFM, for understanding user activity sequences across multiple applications at a billion-scale visual discovery platform. We pretrain a transformer model with 20B+ parameters using extensive user activity data, then fine-tune it for specific applications, efficiently coupling it with existing models. While this pretraining-and-fine-tuning approach has been popular in other domains, such as Vision and NLP, its application in industrial recommender systems presents numerous challenges. The foundational model must be scalable enough to score millions of items every second while meeting tight cost and latency constraints imposed by these systems. Additionally, it should capture the interactions between user activities and other features and handle new items that were not present during the pretraining stage. We developed innovative techniques to address these challenges. Our infrastructure and algorithmic optimizations, such as the Deduplicated Cross-Attention Transformer (DCAT), improved our throughput by 600% on Pinterest internal data. We demonstrate that PinFM can learn interactions between user sequences and candidate items by altering input sequences, leading to a 20% increase in engagement with new items. PinFM is now deployed to help improve the experience of more than a half billion users across various applications.
【16】Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning
标题:学习重要的事情:通过互信息进行模型微调的概率任务选择
链接:https://arxiv.org/abs/2507.12612
作者:handa, Saral Sureka, Parth Pratim Chatterjee, Krishnateja Killamsetty, Nikhil Shivakumar Nayak, Ganesh Ramakrishnan
备注:9, 8 tables, 7 figures
摘要:微调的大型语言模型(LLM)的性能关键取决于训练混合物的组成。然而,选择任务数据集的最佳混合仍然是一个主要手动的,启发式驱动的过程,从业者往往依赖于统一的或基于大小的采样策略。我们介绍了TASKPGM,一个原则性和可扩展的混合优化框架,通过最小化马尔可夫随机场(MRF)的能量函数来选择连续的任务比例。任务关系的建模使用行为的分歧,如詹森香农分歧和逐点互信息计算从单任务微调模型的预测分布。我们的方法在单纯形约束下产生了一个封闭形式的解决方案,并可证明平衡了任务之间的代表性和多样性。我们提供了理论保证,包括弱子模块化预算的变种,并证明了一致的经验改进Llama 2和Mistral跨评估套件,如MMLU和BIGBench。除了性能之外,TASKPGM还提供了对任务影响和混合物组成的可解释的见解,使其成为高效和强大的LLM微调的强大工具。
摘要
:The performance of finetuned large language models (LLMs) hinges critically on the composition of the training mixture. However, selecting an optimal blend of task datasets remains a largely manual, heuristic driven process, with practitioners often relying on uniform or size based sampling strategies. We introduce TASKPGM, a principled and scalable framework for mixture optimization that selects continuous task proportions by minimizing an energy function over a Markov Random Field (MRF). Task relationships are modeled using behavioral divergences such as Jensen Shannon Divergence and Pointwise Mutual Information computed from the predictive distributions of single task finetuned models. Our method yields a closed form solution under simplex constraints and provably balances representativeness and diversity among tasks. We provide theoretical guarantees, including weak submodularity for budgeted variants, and demonstrate consistent empirical improvements on Llama 2 and Mistral across evaluation suites such as MMLU and BIGBench. Beyond performance, TASKPGM offers interpretable insights into task influence and mixture composition, making it a powerful tool for efficient and robust LLM finetuning.
【17】Rel-HNN: Split Parallel Hypergraph Neural Network for Learning on Relational Databases
标题:Rel-HNN:用于关系数据库学习的分裂并行超图神经网络
链接:https://arxiv.org/abs/2507.12562
作者:r Alam, Md. Ahasanul Alam, Md Mahmudur Rahman, Md. Mosaddek Khan
摘要:关系数据库(RDB)在企业和现实世界的应用中无处不在。使数据库扁平化对深度学习模型提出了挑战,这些模型依赖固定大小的输入表示来从关系数据的结构化本质中捕获关系语义。图神经网络(GNN)已经被提出来解决这个问题,但它们经常通过将所有元组建模为单体节点并忽略元组内关联来过度简化关系结构。在这项工作中,我们提出了一种新的基于超图的框架,我们称之为rel-HNN,它将每个唯一的属性-值对建模为节点,将每个元组建模为超边,从而能够捕获细粒度的元组内关系。我们的方法学习跨属性值,元组和表级别的显式多级表示。为了解决大型RDB带来的可扩展性挑战,我们进一步引入了一种分裂并行训练算法,该算法利用多GPU执行来实现高效的超图学习。在真实世界和基准数据集上进行的大量实验表明,rel-HNN在分类和回归任务方面都显着优于现有方法。此外,与传统的单GPU执行相比,我们的分裂并行训练实现了大幅加速-关系数据学习高达3.18倍,超图学习高达2.94倍。
摘要:Relational databases (RDBs) are ubiquitous in enterprise and real-world applications. Flattening the database poses challenges for deep learning models that rely on fixed-size input representations to capture relational semantics from the structured nature of relational data. Graph neural networks (GNNs) have been proposed to address this, but they often oversimplify relational structures by modeling all the tuples as monolithic nodes and ignoring intra-tuple associations. In this work, we propose a novel hypergraph-based framework, that we call rel-HNN, which models each unique attribute-value pair as a node and each tuple as a hyperedge, enabling the capture of fine-grained intra-tuple relationships. Our approach learns explicit multi-level representations across attribute-value, tuple, and table levels. To address the scalability challenges posed by large RDBs, we further introduce a split-parallel training algorithm that leverages multi-GPU execution for efficient hypergraph learning. Extensive experiments on real-world and benchmark datasets demonstrate that rel-HNN significantly outperforms existing methods in both classification and regression tasks. Moreover, our split-parallel training achieves substantial speedups -- up to 3.18x for learning on relational data and up to 2.94x for hypergraph learning -- compared to conventional single-GPU execution.
【18】FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making
标题:创始人:将基金会模型建立在开放式决策的世界模型中
链接:https://arxiv.org/abs/2507.12496
作者:g, Rui Yu, Shenghua Wan, Le Gan, De-Chuan Zhan
备注:Accepted by Forty-Second International Conference on Machine Learning (ICML 2025)
摘要:基础模型(FM)和世界模型(WM)在不同层次上提供了任务概括的互补优势。在这项工作中,我们提出了方正,一个框架,集成了嵌入在FM的动态建模功能的WM,使开放式的任务解决在具体的环境中,在一个无奖励的方式的可推广的知识。我们学习了一个映射函数,该映射函数将WM状态空间中的FM表示接地,有效地从外部观察推断世界模拟器中代理的物理状态。这种映射使得在行为学习过程中能够通过想象来学习目标条件策略,映射的任务作为目标状态。我们的方法利用到目标状态的预测时间距离作为信息奖励信号。FOUNDER在各种多任务离线视觉控制基准测试中表现出卓越的性能,擅长捕捉文本或视频指定的任务的深层语义,特别是在涉及复杂观察或现有方法难以解决的领域差距的场景中。我们的学习奖励函数与地面真实奖励的一致性也得到了经验验证。我们的项目网站是https://sites.google.com/view/founder-rl。
摘要:Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels. In this work, we propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs to enable open-ended task solving in embodied environments in a reward-free manner. We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent's physical states in the world simulator from external observations. This mapping enables the learning of a goal-conditioned policy through imagination during behavior learning, with the mapped task serving as the goal state. Our method leverages the predicted temporal distance to the goal state as an informative reward signal. FOUNDER demonstrates superior performance on various multi-task offline visual control benchmarks, excelling in capturing the deep-level semantics of tasks specified by text or videos, particularly in scenarios involving complex observations or domain gaps where prior methods struggle. The consistency of our learned reward function with the ground-truth reward is also empirically validated. Our project website is https://sites.google.com/view/founder-rl.
【19】The carbon cost of materials discovery: Can machine learning really accelerate the discovery of new photovoltaics?
标题:材料发现的碳成本:机器学习真的能加速新太阳能的发现吗?
链接:https://arxiv.org/abs/2507.13246
作者:alker, Keith T. Butler
摘要:计算筛选已经成为发现高性能光伏(PV)材料的实验努力的有力补充。大多数工作流程依赖于密度泛函理论(DFT)来估计与太阳能转换相关的电子和光学性质。虽然DFT计算比基于实验室的方法更有效,但仍然需要大量的计算和环境成本。机器学习(ML)模型最近作为DFT的替代品受到了关注,它以具有竞争力的预测性能大幅减少了资源使用。在这项研究中,我们重现了一个规范的基于DFT的工作流程,以估计最大效率限制,并逐步用ML代理替换其组件。通过量化与每个计算策略相关的CO2排放量,我们评估了预测效率和环境成本之间的权衡。我们的研究结果揭示了多个混合ML/DFT策略,优化不同的点沿精度-排放前沿。我们发现,直接预测的标量,如最大效率,是显着更易于处理比使用预测的吸收光谱作为中间步骤。有趣的是,在DFT数据上训练的ML模型可以在筛选应用程序中使用替代的交换-相关函数来超越DFT工作流,突出了数据驱动方法的一致性和实用性。我们还评估了通过扩展数据集和针对PV相关特征量身定制的改进模型架构来改善ML驱动筛选的策略。这项工作为建立低排放、高通量的发现管道提供了一个定量框架。
摘要:Computational screening has become a powerful complement to experimental efforts in the discovery of high-performance photovoltaic (PV) materials. Most workflows rely on density functional theory (DFT) to estimate electronic and optical properties relevant to solar energy conversion. Although more efficient than laboratory-based methods, DFT calculations still entail substantial computational and environmental costs. Machine learning (ML) models have recently gained attention as surrogates for DFT, offering drastic reductions in resource use with competitive predictive performance. In this study, we reproduce a canonical DFT-based workflow to estimate the maximum efficiency limit and progressively replace its components with ML surrogates. By quantifying the CO$_2$ emissions associated with each computational strategy, we evaluate the trade-offs between predictive efficacy and environmental cost. Our results reveal multiple hybrid ML/DFT strategies that optimize different points along the accuracy--emissions front. We find that direct prediction of scalar quantities, such as maximum efficiency, is significantly more tractable than using predicted absorption spectra as an intermediate step. Interestingly, ML models trained on DFT data can outperform DFT workflows using alternative exchange--correlation functionals in screening applications, highlighting the consistency and utility of data-driven approaches. We also assess strategies to improve ML-driven screening through expanded datasets and improved model architectures tailored to PV-relevant features. This work provides a quantitative framework for building low-emission, high-throughput discovery pipelines.
【20】Search for Z/2 eigenfunctions on the sphere using machine learning
标题:使用机器学习在球体上搜索Z/2本征函数
链接:https://arxiv.org/abs/2507.13122
作者:ydys, Willem Adriaan Salm
备注:14 pages, 12 pictures
摘要:我们使用机器学习来搜索2-球面上的Z/2本征函数的例子。为此,我们创建了前馈深度神经网络的多值版本,并使用JAX库实现了它。我们发现Z/2特征函数的三种情况:在前两种情况下,我们固定的分支点在顶点的四面体和立方体分别。在第三种情况下,我们允许AI移动分支点,最后,它将分支点定位在一个被压扁的四面体的顶点上。
摘要:We use machine learning to search for examples of Z/2 eigenfunctions on the 2-sphere. For this we created a multivalued version of a feedforward deep neural network, and we implemented it using the JAX library. We found Z/2 eigenfunctions for three cases: In the first two cases we fixed the branch points at the vertices of a tetrahedron and at a cube respectively. In a third case, we allowed the AI to move the branch points around and, in the end, it positioned the branch points at the vertices of a squashed tetrahedron.
【21】When Pattern-by-Pattern Works: Theoretical and Empirical Insights for Logistic Models with Missing Values
标题:当逐个模式起作用时:缺失值的逻辑模型的理论和经验见解
链接:https://arxiv.org/abs/2507.13024
作者:e Muller (PREMEDICAL), Erwan Scornet (LPSM), Julie Josse (PREMEDICAL)
摘要:即使在参数模型中,预测部分缺失输入的响应仍然是一项具有挑战性的任务,因为参数估计本身不足以预测部分观察到的输入。一些作品研究线性模型中的预测。在本文中,我们专注于逻辑模型,这提出了自己的困难。从理论的角度来看,我们证明了一个模式的模式策略(PbP),它学习一个逻辑模型每个缺失模式,准确地近似贝叶斯概率在各种缺失数据的情况下(MCAR,MAR和MNAR)。从经验上讲,我们彻底比较了各种方法(常数和迭代插补,完整的案例分析,PbP和EM算法)在分类,概率估计,校准和参数推断。我们的分析提供了一个全面的观点,对缺失值的逻辑回归。它揭示了平均插补可以用作低样本量的基线,并且通过具有标签的非线性多重迭代插补技术(MICE.RF.Y)获得了改进的性能。对于大样本量,PbP是高斯混合的最佳方法,我们建议在存在非线性特征的情况下使用MICE.RF.Y。
摘要:Predicting a response with partially missing inputs remains a challenging task even in parametric models, since parameter estimation in itself is not sufficient to predict on partially observed inputs. Several works study prediction in linear models. In this paper, we focus on logistic models, which present their own difficulties. From a theoretical perspective, we prove that a Pattern-by-Pattern strategy (PbP), which learns one logistic model per missingness pattern, accurately approximates Bayes probabilities in various missing data scenarios (MCAR, MAR and MNAR). Empirically, we thoroughly compare various methods (constant and iterative imputations, complete case analysis, PbP, and an EM algorithm) across classification, probability estimation, calibration, and parameter inference. Our analysis provides a comprehensive view on the logistic regression with missing values. It reveals that mean imputation can be used as baseline for low sample sizes, and improved performance is obtained via nonlinear multiple iterative imputation techniques with the labels (MICE.RF.Y). For large sample sizes, PbP is the best method for Gaussian mixtures, and we recommend MICE.RF.Y in presence of nonlinear features.
【22】Self Balancing Neural Network: A Novel Method to Estimate Average Treatment Effect
标题:自平衡神经网络:一种估计平均治疗效果的新方法
链接:https://arxiv.org/abs/2507.12818
作者:mechu Abdisa, Yingchun Zhou, Yuqi Qiu
摘要:在观察性研究中,混杂变量影响治疗和结局。此外,工具变量也影响治疗分配机制。这种情况使这项研究与标准的随机对照试验不同,在标准的随机对照试验中,治疗分配是随机的。由于这种情况,估计的平均治疗效果变得有偏差。为了解决这个问题,标准方法是在估计平均治疗效果时纳入估计的倾向评分。然而,这些方法在倾向评分模型中会产生错误指定的风险。为了解决这个问题,一种新的方法称为“自平衡神经网络”(Sbnet),它让模型本身从平衡网获得其伪倾向得分,在这项研究中提出。所提出的方法估计的平均处理效果,通过使用的平衡网络作为前馈神经网络的关键部分。该公式一步即可解决平均治疗效果的估计。此外,多伪倾向得分框架,这是估计从多样化的平衡网,并用于估计的平均治疗效果,提出。最后,所提出的方法进行了比较与国家的最先进的方法在三个模拟设置和现实世界的数据集。它已被证明,所提出的自平衡神经网络表现出更好的性能比国家的最先进的方法。
摘要:In observational studies, confounding variables affect both treatment and outcome. Moreover, instrumental variables also influence the treatment assignment mechanism. This situation sets the study apart from a standard randomized controlled trial, where the treatment assignment is random. Due to this situation, the estimated average treatment effect becomes biased. To address this issue, a standard approach is to incorporate the estimated propensity score when estimating the average treatment effect. However, these methods incur the risk of misspecification in propensity score models. To solve this issue, a novel method called the "Self balancing neural network" (Sbnet), which lets the model itself obtain its pseudo propensity score from the balancing net, is proposed in this study. The proposed method estimates the average treatment effect by using the balancing net as a key part of the feedforward neural network. This formulation resolves the estimation of the average treatment effect in one step. Moreover, the multi-pseudo propensity score framework, which is estimated from the diversified balancing net and used for the estimation of the average treatment effect, is presented. Finally, the proposed methods are compared with state-of-the-art methods on three simulation setups and real-world datasets. It has been shown that the proposed self-balancing neural network shows better performance than state-of-the-art methods.
【23】Finite-Dimensional Gaussian Approximation for Deep Neural Networks: Universality in Random Weights
标题:深度神经网络的半维高斯逼近:随机权重的普适性
链接:https://arxiv.org/abs/2507.12686
作者:mar Balasubramanian, Nathan Ross
摘要:我们研究了具有随机初始化权重的深度神经网络的有限阶矩的多维分布(FDD)。具体来说,我们建立高斯近似界Wasserstein-1 $范数之间的FDD和高斯极限假设Lipschitz激活函数,并允许层的宽度以任意的相对速率增长到无穷大。在所有宽度都与一个公共尺度参数n$成比例且有$L-1$个隐藏层的特殊情况下,对于任意$\n> 0$,我们得到了阶为$n^{-({1}/{6})^{L-1} + \n}$的收敛速度。
摘要:We study the Finite-Dimensional Distributions (FDDs) of deep neural networks with randomly initialized weights that have finite-order moments. Specifically, we establish Gaussian approximation bounds in the Wasserstein-$1$ norm between the FDDs and their Gaussian limit assuming a Lipschitz activation function and allowing the layer widths to grow to infinity at arbitrary relative rates. In the special case where all widths are proportional to a common scale parameter $n$ and there are $L-1$ hidden layers, we obtain convergence rates of order $n^{-({1}/{6})^{L-1} + \epsilon}$, for any $\epsilon > 0$.
【24】Physics constrained learning of stochastic characteristics
标题:随机特征的物理约束学习
链接:https://arxiv.org/abs/2507.12661
作者:i Krishna Ala, Ameya Salvi, Venkat Krovi, Matthias Schmid
备注:6 pages, 6 figures
摘要
:准确的状态估计需要仔细考虑过程和测量模型的不确定性;这些特性通常不是众所周知的,需要有经验的设计人员来选择协方差矩阵。协方差矩阵的选择中的误差可能影响估计算法的准确性,并且有时可能导致滤波器发散。由于噪声源的不确定性和系统噪声建模的困难,识别噪声特性一直是一个具有挑战性的问题。大多数现有的方法试图通过涉及新息序列的优化算法来识别未知的协方差矩阵。近年来,学习方法已被用来确定过程和测量模型的随机特性。我们提出了一种基于学习的方法,使用不同的损失函数来识别噪声特性,并测试这些方法的实时车辆状态估计性能
摘要:Accurate state estimation requires careful consideration of uncertainty surrounding the process and measurement models; these characteristics are usually not well-known and need an experienced designer to select the covariance matrices. An error in the selection of covariance matrices could impact the accuracy of the estimation algorithm and may sometimes cause the filter to diverge. Identifying noise characteristics has long been a challenging problem due to uncertainty surrounding noise sources and difficulties in systematic noise modeling. Most existing approaches try identifying unknown covariance matrices through an optimization algorithm involving innovation sequences. In recent years, learning approaches have been utilized to determine the stochastic characteristics of process and measurement models. We present a learning-based methodology with different loss functions to identify noise characteristics and test these approaches' performance for real-time vehicle state estimation
【25】The Generalist Brain Module: Module Repetition in Neural Networks in Light of the Minicolumn Hypothesis
标题:通才大脑模块:根据微型柱假设的神经网络中的模块重复
链接:https://arxiv.org/abs/2507.12473
作者:n Kvalsund, Mikkel Elle Lepperød
摘要:虽然现代人工智能不断发展,但生物大脑在其鲁棒性、适应性和效率方面仍然是神经网络的巅峰。这篇评论探讨了受大脑结构启发的人工智能架构路径,特别是微柱假设,该假设将新皮层视为重复模块的分布式系统-我们将其连接到集体智能(CI)。尽管现有的工作,有一个连接的皮质柱重复的神经模块的架构缺乏全面的评论。本文旨在通过综合神经模块重复的历史,理论和方法论观点来填补这一空白。我们区分架构重复-重用结构-和参数共享模块重复,其中相同的功能单元在网络中重复。后者表现出关键的CI属性,如鲁棒性,适应性和泛化。有证据表明,重复的模块往往会收敛到一个通才模块:简单,灵活的问题解决者能够处理许多角色的合奏。这种通才倾向可能为现代人工智能的长期挑战提供解决方案:通过简单性和可扩展性提高训练过程中的能源效率,以及通过泛化实现强大的体现控制。虽然实证结果表明,这样的系统可以推广到分布外的问题,理论结果仍然缺乏。总的来说,以模块重复为特征的架构仍然是一种新兴的、未经探索的架构策略,在效率、健壮性和适应性方面具有巨大的潜力。我们相信,一个采用CI优势、同时遵守迷你柱的架构和功能原则的系统可以挑战现代人工智能的可扩展性、能源消耗和民主化问题。
摘要:While modern AI continues to advance, the biological brain remains the pinnacle of neural networks in its robustness, adaptability, and efficiency. This review explores an AI architectural path inspired by the brain's structure, particularly the minicolumn hypothesis, which views the neocortex as a distributed system of repeated modules - a structure we connect to collective intelligence (CI). Despite existing work, there is a lack of comprehensive reviews connecting the cortical column to the architectures of repeated neural modules. This review aims to fill that gap by synthesizing historical, theoretical, and methodological perspectives on neural module repetition. We distinguish between architectural repetition - reusing structure - and parameter-shared module repetition, where the same functional unit is repeated across a network. The latter exhibits key CI properties such as robustness, adaptability, and generalization. Evidence suggests that the repeated module tends to converge toward a generalist module: simple, flexible problem solvers capable of handling many roles in the ensemble. This generalist tendency may offer solutions to longstanding challenges in modern AI: improved energy efficiency during training through simplicity and scalability, and robust embodied control via generalization. While empirical results suggest such systems can generalize to out-of-distribution problems, theoretical results are still lacking. Overall, architectures featuring module repetition remain an emerging and unexplored architectural strategy, with significant untapped potential for both efficiency, robustness, and adaptiveness. We believe that a system that adopts the benefits of CI, while adhering to architectural and functional principles of the minicolumns, could challenge the modern AI problems of scalability, energy consumption, and democratization.
其他(19篇)
【1】Hierarchical Rectified Flow Matching with Mini-Batch Couplings
标题:采用小批量耦合的分层纠正流匹配
链接:https://arxiv.org/abs/2507.13350
作者:ng, Yici Yan, Alex Schwing, Zhizhen Zhao
备注:Project Page: this https URL
摘要:流匹配已经成为一种引人注目的生成式建模方法,广泛应用于各个领域。为了通过流动匹配模型生成数据,通过对建模的速度场进行前向积分来数值求解常微分方程(ODE)。为了更好地捕捉典型速度场中固有的多模态,最近引入了分层流匹配。它使用一个ODE的层次结构,在生成数据时进行数值积分。这种ODE的层次结构捕获多模态速度分布,就像香草流匹配能够对多模态数据分布建模一样。虽然该层级能够对多模态速度分布进行建模,但是建模的分布的复杂性在层级的各个级别上保持相同。在本文中,我们研究了如何通过小批量耦合逐渐调整分布的复杂性在不同层次的层次结构。我们通过对合成和成像数据的令人信服的结果显示了小批量耦合在分层整流流匹配中的好处。代码可在https://riccizz.github.io/HRF_coupling上获得。
摘要:Flow matching has emerged as a compelling generative modeling approach that is widely used across domains. To generate data via a flow matching model, an ordinary differential equation (ODE) is numerically solved via forward integration of the modeled velocity field. To better capture the multi-modality that is inherent in typical velocity fields, hierarchical flow matching was recently introduced. It uses a hierarchy of ODEs that are numerically integrated when generating data. This hierarchy of ODEs captures the multi-modal velocity distribution just like vanilla flow matching is capable of modeling a multi-modal data distribution. While this hierarchy enables to model multi-modal velocity distributions, the complexity of the modeled distribution remains identical across levels of the hierarchy. In this paper, we study how to gradually adjust the complexity of the distributions across different levels of the hierarchy via mini-batch couplings. We show the benefits of mini-batch couplings in hierarchical rectified flow matching via compelling results on synthetic and imaging data. Code is available at https://riccizz.github.io/HRF_coupling.
【2】Computational-Statistical Tradeoffs from NP-hardness
标题:NP硬度的计算统计权衡
链接:https://arxiv.org/abs/2507.13222
作者:, Caleb Koch, Carmen Strassle, Li-Yang Tan
备注:To appear at FOCS 2025
摘要:计算机科学和统计学的一个中心问题是,有效的算法是否能够达到统计问题的信息论极限。在平均情况假设下已经显示了许多计算-统计权衡,但由于统计问题本质上是平均情况,因此将它们基于标准的最坏情况假设一直是一个挑战。 在PAC学习中,这种权衡首先被研究,问题是计算效率是否可以以使用比理论上必要的信息更多的样本为代价。我们基于$\mathsf{NP}$-硬度进行权衡,得到: 假设$\mathsf{NP}$需要指数时间,$\circ$夏普计算统计权衡:对于每个多项式$p(n)$,有一个$n$变量类$C$具有VC维度$1$,使得时间有效学习$C$的样本复杂度为$\Theta(p(n))$。 $\circ$$\mathsf{RP}$与$\mathsf{NP}$在学习方面的特征:$\mathsf{RP} = \mathsf{NP}$当且仅当每个$\mathsf{NP}$-可枚举类都可以用$O(\mathrm{VCdim}(C))$样本在多项式时间内学习。正向蕴涵自(Pitt and Valiant,1988)以来就已为人所知;我们证明了反向蕴涵。 值得注意的是,我们所有的下限都适用于不适当的学习者。这些是第一个$\mathsf{NP}$-硬度的结果,不正确地学习多项式大小的电路的子类,绕过Applebaum,Barak和Xiao(2008)的形式障碍。
摘要
:A central question in computer science and statistics is whether efficient algorithms can achieve the information-theoretic limits of statistical problems. Many computational-statistical tradeoffs have been shown under average-case assumptions, but since statistical problems are average-case in nature, it has been a challenge to base them on standard worst-case assumptions. In PAC learning where such tradeoffs were first studied, the question is whether computational efficiency can come at the cost of using more samples than information-theoretically necessary. We base such tradeoffs on $\mathsf{NP}$-hardness and obtain: $\circ$ Sharp computational-statistical tradeoffs assuming $\mathsf{NP}$ requires exponential time: For every polynomial $p(n)$, there is an $n$-variate class $C$ with VC dimension $1$ such that the sample complexity of time-efficiently learning $C$ is $\Theta(p(n))$. $\circ$ A characterization of $\mathsf{RP}$ vs. $\mathsf{NP}$ in terms of learning: $\mathsf{RP} = \mathsf{NP}$ iff every $\mathsf{NP}$-enumerable class is learnable with $O(\mathrm{VCdim}(C))$ samples in polynomial time. The forward implication has been known since (Pitt and Valiant, 1988); we prove the reverse implication. Notably, all our lower bounds hold against improper learners. These are the first $\mathsf{NP}$-hardness results for improperly learning a subclass of polynomial-size circuits, circumventing formal barriers of Applebaum, Barak, and Xiao (2008).
【3】NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech
标题:非言语TTC:文本对齐非言语发声的公共英语数据库,具有情感注释,用于文本转语音
链接:https://arxiv.org/abs/2507.13155
作者:risov, Egor Spirin, Daria Diatlova
摘要:目前的表达语音合成模型受到包含不同非言语发声(NV)的开源数据集有限的限制。在这项工作中,我们介绍了NonverbalTTS(NVTTS),这是一个17小时的开放获取数据集,注释了10种类型的NV(例如,笑声,咳嗽)和8个情绪类别。该数据集来自流行的来源,VoxCeleb和Exhibition,使用自动检测,然后进行人工验证。我们提出了一个综合的管道,集成了自动语音识别(ASR),NV标记,情感分类,和融合算法,以合并来自多个注释器的transmittance。在NVTTS数据集上微调开源文本到语音(TTS)模型,实现了与闭源系统(如CosyVoice2)的对等性,这是通过人工评估和自动度量(包括说话人相似性和NV保真度)来衡量的。通过发布NVTTS及其相应的注释指南,我们解决了表达性TTS研究中的一个关键瓶颈。该数据集可在https://huggingface.co/datasets/deepvk/NonverbalTTS上获得。
摘要:Current expressive speech synthesis models are constrained by the limited availability of open-source datasets containing diverse nonverbal vocalizations (NVs). In this work, we introduce NonverbalTTS (NVTTS), a 17-hour open-access dataset annotated with 10 types of NVs (e.g., laughter, coughs) and 8 emotional categories. The dataset is derived from popular sources, VoxCeleb and Expresso, using automated detection followed by human validation. We propose a comprehensive pipeline that integrates automatic speech recognition (ASR), NV tagging, emotion classification, and a fusion algorithm to merge transcriptions from multiple annotators. Fine-tuning open-source text-to-speech (TTS) models on the NVTTS dataset achieves parity with closed-source systems such as CosyVoice2, as measured by both human evaluation and automatic metrics, including speaker similarity and NV fidelity. By releasing NVTTS and its accompanying annotation guidelines, we address a key bottleneck in expressive TTS research. The dataset is available at https://huggingface.co/datasets/deepvk/NonverbalTTS.
【4】Teach Old SAEs New Domain Tricks with Boosting
标题:通过Boosting教授旧SAE新领域技巧
链接:https://arxiv.org/abs/2507.12990
作者:riagin, Yaroslav Aksenov, Daniil Laptev, Gleb Gerasimov, Nikita Balagansky, Daniil Gavrilov
摘要:稀疏自动编码器已经成为解释大型语言模型内部表示的强大工具,但它们往往无法捕获在其训练语料库中不普遍的特定领域特征。本文介绍了一种残差学习方法,解决了这种功能盲,而不需要完全重新训练。我们建议专门训练一个辅助SAE,以模拟特定领域文本上预训练SAE的重建错误,有效地捕获主模型遗漏的特征。通过在推理过程中对两个模型的输出进行求和,我们在多个专业领域的LLM交叉熵和解释方差度量方面都有了显着的改进。我们的实验表明,这种方法有效地将新的领域知识到现有的SAE,同时保持其性能的一般任务。这种方法使研究人员能够选择性地增强特定领域的SAE可解释性,为LLM的目标机械可解释性开辟了新的可能性。
摘要:Sparse Autoencoders have emerged as powerful tools for interpreting the internal representations of Large Language Models, yet they often fail to capture domain-specific features not prevalent in their training corpora. This paper introduces a residual learning approach that addresses this feature blindness without requiring complete retraining. We propose training a secondary SAE specifically to model the reconstruction error of a pretrained SAE on domain-specific texts, effectively capturing features missed by the primary model. By summing the outputs of both models during inference, we demonstrate significant improvements in both LLM cross-entropy and explained variance metrics across multiple specialized domains. Our experiments show that this method efficiently incorporates new domain knowledge into existing SAEs while maintaining their performance on general tasks. This approach enables researchers to selectively enhance SAE interpretability for specific domains of interest, opening new possibilities for targeted mechanistic interpretability of LLMs.
【5】A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints
标题:数据共享约束下的异类多领域环境分布式生成人工智能方法
链接:https://arxiv.org/abs/2507.12979
作者:awfilis, Hossam Amer, Minar El-Aasser, Tallal Elshabrawy
摘要:联邦学习因其能够使多个节点在不共享原始数据的情况下协作训练机器学习模型而受到越来越多的关注。与此同时,生成人工智能-特别是生成对抗网络(GANs)-在医疗保健,安全和图像生成等广泛领域取得了显着的成功。然而,训练生成模型通常需要大量的数据集和大量的计算资源,这在现实世界中通常是不可用的。获取这些资源可能成本高昂且效率低下,特别是当许多未充分利用的设备(如物联网设备和边缘设备)具有不同的功能仍然处于空闲状态时。此外,由于隐私问题和版权限制,获取大型数据集具有挑战性,因为大多数设备不愿意共享其数据。为了应对这些挑战,我们提出了一种新的分散式GAN训练方法,该方法可以利用分布式数据和未充分利用的低性能设备,同时不以原始形式共享数据。我们的方法旨在解决去中心化环境中的关键挑战,结合KLD加权的分布式联合学习来解决数据异构性和多域数据集的问题,并结合异构U形分割学习来解决严格数据共享约束下的设备异构性挑战-确保节点之间没有标签或原始数据,无论是真实的还是合成的。实验结果表明,我们的方法在关键性能指标上表现出一致和显著的改进,与几个基准测试相比,它实现了1.1倍-2.2倍的图像生成分数,分类指标平均提高10%(在多域非IID设置中高达50%),延迟低得多。在https://github.com/youssefga28/HuSCF-GAN上找到我们的代码。
摘要:Federated Learning has gained increasing attention for its ability to enable multiple nodes to collaboratively train machine learning models without sharing their raw data. At the same time, Generative AI -- particularly Generative Adversarial Networks (GANs) -- have achieved remarkable success across a wide range of domains, such as healthcare, security, and Image Generation. However, training generative models typically requires large datasets and significant computational resources, which are often unavailable in real-world settings. Acquiring such resources can be costly and inefficient, especially when many underutilized devices -- such as IoT devices and edge devices -- with varying capabilities remain idle. Moreover, obtaining large datasets is challenging due to privacy concerns and copyright restrictions, as most devices are unwilling to share their data. To address these challenges, we propose a novel approach for decentralized GAN training that enables the utilization of distributed data and underutilized, low-capability devices while not sharing data in its raw form. Our approach is designed to tackle key challenges in decentralized environments, combining KLD-weighted Clustered Federated Learning to address the issues of data heterogeneity and multi-domain datasets, with Heterogeneous U-Shaped split learning to tackle the challenge of device heterogeneity under strict data sharing constraints -- ensuring that no labels or raw data, whether real or synthetic, are ever shared between nodes. Experimental results shows that our approach demonstrates consistent and significant improvements across key performance metrics, where it achieves 1.1x -- 2.2x higher image generation scores, an average 10% boost in classification metrics (up to 50% in multi-domain non-IID settings), in much lower latency compared to several benchmarks. Find our code at https://github.com/youssefga28/HuSCF-GAN.
【6】MC$^2$A: Enabling Algorithm-Hardware Co-Design for Efficient Markov Chain Monte Carlo Acceleration
标题
:MC $' 2$A:实现矩阵-硬件协同设计,实现高效的马尔科夫链蒙特卡罗加速
链接:https://arxiv.org/abs/2507.12935
作者:ao, Jun Yin, Lingyun Yao, Martin Andraud, Wannes Meert, Marian Verhelst
备注:14 pages, 15 figures, IEEE journal paper
摘要:越来越多的应用程序正在利用基于采样的算法进行规划,优化和推理。马尔可夫链蒙特卡罗(MCMC)算法形成了机器学习这一新兴分支的计算骨干。不幸的是,高计算成本限制了它们在大规模问题和实际应用中的可行性,并且现有的MCMC加速解决方案要么在硬件灵活性方面受到限制,要么无法在各种端到端应用中保持系统级的效率。本文介绍了\textbf{MC$^2$A},一个算法-硬件协同设计框架,实现了MCMC加速的高效和灵活优化。首先,\textbf{MC$^2$A}通过对处理器性能屋顶模型的第三维扩展来分析MCMC工作负载多样性,以获得计算、采样和内存参数之间的最佳平衡。其次,\textbf{MC$^2$A}提出了一种参数化的硬件加速器架构,该架构具有灵活有效的MCMC内核支持,具有ISA可编程树结构处理单元的流水线,可重新配置的采样器和交叉互连以支持不规则访问。第三,\textbf{MC$^2$A}的核心是由一个新颖的Gumbel采样器提供动力,它消除了指数和归一化运算。在端到端案例研究中,与CPU、GPU、TPU和最先进的MCMC加速器相比,\textbf{MC$^2$A}实现了{$307.6\times$,$1.4\times$,$2.0\times$,$84.2\times$}的总体加速提升。在各种有代表性的MCMC工作负载上进行评估,这项工作证明并利用通用硬件加速的可行性,以推广基于MCMC的解决方案在不同的应用领域。
摘要:An increasing number of applications are exploiting sampling-based algorithms for planning, optimization, and inference. The Markov Chain Monte Carlo (MCMC) algorithms form the computational backbone of this emerging branch of machine learning. Unfortunately, the high computational cost limits their feasibility for large-scale problems and real-world applications, and the existing MCMC acceleration solutions are either limited in hardware flexibility or fail to maintain efficiency at the system level across a variety of end-to-end applications. This paper introduces \textbf{MC$^2$A}, an algorithm-hardware co-design framework, enabling efficient and flexible optimization for MCMC acceleration. Firstly, \textbf{MC$^2$A} analyzes the MCMC workload diversity through an extension of the processor performance roofline model with a 3rd dimension to derive the optimal balance between the compute, sampling and memory parameters. Secondly, \textbf{MC$^2$A} proposes a parametrized hardware accelerator architecture with flexible and efficient support of MCMC kernels with a pipeline of ISA-programmable tree-structured processing units, reconfigurable samplers and a crossbar interconnect to support irregular access. Thirdly, the core of \textbf{MC$^2$A} is powered by a novel Gumbel sampler that eliminates exponential and normalization operations. In the end-to-end case study, \textbf{MC$^2$A} achieves an overall {$307.6\times$, $1.4\times$, $2.0\times$, $84.2\times$} speedup compared to the CPU, GPU, TPU and state-of-the-art MCMC accelerator. Evaluated on various representative MCMC workloads, this work demonstrates and exploits the feasibility of general hardware acceleration to popularize MCMC-based solutions in diverse application domains.
【7】An Investigation of Ear-EEG Signals for a Novel Biometric Authentication System
标题:新型生物识别系统的耳脑电信号研究
链接:https://arxiv.org/abs/2507.12873
作者:ola, Giancarlo Crocetti, Gian Luca Foresti, Daniele Pannone, Claudio Piciarelli, Amedeo Ranaldi
摘要:这项工作探讨了生物认证的可行性,通过入耳式设备,通常被称为耳EEG采集的EEG信号。传统的基于EEG的生物识别系统虽然安全,但由于繁琐的基于头皮的电极设置而经常遭受低可用性。在这项研究中,我们提出了一个新颖而实用的框架,利用耳脑电信号作为一个用户友好的替代日常生物特征认证。该系统从耳EEG信号中提取时间和频谱特征的原始组合,并将其馈送到完全连接的深度神经网络中进行主体识别。目前唯一可用的耳EEG数据集适合于不同的目的,包括生物特征认证,实验结果表明有前途的性能,在受试者识别的情况下,平均准确率为82%。这些发现证实了耳EEG作为下一代真实世界生物识别系统的可行和可部署方向的潜力。
摘要:This work explores the feasibility of biometric authentication using EEG signals acquired through in-ear devices, commonly referred to as ear-EEG. Traditional EEG-based biometric systems, while secure, often suffer from low usability due to cumbersome scalp-based electrode setups. In this study, we propose a novel and practical framework leveraging ear-EEG signals as a user-friendly alternative for everyday biometric authentication. The system extracts an original combination of temporal and spectral features from ear-EEG signals and feeds them into a fully connected deep neural network for subject identification. Experimental results on the only currently available ear-EEG dataset suitable for different purposes, including biometric authentication, demonstrate promising performance, with an average accuracy of 82\% in a subject identification scenario. These findings confirm the potential of ear-EEG as a viable and deployable direction for next-generation real-world biometric systems.
【8】A Kernel Distribution Closeness Testing
标题:一种核分布贴近度检验方法
链接:https://arxiv.org/abs/2507.12843
作者:hou, Liuhua Peng, Xunye Tian, Feng Liu
摘要:分布接近性测试(DCT)评估分布对之间的距离是否至少为$\epsilon$-far。现有的DCT方法主要测量在离散一维空间(例如,使用全变分),这将它们的应用限制于复杂数据(例如,图像)。为了将DCT扩展到更多类型的数据,一个自然的想法是将最大均值差异(MMD)引入DCT场景,MMD是两个复杂分布之间分布差异的强大度量。然而,我们发现,MMD的值可以是相同的许多对分布,具有不同的规范,在相同的再生核希尔伯特空间(RKHS),使MMD信息较少时,评估多个分布对的接近程度。为了缓解这个问题,我们设计了一个新的测量分布差异,范数自适应MMD(NAMD),它使用RKHS分布范数来缩放MMD的值。基于NAMD的渐近分布,我们最后提出了基于NAMD的DCT来评估分布对的贴近度。理论上,我们证明了基于NAMD的DCT比基于MMD的DCT具有更高的测试能力,具有有界的I型错误,这也通过对许多类型数据的广泛实验(例如,合成噪声、真实图像)。此外,我们还将NAMD应用于解决双样本测试问题,发现基于NAMD的双样本测试在理论和实验上都比基于MMD的双样本测试具有更高的测试功效。
摘要:The distribution closeness testing (DCT) assesses whether the distance between a distribution pair is at least $\epsilon$-far. Existing DCT methods mainly measure discrepancies between a distribution pair defined on discrete one-dimensional spaces (e.g., using total variation), which limits their applications to complex data (e.g., images). To extend DCT to more types of data, a natural idea is to introduce maximum mean discrepancy (MMD), a powerful measurement of the distributional discrepancy between two complex distributions, into DCT scenarios. However, we find that MMD's value can be the same for many pairs of distributions that have different norms in the same reproducing kernel Hilbert space (RKHS), making MMD less informative when assessing the closeness levels for multiple distribution pairs. To mitigate the issue, we design a new measurement of distributional discrepancy, norm-adaptive MMD (NAMMD), which scales MMD's value using the RKHS norms of distributions. Based on the asymptotic distribution of NAMMD, we finally propose the NAMMD-based DCT to assess the closeness levels of a distribution pair. Theoretically, we prove that NAMMD-based DCT has higher test power compared to MMD-based DCT, with bounded type-I error, which is also validated by extensive experiments on many types of data (e.g., synthetic noise, real images). Furthermore, we also apply the proposed NAMMD for addressing the two-sample testing problem and find NAMMD-based two-sample test has higher test power than the MMD-based two-sample test in both theory and experiments.
【9】MVA 2025 Small Multi-Object Tracking for Spotting Birds Challenge: Dataset, Methods, and Results
标题:UVA 2025观鸟小型多目标跟踪挑战:数据集、方法和结果
链接:https://arxiv.org/abs/2507.12832
作者
:o, Norimichi Ukita, Riku Kanayama, Yuki Yoshida, Takayuki Yamaguchi, Xiang Yu, Guang Liang, Xinyao Liu, Guan-Zhang Wang, Wei-Ta Chu, Bing-Cheng Chuang, Jia-Hua Lee, Pin-Tseng Kuo, I-Hsuan Chu, Yi-Shein Hsiao, Cheng-Han Wu, Po-Yi Wu, Jui-Chien Tsou, Hsuan-Chi Liu, Chun-Yi Lee, Yuan-Fu Yang, Kosuke Shigematsu, Asuka Shin, Ba Tran
备注:This paper is the official challenge report for SMOT4SB and is published in the proceedings of MVA 2025 (19th International Conference on Machine Vision and Applications). Official challenge page: this https URL
摘要:当目标仅占用几十个像素时,小型多目标跟踪(SMOT)尤其具有挑战性,使得检测和基于外观的关联不可靠。基于MVA 2023 SOD 4SB挑战赛的成功,本文介绍了SMOT 4SB挑战赛,该挑战赛利用时间信息来解决单帧检测的局限性。我们的三个主要贡献是:(1)SMOT 4SB数据集,由211个无人机视频序列组成,在不同的真实世界条件下,有108,192个注释帧,旨在捕捉相机和目标在3D中自由移动的运动纠缠;(2)SO-HOTA,一种结合点距离和HOTA的新度量,以减轻基于IoU的度量对小位移的敏感性;以及(3)竞争性MVA 2025挑战赛,78名参与者和308份提交,获胜方法比基线提高了5.1倍。这项工作为推进SMOT在无人机场景中的应用奠定了基础,可用于避免鸟撞,农业,渔业和生态监测。
摘要:Small Multi-Object Tracking (SMOT) is particularly challenging when targets occupy only a few dozen pixels, rendering detection and appearance-based association unreliable. Building on the success of the MVA2023 SOD4SB challenge, this paper introduces the SMOT4SB challenge, which leverages temporal information to address limitations of single-frame detection. Our three main contributions are: (1) the SMOT4SB dataset, consisting of 211 UAV video sequences with 108,192 annotated frames under diverse real-world conditions, designed to capture motion entanglement where both camera and targets move freely in 3D; (2) SO-HOTA, a novel metric combining Dot Distance with HOTA to mitigate the sensitivity of IoU-based metrics to small displacements; and (3) a competitive MVA2025 challenge with 78 participants and 308 submissions, where the winning method achieved a 5.1x improvement over the baseline. This work lays a foundation for advancing SMOT in UAV scenarios with applications in bird strike avoidance, agriculture, fisheries, and ecological monitoring.
【10】Autoregressive Speech Enhancement via Acoustic Tokens
标题:通过声学令牌的自回归语音增强
链接:https://arxiv.org/abs/2507.12825
作者:a Libera, Cem Subakan, Mirco Ravanelli
备注:5 pages, 2 figures
摘要:在语音处理管道中,提高真实世界录音的质量和清晰度至关重要。虽然监督回归是语音增强的主要方法,但音频标记化正在成为与其他模态平滑集成的有前途的替代方案。然而,利用离散表示进行语音增强的研究仍然有限。以前的工作主要集中在语义令牌,往往会丢弃关键的声学细节,如扬声器的身份。此外,这些研究通常采用非自回归模型,假设输出的条件独立性,忽略了自回归模型提供的潜在改进。为了弥补这些差距,我们:1)对用于语音增强的声学令牌的性能进行全面研究,包括比特率和噪声强度的影响; 2)引入专门为此任务设计的基于换能器的自回归架构。在VoiceBank和Libri 1 Mix数据集上的实验表明,声学令牌在保留说话者身份方面优于语义令牌,并且我们的自回归方法可以进一步提高性能。尽管如此,我们观察到,离散表示仍然低于连续的,强调需要在这一领域进行进一步的研究。
摘要:In speech processing pipelines, improving the quality and intelligibility of real-world recordings is crucial. While supervised regression is the primary method for speech enhancement, audio tokenization is emerging as a promising alternative for a smooth integration with other modalities. However, research on speech enhancement using discrete representations is still limited. Previous work has mainly focused on semantic tokens, which tend to discard key acoustic details such as speaker identity. Additionally, these studies typically employ non-autoregressive models, assuming conditional independence of outputs and overlooking the potential improvements offered by autoregressive modeling. To address these gaps we: 1) conduct a comprehensive study of the performance of acoustic tokens for speech enhancement, including the effect of bitrate and noise strength; 2) introduce a novel transducer-based autoregressive architecture specifically designed for this task. Experiments on VoiceBank and Libri1Mix datasets show that acoustic tokens outperform semantic tokens in terms of preserving speaker identity, and that our autoregressive approach can further improve performance. Nevertheless, we observe that discrete representations still fall short compared to continuous ones, highlighting the need for further research in this area.
【11】AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation
标题:AnyPos:双手操纵的自动化任务不可知动作
链接:https://arxiv.org/abs/2507.12768
作者:an, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, Jun Zhu
摘要:视觉-语言-动作(VLA)模型在复杂环境(如双手操作)中的任务条件控制方面表现出了希望。然而,对特定任务的人类示范的严重依赖限制了它们的推广,并导致高的数据采集成本。在这项工作中,我们提出了一个新的概念,任务不可知的行动范式,从任务特定的条件反射,提高可扩展性,效率和成本效益的行动执行。为了解决这种模式所带来的数据收集挑战-例如低覆盖密度,行为冗余和安全风险-我们引入了ATARA(自动任务不可知随机动作),这是一个可扩展的自我监督框架,与人类远程操作相比,它可以加速收集超过30倍。为了进一步从任务不可知的数据中进行有效的学习,这些数据通常会受到分布失配和不相关轨迹的影响,我们提出了AnyPos,这是一种配备了Arm解耦估计和方向感知解码器(DAD)的逆动力学模型。我们还集成了一个视频条件动作验证模块,以验证在不同的操作任务的学习策略的可行性。大量的实验表明,AnyPos-ATARA管道在测试准确性方面提高了51%,并在下游任务中实现了30-40%的成功率,例如使用基于回放的视频验证的提升,拾取和放置,点击。项目页面:https://embodiedfoundation.github.io/vidar_anypos
摘要:Vision-language-action (VLA) models have shown promise on task-conditioned control in complex settings such as bimanual manipulation. However, the heavy reliance on task-specific human demonstrations limits their generalization and incurs high data acquisition costs. In this work, we present a new notion of task-agnostic action paradigm that decouples action execution from task-specific conditioning, enhancing scalability, efficiency, and cost-effectiveness. To address the data collection challenges posed by this paradigm -- such as low coverage density, behavioral redundancy, and safety risks -- we introduce ATARA (Automated Task-Agnostic Random Actions), a scalable self-supervised framework that accelerates collection by over $ 30\times $ compared to human teleoperation. To further enable effective learning from task-agnostic data, which often suffers from distribution mismatch and irrelevant trajectories, we propose AnyPos, an inverse dynamics model equipped with Arm-Decoupled Estimation and a Direction-Aware Decoder (DAD). We additionally integrate a video-conditioned action validation module to verify the feasibility of learned policies across diverse manipulation tasks. Extensive experiments show that the AnyPos-ATARA pipeline yields a 51% improvement in test accuracy and achieves 30-40% higher success rates in downstream tasks such as lifting, pick-and-place, and clicking, using replay-based video validation. Project Page: https://embodiedfoundation.github.io/vidar_anypos
【12】Benchmarking Deception Probes via Black-to-White Performance Boosts
标题:通过黑白性能提升对欺骗探针进行基准测试
链接:https://arxiv.org/abs/2507.12691
作者:ck, Carlo Leonardo Attubato, Stefan Heimersheim
备注:Preprint. 37 pages, 10 figures, 7 tables
摘要:AI助手偶尔会对用户的查询做出欺骗性的回应。最近,线性分类器(称为“欺骗探针”)已被训练来区分欺骗性与诚实反应期间语言模型的内部激活。然而,目前还不清楚这些探针在实际中检测欺骗行为的有效性,也不清楚这些探针是否能抵抗希望逃避检测的欺骗性助手的简单反击策略。在本文中,我们比较了白盒监控(监控器可以访问令牌级探测激活)和黑盒监控(没有这种访问)。我们基准欺骗探针的程度,白盒监视器优于黑盒监视器,即黑到白的性能提升。我们发现,从现有的欺骗探针弱,但令人鼓舞的黑色到白色的性能提升。
摘要
:AI assistants will occasionally respond deceptively to user queries. Recently, linear classifiers (called "deception probes") have been trained to distinguish the internal activations of a language model during deceptive versus honest responses. However, it's unclear how effective these probes are at detecting deception in practice, nor whether such probes are resistant to simple counter strategies from a deceptive assistant who wishes to evade detection. In this paper, we compare white-box monitoring (where the monitor has access to token-level probe activations) to black-box monitoring (without such access). We benchmark deception probes by the extent to which the white box monitor outperforms the black-box monitor, i.e. the black-to-white performance boost. We find weak but encouraging black-to-white performance boosts from existing deception probes.
【13】Data Transformation Strategies to Remove Heterogeneity
标题:消除异构的数据转换策略
链接:https://arxiv.org/abs/2507.12677
作者:Yoo, Jaeyoung Lee, Chanyoung Yoon, Geonyeong Son, Hyein Hong, Seongbum Seo, Soobin Yim, Chanyoung Jung, Jungsoo Park, Misuk Kim, Yun Jang
摘要:数据异构性是一个普遍存在的问题,它源于各种相互冲突的因素,使其利用变得复杂。这种不确定性,特别是由数据格式差异引起的不确定性,经常需要专家参与寻找解决方案。当前的方法主要解决与数据结构和模式相关的冲突,通常忽略了数据转换所扮演的关键角色。随着人工智能(AI)的使用不断扩大,对更精简的数据准备流程的需求不断增长,数据转换变得至关重要。它可以自定义训练数据以提高AI学习效率,并调整输入格式以适应不同的AI模型。选择适当的转换技术对于保留关键数据细节至关重要。尽管人工智能在各个行业中广泛集成,但有关当代数据转换方法的全面评论却很少。本调查探讨了数据异质性及其潜在来源的复杂性。它系统地分类并提出了解决数据格式差异引起的异质性的战略,阐明了与每一战略相关的固有挑战。
摘要:Data heterogeneity is a prevalent issue, stemming from various conflicting factors, making its utilization complex. This uncertainty, particularly resulting from disparities in data formats, frequently necessitates the involvement of experts to find resolutions. Current methodologies primarily address conflicts related to data structures and schemas, often overlooking the pivotal role played by data transformation. As the utilization of artificial intelligence (AI) continues to expand, there is a growing demand for a more streamlined data preparation process, and data transformation becomes paramount. It customizes training data to enhance AI learning efficiency and adapts input formats to suit diverse AI models. Selecting an appropriate transformation technique is paramount in preserving crucial data details. Despite the widespread integration of AI across various industries, comprehensive reviews concerning contemporary data transformation approaches are scarce. This survey explores the intricacies of data heterogeneity and its underlying sources. It systematically categorizes and presents strategies to address heterogeneity stemming from differences in data formats, shedding light on the inherent challenges associated with each strategy.
【14】Second-Order Bounds for [0,1]-Valued Regression via Betting Loss
标题:通过投注损失的[0,1]值回归的二阶界
链接:https://arxiv.org/abs/2507.12584
作者: Kwang-Sung Jun
摘要:我们考虑独立同分布中的$[0,1]$-值回归问题.设置.在一个称为成本敏感分类的相关问题中,\citet{foster 21 efficient}已经表明,与平方损失最小化器相比,对数损失最小化器实现了改进的泛化边界,因为边界与最佳分类器的成本成比例,这可以根据手头的问题任意小。这样的结果通常被称为一阶界。对于$[0,1]$-值回归,我们首先证明了对数损失最小化导致了类似的一阶界。然后,我们问是否存在一个损失函数,达到方差依赖的界限(也称为二阶界限),这是一个严格的改进一阶界限。我们通过提出一个称为投注损失的新损失函数来肯定地回答这个问题。我们的结果是“方差自适应的”,在这个意义上,边界是在没有任何关于方差的知识的情况下获得的,这与将标签(或奖励)方差或标签分布本身明确建模为函数类的一部分(如分布式强化学习)相反。
摘要:We consider the $[0,1]$-valued regression problem in the i.i.d. setting. In a related problem called cost-sensitive classification, \citet{foster21efficient} have shown that the log loss minimizer achieves an improved generalization bound compared to that of the squared loss minimizer in the sense that the bound scales with the cost of the best classifier, which can be arbitrarily small depending on the problem at hand. Such a result is often called a first-order bound. For $[0,1]$-valued regression, we first show that the log loss minimizer leads to a similar first-order bound. We then ask if there exists a loss function that achieves a variance-dependent bound (also known as a second order bound), which is a strict improvement upon first-order bounds. We answer this question in the affirmative by proposing a novel loss function called the betting loss. Our result is ``variance-adaptive'' in the sense that the bound is attained \textit{without any knowledge about the variance}, which is in contrast to modeling label (or reward) variance or the label distribution itself explicitly as part of the function class such as distributional reinforcement learning.
【15】Evaluation of Neural Surrogates for Physical Modelling Synthesis of Nonlinear Elastic Plates
标题:非线性弹性板物理模型综合的神经代理评价
链接:https://arxiv.org/abs/2507.12563
作者: La Vega Martin, Rodrigo Diaz Fernandez, Mark Sandler
摘要:物理建模合成旨在从振动结构的物理模拟生成音频。薄弹性板是鼓式膜的常见模型。传统的数值方法,如有限差分和有限元提供了高精度,但计算要求高,限制了它们在实时音频应用中的使用。本文提出了一种基于神经网络的方法来解决非线性弹性板的振动的比较分析。我们评估了几种最先进的模型,在短序列上训练,以自回归方式预测长序列。我们展示了这些模型的一些局限性,以及为什么不足以查看时域中的预测误差。我们讨论了实时音频合成的影响,并提出了未来的方向,以改善神经方法来模拟非线性振动。
摘要:Physical modelling synthesis aims to generate audio from physical simulations of vibrating structures. Thin elastic plates are a common model for drum membranes. Traditional numerical methods like finite differences and finite elements offer high accuracy but are computationally demanding, limiting their use in real-time audio applications. This paper presents a comparative analysis of neural network-based approaches for solving the vibration of nonlinear elastic plates. We evaluate several state-of-the-art models, trained on short sequences, for prediction of long sequences in an autoregressive fashion. We show some of the limitations of these models, and why is not enough to look at the prediction error in the time domain. We discuss the implications for real-time audio synthesis and propose future directions for improving neural approaches to model nonlinear vibration.
【16】Can Mental Imagery Improve the Thinking Capabilities of AI Systems?
标题:心理意象能否提高人工智能系统的思维能力?
链接:https://arxiv.org/abs/2507.12555
作者:arabi
备注:15 pages, 8 figures
摘要:虽然现有的模型可以与人类互动并提供令人满意的响应,但它们缺乏自主行动或独立推理的能力。此外,这些模型中的输入数据通常作为显式查询提供,即使已经获取了一些传感数据。 此外,人工智能代理是一种计算实体,旨在根据其编程,数据输入和学习知识自主执行任务和做出决策,已经取得了重大进展。然而,与人类不同的是,它们难以整合多个领域的知识。 心理意象在大脑的思维过程中发挥着重要作用,这涉及到基于内部多感官数据,计划行动,需求和推理能力执行任务。在本文中,我们研究如何将心理意象整合到机器思维框架中,以及这如何有利于启动思维过程。我们提出的机器思维框架集成了一个认知思维单元,由三个辅助单元支持:输入数据单元,需求单元和心理意象单元。在这个框架内,数据被表示为自然语言句子或绘制的草图,用于信息和决策目的。我们对这个框架进行了验证测试,并对结果进行了介绍和讨论。
摘要
:Although existing models can interact with humans and provide satisfactory responses, they lack the ability to act autonomously or engage in independent reasoning. Furthermore, input data in these models is typically provided as explicit queries, even when some sensory data is already acquired. In addition, AI agents, which are computational entities designed to perform tasks and make decisions autonomously based on their programming, data inputs, and learned knowledge, have shown significant progress. However, they struggle with integrating knowledge across multiple domains, unlike humans. Mental imagery plays a fundamental role in the brain's thinking process, which involves performing tasks based on internal multisensory data, planned actions, needs, and reasoning capabilities. In this paper, we investigate how to integrate mental imagery into a machine thinking framework and how this could be beneficial in initiating the thinking process. Our proposed machine thinking framework integrates a Cognitive thinking unit supported by three auxiliary units: the Input Data Unit, the Needs Unit, and the Mental Imagery Unit. Within this framework, data is represented as natural language sentences or drawn sketches, serving both informative and decision-making purposes. We conducted validation tests for this framework, and the results are presented and discussed.
【17】The Serial Scaling Hypothesis
标题:序列尺度假说
链接:https://arxiv.org/abs/2507.12549
作者: Konpat Preechakul, Kananart Kuwaranancharoen, Yutong Bai
备注:28 pages (13 pages main text + appendices & references), 8 figures, equal-contribution first authors
摘要:虽然机器学习已经通过大规模并行化取得了进展,但我们发现了一个关键的盲点:一些问题从根本上是连续的。这些“固有的串行”问题--从数学推理到物理模拟再到顺序决策--需要无法并行化的相关计算步骤。从复杂性理论中,我们正式区分,并表明,目前的并行为中心的架构面临着根本的限制,这样的任务。我们认为,认识到计算的串行性对机器学习,模型设计,硬件开发具有深远的影响。随着人工智能处理越来越复杂的推理,有意扩展串行计算-而不仅仅是并行计算-对于持续进步至关重要。
摘要:While machine learning has advanced through massive parallelization, we identify a critical blind spot: some problems are fundamentally sequential. These "inherently serial" problems-from mathematical reasoning to physical simulations to sequential decision-making-require dependent computational steps that cannot be parallelized. Drawing from complexity theory, we formalize this distinction and demonstrate that current parallel-centric architectures face fundamental limitations on such tasks. We argue that recognizing the serial nature of computation holds profound implications on machine learning, model design, hardware development. As AI tackles increasingly complex reasoning, deliberately scaling serial computation-not just parallel computation-is essential for continued progress.
【18】Perfect diffusion is $\mathsf{TC}^0$ -- Bad diffusion is Turing-complete
链接:https://arxiv.org/abs/2507.12469
作者:
备注:7 pages
摘要:本文探讨了基于扩散的语言建模的计算复杂性。我们证明了一个二分法的基础上的质量的分数匹配网络的扩散模型。在一个方向上,精确计算某些初始分布的得分函数的网络只能在$\mathsf{TC}^0$复杂性类内执行语言建模,这反映了与快速收敛相关的限制。在另一个方向上,我们证明了如果不要求网络匹配任何得分函数,那么扩散模型在某种意义上可以模拟任何图灵机。这种二分法提供了一个理论镜头的扩散模型的能力和局限性,特别是关于需要顺序计算的任务。我们猜想我们的理论结果的扩展,包括扩散模型是不完美的情况下,但仅仅是好的。我们还讨论了更广泛的背景和实际意义,并假设可以在顺序和并行操作模式之间进行插值的机器学习架构将优于Transformers和扩散模型。
摘要:This paper explores the computational complexity of diffusion-based language modeling. We prove a dichotomy based on the quality of the score-matching network in a diffusion model. In one direction, a network that exactly computes the score function of some initial distribution can only perform language modeling within the $\mathsf{TC}^0$ complexity class, reflecting limitations tied to rapid convergence. In the other direction, we show that if there is no requirement for the network to match any score function, then diffusion modeling can simulate any Turing machine in a certain sense. This dichotomy provides a theoretical lens on the capabilities and limitations of diffusion models, particularly concerning tasks requiring sequential computation. We conjecture extensions of our theoretical results, including for the case where the diffusion model is not perfect, but merely good. We also discuss the wider context and practical implications, and hypothesize that a machine learning architecture that can interpolate between sequential and parallel modes of operation would be superior to both Transformers and diffusion models.
【19】Relation-Aware Slicing in Cross-Domain Alignment
标题:跨域对齐中的感知切片
链接:https://arxiv.org/abs/2507.13194
作者:kar, Aprameyo Chakrabartty, Anish Chakrabarty, Swagatam Das
摘要:切片Gromov-Wasserstein(SGW)距离利用从单位超球面均匀采样的投影方向,旨在减轻求解非凸二次规划(Gromov-Wasserstein距离)的计算成本。这种切片机制由于没有信息的方向而导致不必要的计算成本,这也影响了距离的代表性功率。然而,在投影方向上找到更合适的分布(切片分布)本身通常是一个优化问题,它有自己的计算成本。此外,对于更复杂的分布,采样本身可能是昂贵的。作为一种补救措施,我们提出了一个无优化切片分布,提供快速采样的蒙特卡罗近似。我们这样做,通过引入的随机感知投影方向(RAPD),有效地捕捉每两对随机向量,每个以下的环境法律的成对关联。这使我们能够推导出感知切片分布(RASD),对应于采样的RAPD的位置-尺度定律。最后,我们介绍RASGW距离及其变体,例如,IWRASGW(重要性加权RASGW),克服了SGW的缺点。我们从理论上分析了它的属性,并通过对各种对齐任务进行广泛的实验来证实它的经验实力。
摘要:The Sliced Gromov-Wasserstein (SGW) distance, aiming to relieve the computational cost of solving a non-convex quadratic program that is the Gromov-Wasserstein distance, utilizes projecting directions sampled uniformly from unit hyperspheres. This slicing mechanism incurs unnecessary computational costs due to uninformative directions, which also affects the representative power of the distance. However, finding a more appropriate distribution over the projecting directions (slicing distribution) is often an optimization problem in itself that comes with its own computational cost. In addition, with more intricate distributions, the sampling itself may be expensive. As a remedy, we propose an optimization-free slicing distribution that provides fast sampling for the Monte Carlo approximation. We do so by introducing the Relation-Aware Projecting Direction (RAPD), effectively capturing the pairwise association of each of two pairs of random vectors, each following their ambient law. This enables us to derive the Relation-Aware Slicing Distribution (RASD), a location-scale law corresponding to sampled RAPDs. Finally, we introduce the RASGW distance and its variants, e.g., IWRASGW (Importance Weighted RASGW), which overcome the shortcomings experienced by SGW. We theoretically analyze its properties and substantiate its empirical prowess using extensive experiments on various alignment tasks.
机器翻译由腾讯交互翻译提供,仅供参考
点击“阅读原文”获取带摘要的学术速递