机器学习学术速递[12.3]

点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.LG 方向，今日共计144篇

大模型相关(19篇)

【1】TokenPowerBench: Benchmarking the Power Consumption of LLM Inference
标题：TokenPowerBench：LLM推理的功耗基准测试
链接：https://arxiv.org/abs/2512.03024

作者：Chenxu Niu,Wei Zhang,Jie Li,Yongjian Zhao,Tongyang Wang,Xi Wang,Yong Chen
备注：Accepted by the AAAI'26 Conference Main Track
摘要：大型语言模型（LLM）服务现在每天回答数十亿个查询，行业报告显示，推理（而不是训练）占总功耗的90%以上。然而，现有的基准集中在训练/微调或推理的性能，并提供很少的支持，功耗测量和分析的推理。我们介绍TokenPowerBench，第一个轻量级和可扩展的基准设计的LLM推理功耗研究。该基准测试结合了（i）一个声明性配置接口，涵盖模型选择、提示集和推理引擎;（ii）一个测量层，无需专门的功率计即可捕获GPU、节点和系统级功率;以及（iii）一个相位对齐的度量管道，将能量分配给每个请求的预填充和解码阶段。这些元素使其可以直接探索LLM推理运行所消耗的功率;此外，通过改变批量大小，上下文长度，并行策略和量化，用户可以快速评估每个设置如何影响每个令牌的焦耳和其他能效指标。我们在四个最广泛使用的模型系列（Llama，Falcon，Qwen和Mistral）上评估TokenPowerBench。我们的实验涵盖了从10亿个参数到前沿尺度Llama 3 - 405 B模型。此外，我们还将TokenPowerBench作为开源项目发布，以帮助用户在部署LLM服务时测量功耗、预测运营费用并实现可持续发展目标。
摘要：Large language model (LLM) services now answer billions of queries per day, and industry reports show that inference, not training, accounts for more than 90% of total power consumption. However, existing benchmarks focus on either training/fine-tuning or performance of inference and provide little support for power consumption measurement and analysis of inference. We introduce TokenPowerBench, the first lightweight and extensible benchmark designed for LLM-inference power consumption studies. The benchmark combines (i) a declarative configuration interface covering model choice, prompt set, and inference engine, (ii) a measurement layer that captures GPU-, node-, and system-level power without specialized power meters, and (iii) a phase-aligned metrics pipeline that attributes energy to the prefill and decode stages of every request. These elements make it straight-forward to explore the power consumed by an LLM inference run; furthermore, by varying batch size, context length, parallelism strategy and quantization, users can quickly assess how each setting affects joules per token and other energy-efficiency metrics. We evaluate TokenPowerBench on four of the most widely used model series (Llama, Falcon, Qwen, and Mistral). Our experiments cover from 1 billion parameters up to the frontier-scale Llama3-405B model. Furthermore, we release TokenPowerBench as open source to help users to measure power consumption, forecast operating expenses, and meet sustainability targets when deploying LLM services.

【2】Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge
标题：将法学硕士视为法官的分布校准推理时间计算
链接：https://arxiv.org/abs/2512.03019

作者：Hamid Dadkhahi,Firas Trabelsi,Parker Riley,Juraj Juraska,Mehdi Mirzazadeh
摘要：思维大语言模型（LLM）作为成对偏好的判断在单样本水平上仍然是嘈杂的，并且当允许领带时，常见的聚合规则（多数投票，软自一致性或基于谨慎的自聚合）是不一致的。我们研究的推理时间计算（ITC）的评估，产生n个独立的思考，每个项目的评级样本，并提出了一个原则性的，分布校准的聚合方案。我们的方法模型的三种方式的偏好与布拉德利特里戴维森制定评级计数，利用极性（非领带之间的利润率）和决定性（非领带率），以区分狭窄的利润率从强烈的共识。在各种评估基准中，我们的方法始终降低了MAE，并提高了与标准基线相比的成对准确性，并且在对人类共识元标签进行评估时，匹配或超过了个人评分者。这些结果表明，仔细分配ITC和聚集与分布感知的方法，把嘈杂的个人模型的判断，为评估可靠的评级。
摘要：Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking-rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation.

【3】Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning
标题：Martingale得分：LLM推理中Bayesian理性的无监督指标
链接：https://arxiv.org/abs/2512.02914

作者：Zhonghao He,Tianyi Qiu,Hirokazu Shirado,Maarten Sap
备注：NeurIPS 2025
摘要：推理技术的最新进展大大提高了大型语言模型（LLM）的性能，提高了人们对其提供准确、真实和可靠信息的能力的期望。然而，新出现的证据表明，迭代推理可能会促进信念巩固和确认偏见，而不是加强寻求真理的行为。在这项研究中，我们提出了一个系统的评估框架，利用贝叶斯统计的鞅属性，在LLM推理的信念防御。这个属性意味着，在理性信念更新下，未来信念的期望值应该保持等于当前信念，即，根据当前信念更新是不可预测的。我们提出了无监督的，基于回归的鞅分数来衡量违反这一属性，信号偏离贝叶斯更新新证据的能力。在开放式问题领域，包括事件预测，价值负载问题和学术论文审查，我们发现这种违规行为在模型和设置中普遍存在，当前信念积极预测未来信念更新，我们称之为信念巩固的现象。我们确定的模型，推理技术，和领域更容易信念根深蒂固。最后，我们验证了鞅分数表明，它预测地面真理的准确性问题域地面真理标签可用。这表明，虽然被设计为一个无监督的度量，即使在没有访问地面真理的域中也可以操作，但鞅分数是推理过程寻求真理能力的有用代理。
摘要：Recent advances in reasoning techniques have substantially improved the performance of large language models (LLMs), raising expectations for their ability to provide accurate, truthful, and reliable information. However, emerging evidence suggests that iterative reasoning may foster belief entrenchment and confirmation bias, rather than enhancing truth-seeking behavior. In this study, we propose a systematic evaluation framework for belief entrenchment in LLM reasoning by leveraging the Martingale property from Bayesian statistics. This property implies that, under rational belief updating, the expected value of future beliefs should remain equal to the current belief, i.e., belief updates are unpredictable from the current belief. We propose the unsupervised, regression-based Martingale Score to measure violations of this property, which signal deviation from the Bayesian ability of updating on new evidence. In open-ended problem domains including event forecasting, value-laden questions, and academic paper review, we find such violations to be widespread across models and setups, where the current belief positively predicts future belief updates, a phenomenon which we term belief entrenchment. We identify the models, reasoning techniques, and domains more prone to belief entrenchment. Finally, we validate the Martingale Score by showing that it predicts ground-truth accuracy on problem domains where ground truth labels are available. This indicates that, while designed as an unsupervised metric that operates even in domains without access to ground truth, the Martingale Score is a useful proxy of the truth-seeking ability of a reasoning process.

【4】Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages
标题：跨语言提示可操纵性：迈向跨语言准确且稳健的LLM行为
链接：https://arxiv.org/abs/2512.02841

作者：Lechen Zhang,Yusheng Zhou,Tolga Ergen,Lajanugen Logeswaran,Moontae Lee,David Jurgens
摘要：系统提示提供了一种轻量级但功能强大的机制，用于在推理时调节大型语言模型（LLM）。虽然之前的工作主要集中在英语设置上，但实际部署受益于单一提示符，可以跨语言可靠地操作。本文提出了一个全面的研究如何不同的系统提示转向模型准确和强大的跨语言行为。我们提出了一个统一的四维评估框架来评估多语言环境中的系统提示。通过对五种语言，三种LLM和三种基准测试的大规模实验，我们发现某些提示成分，如CoT，情感和场景，与强大的多语言行为相关。我们为多语言设置开发了一个提示优化框架，并显示它可以自动发现提示，将所有指标提高5- 10%。最后，我们分析了超过1000万个推理单元，发现更高性能的系统提示会导致更结构化和一致的推理模式，同时减少不必要的语言切换。总之，我们强调系统提示优化作为一个可扩展的路径，以准确和强大的多语言LLM行为。
摘要：System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.

【5】Phase-Adaptive LLM Framework with Multi-Stage Validation for Construction Robot Task Allocation: A Systematic Benchmark Against Traditional Optimization Algorithms
标题：具有建筑机器人任务分配多阶段验证的阶段自适应LLM框架：针对传统优化算法的系统基准
链接：https://arxiv.org/abs/2512.02810

作者：Shyam prasad reddy Kaitha,Hongrui Yu
摘要：建筑自动化中的多机器人任务分配传统上依赖于动态规划和强化学习等优化方法。本研究介绍了基于LangGraph的任务分配代理（LTAA），一个LLM驱动的框架，集成了阶段自适应分配策略，多级验证与分层重试，并动态提示有效的机器人协调。虽然最近的LLM方法显示出建筑机器人的潜力，但它们在很大程度上缺乏严格的验证和对现有算法的基准测试。本文首次将基于LLM的任务分配方法与传统方法在施工场景中进行了系统的比较，通过SMART-LLM复制验证了LLM的可行性，并使用自校正代理架构解决了实施挑战。LTAA利用自然语言推理与结构化验证机制相结合，通过动态提示实现了主要的计算收益，减少了94.6%的令牌使用量和86%的分配时间。该框架在各个阶段调整其策略：早期强调执行可行性，后期强调工作负载平衡。作者使用TEACh人机协作数据集的构建操作，根据动态规划，Q学习和深度Q网络（DQN）基线评估LTAA。在Heavy Excels设置中，机器人具有很强的任务专业化，LTAA实现了77%的任务完成率，具有出色的工作负载平衡，优于所有传统方法。这些研究结果表明，基于LLM的推理与结构化验证可以匹配已建立的优化算法，同时提供额外的优势，如可解释性，适应性和更新任务逻辑的能力，而无需重新训练。
摘要：Multi-robot task allocation in construction automation has traditionally relied on optimization methods such as Dynamic Programming and Reinforcement Learning. This research introduces the LangGraph-based Task Allocation Agent (LTAA), an LLM-driven framework that integrates phase-adaptive allocation strategies, multi-stage validation with hierarchical retries, and dynamic prompting for efficient robot coordination. Although recent LLM approaches show potential for construction robotics, they largely lack rigorous validation and benchmarking against established algorithms. This paper presents the first systematic comparison of LLM-based task allocation with traditional methods in construction scenarios.The study validates LLM feasibility through SMART-LLM replication and addresses implementation challenges using a Self-Corrective Agent Architecture. LTAA leverages natural-language reasoning combined with structured validation mechanisms, achieving major computational gains reducing token usage by 94.6% and allocation time by 86% through dynamic prompting. The framework adjusts its strategy across phases: emphasizing execution feasibility early and workload balance in later allocations.The authors evaluate LTAA against Dynamic Programming, Q-learning, and Deep Q-Network (DQN) baselines using construction operations from the TEACh human-robot collaboration dataset. In the Heavy Excels setting, where robots have strong task specializations, LTAA achieves 77% task completion with superior workload balance, outperforming all traditional methods. These findings show that LLM-based reasoning with structured validation can match established optimization algorithms while offering additional advantages such as interpretability, adaptability, and the ability to update task logic without retraining.

【6】LumiX: Structured and Coherent Text-to-Intrinsic Generation
标题：LumiX：结构化且连贯的文本到本质生成
链接：https://arxiv.org/abs/2512.02781

作者：Xu Han,Biao Zhang,Xiangjun Tang,Xianzhi Li,Peter Wonka
备注：The code will be available at https://github.com/xhanxu/LumiX
摘要：我们提出了LumiX，一个结构化的扩散框架，用于相干文本到内在生成。在文本提示的条件下，LumiX联合生成一组全面的内在映射（例如，亮度、辐照度、法线、深度和最终颜色），提供对底层场景的结构化和物理一致的描述。这是由两个关键贡献实现的：1）查询广播注意力，这是一种通过在每个自我注意力块中的所有映射中共享查询来确保结构一致性的机制。2)Tensor LoRA，一种基于张量的自适应，可以有效地对跨映射关系进行参数建模，以实现高效的联合训练。这些设计共同实现了稳定的联合扩散训练和统一生成多个内在属性。实验表明，与现有技术相比，LumiX产生了连贯和物理意义的结果，实现了23%的高对齐和更好的偏好评分（0.19 vs. -0.41），并且它还可以在相同的框架内执行图像调节的内在分解。
摘要：We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g., albedo, irradiance, normal, depth, and final color), providing a structured and physically consistent description of an underlying scene. This is enabled by two key contributions: 1) Query-Broadcast Attention, a mechanism that ensures structural consistency by sharing queries across all maps in each self-attention block. 2) Tensor LoRA, a tensor-based adaptation that parameter-efficiently models cross-map relations for efficient joint training. Together, these designs enable stable joint diffusion training and unified generation of multiple intrinsic properties. Experiments show that LumiX produces coherent and physically meaningful results, achieving 23% higher alignment and a better preference score (0.19 vs. -0.41) compared to the state of the art, and it can also perform image-conditioned intrinsic decomposition within the same framework.

【7】Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs
标题：LLM中的涌现贝叶斯行为与最优线索组合
链接：https://arxiv.org/abs/2512.02719

作者：Julian Ma,Jun Wang,Zafeirios Fountas
摘要：大型语言模型（LLM）擅长显式推理，但其隐式计算策略仍有待探索。数十年的心理物理学研究表明，人类在感知任务中使用接近最优的贝叶斯策略直观地处理和整合噪声信号。我们问LLM是否表现出类似的行为，并在没有明确的训练或指导的情况下进行最佳的多模态整合。采用心理物理学的范式，我们推断计算的LLM原则，从系统的行为研究。我们引入了一个行为基准-贝叶斯Bench：文本和图像上的四个幅度估计任务（长度，位置，距离和持续时间），灵感来自经典的心理物理学，并评估了一组不同的九个LLM以及人类的判断进行校准。通过对噪音、上下文和指令提示的受控消除，我们衡量多模式提示组合中的表现、行为和效率。除了准确性和效率指标，我们还引入了贝叶斯一致性得分，即使在准确性饱和时也能检测贝叶斯一致的行为变化。我们的研究结果表明，虽然有能力的模型通常以贝叶斯一致的方式适应，但准确性并不能保证鲁棒性。值得注意的是，GPT-5 Mini实现了完美的文本准确性，但未能有效地整合视觉提示。这揭示了能力和战略之间的关键分离，表明以准确性为中心的基准可能会在性能上过度索引，同时错过脆弱的不确定性处理。这些发现揭示了紧急原则处理的不确定性，并强调准确性和贝叶斯倾向之间的相关性。我们发布了我们的心理物理学基准和一致性度量（https：bayes-bench.github.io）作为评估工具，并为未来的多模态架构设计提供信息。
摘要：Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (https://bayes-bench.github.io) as evaluation tools and to inform future multimodal architecture designs.

【8】Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
标题：听听什么重要！文本条件选择性视频到音频生成
链接：https://arxiv.org/abs/2512.02650

作者：Junwon Lee,Juhan Nam,Jiyoung Lee
摘要：这项工作介绍了一个新的任务，文本条件下的选择性视频到音频（V2 A）的生成，它只产生用户想要的声音从多对象视频。这种能力在多媒体制作中尤其重要，因为每个声源的音轨都要单独处理，以便进行精确的编辑、混音和创意控制。然而，目前的方法一次生成单源混合声音，主要是因为视觉特征纠缠在一起，区域提示或提示往往无法指定来源。我们提出了SelVA，一种新的文本条件V2 A模型，将文本提示作为目标源的显式选择器，并调制视频编码器以明显地提取与文本相关的视频特征。建议的补充令牌促进交叉注意，抑制文本无关的激活与有效的参数调整，产生强大的语义和时间接地。SelVA进一步采用自增强方案来克服单声道音轨监督的缺乏。我们在VGG-MONOAUDIO上评估SelVA，VGG-MONOAUDIO是一个针对此类任务的干净单源视频的策划基准。大量的实验和消融一致地验证了其在音频质量、语义对齐和时间同步方面的有效性。代码和演示可在https://jnwnlee.github.io/selva-demo/上获得。
摘要：This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder to distinctly extract prompt-relevant video features. The proposed supplementary tokens promote cross-attention by suppressing text-irrelevant activations with efficient parameter tuning, yielding robust semantic and temporal grounding. SelVA further employs a self-augmentation scheme to overcome the lack of mono audio track supervision. We evaluate SelVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization. Code and demo are available at https://jnwnlee.github.io/selva-demo/.

【9】In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs
标题：具有自洽级联的上下文蒸馏：一种简单，免训练的方法来降低LLM代理成本
链接：https://arxiv.org/abs/2512.02543

作者：Vishnu Sarukkai,Asanshay Gupta,James Hong,Michaël Gharbi,Kayvon Fatahalian
备注：16 pages, 4 figures
摘要：目前，世界上有很多关于如何使用新的LLM代理的想法，开发人员寻求快速原型和测试新的代理设计。然而，使用高容量LLM的大规模执行代理会导致高推理成本。我们提出了一种简单的方法来降低LLM代理推理成本，而不会产生与LLM微调（长训练周期，优化超参数调整循环）或手动提示工程（费力的试错）相关的开发摩擦成本。最重要的是，我们引入了$\textit{in-context distillation}$，它将知识蒸馏的思想（训练一个低成本的学生模型来模仿一个高成本的教师）适应于上下文学习环境。我们的方法检索相关的教师示范在每个代理步骤，并提供给学生的背景下的例子，使学生能够模仿教师的行为上飞。我们将上下文蒸馏与$\textit{self-consistency cascades}$的既定思想相结合，以了解何时信任学生。这种自适应策略实现了模型专门化的成本效益，同时保持了使用冻结模型的生产力。在多步体现推理基准ALFWorld上，我们的方法以$\textbf{2.5$\times$ lower cost}$匹配教师级别的准确性，将每集成本从\$0.059降低到\$0.024。前期演示成本仅在843集之后摊销，在部署规模（1 M集）上累计节省超过34，900美元。在AppWorld上，一个需要多步API工作流的复杂代理基准测试，我们通过在等精度下实现$\textbf{2$\times$ cost reduction}$来移动帕累托边界。通过降低运营成本，同时保持冷冻模型的快速实验周期，我们的方法使先进的代理系统在更广泛的应用中经济可行。
摘要：The world currently has an abundance of ideas for how to use new LLM agents, and developers seek to rapidly prototype and test new agentic designs. However, executing agents at scale using high-capacity LLMs incurs high inference costs. We propose a simple method for reducing LLM agent inference costs without incurring the development friction costs associated with LLM fine-tuning (long training cycles, optimization hyperparameter tweaking loops) or manual prompt engineering (laborious trial and error). Most importantly, we introduce $\textit{in-context distillation}$, which adapts the idea of knowledge distillation (training a low cost-student model to mimic a high-cost teacher) to an in-context learning setting. Our approach retrieves relevant teacher demonstrations at each agent step and provides them to the student as in-context examples, enabling the student to imitate teacher behavior on-the-fly. We combine in-context distillation with the established idea of $\textit{self-consistency cascades}$ to know when the trust the student. This adaptive strategy realizes the cost benefits of model specialization while preserving the productivity of working with frozen models. On the multi-step embodied reasoning benchmark ALFWorld, our method matches teacher-level accuracy at $\textbf{2.5$\times$ lower cost}$, reducing per-episode costs from \$0.059 to \$0.024. The upfront demonstration cost amortizes after just 843 episodes, yielding cumulative savings exceeding \$34,900 at deployment scale (1M episodes). On AppWorld, a complex agent benchmark requiring multi-step API workflows, we shift the Pareto frontier by achieving a $\textbf{2$\times$ cost reduction}$ at iso-accuracy. By reducing operational costs while maintaining rapid experimentation cycles with frozen models, our approach makes advanced agentic systems economically viable for a broader range of applications.

【10】A Concise Review of Hallucinations in LLMs and their Mitigation
标题：LLM中的幻觉及其缓解方法的简要回顾
链接：https://arxiv.org/abs/2512.02527

作者：Parth Pulkundwar,Vivek Dhanawade,Rohit Yadav,Minal Sonkar,Medha Asurlekar,Sarita Rathod
备注：7 pages
摘要：传统的语言模型面临着来自幻觉的挑战。它们的存在给自然语言处理这个充满希望的领域投下了巨大而危险的阴影。了解现在发生的各种幻觉，它们的起源以及减少它们的方法变得至关重要。本文件对此作了简明扼要的总结。它是一个一站式的资源，用于对幻觉的一般理解以及如何减轻幻觉。
摘要：Traditional language models face a challenge from hallucinations. Their very presence casts a large, dangerous shadow over the promising realm of natural language processing. It becomes crucial to understand the various kinds of hallucinations that occur nowadays, their origins, and ways of reducing them. This document provides a concise and straightforward summary of that. It serves as a one-stop resource for a general understanding of hallucinations and how to mitigate them.

【11】Guided Self-Evolving LLMs with Minimal Human Supervision
标题：引导自我发展的LLM与最小的人类监督
链接：https://arxiv.org/abs/2512.02472

作者：Wenhao Yu,Zhenwen Liang,Chengsong Huang,Kishan Panaganti,Tianqing Fang,Haitao Mi,Dong Yu
摘要：人工智能的自我进化一直被认为是通往超级智能的道路，模型可以自主地从自己的学习经验中获取、提炼和内化知识。然而，在实践中，无指导的自我进化系统往往会随着训练的进行而迅速稳定甚至退化。这些失败源于概念漂移、多样性崩溃和错误进化等问题，因为模型强化了自己的偏见，并向低熵行为收敛。为了使模型能够以稳定和可控的方式自我进化，同时最大限度地减少对人类监督的依赖，我们引入了R-Few，这是一个引导式Self-Play决策器-求解器框架，通过上下文基础和混合训练来整合轻量级的人类监督。在每次迭代中，Challenger会抽取一小部分人工标记的示例来指导合成问题的生成，而Solver则会在基于难度的在线课程中对人工和合成示例进行联合训练。在数学和一般推理基准测试中，R-Few实现了一致和迭代的改进。例如，Qwen 3 -8B-Base在数学任务上比R-Zero提高了+3.0分，并实现了与General-Reasoner相当的性能，尽管后者是在20倍的人类数据上训练的。消融研究证实了接地挑战者训练和基于矩阵的求解器训练的互补贡献，进一步的分析表明，R-Few减轻了漂移，产生更稳定和可控的共同进化动力学。
摘要：AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.

【12】When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents
标题：当拒绝失败时：长上下文LLM代理中不稳定的安全机制
链接：https://arxiv.org/abs/2512.02445

作者：Tsimur Hadeliya,Mohammad Ali Jauhar,Nidhi Sakpal,Diogo Cruz
备注：12 pages, 11 figures. Accepted at AAAI 2026 TrustAgent Workshop
摘要：解决复杂或长期问题通常需要大型语言模型（LLM）使用外部工具，并在更长的上下文窗口上运行。新的LLM支持更长的上下文窗口并支持工具调用功能。以前的工作主要集中在评估LLM的长上下文提示，离开代理设置相对未探索，无论是从能力和安全的角度。我们的工作解决了这一差距。我们发现LLM代理可能对上下文的长度，类型和位置敏感，在任务性能和拒绝执行有害请求方面表现出意外和不一致的变化。具有1 M-2 M令牌上下文窗口的模型在100 K令牌时已经显示出严重的退化，对于良性和有害任务，性能下降超过50%。拒绝率的变化不可预测：GPT-4.1-nano从$\sim$5\%增加到$\sim$40\%，而Grok 4 Fast从$\sim$80\%减少到$\sim$10\%（20万代币）。我们的工作显示了代理在较长上下文上操作的潜在安全问题，并对当前评估LLM代理在长多步任务上的安全性的指标和范例提出了更多问题。特别是，我们的研究结果显示，在能力和安全性能的显着分歧相比，以前的评估LLM类似的标准。
摘要：Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50\% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from $\sim$5\% to $\sim$40\% while Grok 4 Fast decreases from $\sim$80\% to $\sim$10\% at 200K tokens. Our work shows potential safety issues with agents operating on longer context and opens additional questions on the current metrics and paradigm for evaluating LLM agent safety on long multi-step tasks. In particular, our results on LLM agents reveal a notable divergence in both capability and safety performance compared to prior evaluations of LLMs on similar criteria.

【13】Leveraging Large Language Models to Bridge On-chain and Off-chain Transparency in Stablecoins
标题：利用大型语言模型连接稳定币的链上和链下透明度
链接：https://arxiv.org/abs/2512.02418

作者：Yuexin Xiang,Yuchen Lei,SM Mahir Shazeed Rish,Yuanzhe Zhang,Qin Wang,Tsz Hon Yuen,Jiangshan Yu
摘要：USDT和USDC等稳定币希望通过将发行控制与储备证明相结合来实现稳定性。然而，在实践中，透明度分为两个世界：可验证的链上跟踪和锁定在非结构化文本中的链下披露。我们引入了一个基于大型语言模型（LLM）的自动化框架，该框架通过将链上发行数据与链下披露声明对齐来桥接这两个维度。首先，我们提出了一个使用LLM的集成框架，通过文档解析和语义对齐来捕获和分析链上和链下数据，从发行人证明中提取关键财务指标，并将其映射到相应的链上指标。其次，我们将多链发行记录和披露文件整合到一个模型上下文协议（MCP）框架中，该框架使LLM能够访问定量市场数据和定性披露叙述。该框架支持跨异构稳定币信息源的统一检索和上下文对齐，并促进一致性分析。第三，我们展示了LLM在区块链分析中跨异构数据模式运行的能力，量化了报告和观察到的流通之间的差异，并研究了它们对跨链透明度和价格动态的影响。我们的研究结果揭示了披露和可验证数据之间的系统性差距，表明LLM辅助分析增强了跨模式的透明度，并支持去中心化金融（DeFi）中的自动化数据驱动审计。
摘要：Stablecoins such as USDT and USDC aspire to peg stability by coupling issuance controls with reserve attestations. In practice, however, the transparency is split across two worlds: verifiable on-chain traces and off-chain disclosures locked in unstructured text that are unconnected. We introduce a large language model (LLM)-based automated framework that bridges these two dimensions by aligning on-chain issuance data with off-chain disclosure statements. First, we propose an integrative framework using LLMs to capture and analyze on- and off-chain data through document parsing and semantic alignment, extracting key financial indicators from issuer attestations and mapping them to corresponding on-chain metrics. Second, we integrate multi-chain issuance records and disclosure documents within a model context protocol (MCP) framework that standardizes LLMs access to both quantitative market data and qualitative disclosure narratives. This framework enables unified retrieval and contextual alignment across heterogeneous stablecoin information sources and facilitates consistent analysis. Third, we demonstrate the capability of LLMs to operate across heterogeneous data modalities in blockchain analytics, quantifying discrepancies between reported and observed circulation and examining their implications for cross-chain transparency and price dynamics. Our findings reveal systematic gaps between disclosed and verifiable data, showing that LLM-assisted analysis enhances cross-modal transparency and supports automated, data-driven auditing in decentralized finance (DeFi).

【14】Synthetic Error Injection Fails to Elicit Self-Correction In Language Models
标题：合成错误注入无法在语言模型中实现自我纠正
链接：https://arxiv.org/abs/2512.02389

作者：David X. Wu,Shreyas Kapur,Anant Sahai,Stuart Russell
备注：13 pages, 12 figures
摘要：强化学习已经成为大型语言模型中引发推理和自我纠正能力的主导范式，但其计算成本促使人们探索替代方案。受自动驾驶和机器人技术的启发，我们研究了带有合成错误注入的监督学习是否可以在语言模型中诱导自我纠正能力。我们的方法将人工错误插入到推理链中，掩盖它们，并监督模型识别和纠正这些错误。尽管这种方法具有直观的吸引力，但我们发现，即使在多个模型的简单合成任务上，它也无法显着提高性能。此外，即使模型发现了自己的错误，它也经常重复最初的错误。我们发现，合成错误的分布转移到政策错误显着降低了微调模型的纠错能力，即使有良好的合成覆盖政策错误。我们的研究结果有助于解释为什么基于策略的强化学习方法被证明是引发自我纠正的唯一有效方法。
摘要：Reinforcement learning has become the dominant paradigm for eliciting reasoning and self-correction capabilities in large language models, but its computational expense motivates exploration of alternatives. Inspired by techniques from autonomous driving and robotics, we investigate whether supervised learning with synthetic error injection can induce self-correction abilities in language models. Our approach inserts artificial errors into reasoning chains, masks them, and supervises the model to recognize and correct these mistakes. Despite the intuitive appeal of this method, we find that it fails to significantly improve performance even on simple synthetic tasks across multiple models. Moreover, even when the model catches its own error, it often parrots the original mistake. We find that the distribution shift of synthetic errors to on-policy errors significantly degrades the error-correction capabilities of the fine-tuned model, even with good synthetic coverage of on-policy errors. Our results help explain why on-policy reinforcement learning methods have proven uniquely effective for eliciting self-correction.

【15】See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
标题：看、听和理解：多模式大型语言模型中视听人类语音理解的基准
链接：https://arxiv.org/abs/2512.02231

作者：Le Thien Phuc Nguyen,Zhuoran Yu,Samuel Low Yu Hang,Subin An,Jeongik Lee,Yohan Ban,SeungEun Chung,Thanh-Huy Nguyen,JuWan Maeng,Soochahn Lee,Yong Jae Lee
备注：preprint
摘要：多模态大型语言模型（MLLM）有望联合解释视觉，音频和语言，但现有的视频基准很少评估人类语音的细粒度推理。许多任务仍然是视觉上可解决的，或者只是粗略地评估语音，对模型是否可以对齐谁在说话、说了什么以及何时发生的洞察力有限。我们介绍了AV-SpeakerBench，这是一个包含3，212个多项选择题的策划基准，专注于现实世界视频中以说话者为中心的视听推理。其特点是：（1）以说话者为中心的公式，将说话者而不是场景视为核心推理单元;（2）融合基础的问题设计，将视听依赖嵌入问题语义中;（3）专家策划的注释，确保时间精度和跨模态有效性。综合评估表明，Gemini系列始终优于开源系统，Gemini 2.5 Pro达到了最佳效果。在开放型号中，Qwen 3-Omni-30 B接近Gemini 2.0 Flash，但仍远远落后于Gemini 2.5 Pro，主要是由于视听融合较弱，而不是视觉感知。我们相信AV-SpeakerBench为在未来的多模态系统中推进细粒度视听推理奠定了坚实的基础。
摘要：Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.

【16】STRIDE: A Systematic Framework for Selecting AI Modalities - Agentic AI, AI Assistants, or LLM Calls
标题：STRIDE：选择AI模式的系统框架-人工智能，AI助手或LLM呼叫
链接：https://arxiv.org/abs/2512.02228

作者：Shubhi Asthana,Bing Zhang,Chad DeLuca,Ruchi Mahindru,Hima Patel
备注：10 pages, 4 Figures, 5 Tables Paper presented at NeurIPS 2025 LAW workshop: Bridging Language, Agent, and World Models
摘要：从无状态的大型语言模型（LLM）到自主的、目标驱动的代理的快速转变提出了一个核心问题：什么时候真正需要代理人工智能？虽然代理支持多步推理、持久内存和工具编排，但不加选择地部署它们会导致更高的成本、复杂性和风险。我们提出了STRIDE（系统任务推理智能部署评估器），这是一个框架，为在三种模式之间进行选择提供了原则性建议：（i）直接LLM调用，（ii）引导AI助手，以及（iii）完全自主的代理AI。STRIDE整合了结构化任务分解、动态归因和自我反思需求分析，以产生一个代理适应性分数，确保完全的代理自主权保留给具有内在动态或不断变化的背景的任务。在跨越SRE、合规性和企业自动化的30个实际任务中进行评估，STRIDE在模态选择方面实现了92%的准确性，将不必要的代理部署减少了45%，并将资源成本降低了37%。在SRE和合规领域进行了六个多月的专家验证，证实了其实用性，领域专家同意STRIDE有效地区分了需要简单LLM呼叫，指导助理或完全代理自主权的任务。这项工作将代理采用重新定义为必要性驱动的设计决策，确保只有在其收益证明成本合理时才应用自主权。
摘要：The rapid shift from stateless large language models (LLMs) to autonomous, goal-driven agents raises a central question: When is agentic AI truly necessary? While agents enable multi-step reasoning, persistent memory, and tool orchestration, deploying them indiscriminately leads to higher cost, complexity, and risk. We present STRIDE (Systematic Task Reasoning Intelligence Deployment Evaluator), a framework that provides principled recommendations for selecting between three modalities: (i) direct LLM calls, (ii) guided AI assistants, and (iii) fully autonomous agentic AI. STRIDE integrates structured task decomposition, dynamism attribution, and self-reflection requirement analysis to produce an Agentic Suitability Score, ensuring that full agentic autonomy is reserved for tasks with inherent dynamism or evolving context. Evaluated across 30 real-world tasks spanning SRE, compliance, and enterprise automation, STRIDE achieved 92% accuracy in modality selection, reduced unnecessary agent deployments by 45%, and cut resource costs by 37%. Expert validation over six months in SRE and compliance domains confirmed its practical utility, with domain specialists agreeing that STRIDE effectively distinguishes between tasks requiring simple LLM calls, guided assistants, or full agentic autonomy. This work reframes agent adoption as a necessity-driven design decision, ensuring autonomy is applied only when its benefits justify the costs.

【17】Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models
标题：修剪前三思：推理语言模型的自我反思结构化修剪
链接：https://arxiv.org/abs/2512.02185

作者：Ziyan Wang,Enmao Diao,Qi Le,Pu Wang,Guanchu Wang,Minwoo Lee,Shu-ping Yeh,Li Yang
备注：7 pages, 3 figures
摘要：推理LLM（RLM），如OpenAI o 1，DeepSeek-R1和Qwen 3，通过思想链生成提供强大的多步推理，但其庞大的模型大小和冗长的解码时间输出使其部署成本高昂，不适合资源受限的设置。为了减少计算和存储成本，剪枝提供了一个有前途的解决方案，删除不重要的参数。然而，尽管它们在标准LLM上取得了成功，但现有的修剪方法严重损害了RLM，因为即使是中等稀疏性（例如，20%）可能会破坏准确性，并完全破坏模型的推理连贯性。我们首先分析了为什么现有的修剪流水线在推理LLM上失败，并发现它们的脆弱性主要源于校准数据，修剪目标和模型的解码时间推理行为之间的不匹配。我们的研究进一步表明，最可靠的校准信号不是来自人类书写的标签，而是来自模型自己生成的推理痕迹，这些痕迹更准确地反映了其推理分布。在这些见解的指导下，我们引入了RESP，这是一个自反射结构化修剪框架，通过自生成的校准，仅基于解码的梯度重要性估计和渐进式再生来将修剪决策与模型的推理动态保持一致，从而随着稀疏度的增加而保持校准保真度。在Qwen 3 -8B上的实验表明，RESP在GSM 8 K和MathQA上的性能明显优于现有的结构化剪枝方法，在20-30%的稀疏度下保持近密集的准确性，并在更高的稀疏度水平下大大减轻了性能崩溃。在40%的稀疏度下，RESP在GSM 8 K上达到81.3%的准确率，在MathQA上达到59.6%，分别超过最强基线66.87%和47%。
摘要：Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model's reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model's decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model's own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model's reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20-30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.

【18】The 4/$δ$ Bound: Designing Predictable LLM-Verifier Systems for Formal Method Guarantee
标题：4/$δ$界限：设计可预测的LLM验证器系统以保证形式方法
链接：https://arxiv.org/abs/2512.02080

作者：PIerre Dantas,Lucas Cordeiro,Youcheng Sun,Waldir Junior
备注：32 pages, 9 figures
摘要：使用大型语言模型（LLM）的形式化验证工具的想法使软件验证能够超越手动工作流程。然而，目前的方法仍然不可靠。如果没有坚实的理论基础，细化过程可能会徘徊;有时它会停下来，有时它会循环回去，有时它会脱离任何稳定的轨道。这项工作通过开发一个LLM-Verifier收敛定理来弥合这一关键差距，为终止和收敛提供了第一个可证明的保证的正式框架。我们将LLM和验证者之间的交互建模为离散时间马尔可夫链，状态转换由一个关键参数确定：错误减少概率（$δ$）。达到Verified状态的过程几乎肯定表明程序在任何$δ> 0$时终止，预期迭代次数以$\mathbb{E}[n] \leq 4/δ$为界。然后，我们在一个包括90，000多个试验的广泛的实证活动中对这一预测进行了压力测试。实证结果与理论相吻合，具有显著的一致性。每一次运行都达到了验证，收敛因子紧密聚集在$C_f\约$1.0附近。因此，绑定反映了系统的实际行为。证据是足够强大的，以支持分为三个不同的操作区的工作流程：边缘，实用，高性能。因此，我们以绝对的置信度建立设计阈值。理论保证和实验证据一起为LLM辅助验证提供了更清晰的架构基础。启发式调优不再必须由系统执行。工程师获得了一个支持可预测的资源规划和性能预算的框架，这正是在将这些管道部署到安全关键软件环境之前所需要的。
摘要：The idea of using Formal Verification tools with large language models (LLMs) has enabled scaling software verification beyond manual workflows. However, current methods remain unreliable. Without a solid theoretical footing, the refinement process can wander; sometimes it settles, sometimes it loops back, and sometimes it breaks away from any stable trajectory. This work bridges this critical gap by developing an LLM-Verifier Convergence Theorem, providing the first formal framework with provable guarantees for termination and convergence. We model the interaction between the LLM and the verifier as a discrete-time Markov Chain, with state transitions determined by a key parameter: the error-reduction probability ($δ$). The procedure reaching the Verified state almost surely demonstrates that the program terminates for any $δ> 0$, with an expected iteration count bounded by $\mathbb{E}[n] \leq 4/δ$. We then stress-tested this prediction in an extensive empirical campaign comprising more than 90,000 trials. The empirical results match the theory with striking consistency. Every single run reached verification, and the convergence factor clustered tightly around $C_f\approx$ 1.0. Consequently, the bound mirrors the system's actual behavior. The evidence is sufficiently robust to support dividing the workflow into three distinct operating zones: marginal, practical, and high-performance. Consequently, we establish the design thresholds with absolute confidence. Together, the theoretical guarantee and the experimental evidence provide a clearer architectural foundation for LLM-assisted verification. Heuristic tuning no longer has to be carried out by the system. Engineers gain a framework that supports predictable resource planning and performance budgeting, precisely what is needed before deploying these pipelines into safety-critical software environments.

【19】Do Large Language Models Walk Their Talk? Measuring the Gap Between Implicit Associations, Self-Report, and Behavioral Altruism
标题：大型语言模型会说话吗？衡量内隐联想、自我报告和行为利他主义之间的差距
链接：https://arxiv.org/abs/2512.01568

作者：Sandro Andric
备注：14 pages, 7 figures, 7 tables. Code and data available at https://github.com/sandroandric/LLMs_Altruism_Study_Code
摘要：我们研究大型语言模型（LLM）是否表现出利他主义的倾向，关键是，他们的隐式关联和自我报告是否预测实际的利他主义行为。使用受人类社会心理学启发的多方法方法，我们测试了三种范式的24个前沿LLM：（1）测量内隐利他主义偏见的内隐联想测试（IAT），（2）测量行为利他主义的强制二元选择任务，（3）测量外显利他主义信念的自我评估量表。我们的主要发现是：（1）所有模型都表现出强烈的内隐亲利他偏见（平均IAT = 0.87，p < .0001），证实模型“知道”利他主义是好的。(2)模型的行为比偶然行为更利他（65.6% vs. 50%，p < .0001），但差异很大（48-85%）。(3)内隐关联不能预测行为（r = 0.22，p = 0.29）。(4)最关键的是，模型系统性地高估了自己的利他主义，声称77.5%的利他主义，而实际上只有65.6%（p <0.0001，科恩d = 1.08）。这种“美德信号缺口”影响了75%的测试模型。基于这些发现，我们推荐校准差距（自我报告值和行为值之间的差异）作为标准化的对齐度量。校准良好的模型更可预测，行为更一致;只有12.5%的模型实现了高度亲社会行为和准确自我认知的理想组合。
摘要：We investigate whether Large Language Models (LLMs) exhibit altruistic tendencies, and critically, whether their implicit associations and self-reports predict actual altruistic behavior. Using a multi-method approach inspired by human social psychology, we tested 24 frontier LLMs across three paradigms: (1) an Implicit Association Test (IAT) measuring implicit altruism bias, (2) a forced binary choice task measuring behavioral altruism, and (3) a self-assessment scale measuring explicit altruism beliefs. Our key findings are: (1) All models show strong implicit pro-altruism bias (mean IAT = 0.87, p < .0001), confirming models "know" altruism is good. (2) Models behave more altruistically than chance (65.6% vs. 50%, p < .0001), but with substantial variation (48-85%). (3) Implicit associations do not predict behavior (r = .22, p = .29). (4) Most critically, models systematically overestimate their own altruism, claiming 77.5% altruism while acting at 65.6% (p < .0001, Cohen's d = 1.08). This "virtue signaling gap" affects 75% of models tested. Based on these findings, we recommend the Calibration Gap (the discrepancy between self-reported and behavioral values) as a standardized alignment metric. Well-calibrated models are more predictable and behaviorally consistent; only 12.5% of models achieve the ideal combination of high prosocial behavior and accurate self-knowledge.

Graph相关(图学习|图神经网络|图优化等)(8篇)

【1】GraphMatch: Fusing Language and Graph Representations in a Dynamic Two-Sided Work Marketplace
标题：GraphMatch：在动态双边工作市场中融合语言和图形表示
链接：https://arxiv.org/abs/2512.02849

作者：Mikołaj Sacha,Hammad Jafri,Mattie Terzolo,Ayan Sinha,Andrew Rabinovich
摘要：由于不断发展的内容和交互图，在文本丰富、动态的双边市场中推荐匹配呈现出独特的挑战。我们引入了GraphMatch，这是一个新的大规模推荐框架，它将预训练的语言模型与图神经网络融合在一起，以克服这些挑战。与之前以独立模型为中心的方法不同，GraphMatch是一个基于强大的文本编码器和GNN协同工作的综合方法。它采用对抗性负采样以及时间点子图训练来学习表示，这些表示既捕获了不断变化的文本的细粒度语义，又捕获了图的时间敏感结构。我们广泛地评估了来自Upwork（一个领先的劳动力市场）的大规模交互数据，并讨论了适合实时使用的低延迟推理方法。在我们的实验中，GraphMatch在匹配任务上优于仅语言和仅图形基线，同时在运行时高效。这些结果表明，统一的语言和图形表示为文本丰富的动态双边推荐提供了一个高效的解决方案，在实践中弥合了强大的预训练LM和大规模图形之间的差距。
摘要：Recommending matches in a text-rich, dynamic two-sided marketplace presents unique challenges due to evolving content and interaction graphs. We introduce GraphMatch, a new large-scale recommendation framework that fuses pre-trained language models with graph neural networks to overcome these challenges. Unlike prior approaches centered on standalone models, GraphMatch is a comprehensive recipe built on powerful text encoders and GNNs working in tandem. It employs adversarial negative sampling alongside point-in-time subgraph training to learn representations that capture both the fine-grained semantics of evolving text and the time-sensitive structure of the graph. We evaluated extensively on interaction data from Upwork, a leading labor marketplace, at large scale, and discuss our approach towards low-latency inference suitable for real-time use. In our experiments, GraphMatch outperforms language-only and graph-only baselines on matching tasks while being efficient at runtime. These results demonstrate that unifying language and graph representations yields a highly effective solution to text-rich, dynamic two-sided recommendations, bridging the gap between powerful pretrained LMs and large-scale graphs in practice.

【2】Credal Graph Neural Networks
标题：Credal图神经网络
链接：https://arxiv.org/abs/2512.02722

作者：Matteo Tolloso,Davide Bacciu
摘要：不确定性量化对于部署可靠的图神经网络（GNN）至关重要，现有方法主要依赖于贝叶斯推理或集成。在本文中，我们介绍了第一个信任图神经网络（CGNNs），它通过训练GNNs以信任集的形式输出集值预测来将信任学习扩展到图域。为了解释GNN中消息传递的独特性质，我们开发了一种补充方法来利用分层信息传播的不同方面来进行信任学习。我们评估我们的方法不确定性量化节点分类下的分布条件。我们的分析强调了图同伦假设在形成不确定性估计的有效性方面的关键作用。大量实验表明，CGNN提供了更可靠的认识不确定性表示，并在异嗜图上的分布偏移下实现了最先进的性能。
摘要：Uncertainty quantification is essential for deploying reliable Graph Neural Networks (GNNs), where existing approaches primarily rely on Bayesian inference or ensembles. In this paper, we introduce the first credal graph neural networks (CGNNs), which extend credal learning to the graph domain by training GNNs to output set-valued predictions in the form of credal sets. To account for the distinctive nature of message passing in GNNs, we develop a complementary approach to credal learning that leverages different aspects of layer-wise information propagation. We assess our approach on uncertainty quantification in node classification under out-of-distribution conditions. Our analysis highlights the critical role of the graph homophily assumption in shaping the effectiveness of uncertainty estimates. Extensive experiments demonstrate that CGNNs deliver more reliable representations of epistemic uncertainty and achieve state-of-the-art performance under distributional shift on heterophilic graphs.

【3】FGC-Comp: Adaptive Neighbor-Grouped Attribute Completion for Graph-based Anomaly Detection
标题：FGC-Comp：用于基于图的异常检测的自适应邻居分组属性完成
链接：https://arxiv.org/abs/2512.02705

作者：Junpeng Wu,Pinheng Zong
备注：6 pages, 2 figures
摘要：近年来，基于图的异常检测模型得到了广泛的采用，通过聚合邻居信息来识别可疑节点。然而，大多数现有的研究忽略了普遍存在的问题，失踪和adversarially模糊的节点属性，这可能会破坏聚合的稳定性和预测的可靠性。为了缓解这一问题，我们提出了FGC-Comp，一个轻量级的，分类器不可知的，部署友好的属性完成模块，旨在增强不完整属性下的邻域聚合。我们将每个节点的邻居划分为三个基于标签的组，将组特定的变换应用到标记组，而节点条件门处理未知数，通过剩余连接融合消息，并使用二进制分类目标进行端到端的训练，以提高聚合稳定性和缺失属性下的预测可靠性。在两个真实的欺诈数据集上的实验验证了该方法的有效性，计算开销可以忽略不计。
摘要：Graph-based Anomaly Detection models have gained widespread adoption in recent years, identifying suspicious nodes by aggregating neighborhood information. However, most existing studies overlook the pervasive issues of missing and adversarially obscured node attributes, which can undermine aggregation stability and prediction reliability. To mitigate this, we propose FGC-Comp, a lightweight, classifier-agnostic, and deployment-friendly attribute completion module-designed to enhance neighborhood aggregation under incomplete attributes. We partition each node's neighbors into three label-based groups, apply group-specific transforms to the labeled groups while a node-conditioned gate handles unknowns, fuse messages via residual connections, and train end-to-end with a binary classification objective to improve aggregation stability and prediction reliability under missing attributes. Experiments on two real-world fraud datasets validate the effectiveness of the approach with negligible computational overhead.

【4】Graph VQ-Transformer (GVT): Fast and Accurate Molecular Generation via High-Fidelity Discrete Latents
标题：图形VQ转换器（GVT）：通过高保真离散潜伏物快速准确地生成分子
链接：https://arxiv.org/abs/2512.02667

作者：Haozhuo Zheng,Cheng Wang,Yang Liu
摘要：从头生成具有所需性质的分子是一个关键的挑战，其中扩散模型是计算密集型的，自回归模型与误差传播作斗争。在这项工作中，我们介绍了图形VQ-Transformer（GVT），一个两阶段的生成框架，实现了高精度和高效率。我们的方法的核心是一种新的图形矢量量化变分自动编码器（VQ-VAE），压缩成高保真离散潜在序列的分子图。通过将图Transformer与规范的反向Cuthill-McKee（RCM）节点排序和旋转位置嵌入（RoPE）协同组合，我们的VQ-VAE实现了近乎完美的重建率。然后，在这些离散的潜伏期上训练自回归Transformer，有效地将图生成转换为结构良好的序列建模问题。至关重要的是，这种复杂图到高保真离散序列的映射将分子设计与大规模序列建模的强大范例联系起来，从而释放出与大型语言模型（LLM）的潜在协同作用。大量的实验表明，GVT在ZINC 250 k、MOSES和GuacaMol等主要基准测试中达到了最先进或极具竞争力的性能，并且在关键分布相似性度量（如FCD和KL Divergence）方面明显优于领先的扩散模型。凭借其卓越的性能，效率和架构新颖性，GVT不仅为扩散模型提供了一个引人注目的替代方案，而且还为该领域建立了一个强大的新基线，为未来离散潜空间分子生成的研究铺平了道路。
摘要：The de novo generation of molecules with desirable properties is a critical challenge, where diffusion models are computationally intensive and autoregressive models struggle with error propagation. In this work, we introduce the Graph VQ-Transformer (GVT), a two-stage generative framework that achieves both high accuracy and efficiency. The core of our approach is a novel Graph Vector Quantized Variational Autoencoder (VQ-VAE) that compresses molecular graphs into high-fidelity discrete latent sequences. By synergistically combining a Graph Transformer with canonical Reverse Cuthill-McKee (RCM) node ordering and Rotary Positional Embeddings (RoPE), our VQ-VAE achieves near-perfect reconstruction rates. An autoregressive Transformer is then trained on these discrete latents, effectively converting graph generation into a well-structured sequence modeling problem. Crucially, this mapping of complex graphs to high-fidelity discrete sequences bridges molecular design with the powerful paradigm of large-scale sequence modeling, unlocking potential synergies with Large Language Models (LLMs). Extensive experiments show that GVT achieves state-of-the-art or highly competitive performance across major benchmarks like ZINC250k, MOSES, and GuacaMol, and notably outperforms leading diffusion models on key distribution similarity metrics such as FCD and KL Divergence. With its superior performance, efficiency, and architectural novelty, GVT not only presents a compelling alternative to diffusion models but also establishes a strong new baseline for the field, paving the way for future research in discrete latent-space molecular generation.

【5】Sampling on Metric Graphs
标题：米制图上的抽样
链接：https://arxiv.org/abs/2512.02175

作者：Rajat Vadiraj Dwaraknath,Lexing Ying
摘要：度量图是通过将标准图中的边与实线段相关联并将这些线段粘合在图的顶点处而获得的结构。由此产生的结构有一个自然的度量，允许研究微分算子和随机过程的图。在这些区域中的布朗运动已经被广泛的理论研究，使用它们的生成元。然而，较少的工作已经做了实际的算法来模拟这些过程。我们介绍了第一个算法模拟布朗运动的度量图通过时间步分裂欧拉-丸山为基础的离散化相应的随机微分方程。通过将该方案应用于度量图上的Langevin扩散，我们还得到了度量图上的第一个采样算法。我们提供了理论保证的算法收敛到基本的随机过程所需的时间步长分裂的数量。我们还表明，模拟粒子的退出概率收敛到基本的随机微分方程的顶点边缘跳跃概率的时间步长为零。最后，由于这种方法是高度并行化的，我们以自定义CUDA内核的形式提供了我们算法的快速，内存感知的实现，比使用PyTorch在简单的星形度量图上的GPU实现快约8000倍。除了简单的星形图，我们基准我们的算法上提取的一个真正的皮质血管网络的DuMuX组织灌注模型示踪剂运输。我们的算法能够运行稳定的模拟，时间步长明显大于DuMuX中使用的有限体积法的稳定极限，同时还实现了高达1500倍的加速比。
摘要：Metric graphs are structures obtained by associating edges in a standard graph with segments of the real line and gluing these segments at the vertices of the graph. The resulting structure has a natural metric that allows for the study of differential operators and stochastic processes on the graph. Brownian motions in these domains have been extensively studied theoretically using their generators. However, less work has been done on practical algorithms for simulating these processes. We introduce the first algorithm for simulating Brownian motions on metric graphs through a timestep splitting Euler-Maruyama-based discretization of their corresponding stochastic differential equation. By applying this scheme to Langevin diffusions on metric graphs, we also obtain the first algorithm for sampling on metric graphs. We provide theoretical guarantees on the number of timestep splittings required for the algorithm to converge to the underlying stochastic process. We also show that the exit probabilities of the simulated particle converge to the vertex-edge jump probabilities of the underlying stochastic differential equation as the timestep goes to zero. Finally, since this method is highly parallelizable, we provide fast, memory-aware implementations of our algorithm in the form of custom CUDA kernels that are up to ~8000x faster than a GPU implementation using PyTorch on simple star metric graphs. Beyond simple star graphs, we benchmark our algorithm on a real cortical vascular network extracted from a DuMuX tissue-perfusion model for tracer transport. Our algorithm is able to run stable simulations with timesteps significantly larger than the stable limit of the finite volume method used in DuMuX while also achieving speedups of up to ~1500x.

【6】Cross-View Topology-Aware Graph Representation Learning
标题：交叉视图布局感知图表示学习
链接：https://arxiv.org/abs/2512.02130

作者：Ahmet Sami Korkmaz,Selim Coskunuzer,Md Joshem Uddin
摘要：图分类由于其在化学、社交网络和生物信息学中的应用而获得了极大的关注。虽然图神经网络（GNN）可以有效地捕捉局部结构模式，但它们往往忽略了全局拓扑特征，这些特征对于鲁棒的表示学习至关重要。在这项工作中，我们提出了GraphTCL，这是一个双视图对比学习框架，它将GNN的结构嵌入与来自持久同源性的拓扑嵌入集成在一起。通过对齐这些互补的意见，通过跨视图对比损失，我们的方法提高了表示质量，提高了分类性能。在包括TU和OGB分子图在内的基准数据集上进行的广泛实验表明，GraphTCL始终优于最先进的基线。这项研究强调了拓扑感知的对比学习对于推进图表示方法的重要性。
摘要：Graph classification has gained significant attention due to its applications in chemistry, social networks, and bioinformatics. While Graph Neural Networks (GNNs) effectively capture local structural patterns, they often overlook global topological features that are critical for robust representation learning. In this work, we propose GraphTCL, a dual-view contrastive learning framework that integrates structural embeddings from GNNs with topological embeddings derived from persistent homology. By aligning these complementary views through a cross-view contrastive loss, our method enhances representation quality and improves classification performance. Extensive experiments on benchmark datasets, including TU and OGB molecular graphs, demonstrate that GraphTCL consistently outperforms state-of-the-art baselines. This study highlights the importance of topology-aware contrastive learning for advancing graph representation methods.

【7】HTG-GCL: Leveraging Hierarchical Topological Granularity from Cellular Complexes for Graph Contrastive Learning
标题：HTG-GCL：利用细胞复合体的分层布局粒度进行图对比学习
链接：https://arxiv.org/abs/2512.02073

作者：Qirui Ji,Bin Qin,Yifan Jin,Yunze Zhao,Chuxiong Sun,Changwen Zheng,Jianwen Cao,Jiangmeng Li
摘要：图对比学习（GCL）的目的是通过对比共享关键拓扑模式的同一图的不同视图来学习区分性语义不变性。然而，现有的GCL方法与结构增强往往难以识别任务相关的拓扑结构，更不用说适应不同的下游任务所需的不同的粗到细的拓扑粒度。为了解决这个问题，我们引入了分层拓扑粒度图对比学习（HTG-GCL），这是一种新的框架，它利用同一个图的变换来生成多尺度的基于环的细胞复合体，体现了拓扑粒度的概念，从而生成不同的拓扑视图。认识到一定的粒度可能包含误导语义，我们提出了一个多粒度解耦对比，并应用基于不确定性估计的粒度特定的加权机制。在各种基准上的综合实验证明了HTG-GCL的有效性，突出了其在通过层次拓扑信息捕获有意义的图形表示方面的优越性能。
摘要：Graph contrastive learning (GCL) aims to learn discriminative semantic invariance by contrasting different views of the same graph that share critical topological patterns. However, existing GCL approaches with structural augmentations often struggle to identify task-relevant topological structures, let alone adapt to the varying coarse-to-fine topological granularities required across different downstream tasks. To remedy this issue, we introduce Hierarchical Topological Granularity Graph Contrastive Learning (HTG-GCL), a novel framework that leverages transformations of the same graph to generate multi-scale ring-based cellular complexes, embodying the concept of topological granularity, thereby generating diverse topological views. Recognizing that a certain granularity may contain misleading semantics, we propose a multi-granularity decoupled contrast and apply a granularity-specific weighting mechanism based on uncertainty estimation. Comprehensive experiments on various benchmarks demonstrate the effectiveness of HTG-GCL, highlighting its superior performance in capturing meaningful graph representations through hierarchical topological information.

【8】Seizure-NGCLNet: Representation Learning of SEEG Spatial Pathological Patterns for Epileptic Seizure Detection via Node-Graph Dual Contrastive Learning
标题：扣押-NGCLNet：通过节点图双对比学习进行癫痫发作检测的SEeg空间病理模式的表示学习
链接：https://arxiv.org/abs/2512.02028

作者：Yiping Wang,Peiren Wang,Zhenye Li,Fang Liu,Jinguo Huang
摘要：复杂的空间连接模式，如发作间期抑制和发作传播，使使用立体定向脑电图（SEEG）和传统机器学习方法进行精确的耐药癫痫（DRE）发作检测变得复杂。仍然存在两个关键挑战：（1）功能连接估计的信噪比低，使得难以学习与癫痫相关的相互作用;以及（2）空间病理连接模式的专家标签难以获得，同时缺乏模式的表示以改善癫痫发作检测。为了解决这些问题，我们提出了一种新的节点图双对比学习框架，癫痫发作NGCLNet，学习SEEG发作间期抑制和发作传播模式，以高精度检测DRE癫痫发作。首先，一个自适应的图增强策略引导的中心性指标的开发，以产生与大脑相关的网络。第二，一个双对比学习的方法是集成的，结合全局图级对比与局部节点图对比，编码空间结构和语义癫痫特征。第三，通过top-k局部图注意力网络对预训练的嵌入进行微调，以执行最终分类。在来自33名DRE患者的大规模公共SEEG数据集上进行的广泛实验表明，Seizure-NGCLNet达到了最先进的性能，平均准确率为95.93%，灵敏度为96.25%，特异性为94.12%。可视化证实，学习嵌入清楚地分开发作发作间期状态，反映抑制和传播模式，对应于临床机制。这些结果突出了Seizure-NGCLNet学习可解释的空间病理模式的能力，增强了癫痫发作检测和癫痫发作区域定位。
摘要：Complex spatial connectivity patterns, such as interictal suppression and ictal propagation, complicate accurate drug-resistant epilepsy (DRE) seizure detection using stereotactic electroencephalography (SEEG) and traditional machine learning methods. Two critical challenges remain:(1)a low signal-to-noise ratio in functional connectivity estimates, making it difficult to learn seizure-related interactions; and (2)expert labels for spatial pathological connectivity patterns are difficult to obtain, meanwhile lacking the patterns' representation to improve seizure detection. To address these issues, we propose a novel node-graph dual contrastive learning framework, Seizure-NGCLNet, to learn SEEG interictal suppression and ictal propagation patterns for detecting DRE seizures with high precision. First, an adaptive graph augmentation strategy guided by centrality metrics is developed to generate seizure-related brain networks. Second, a dual-contrastive learning approach is integrated, combining global graph-level contrast with local node-graph contrast, to encode both spatial structural and semantic epileptogenic features. Third, the pretrained embeddings are fine-tuned via a top-k localized graph attention network to perform the final classification. Extensive experiments on a large-scale public SEEG dataset from 33 DRE patients demonstrate that Seizure-NGCLNet achieves state-of-the-art performance, with an average accuracy of 95.93%, sensitivity of 96.25%, and specificity of 94.12%. Visualizations confirm that the learned embeddings clearly separate ictal from interictal states, reflecting suppression and propagation patterns that correspond to the clinical mechanisms. These results highlight Seizure-NGCLNet's ability to learn interpretable spatial pathological patterns, enhancing both seizure detection and seizure onset zone localization.

Transformer(3篇)

【1】ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity
标题：ESACT：通过本地相似性为计算密集型变形机提供端到端稀疏加速器
链接：https://arxiv.org/abs/2512.02403

作者：Hongxiang Liu,Zhifang Deng,Tong Pu,Shengli Lu
摘要：Transformers由QKV生成、注意力计算和FFNs组成，由于其出色的性能，已成为各个领域的主导模式。然而，它们的高计算成本阻碍了高效的硬件部署。 Sparsity提供了一个有前途的解决方案，然而大多数现有的加速器仅利用注意力的行内稀疏性，而很少考虑行间稀疏性。利用行间稀疏性的方法通常依赖于昂贵的全局相似性估计，这减少了稀疏性的加速益处，并且通常仅对一个或两个Transformer组件应用稀疏性。通过对注意力分布和计算流程的分析，我们观察到局部相似性允许具有较低计算开销的端到端稀疏加速。基于这一观察，我们提出了ESACT，用于计算密集型Transformers的端到端稀疏加速器。 ESACT以局部相似性稀疏预测（SPLS）机制为中心，其利用HLog量化来在QK生成之前准确地预测局部注意稀疏性，在所有Transformer组件上实现有效的稀疏性。为了支持高效的硬件实现，我们引入了三个架构创新。在26个基准上的实验结果表明， SPLS减少了总计算量的52.03%，精度损失小于1%。 ESACT实现了3.29 TOPS/W的端到端能效，并将注意力水平的能源效率提高了2.95倍和2.26倍， SOTA注意力加速器SpAtten和Sanger。
摘要：Transformers, composed of QKV generation, attention computation, and FFNs, have become the dominant model across various domains due to their outstanding performance. However, their high computational cost hinders efficient hardware deployment. Sparsity offers a promising solution, yet most existing accelerators exploit only intra-row sparsity in attention, while few consider inter-row sparsity. Approaches leveraging inter-row sparsity often rely on costly global similarity estimation, which diminishes the acceleration benefits of sparsity, and typically apply sparsity to only one or two transformer components. Through careful analysis of the attention distribution and computation flow, we observe that local similarity allows end-to-end sparse acceleration with lower computational overhead. Motivated by this observation, we propose ESACT, an end-to-end sparse accelerator for compute-intensive Transformers. ESACT centers on the Sparsity Prediction with Local Similarity (SPLS) mechanism, which leverages HLog quantization to accurately predict local attention sparsity prior to QK generation, achieving efficient sparsity across all transformer components. To support efficient hardware realization, we introduce three architectural innovations. Experimental results on 26 benchmarks demonstrate that SPLS reduces total computation by 52.03% with less than 1% accuracy loss. ESACT achieves an end-to-end energy efficiency of 3.29 TOPS/W, and improves attention-level energy efficiency by 2.95x and 2.26x over SOTA attention accelerators SpAtten and Sanger, respectively.

【2】Contextual Gating within the Transformer Stack: Synergistic Feature Modulation for Enhanced Lyrical Classification and Calibration
标题：Transformer堆栈内的上下文门控：协同特征调制以增强抒情分类和校准
链接：https://arxiv.org/abs/2512.02053

作者：M. A. Gameiro
摘要：本研究通过将辅助结构特征直接集成到预先训练的Transformer的自我注意机制中，在用于抒情内容分类的特征融合方面引入了显著的架构进步。我提出了SFL Transformer，这是一种新型的深度学习模型，它利用上下文门控机制（中间SFL）来调制BERT编码器堆栈中的隐藏状态序列，而不是在最终输出层融合特征。这种方法使用低维结构线索（Fstruct）来调节深层的上下文语义特征（Hseq）。该模型适用于一个具有挑战性的二进制分类任务来自UMAP减少抒情嵌入。SFL Transformer实现了0.9910的准确度和0.9910的宏F1分数，显著提高了由先前发布的SFL模型建立的最新技术水平（准确度0.9894）。至关重要的是，这种上下文门控策略保持了出色的可靠性，预期校准误差（ECE = 0.0081）和对数损失（0.0489）较低。这项工作验证了这样一个假设，即在堆栈中间注入辅助上下文是协同结合结构和语义信息的最有效手段，创建了一个具有卓越区分能力和高保真概率估计的模型。
摘要：This study introduces a significant architectural advancement in feature fusion for lyrical content classification by integrating auxiliary structural features directly into the self-attention mechanism of a pre-trained Transformer. I propose the SFL Transformer, a novel deep learning model that utilizes a Contextual Gating mechanism (an Intermediate SFL) to modulate the sequence of hidden states within the BERT encoder stack, rather than fusing features at the final output layer. This approach modulates the deep, contextualized semantic features (Hseq) using low-dimensional structural cues (Fstruct). The model is applied to a challenging binary classification task derived from UMAP-reduced lyrical embeddings. The SFL Transformer achieved an Accuracy of 0.9910 and a Macro F1 score of 0.9910, significantly improving the state-of-the-art established by the previously published SFL model (Accuracy 0.9894). Crucially, this Contextual Gating strategy maintained exceptional reliability, with a low Expected Calibration Error (ECE = 0.0081) and Log Loss (0.0489). This work validates the hypothesis that injecting auxiliary context mid-stack is the most effective means of synergistically combining structural and semantic information, creating a model with both superior discriminative power and high-fidelity probability estimates.

【3】Flexible Gravitational-Wave Parameter Estimation with Transformers
标题：使用Transformer进行灵活的重力波参数估计
链接：https://arxiv.org/abs/2512.02968

作者：Annalena Kofler,Maximilian Dax,Stephen R. Green,Jonas Wildberger,Nihar Gupte,Jakob H. Macke,Jonathan Gair,Alessandra Buonanno,Bernhard Schölkopf
备注：8+11 pages, 3+7 figures
摘要：引力波数据分析依赖于准确有效的方法从嘈杂的探测器信号中提取物理信息，但观测的速度和复杂性越来越高，这是一个越来越大的挑战。深度学习为传统推理提供了强大的替代方案，但现有的神经模型通常缺乏处理数据分析设置变化的灵活性。这种变化适应不完美的观测或需要专门的测试，并可能包括检测器配置，整体频率范围或局部削减的变化。我们引入了一个灵活的基于transformer的架构，并结合了一个训练策略，可以在推理时适应不同的分析设置。应用于参数估计，我们证明了一个单一的灵活的模型-称为Dingo-T1 -可以（i）分析48引力波事件从第三LIGO-Virgo-KAGRA观测运行在广泛的分析配置下，（ii）使系统的研究如何探测器和频率配置影响推断后验，和（iii）执行inspiral-merger-ringdown一致性测试探测广义相对论。Dingo-T1还将真实事件的中位样本效率从基线的1.4%提高到4.2%。因此，我们的方法展示了灵活和可扩展的推理，具有处理缺失或不完整数据的原则框架-当前和下一代观测站的关键功能。
摘要：Gravitational-wave data analysis relies on accurate and efficient methods to extract physical information from noisy detector signals, yet the increasing rate and complexity of observations represent a growing challenge. Deep learning provides a powerful alternative to traditional inference, but existing neural models typically lack the flexibility to handle variations in data analysis settings. Such variations accommodate imperfect observations or are required for specialized tests, and could include changes in detector configurations, overall frequency ranges, or localized cuts. We introduce a flexible transformer-based architecture paired with a training strategy that enables adaptation to diverse analysis settings at inference time. Applied to parameter estimation, we demonstrate that a single flexible model -- called Dingo-T1 -- can (i) analyze 48 gravitational-wave events from the third LIGO-Virgo-KAGRA Observing Run under a wide range of analysis configurations, (ii) enable systematic studies of how detector and frequency configurations impact inferred posteriors, and (iii) perform inspiral-merger-ringdown consistency tests probing general relativity. Dingo-T1 also improves median sample efficiency on real events from a baseline of 1.4% to 4.2%. Our approach thus demonstrates flexible and scalable inference with a principled framework for handling missing or incomplete data -- key capabilities for current and next-generation observatories.

GAN|对抗|攻击|生成相关(5篇)

【1】VLM as Strategist: Adaptive Generation of Safety-critical Testing Scenarios via Guided Diffusion
标题：VLM作为策略师：通过引导扩散自适应生成安全关键测试场景
链接：https://arxiv.org/abs/2512.02844

作者：Xinzheng Wu,Junyi Chen,Naiting Zhong,Yong Shen
备注：25 pages, 9 figures
摘要：自动驾驶系统（ADS）的安全部署依赖于全面的测试和评估。然而，在现实世界中，能够有效暴露系统漏洞的安全关键场景非常稀少。现有的场景生成方法面临的挑战，有效地构建长尾场景，确保保真度，关键性和交互性，而特别是缺乏实时动态响应能力的车辆测试（VUT）。为了解决这些挑战，本文提出了一个安全关键的测试场景生成框架，集成了高层次的语义理解能力的视觉语言模型（VLM）与自适应引导扩散模型的细粒度生成能力。该框架建立了一个三层的层次结构，包括一个战略层的VLM导向的场景生成目标的确定，一个战术层的指导功能制定，和一个操作层的指导扩散执行。我们首先建立一个高质量的基本扩散模型，学习真实驾驶场景的数据分布。接下来，我们设计了一种自适应引导扩散方法，可以在闭环仿真中实时，精确地控制背景车辆（BV）。然后，通过深入的场景理解和风险推理，结合VLM自主生成场景生成目标和指导功能，最终指导扩散模型实现VLM指导的场景生成。实验结果表明，该方法可以有效地生成逼真的，多样的，高度互动的安全关键测试场景。最后，通过实例验证了该方法的适应性和面向VLM-Directed的生成性能。
摘要：The safe deployment of autonomous driving systems (ADSs) relies on comprehensive testing and evaluation. However, safety-critical scenarios that can effectively expose system vulnerabilities are extremely sparse in the real world. Existing scenario generation methods face challenges in efficiently constructing long-tail scenarios that ensure fidelity, criticality, and interactivity, while particularly lacking real-time dynamic response capabilities to the vehicle under test (VUT). To address these challenges, this paper proposes a safety-critical testing scenario generation framework that integrates the high-level semantic understanding capabilities of Vision Language Models (VLMs) with the fine-grained generation capabilities of adaptive guided diffusion models. The framework establishes a three-layer hierarchical architecture comprising a strategic layer for VLM-directed scenario generation objective determination, a tactical layer for guidance function formulation, and an operational layer for guided diffusion execution. We first establish a high-quality fundamental diffusion model that learns the data distribution of real driving scenarios. Next, we design an adaptive guided diffusion method that enables real-time, precise control of background vehicles (BVs) in closed-loop simulation. The VLM is then incorporated to autonomously generate scenario generation objectives and guidance functions through deep scenario understanding and risk reasoning, ultimately guiding the diffusion model to achieve VLM-directed scenario generation. Experimental results demonstrate that the proposed method can efficiently generate realistic, diverse, and highly interactive safety-critical testing scenarios. Furthermore, case studies validate the adaptability and VLM-directed generation performance of the proposed method.

【2】Adversarial Jamming for Autoencoder Distribution Matching
标题：自动编码器分布匹配的对抗干扰
链接：https://arxiv.org/abs/2512.02740

作者：Waleed El-Geresy,Deniz Gündüz
备注：Presented at ICASSP 2024. 5 pages, 3 figures
摘要：我们建议使用对抗性无线干扰来正则化自动编码器的潜在空间，以匹配对角高斯分布。我们认为最小化的均方误差失真，干扰机试图破坏恢复的高斯源编码和传输的对抗信道。现有理论结果的一个直接后果是，极大极小博弈的鞍点（涉及这样的编码器、其相应的解码器和对抗性干扰器）由干扰器输出的对角高斯噪声组成。我们使用这个结果作为灵感的一种新的方法，在潜在的空间分布匹配，利用干扰作为辅助目标，鼓励聚合的潜在后验匹配的对角高斯分布。使用这种新技术，我们实现了分布匹配的标准变分自动编码器和Wasserstein自动编码器。这种方法也可以推广到其他潜在分布。
摘要：We propose the use of adversarial wireless jamming to regularise the latent space of an autoencoder to match a diagonal Gaussian distribution. We consider the minimisation of a mean squared error distortion, where a jammer attempts to disrupt the recovery of a Gaussian source encoded and transmitted over the adversarial channel. A straightforward consequence of existing theoretical results is the fact that the saddle point of a minimax game - involving such an encoder, its corresponding decoder, and an adversarial jammer - consists of diagonal Gaussian noise output by the jammer. We use this result as inspiration for a novel approach to distribution matching in the latent space, utilising jamming as an auxiliary objective to encourage the aggregated latent posterior to match a diagonal Gaussian distribution. Using this new technique, we achieve distribution matching comparable to standard variational autoencoders and to Wasserstein autoencoders. This approach can also be generalised to other latent distributions.

【3】SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification
标题：SpecPV：通过部分验证改进长上下文生成的自我推测解码
链接：https://arxiv.org/abs/2512.02337

作者：Zhendong Tan,Xingjun Zhang,Chaoyi Hu,Junjie Peng,Kun Xia
摘要：代码生成、深度推理和长文档理解等任务的需求不断增长，使得长上下文生成成为大型语言模型（LLM）的关键能力。推测解码是加速生成最直接、最有效的方法之一。它遵循一个draft-verify范式，其中一个轻量级的草案模型提出了几个候选令牌，目标模型验证它们。然而，我们发现，随着上下文长度的增长，验证成为主要的瓶颈。为了进一步加速长上下文生成中的推测解码，我们引入了SpecPV，这是一种自推测解码方法，它使用部分键值状态（KV）执行快速验证，并定期应用完全验证以消除累积错误。我们在多个长上下文基准和模型上验证SpecPV，包括LLaMA-3.1-8B-Instruct和Qwen 3系列。实验结果表明，SpecPV实现了高达6倍的解码加速比标准的自回归解码与轻微的退化。
摘要：Growing demands from tasks like code generation, deep reasoning, and long-document understanding have made long-context generation a crucial capability for large language models (LLMs). Speculative decoding is one of the most direct and effective approaches for accelerating generation. It follows a draft-verify paradigm, where a lightweight draft model proposes several candidate tokens and the target model verifies them. However, we find that as the context length grows, verification becomes the dominant bottleneck. To further accelerate speculative decoding in long-context generation, we introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states (KV) and periodically applies full verification to eliminate accumulated errors. We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct and Qwen3-series. Experimental results show that SpecPV achieves up to 6x decoding speedup over standard autoregressive decoding with minor degradation.

【4】Adversarial Robustness of Traffic Classification under Resource Constraints: Input Structure Matters
标题：资源约束下流量分类的对抗鲁棒性：输入结构很重要
链接：https://arxiv.org/abs/2512.02276

作者：Adel Chehade,Edoardo Ragusa,Paolo Gastaldo,Rodolfo Zunino
备注：Accepted at the 2025 IEEE International Symposium on Networks, Computers and Communications (ISNCC)
摘要：流量分类（TC）在网络安全中发挥着关键作用，特别是在物联网和嵌入式环境中，检查通常必须在严格的硬件限制下在本地进行。我们使用硬件感知神经架构搜索（HW-NAS）来获得轻量级TC模型，这些模型准确，高效，可在边缘平台上部署。两种输入格式被认为是：一个扁平化的字节序列和一个二维数据包的时间序列，我们研究输入结构如何影响对抗脆弱性时，使用资源受限的模型。针对白盒攻击，特别是快速梯度符号法（FGSM）和投影梯度下降（PGD），评估了鲁棒性。在USTC-TFC2016上，两个HW-NAS模型都实现了超过99%的干净数据准确性，同时保持在65k参数和2M FLOPs范围内。然而，在强度为0.1的扰动下，它们的鲁棒性有所不同：平坦模型保持了85%以上的准确性，而时间序列变量则下降到35%以下。对抗性微调提供了强大的增益，平坦输入准确率超过96%，时间序列变量的鲁棒性恢复超过60个百分点，所有这些都不会影响效率。结果强调了输入结构如何影响对抗脆弱性，并表明即使是紧凑的，资源高效的模型也可以获得强大的鲁棒性，支持它们在安全的基于边缘的TC中的实际部署。
摘要：Traffic classification (TC) plays a critical role in cybersecurity, particularly in IoT and embedded contexts, where inspection must often occur locally under tight hardware constraints. We use hardware-aware neural architecture search (HW-NAS) to derive lightweight TC models that are accurate, efficient, and deployable on edge platforms. Two input formats are considered: a flattened byte sequence and a 2D packet-wise time series; we examine how input structure affects adversarial vulnerability when using resource-constrained models. Robustness is assessed against white-box attacks, specifically Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD). On USTC-TFC2016, both HW-NAS models achieve over 99% clean-data accuracy while remaining within 65k parameters and 2M FLOPs. Yet under perturbations of strength 0.1, their robustness diverges: the flat model retains over 85% accuracy, while the time-series variant drops below 35%. Adversarial fine-tuning delivers robust gains, with flat-input accuracy exceeding 96% and the time-series variant recovering over 60 percentage points in robustness, all without compromising efficiency. The results underscore how input structure influences adversarial vulnerability, and show that even compact, resource-efficient models can attain strong robustness, supporting their practical deployment in secure edge-based TC.

【5】Leveraging generative adversarial networks with spatially adaptive denormalization for multivariate stochastic seismic data inversion
标题：利用具有空间自适应反正规化的生成对抗网络进行多元随机地震数据反演算
链接：https://arxiv.org/abs/2512.02863

作者：Roberto Miele,Leonardo Azevedo
摘要：概率地震反演建模通常需要预测空间相关的地质非均匀性（例如，相）和连续参数（例如，岩石和弹性性质）。生成对抗网络（GANs）提供了一种高效的基于训练图像的仿真框架，能够以高精度和极低的生成成本再现复杂的地质模型。然而，它们在随机地球物理反演中用于多变量属性预测的应用是有限的，因为表示多个耦合属性需要具有高存储器和训练需求的大型且不稳定的网络。具有空间自适应去规范化（SPADE-GAN）的GANs的较新变体使得能够在局部概率图上直接调节相空间分布。利用这些特征，提出了一种迭代地质统计反演算法SPADE-GANInv，将预训练的SPADE-GAN与地质统计模拟相结合，用于预测地震数据的相和多个相关的连续属性。SPADE-GAN被训练以再现现实的相几何形状，而顺序随机联合模拟预测相依赖的连续属性的空间变异性。在每次迭代时，生成一组地下实现并用于计算合成地震数据。在下一次迭代中，使用向观测数据提供最高相似系数的实现来更新地下概率模型。该方法被证明在2-D合成方案和现场数据，针对预测的相，孔隙度和全叠地震数据的声阻抗。结果表明，该算法能够实现准确的多变量预测，减轻有偏见的先验数据的影响，并容纳额外的本地条件，如测井。
摘要：Probabilistic seismic inverse modeling often requires the prediction of both spatially correlated geological heterogeneities (e.g., facies) and continuous parameters (e.g., rock and elastic properties). Generative adversarial networks (GANs) provide an efficient training-image-based simulation framework capable of reproducing complex geological models with high accuracy and comparably low generative cost. However, their application in stochastic geophysical inversion for multivariate property prediction is limited, as representing multiple coupled properties requires large and unstable networks with high memory and training demands. A more recent variant of GANs with spatially adaptive denormalization (SPADE-GAN) enables the direct conditioning of facies spatial distributions on local probability maps. Leveraging on such features, an iterative geostatistical inversion algorithm is proposed, SPADE-GANInv, integrating a pre-trained SPADE-GAN with geostatistical simulation, for the prediction of facies and multiple correlated continuous properties from seismic data. The SPADE-GAN is trained to reproduce realistic facies geometries, while sequential stochastic co-simulation predicts the spatial variability of the facies-dependent continuous properties. At each iteration, a set of subsurface realizations is generated and used to compute synthetic seismic data. The realizations providing the highest similarity coefficient to the observed data are used to update the subsurface probability models in the next iteration. The method is demonstrated on both 2-D synthetic scenarios and field data, targeting the prediction of facies, porosity, and acoustic impedance from full-stack seismic data. Results show that the algorithm enables accurate multivariate prediction, mitigates the impact of biased prior data, and accommodates additional local conditioning such as well logs.

半/弱/无/有监督|不确定性|主动学习(2篇)

【1】Beyond Paired Data: Self-Supervised UAV Geo-Localization from Reference Imagery Alone
标题：超越配对数据：仅根据参考图像进行自我监督的无人机地理定位
链接：https://arxiv.org/abs/2512.02737

作者：Tristan Amadei,Enric Meinhardt-Llopis,Benedicte Bascle,Corentin Abgrall,Gabriele Facciolo
备注：Accepted at WACV 2026
摘要：在GNSS拒绝环境中基于图像的定位对于无人机的自主性至关重要。现有的最先进的方法依赖于将UAV图像与地理参考卫星图像进行匹配;然而，它们通常需要大规模的配对UAV卫星数据集进行训练。这些数据的获取成本很高，而且往往无法获得，从而限制了其适用性。为了应对这一挑战，我们采用了一种训练模式，通过直接从卫星视图参考图像中学习，消除了训练过程中对无人机图像的需求。这是通过一个专门的增强策略来实现的，该策略模拟卫星和真实世界无人机视图之间的视觉域转换。我们介绍CAEVL，一个有效的模型，旨在利用这种模式，并验证它的ViLD，一个新的和具有挑战性的数据集的真实世界的无人机图像，我们发布给社区。与使用配对数据训练的方法相比，我们的方法具有竞争力的性能，证明了其有效性和强大的泛化能力。
摘要：Image-based localization in GNSS-denied environments is critical for UAV autonomy. Existing state-of-the-art approaches rely on matching UAV images to geo-referenced satellite images; however, they typically require large-scale, paired UAV-satellite datasets for training. Such data are costly to acquire and often unavailable, limiting their applicability. To address this challenge, we adopt a training paradigm that removes the need for UAV imagery during training by learning directly from satellite-view reference images. This is achieved through a dedicated augmentation strategy that simulates the visual domain shift between satellite and real-world UAV views. We introduce CAEVL, an efficient model designed to exploit this paradigm, and validate it on ViLD, a new and challenging dataset of real-world UAV images that we release to the community. Our method achieves competitive performance compared to approaches trained with paired data, demonstrating its effectiveness and strong generalization capabilities.

【2】Uncertainty Reasoning with Photonic Bayesian Machines
标题：利用Photonic Bayesian机进行不确定性推理
链接：https://arxiv.org/abs/2512.02217

作者：F. Brückerhoff-Plückelmann,H. Borras,S. U. Hulyal,L. Meyer,X. Ji,J. Hu,J. Sun,B. Klein,F. Ebert,J. Dijkstra,L. McRae,P. Schmidt,T. J. Kippenberg,H. Fröning,W. Pernice
摘要：人工智能（AI）系统越来越多地影响社会的安全关键方面，从医疗诊断到自主移动，使不确定性意识成为值得信赖的AI的核心要求。我们提出了一个光子贝叶斯机，利用混沌光源的固有随机性，使贝叶斯神经网络的框架内的不确定性推理。模拟处理器具有与PyTorch兼容的1.28 Tbit/s数字接口，可在37.5 ps/卷积内实现概率卷积处理。我们使用该系统的同时分类和域外检测的血细胞显微镜图像，并展示任意和认识的不确定性之间的推理。光子贝叶斯机消除了数字系统中伪随机数生成的瓶颈，最大限度地减少了概率模型的采样成本，从而实现了高速可信的AI系统。
摘要：Artificial intelligence (AI) systems increasingly influence safety-critical aspects of society, from medical diagnosis to autonomous mobility, making uncertainty awareness a central requirement for trustworthy AI. We present a photonic Bayesian machine that leverages the inherent randomness of chaotic light sources to enable uncertainty reasoning within the framework of Bayesian Neural Networks. The analog processor features a 1.28 Tbit/s digital interface compatible with PyTorch, enabling probabilistic convolutions processing within 37.5 ps per convolution. We use the system for simultaneous classification and out-of-domain detection of blood cell microscope images and demonstrate reasoning between aleatoric and epistemic uncertainties. The photonic Bayesian machine removes the bottleneck of pseudo random number generation in digital systems, minimizes the cost of sampling for probabilistic models, and thus enables high-speed trustworthy AI systems.

迁移|Zero/Few/One-Shot|自适应(10篇)

【1】Adaptive Decentralized Federated Learning for Robust Optimization
标题：用于鲁棒优化的自适应分散联邦学习
链接：https://arxiv.org/abs/2512.02852

作者：Shuyuan Wu,Feifei Wang,Yuan Gao,Hansheng Wang
摘要：在分散式联邦学习（DFL）中，异常客户端的存在（通常由噪声或中毒数据引起）会严重破坏学习过程，并降低模型的整体鲁棒性。以前的方法在这个问题上往往需要足够大的正常的邻居客户端或可靠的客户端的先验知识，这降低了DFL的实用性。为了解决这些局限性，我们在这里开发了一种新的自适应DFL（aDFL）的鲁棒估计方法。其核心思想是自适应地调整客户端的学习率。通过为可疑客户分配较小的费率，为正常客户分配较大的费率，aDFL以完全自适应的方式减轻异常客户对全局模型的负面影响。我们的理论没有对相邻节点提出任何严格的条件，也不需要先验知识。严格的收敛性分析保证了aDFL的预言性质。大量的数值实验证明了aDFL方法的优越性能。
摘要：In decentralized federated learning (DFL), the presence of abnormal clients, often caused by noisy or poisoned data, can significantly disrupt the learning process and degrade the overall robustness of the model. Previous methods on this issue often require a sufficiently large number of normal neighboring clients or prior knowledge of reliable clients, which reduces the practical applicability of DFL. To address these limitations, we develop here a novel adaptive DFL (aDFL) approach for robust estimation. The key idea is to adaptively adjust the learning rates of clients. By assigning smaller rates to suspicious clients and larger rates to normal clients, aDFL mitigates the negative impact of abnormal clients on the global model in a fully adaptive way. Our theory does not put any stringent conditions on neighboring nodes and requires no prior knowledge. A rigorous convergence analysis is provided to guarantee the oracle property of aDFL. Extensive numerical experiments demonstrate the superior performance of the aDFL method.

【2】A Comparative Study on How Data Normalization Affects Zero-Shot Generalization in Time Series Foundation Models
标题：时间序列基础模型中数据规范化如何影响Zero-Shot概括的比较研究
链接：https://arxiv.org/abs/2512.02833

作者：Ihab Ahmed,Denis Krompaß,Cheng Feng,Volker Tresp
摘要：我们研究时间序列基础模型（TSFM）的输入归一化方法。虽然归一化在特定于时间序列的模型中得到了很好的研究，但在泛化至关重要的TSFM中，它仍然被忽视。与文本或图像不同，时间序列数据在域和通道之间表现出显着的尺度变化，再加上非平稳性，无论架构复杂性如何，都会破坏TSFM的性能。通过对四种不同架构的TSFM进行系统评估，我们根据经验将REVIN确定为最有效的方法，相对于未归一化的基线，将zero-shot MASE降低了89%，与其他归一化方法相比降低了44%，同时在没有任何预处理的情况下匹配最佳的域内精度（0.84 MASE）-产生最高的精度-效率权衡。然而，其效果利用取决于架构设计选择和优化目标，特别是关于训练损失规模敏感性和模型类型（概率，点预测或基于LLM的模型）。
摘要：We investigate input normalization methods for Time-Series Foundation Models (TSFMs). While normalization is well-studied in dataset-specific time-series models, it remains overlooked in TSFMs where generalization is critical. Time-series data, unlike text or images, exhibits significant scale variation across domains and channels, coupled with non-stationarity, can undermine TSFM performance regardless of architectural complexity. Through systematic evaluation across four architecturally diverse TSFMs, we empirically establish REVIN as the most efficient approach, reducing zero-shot MASE by 89\% relative to an un-normalized baseline and by 44\% versus other normalization methods, while matching the best in-domain accuracy (0.84 MASE) without any dataset-level preprocessing -- yielding the highest accuracy-efficiency trade-off. Yet its effect utilization depends on architectural design choices and optimization objective, particularly with respect to training loss scale sensitivity and model type (probabilistic, point-forecast, or LLM-based models).

【3】Adaptive Weighted LSSVM for Multi-View Classification
标题：用于多视图分类的自适应加权LSSIM
链接：https://arxiv.org/abs/2512.02653

作者：Farnaz Faramarzi Lighvan,Mehrdad Asadi,Lynn Houthuys
摘要：多视图学习集成了相同实例的不同表示以提高性能。大多数现有的基于核的多视图学习方法使用融合技术，而不强制执行显式的跨视图协作类型或限制全局协作的共正则化。我们提出了AW-LSSVM，这是一种自适应加权LS-SVM，它通过迭代全局耦合来促进互补学习，使每个视图关注先前迭代中其他视图的硬样本。实验表明，AW-LSSVM在大多数数据集上优于现有的基于内核的多视图方法，同时保持原始特征隔离，使其也适用于隐私保护场景。
摘要：Multi-view learning integrates diverse representations of the same instances to improve performance. Most existing kernel-based multi-view learning methods use fusion techniques without enforcing an explicit collaboration type across views or co-regularization which limits global collaboration. We propose AW-LSSVM, an adaptive weighted LS-SVM that promotes complementary learning by an iterative global coupling to make each view focus on hard samples of others from previous iterations. Experiments demonstrate that AW-LSSVM outperforms existing kernel-based multi-view methods on most datasets, while keeping raw features isolated, making it also suitable for privacy-preserving scenarios.

【4】Zero-Shot Instruction Following in RL via Structured LTL Representations
标题：通过结构化LTL表示在RL中遵循Zero-Shot指令
链接：https://arxiv.org/abs/2512.02633

作者：Mattia Giuri,Mathias Jackermeier,Alessandro Abate
备注：ICML 2025 Workshop on Programmatic Representations for Agent Learning
摘要：线性时序逻辑（LTL）是一个令人信服的框架，用于为强化学习（RL）代理指定复杂的结构化任务。最近的工作表明，将LTL指令解释为有限自动机，可以被视为监控任务进度的高级程序，可以学习一个能够在测试时执行任意指令的通用策略。然而，现有方法在多个高级别事件（即，原子命题）可以同时为真，并可能以复杂的方式相互作用。在这项工作中，我们提出了一种新的方法来学习一个多任务的政策，以下任意LTL指令，解决这个缺点。我们的方法条件的简单的布尔公式，直接对齐自动机中的过渡序列的政策，并通过图形神经网络（GNN）编码，以产生结构化的任务表示。在复杂的国际象棋环境中的实验证明了我们方法的优势。
摘要：Linear temporal logic (LTL) is a compelling framework for specifying complex, structured tasks for reinforcement learning (RL) agents. Recent work has shown that interpreting LTL instructions as finite automata, which can be seen as high-level programs monitoring task progress, enables learning a single generalist policy capable of executing arbitrary instructions at test time. However, existing approaches fall short in environments where multiple high-level events (i.e., atomic propositions) can be true at the same time and potentially interact in complicated ways. In this work, we propose a novel approach to learning a multi-task policy for following arbitrary LTL instructions that addresses this shortcoming. Our method conditions the policy on sequences of simple Boolean formulae, which directly align with transitions in the automaton, and are encoded via a graph neural network (GNN) to yield structured task representations. Experiments in a complex chess-based environment demonstrate the advantages of our approach.

【5】Adapting Tensor Kernel Machines to Enable Efficient Transfer Learning for Seizure Detection
标题：调整张量核机器以实现癫痫发作检测的高效转移学习
链接：https://arxiv.org/abs/2512.02626

作者：Seline J. S. de Rooij,Borbála Hunyadi
备注：This work has been submitted to the IEEE for possible publication
摘要：迁移学习旨在通过从相关的源问题中学习来优化目标任务的性能。在这项工作中，我们提出了一种有效的迁移学习方法，使用张量核机器。我们的方法从自适应SVM中获得灵感，因此通过正则化将“知识”从源转移到“适应”模型。使用张量核机器的主要优点是它们利用低秩张量网络来学习原始域中的紧凑非线性模型。这允许在不向模型添加更多参数的情况下进行更有效的自适应。为了证明我们的方法的有效性，我们应用自适应张量核机器（Adapt-TKM）的癫痫发作检测耳后EEG。通过使用少量患者特异性数据对患者无关模型进行个性化，与患者无关和完全患者特异性模型相比，患者适应模型（利用Adapt-TKM）实现了更好的性能。值得注意的是，它能够做到这一点，同时需要比自适应SVM模型少100倍的参数，从而导致相应更快的推理速度。这使得Adapt-TKM对于资源受限的可穿戴设备特别有用。
摘要：Transfer learning aims to optimize performance in a target task by learning from a related source problem. In this work, we propose an efficient transfer learning method using a tensor kernel machine. Our method takes inspiration from the adaptive SVM and hence transfers 'knowledge' from the source to the 'adapted' model via regularization. The main advantage of using tensor kernel machines is that they leverage low-rank tensor networks to learn a compact non-linear model in the primal domain. This allows for a more efficient adaptation without adding more parameters to the model. To demonstrate the effectiveness of our approach, we apply the adaptive tensor kernel machine (Adapt-TKM) to seizure detection on behind-the-ear EEG. By personalizing patient-independent models with a small amount of patient-specific data, the patient-adapted model (which utilizes the Adapt-TKM), achieves better performance compared to the patient-independent and fully patient-specific models. Notably, it is able to do so while requiring around 100 times fewer parameters than the adaptive SVM model, leading to a correspondingly faster inference speed. This makes the Adapt-TKM especially useful for resource-constrained wearable devices.

【6】Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation
标题：面向基本的低级别转移用于Few-Shot和测试时适应
链接：https://arxiv.org/abs/2512.02441

作者：Junghwan Park,Woojin Cho,Junhyuk Heo,Darongsae Kwon,Kookjin Lee
摘要：在数据和计算预算紧张的情况下，使大型预训练模型适应看不见的任务仍然具有挑战性。元学习方法显式地学习良好的初始化，但它们需要在许多任务上进行额外的元训练阶段，导致高训练成本，并且可能不稳定。与此同时，特定于任务的预训练模型的数量继续增长，但如何以最少的额外训练将它们转移到新任务的问题仍然相对不足。我们提出了BOLT（Basis-Oriented Low-rank Transfer），这是一个框架，它不是通过合并权重来重用现有的微调模型，而是通过提取正交的、基于任务的谱基并在该子空间内进行调整。在离线阶段，BOLT从多个任务向量中收集主导奇异方向，并在每层对其进行正交化，以形成可重用的基础。在在线阶段，我们冻结这些基地和训练只有一个小的对角线系数每层的新任务，产生一个等级控制的更新与很少的可训练参数。这种设计提供了（i）一个强大的，无训练的初始化看不见的任务，通过池化源任务系数，以及一个轻量级的重新缩放步骤，同时利用共享的正交基，以及（ii）一个参数有效的微调（PEFT）路径，在我们的实验中，实现了强大的性能相比，常见的PEFT基线以及代表性的元学习初始化。我们的研究结果表明，约束适应任务知情的正交子空间提供了一个有效的替代unseen-task转移。
摘要：Adapting large pre-trained models to unseen tasks under tight data and compute budgets remains challenging. Meta-learning approaches explicitly learn good initializations, but they require an additional meta-training phase over many tasks, incur high training cost, and can be unstable. At the same time, the number of task-specific pre-trained models continues to grow, yet the question of how to transfer them to new tasks with minimal additional training remains relatively underexplored. We propose BOLT (Basis-Oriented Low-rank Transfer), a framework that reuses existing fine-tuned models not by merging weights, but instead by extracting an orthogonal, task-informed spectral basis and adapting within that subspace. In the offline phase, BOLT collects dominant singular directions from multiple task vectors and orthogonalizes them per layer to form reusable bases. In the online phase, we freeze these bases and train only a small set of diagonal coefficients per layer for the new task, yielding a rank-controlled update with very few trainable parameters. This design provides (i) a strong, training-free initialization for unseen tasks, obtained by pooling source-task coefficients, along with a lightweight rescaling step while leveraging the shared orthogonal bases, and (ii) a parameter-efficient fine-tuning (PEFT) path that, in our experiments, achieves robust performance compared to common PEFT baselines as well as a representative meta-learned initialization. Our results show that constraining adaptation to a task-informed orthogonal subspace provides an effective alternative for unseen-task transfer.

【7】Cross-Domain Offline Policy Adaptation with Dynamics- and Value-Aligned Data Filtering
标题：通过动态和价值一致的数据过滤进行跨域离线政策调整
链接：https://arxiv.org/abs/2512.02435

作者：Zhongjian Qiao,Rui Yang,Jiafei Lyu,Chenjia Bai,Xiu Li,Zhuoran Yang,Siyang Gao,Shuang Qiu
摘要：跨域离线强化学习旨在训练部署在目标环境中的代理，利用有限的目标域数据集和具有（可能）足够数据覆盖的源域数据集。由于源域和目标域之间的潜在动态不对准，简单地合并来自两个数据集的数据可能会导致较差的性能。最近的进展通过选择性地共享与目标域呈现动态对准的源域样本来解决这个问题。然而，这些方法仅关注动态对齐，而忽略了\texit {value alignment}，即，从所述源域中选择高质量、高价值的样本。在本文中，我们首先证明了动态对齐和值对齐是必不可少的政策学习，通过检查跨域RL的当前理论框架的局限性，并建立一个具体的次优差距的政策训练的源域和目标域上评估。受理论见解的启发，我们建议选择性地共享那些具有高动态和值对齐的源域样本，并提出我们的\textbf{\underline{D}}值对齐和\textbf{\underline{V}}值对齐的\textbf{\underline{D}}ata \textbf{\underline{F}}过滤（DVDF）方法。我们设计了一系列动态偏移设置，包括运动学和形态学偏移，并在各种任务和数据集上评估DVDF，以及在具有挑战性的极低数据设置中，目标域数据集仅包含5，000个转换。大量的实验表明，DVDF始终优于先前的强基线，并在多个任务和数据集上提供卓越的性能。
摘要：Cross-Domain Offline Reinforcement Learning aims to train an agent deployed in the target environment, leveraging both a limited target domain dataset and a source domain dataset with (possibly) sufficient data coverage. Due to the underlying dynamics misalignment between the source and target domain, simply merging the data from two datasets may incur inferior performance. Recent advances address this issue by selectively sharing source domain samples that exhibit dynamics alignment with the target domain. However, these approaches focus solely on dynamics alignment and overlook \textit{value alignment}, i.e., selecting high-quality, high-value samples from the source domain. In this paper, we first demonstrate that both dynamics alignment and value alignment are essential for policy learning, by examining the limitations of the current theoretical framework for cross-domain RL and establishing a concrete sub-optimality gap of a policy trained on the source domain and evaluated on the target domain. Motivated by the theoretical insights, we propose to selectively share those source domain samples with both high dynamics and value alignment and present our \textbf{\underline{D}}ynamics- and \textbf{\underline{V}}alue-aligned \textbf{\underline{D}}ata \textbf{\underline{F}}iltering (DVDF) method. We design a range of dynamics shift settings, including kinematic and morphology shifts, and evaluate DVDF on various tasks and datasets, as well as in challenging extremely low-data settings where the target domain dataset contains only 5,000 transitions. Extensive experiments demonstrate that DVDF consistently outperforms prior strong baselines and delivers exceptional performance across multiple tasks and datasets.

【8】Ada-MoGE: Adaptive Mixture of Gaussian Expert Model for Time Series Forecasting
标题：Ada-MoGE：用于时间序列预测的自适应高斯混合专家模型
链接：https://arxiv.org/abs/2512.02061

作者：Zhenliang Ni,Xiaowen Ma,Zhenkai Wu,Shuai Xiao,Han Shu,Xinghao Chen
摘要：多变量时间序列预测被广泛应用，如工业、交通和金融预测。然而，时间序列中的主频率可能会随着数据频谱分布的变化而发生变化。传统的专家混合（MoE）模型采用固定数量的专家，难以适应这些变化，导致频率覆盖不平衡问题。具体来说，专家太少会导致关键信息被忽视，而太多则会带来噪音。为此，我们提出了Ada-MoGE，一种自适应高斯混合专家模型。Ada-MoGE集成了光谱强度和频率响应，以自适应地确定专家的数量，确保与输入数据的频率分布保持一致。这种方法既可以防止由于专家数量不足而造成的信息损失，也可以防止由于专家数量过多而造成的噪声污染。此外，为了防止直接频带截断引入噪声，我们采用高斯带通滤波来平滑分解频域特征，进一步优化特征表示。实验结果表明，我们的模型在六个公共基准上达到了最先进的性能，只有20万个参数。
摘要：Multivariate time series forecasts are widely used, such as industrial, transportation and financial forecasts. However, the dominant frequencies in time series may shift with the evolving spectral distribution of the data. Traditional Mixture of Experts (MoE) models, which employ a fixed number of experts, struggle to adapt to these changes, resulting in frequency coverage imbalance issue. Specifically, too few experts can lead to the overlooking of critical information, while too many can introduce noise. To this end, we propose Ada-MoGE, an adaptive Gaussian Mixture of Experts model. Ada-MoGE integrates spectral intensity and frequency response to adaptively determine the number of experts, ensuring alignment with the input data's frequency distribution. This approach prevents both information loss due to an insufficient number of experts and noise contamination from an excess of experts. Additionally, to prevent noise introduction from direct band truncation, we employ Gaussian band-pass filtering to smoothly decompose the frequency domain features, further optimizing the feature representation. The experimental results show that our model achieves state-of-the-art performance on six public benchmarks with only 0.2 million parameters.

【9】Opening the Black Box: An Explainable, Few-shot AI4E Framework Informed by Physics and Expert Knowledge for Materials Engineering
标题：打开黑匣子：一个可解释的，Few-Shot的AI4E框架，由物理学和材料工程专家知识提供信息
链接：https://arxiv.org/abs/2512.02057

作者：Haoxiang Zhang,Ruihao Yuan,Lihui Zhang,Yushi Luo,Qiang Zhang,Pan Ding,Xiaodong Ren,Weijie Xing,Niu Gao,Jishan Chen,Chubo Zhang
摘要：人工智能工程（AI 4 E）的工业应用面临着两个基本瓶颈：稀缺的高质量数据和黑箱模型缺乏可解释性，这在航空航天等安全敏感领域尤为关键。我们提出了一个可解释的，Few-Shot AI 4 E框架，该框架在整个架构中系统地了解物理和专家知识。从只有32个实验样品在航空K439 B高温合金铸件修复焊接的情况下，我们首先增加物理上合理的合成数据，通过一个三阶段的协议：差分噪声注入校准过程的变异性，强制执行硬物理约束，并保存参数间的关系。然后，我们采用嵌套优化策略的本构模型发现，符号回归探索方程结构，而差分进化优化参数，然后密集的参数细化使用混合全局-局部优化。由此产生的可解释的本构方程达到88%的准确性预测热裂纹倾向。该方程不仅提供了定量预测，还提供了明确的物理见解，揭示了热，几何和冶金机制如何耦合驱动裂纹，从而提高工程师对该过程的认知理解。此外，本构方程可作为工艺优化和高保真虚拟数据生成的多功能工具，从而提高其他数据驱动模型的精度。我们的方法为开发可信赖的人工智能系统提供了一个通用蓝图，这些系统将工程领域知识直接嵌入其架构中，从而能够在数据有限但物理理解可用的高风险工业应用中可靠地采用。
摘要：The industrial adoption of Artificial Intelligence for Engineering (AI4E) faces two fundamental bottlenecks: scarce high-quality data and the lack of interpretability in black-box models-particularly critical in safety-sensitive sectors like aerospace. We present an explainable, few-shot AI4E framework that is systematically informed by physics and expert knowledge throughout its architecture. Starting from only 32 experimental samples in an aerial K439B superalloy castings repair welding case, we first augment physically plausible synthetic data through a three-stage protocol: differentiated noise injection calibrated to process variabilities, enforcement of hard physical constraints, and preservation of inter-parameter relationships. We then employ a nested optimization strategy for constitutive model discovery, where symbolic regression explores equation structures while differential evolution optimizes parameters, followed by intensive parameter refinement using hybrid global-local optimization. The resulting interpretable constitutive equation achieves 88% accuracy in predicting hot-cracking tendency. This equation not only provides quantitative predictions but also delivers explicit physical insight, revealing how thermal, geometric, and metallurgical mechanisms couple to drive cracking-thereby advancing engineers' cognitive understanding of the process. Furthermore, the constitutive equation serves as a multi-functional tool for process optimization and high-fidelity virtual data generation, enabling accuracy improvements in other data-driven models. Our approach provides a general blueprint for developing trustworthy AI systems that embed engineering domain knowledge directly into their architecture, enabling reliable adoption in high-stakes industrial applications where data is limited but physical understanding is available.

【10】Few-shot Protein Fitness Prediction via In-context Learning and Test-time Training
标题：通过上下文学习和测试时训练进行少量蛋白质体能预测
链接：https://arxiv.org/abs/2512.02315

作者：Felix Teufel,Aaron W. Kollasch,Yining Huang,Ole Winther,Kevin K. Yang,Pascal Notin,Debora S. Marks
备注：AI for Science Workshop (NeurIPS 2025)
摘要：用最少的实验数据准确预测蛋白质适合度是蛋白质工程中的一个持续挑战。我们介绍了PRIMO（PRotein In-context Mutation Oracle），这是一个基于transformer的框架，它利用上下文学习和测试时训练来快速适应新的蛋白质和检测，而无需大型特定任务的数据集。通过将序列信息、辅助zero-shot预测和来自许多测定的稀疏实验标签编码为预训练掩蔽语言建模范例中的统一标记集，PRIMO通过基于偏好的损失函数学习优先考虑有希望的变体。在不同的蛋白质家族和属性中-包括取代和插入缺失突变-PRIMO优于zero-shot和完全监督的基线。这项工作强调了将大规模预训练与有效的测试时间适应相结合的力量，以解决具有挑战性的蛋白质设计任务，其中数据收集昂贵且标签可用性有限。
摘要：Accurately predicting protein fitness with minimal experimental data is a persistent challenge in protein engineering. We introduce PRIMO (PRotein In-context Mutation Oracle), a transformer-based framework that leverages in-context learning and test-time training to adapt rapidly to new proteins and assays without large task-specific datasets. By encoding sequence information, auxiliary zero-shot predictions, and sparse experimental labels from many assays as a unified token set in a pre-training masked-language modeling paradigm, PRIMO learns to prioritize promising variants through a preference-based loss function. Across diverse protein families and properties-including both substitution and indel mutations-PRIMO outperforms zero-shot and fully supervised baselines. This work underscores the power of combining large-scale pre-training with efficient test-time adaptation to tackle challenging protein design tasks where data collection is expensive and label availability is limited.

强化学习(8篇)

【1】GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies
标题：GoRL：具有生成策略的在线强化学习的概念不可知框架
链接：https://arxiv.org/abs/2512.02581

作者：Chubin Zhang,Zhenglin Wan,Feng Chen,Xingrui Yu,Ivor Tsang,Bo An
备注：27 pages
摘要：强化学习（RL）面临着一个持续的紧张局势：稳定的优化策略往往过于简单，无法表示复杂控制所需的多模态动作分布。高斯策略提供了易于处理的似然性和平滑梯度，但它们的单峰形式限制了表达能力。相反，基于扩散或流匹配的生成策略可以对丰富的多模态行为进行建模;然而，在在线RL中，由于难以处理的可能性和通过深度采样链传播的噪声梯度，它们通常是不稳定的。我们用一个关键的结构原则来解决这种紧张关系：将优化与发电解耦。基于这一见解，我们引入了GoRL（生成式在线强化学习），这是一个框架，它优化了一个易于处理的潜在策略，同时利用条件生成解码器来合成动作。一个两个时标的更新时间表，使潜在的政策，学习稳定，而解码器稳定地增加表现力，而不需要易处理的行动的可能性。在一系列连续控制任务中，GoRL始终优于高斯策略和最近的生成策略基线。值得注意的是，在HopperStand任务中，它达到了870以上的标准化回报，是最强基线的3倍多。这些结果表明，将优化与生成分离提供了一条实用的路径，可以实现稳定且高度表达的策略。
摘要：Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize are often too simple to represent the multimodal action distributions needed for complex control. Gaussian policies provide tractable likelihoods and smooth gradients, but their unimodal form limits expressiveness. Conversely, generative policies based on diffusion or flow matching can model rich multimodal behaviors; however, in online RL, they are frequently unstable due to intractable likelihoods and noisy gradients propagating through deep sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this insight, we introduce GoRL (Generative Online Reinforcement Learning), a framework that optimizes a tractable latent policy while utilizing a conditional generative decoder to synthesize actions. A two-timescale update schedule enables the latent policy to learn stably while the decoder steadily increases expressiveness, without requiring tractable action likelihoods. Across a range of continuous-control tasks, GoRL consistently outperforms both Gaussian policies and recent generative-policy baselines. Notably, on the HopperStand task, it reaches a normalized return above 870, more than 3 times that of the strongest baseline. These results demonstrate that separating optimization from generation provides a practical path to policies that are both stable and highly expressive.

【2】CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
标题：CUDA-L2：通过强化学习超越cuBLAS矩阵相乘性能
链接：https://arxiv.org/abs/2512.02551

作者：Songqiao Su,Xiaofei Sun,Xiaoya Li,Albert Wang,Jiwei Li,Chris Shum
摘要：In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0\% over {\it torch.matmul} on average; +19.2\% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7\%, +26.0\%, +22.4\%, and +15.9\% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2
摘要：In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0\% over {\it torch.matmul} on average; +19.2\% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7\%, +26.0\%, +22.4\%, and +15.9\% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2

【3】Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts
标题：对抗动态变化的双稳健跨域离线强化学习
链接：https://arxiv.org/abs/2512.02486

作者：Zhongjian Qiao,Rui Yang,Jiafei Lyu,Xiu Li,Zhongxiang Dai,Zhuoran Yang,Siyang Gao,Shuang Qiu
摘要：单域离线强化学习（RL）通常受到有限的数据覆盖范围的影响，而跨域离线RL通过利用来自其他动态变化域的额外数据来处理这个问题。然而，现有的研究主要集中在训练时间的鲁棒性（处理动态变化从训练数据），忽略了测试时间的鲁棒性对动态扰动时部署在实际场景中。在本文中，我们研究了双（训练时间和测试时间）鲁棒性对动态变化的跨域离线RL。我们首先经验表明，在评估过程中，特别是当目标域数据有限时，使用跨域离线RL训练的策略在动态扰动下表现出脆弱性。为了解决这个问题，我们引入了一种新的鲁棒跨域Bellman（RCB）算子，它增强了测试时间对动态扰动的鲁棒性，同时保持保守的分布外动态过渡，从而保证了训练时间的鲁棒性。为了进一步抵消由RCB算子引起的潜在值高估或低估，我们引入了两种技术，动态值惩罚和Huber损失，到我们的框架中，从而产生了实用的\textbf{D}ual-\textbf{RO}bust \textbf{C}ross-domain \textbf{O}离线RL（DROCO）算法。在各种动态变化场景中的大量实证结果表明，DROCO优于强基线，并表现出增强的鲁棒性动态扰动。
摘要：Single-domain offline reinforcement learning (RL) often suffers from limited data coverage, while cross-domain offline RL handles this issue by leveraging additional data from other domains with dynamics shifts. However, existing studies primarily focus on train-time robustness (handling dynamics shifts from training data), neglecting the test-time robustness against dynamics perturbations when deployed in practical scenarios. In this paper, we investigate dual (both train-time and test-time) robustness against dynamics shifts in cross-domain offline RL. We first empirically show that the policy trained with cross-domain offline RL exhibits fragility under dynamics perturbations during evaluation, particularly when target domain data is limited. To address this, we introduce a novel robust cross-domain Bellman (RCB) operator, which enhances test-time robustness against dynamics perturbations while staying conservative to the out-of-distribution dynamics transitions, thus guaranteeing the train-time robustness. To further counteract potential value overestimation or underestimation caused by the RCB operator, we introduce two techniques, the dynamic value penalty and the Huber loss, into our framework, resulting in the practical \textbf{D}ual-\textbf{RO}bust \textbf{C}ross-domain \textbf{O}ffline RL (DROCO) algorithm. Extensive empirical results across various dynamics shift scenarios show that DROCO outperforms strong baselines and exhibits enhanced robustness to dynamics perturbations.

【4】QJoin: Transformation-aware Joinable Data Discovery Using Reinforcement Learning
标题：QJoin：使用强化学习的转换感知联合数据发现
链接：https://arxiv.org/abs/2512.02444

作者：Ning Wang,Sainyam Galhotra
摘要：发现大型异构存储库中的哪些表可以连接以及通过什么转换连接是数据集成和数据发现中的一个核心挑战。传统的连接发现方法在很大程度上是为等连接而设计的，它假设连接键完全匹配或接近匹配。这些技术虽然在干净的、规范化良好的数据库中很有效，但在标识符格式不一致、嵌入或跨多个列拆分的开放或联合设置中却失败了。近似连接或模糊连接可以减少较小的字符串变化，但不能捕获系统转换。我们介绍QJoin，一个学习框架，学习和重用跨连接任务的转换策略。QJoin在独特性意识奖励下训练代理，该奖励平衡了相似性与关键独特性，使其能够探索简洁，高价值的转换链。为了加速新的连接，我们引入了两种重用机制：（i）代理转移，它从预先训练的代理中提取新的策略，以及（ii）转换重用，它为类似的列集群缓存成功的操作符序列。在AutoJoin Web基准测试（31个表对）中，QJoin的平均F1得分为91.0%。对于NYC+Chicago开放数据集中的19，990个连接任务，Qjoin通过重用将运行时间减少了7.4%（13，747 s）。这些结果表明，转换学习和重用可以使连接发现更准确，更有效。
摘要：Discovering which tables in large, heterogeneous repositories can be joined and by what transformations is a central challenge in data integration and data discovery. Traditional join discovery methods are largely designed for equi-joins, which assume that join keys match exactly or nearly so. These techniques, while efficient in clean, well-normalized databases, fail in open or federated settings where identifiers are inconsistently formatted, embedded, or split across multiple columns. Approximate or fuzzy joins alleviate minor string variations but cannot capture systematic transformations. We introduce QJoin, a reinforcement-learning framework that learns and reuses transformation strategies across join tasks. QJoin trains an agent under a uniqueness-aware reward that balances similarity with key distinctiveness, enabling it to explore concise, high-value transformation chains. To accelerate new joins, we introduce two reuse mechanisms: (i) agent transfer, which initializes new policies from pretrained agents, and (ii) transformation reuse, which caches successful operator sequences for similar column clusters. On the AutoJoin Web benchmark (31 table pairs), QJoin achieves an average F1-score of 91.0%. For 19,990 join tasks in NYC+Chicago open datasets, Qjoin reduces runtime by up to 7.4% (13,747 s) by using reusing. These results demonstrate that transformation learning and reuse can make join discovery both more accurate and more efficient.

【5】Dynamic Configuration of On-Street Parking Spaces using Multi Agent Reinforcement Learning
标题：使用多智能体强化学习的路边停车位动态配置
链接：https://arxiv.org/abs/2512.02406

作者：Oshada Jayasinghe,Farhana Choudhury,Egemen Tanin,Shanika Karunasekera
摘要：随着交通需求的增长，交通拥堵已成为大多数城市地区的主要问题。为路边停车场分配空间，通过限制可供驾驶的有效道路宽度，进一步阻碍了交通流量。随着车辆到基础设施连接技术的进步，我们探索如何通过动态配置路边停车位来最大限度地减少路边停车对交通拥堵的影响。为此，我们制定了动态的路边停车位配置作为一个优化问题，我们遵循数据驱动的方法，考虑到我们的问题的性质。我们提出的解决方案包括一个两层的多智能体强化学习的框架，这是固有的可扩展性，以大型道路网络。车道级代理负责为每条车道决定最佳停车位配置，我们引入了一种新的深度Q学习架构，该架构有效地利用长短期记忆网络和图形注意力网络来捕获给定问题中明显的时空相关性。街区级智能体控制车道级智能体的动作，并在街区周围保持足够的停车水平。我们使用SUMO对墨尔本市的合成数据和真实数据进行了一系列综合实验。我们的实验表明，所提出的框架可以减少车辆的平均行驶时间损失显着，高达47%，与停车步行距离的增加微不足道。
摘要：With increased travelling needs more than ever, traffic congestion has become a major concern in most urban areas. Allocating spaces for on-street parking, further hinders traffic flow, by limiting the effective road width available for driving. With the advancement of vehicle-to-infrastructure connectivity technologies, we explore how the impact of on-street parking on traffic congestion could be minimized, by dynamically configuring on-street parking spaces. Towards that end, we formulate dynamic on-street parking space configuration as an optimization problem, and we follow a data driven approach, considering the nature of our problem. Our proposed solution comprises a two-layer multi agent reinforcement learning based framework, which is inherently scalable to large road networks. The lane level agents are responsible for deciding the optimal parking space configuration for each lane, and we introduce a novel Deep Q-learning architecture which effectively utilizes long short term memory networks and graph attention networks to capture the spatio-temporal correlations evident in the given problem. The block level agents control the actions of the lane level agents and maintain a sufficient level of parking around the block. We conduct a set of comprehensive experiments using SUMO, on both synthetic data as well as real-world data from the city of Melbourne. Our experiments show that the proposed framework could reduce the average travel time loss of vehicles significantly, reaching upto 47%, with a negligible increase in the walking distance for parking.

【6】Reinforcement Learning in POMDP's via Direct Gradient Ascent
标题：通过直接梯度上升在POMDP中进行强化学习
链接：https://arxiv.org/abs/2512.02383

作者：Jonathan Baxter,Peter L. Bartlett
摘要：本文讨论了理论和实验方面的梯度为基础的方法，直接优化政策性能的控制POMDPs。我们引入GPOMDP，一个类似于REINFORCE的算法，用于估计作为随机策略参数的函数的平均奖励的梯度的近似值。该算法的主要优点是，它只需要底层马尔可夫链的一条样本路径，它只使用一个自由参数$β\in [0，1）$，它在偏差-方差权衡方面有一个自然的解释，并且它不需要底层状态的知识。我们证明了GPOMDP的收敛性，并展示了GPOMDP产生的梯度估计如何可以用于共轭梯度过程中找到平均奖励的局部最优解。
摘要：This paper discusses theoretical and experimental aspects of gradient-based approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy. The algorithm's chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one free parameter $β\in [0,1)$, which has a natural interpretation in terms of bias-variance trade-off, and it requires no knowledge of the underlying state. We prove convergence of GPOMDP and show how the gradient estimates produced by GPOMDP can be used in a conjugate-gradient procedure to find local optima of the average reward.

【7】FOVA: Offline Federated Reinforcement Learning with Mixed-Quality Data
标题：FOVA：使用混合质量数据的离线联合强化学习
链接：https://arxiv.org/abs/2512.02350

作者：Nan Qiao,Sheng Yue,Ju Ren,Yaoxue Zhang
备注：Accepted by IEEE/ACM ToN
摘要：离线联合强化学习（FRL）是联合学习和离线强化学习的结合，近年来引起了越来越多的关注。虽然有一些进步，我们发现，大多数现有的离线FRL方法的性能显着下降时，提供混合质量的数据，即日志行为（离线数据）收集的政策，不同的质量在客户端。为了克服这一局限性，本文介绍了一种新的基于投票的离线FRL框架，名为FOVA。它利用一个投票机制来识别高回报的行动在当地的政策评估，减轻低质量的行为的负面影响，从不同的当地学习政策。此外，基于加权回归（AWR），我们构建了一致的局部和全局训练目标，显著提高了FOVA的效率和稳定性。此外，我们进行了广泛的理论分析，并严格表明，FOVA学习的政策享有严格的政策改进的行为政策。大量的实验证实了显着的性能增益，我们提出的算法在广泛使用的基准测试现有的基线。
摘要：Offline Federated Reinforcement Learning (FRL), a marriage of federated learning and offline reinforcement learning, has attracted increasing interest recently. Albeit with some advancement, we find that the performance of most existing offline FRL methods drops dramatically when provided with mixed-quality data, that is, the logging behaviors (offline data) are collected by policies with varying qualities across clients. To overcome this limitation, this paper introduces a new vote-based offline FRL framework, named FOVA. It exploits a \emph{vote mechanism} to identify high-return actions during local policy evaluation, alleviating the negative effect of low-quality behaviors from diverse local learning policies. Besides, building on advantage-weighted regression (AWR), we construct consistent local and global training objectives, significantly enhancing the efficiency and stability of FOVA. Further, we conduct an extensive theoretical analysis and rigorously show that the policy learned by FOVA enjoys strict policy improvement over the behavioral policy. Extensive experiments corroborate the significant performance gains of our proposed algorithm over existing baselines on widely used benchmarks.

【8】Improved Training Mechanism for Reinforcement Learning via Online Model Selection
标题：通过在线模型选择改进强化学习训练机制
链接：https://arxiv.org/abs/2512.02214

作者：Aida Afshar,Aldo Pacchiano
摘要：我们研究了强化学习中的在线模型选择问题，其中选择器可以访问一类强化学习代理，并学习自适应地选择具有正确配置的代理。我们的目标是通过将在线模型选择方法集成到强化学习训练过程中来提高效率和性能。我们研究了在实践中有效识别正确配置的理论特征，并从理论角度解决了三个实际标准：1）有效的资源分配，2）非平稳动态下的适应性，以及3）不同种子的训练稳定性。我们的理论结果伴随着强化学习中各种模型选择任务的经验证据，包括神经结构选择，步长选择和自模型选择。
摘要：We study the problem of online model selection in reinforcement learning, where the selector has access to a class of reinforcement learning agents and learns to adaptively select the agent with the right configuration. Our goal is to establish the improved efficiency and performance gains achieved by integrating online model selection methods into reinforcement learning training procedures. We examine the theoretical characterizations that are effective for identifying the right configuration in practice, and address three practical criteria from a theoretical perspective: 1) Efficient resource allocation, 2) Adaptation under non-stationary dynamics, and 3) Training stability across different seeds. Our theoretical results are accompanied by empirical evidence from various model selection tasks in reinforcement learning, including neural architecture selection, step-size selection, and self model selection.

医学相关(6篇)

【1】ALDI-ray: Adapting the ALDI Framework for Security X-ray Object Detection
标题：ALDI-ray：调整ALDI框架用于安全X射线对象检测
链接：https://arxiv.org/abs/2512.02696

作者：Omid Reza Heidari,Yang Wang,Xinxin Zuo
备注：Submitted to ICASSP 2026 Conference
摘要：目标检测中的域自适应对于真实世界的应用至关重要，其中分布偏移会降低模型性能。由于扫描设备和环境条件的变化，安全X射线成像提出了一个独特的挑战，导致显着的域差异。为了解决这个问题，我们应用ALDI++，一个域自适应框架，集成了自蒸馏，特征对齐和增强的训练策略，以有效地减轻这一领域的域转移。我们在EDS数据集上进行了广泛的实验，证明ALDI++在多个适应场景中超越了最先进的（SOTA）域适应方法。特别是，具有Vision Transformer for Detection（ViTDet）主干的ALDI++实现了最高的平均精度（mAP），证实了基于transformer的架构在跨域对象检测中的有效性。此外，我们的分类分析突出了检测准确性的持续改进，加强了模型在不同对象类别中的鲁棒性。我们的研究结果使ALDI++成为域自适应目标检测的有效解决方案，为安全X射线图像的性能稳定性和跨域泛化设定了新的基准。
摘要：Domain adaptation in object detection is critical for real-world applications where distribution shifts degrade model performance. Security X-ray imaging presents a unique challenge due to variations in scanning devices and environmental conditions, leading to significant domain discrepancies. To address this, we apply ALDI++, a domain adaptation framework that integrates self-distillation, feature alignment, and enhanced training strategies to mitigate domain shift effectively in this area. We conduct extensive experiments on the EDS dataset, demonstrating that ALDI++ surpasses the state-of-the-art (SOTA) domain adaptation methods across multiple adaptation scenarios. In particular, ALDI++ with a Vision Transformer for Detection (ViTDet) backbone achieves the highest mean average precision (mAP), confirming the effectiveness of transformer-based architectures for cross-domain object detection. Additionally, our category-wise analysis highlights consistent improvements in detection accuracy, reinforcing the robustness of the model across diverse object classes. Our findings establish ALDI++ as an efficient solution for domain-adaptive object detection, setting a new benchmark for performance stability and cross-domain generalization in security X-ray imagery.

【2】CLEF: Clinically-Guided Contrastive Learning for Electrocardiogram Foundation Models
标题：CREF：心电图基础模型的临床指导对比学习
链接：https://arxiv.org/abs/2512.02180

作者：Yuxuan Shu,Peter H. Charlton,Fahim Kawsar,Jussi Hernesniemi,Mohammad Malekzadeh
备注：The code is available at https://github.com/Nokia-Bell-Labs/ecg-foundation-model
摘要：心电图（ECG）是心血管健康的关键诊断工具。单导联ECG记录集成到临床级和消费者可穿戴设备中。虽然在未标记的ECG上对基础模型进行自我监督预训练可以提高诊断性能，但现有方法并不包含来自临床元数据的领域知识。我们介绍了一种新的对比学习方法，利用一个既定的临床风险评分自适应加权负对：临床指导的对比学习。它将ECG嵌入的相似性与受试者之间具有临床意义的差异对齐，并具有处理缺失元数据的明确机制。在MIMIC-IV数据集中来自161 K患者的12导联ECG上，我们在三个尺度上预训练单导联ECG基础模型，统称为CLEF，仅使用常规收集的元数据，而不需要每个样本的ECG注释。我们评估了CLEF在7个数据集上的18个临床分类和回归任务，并对5个基础模型基线和3个自监督算法进行了基准测试。当在12导联ECG数据上进行预训练并在I导联数据上进行测试时，CLEF的性能优于自监督基础模型基线：中等规模的CLEF在分类方面实现了至少2.6%的平均AUROC改进，在回归方面实现了至少3.2%的平均MAE减少。与现有的自监督学习算法相比，CLEF的平均AUROC提高了至少1.8%。此外，当仅在用于分类任务的导联I数据上进行预训练时，CLEF对以监督方式训练的最先进的ECGFounder进行训练。总的来说，CLEF能够实现更准确和可扩展的单导联ECG分析，推进远程健康监测。代码和预训练的CLEF模型可在github.com/Nokia-Bell-Labs/ecg-foundation-model上获得。
摘要：The electrocardiogram (ECG) is a key diagnostic tool in cardiovascular health. Single-lead ECG recording is integrated into both clinical-grade and consumer wearables. While self-supervised pretraining of foundation models on unlabeled ECGs improves diagnostic performance, existing approaches do not incorporate domain knowledge from clinical metadata. We introduce a novel contrastive learning approach that utilizes an established clinical risk score to adaptively weight negative pairs: clinically-guided contrastive learning. It aligns the similarities of ECG embeddings with clinically meaningful differences between subjects, with an explicit mechanism to handle missing metadata. On 12-lead ECGs from 161K patients in the MIMIC-IV dataset, we pretrain single-lead ECG foundation models at three scales, collectively called CLEF, using only routinely collected metadata without requiring per-sample ECG annotations. We evaluate CLEF on 18 clinical classification and regression tasks across 7 held-out datasets, and benchmark against 5 foundation model baselines and 3 self-supervised algorithms. When pretrained on 12-lead ECG data and tested on lead-I data, CLEF outperforms self-supervised foundation model baselines: the medium-sized CLEF achieves average AUROC improvements of at least 2.6% in classification and average reductions in MAEs of at least 3.2% in regression. Comparing with existing self-supervised learning algorithms, CLEF improves the average AUROC by at least 1.8%. Moreover, when pretrained only on lead-I data for classification tasks, CLEF performs comparably to the state-of-the-art ECGFounder, which was trained in a supervised manner. Overall, CLEF enables more accurate and scalable single-lead ECG analysis, advancing remote health monitoring. Code and pretrained CLEF models are available at: github.com/Nokia-Bell-Labs/ecg-foundation-model.

【3】An Improved Ensemble-Based Machine Learning Model with Feature Optimization for Early Diabetes Prediction
标题：用于早期糖尿病预测的改进的基于集成的机器学习模型具有特征优化
链接：https://arxiv.org/abs/2512.02023

作者：Md. Najmul Islam,Md. Miner Hossain Rimon,Shah Sadek-E-Akbor Shamim,Zarif Mohaimen Fahad,Md. Jehadul Islam Mony,Md. Jalal Uddin Chowdhury
备注：Accepted for presentation at the 7th International Conference on Trends in Computational and Cognitive Engineering (TCCE-2025), 12-13 November 2025. This manuscript contains 10 pages and 7 figures
摘要：糖尿病是一个严重的全球性健康问题，成功的干预取决于早期发现。然而，重叠的风险因素和数据不对称使预测变得困难。使用广泛的健康调查数据来创建一个准确和可理解的糖尿病分类机器学习框架，以产生有助于临床决策的结果。使用BRFSS数据集，我们评估了一些监督学习技术。SMOTE和Tomek链接被用来纠正类不平衡。为了提高预测性能，研究了单个模型和集成技术，如堆叠。本研究使用了2015年BRFSS数据集，其中包括大约253，680条具有22个数值特征的记录。通过各个模型Random Forest、XGBoost、CatBoost和LightGBM实现了约0.96的强ROC-AUC性能。XGBoost和KNN的堆叠集成产生了最佳的总体结果，准确率为94.82%，ROC-AUC为0.989，PR-AUC为0.991，表明召回率和精度之间的良好平衡。在我们的研究中，我们提出并开发了一个基于React Native-based的应用程序，该应用程序具有Python Flask后端，以支持早期糖尿病预测，为用户提供一个可访问且高效的健康监测工具。
摘要：Diabetes is a serious worldwide health issue, and successful intervention depends on early detection. However, overlapping risk factors and data asymmetry make prediction difficult. To use extensive health survey data to create a machine learning framework for diabetes classification that is both accurate and comprehensible, to produce results that will aid in clinical decision-making. Using the BRFSS dataset, we assessed a number of supervised learning techniques. SMOTE and Tomek Links were used to correct class imbalance. To improve prediction performance, both individual models and ensemble techniques such as stacking were investigated. The 2015 BRFSS dataset, which includes roughly 253,680 records with 22 numerical features, is used in this study. Strong ROC-AUC performance of approximately 0.96 was attained by the individual models Random Forest, XGBoost, CatBoost, and LightGBM.The stacking ensemble with XGBoost and KNN yielded the best overall results with 94.82\% accuracy, ROC-AUC of 0.989, and PR-AUC of 0.991, indicating a favourable balance between recall and precision. In our study, we proposed and developed a React Native-based application with a Python Flask backend to support early diabetes prediction, providing users with an accessible and efficient health monitoring tool.

【4】A Real-time Face Mask Detection and Social Distancing System for COVID-19 using Attention-InceptionV3 Model
标题：使用Attue-InceptionV 3模型的COVID-19实时面罩检测和社交距离系统
链接：https://arxiv.org/abs/2411.05312

作者：Abdullah Al Asif,Farhana Chowdhury Tisha
摘要：由于COVID-19，目前世界上最致命的流行病之一正在发生。这种传染性病毒像野火一样在全世界蔓延。为了最大限度地减少这种病毒的传播，世界卫生组织（世卫组织）制定了强制佩戴口罩和保持6英尺物理距离的协议。在本文中，我们开发了一个系统，可以检测该距离的适当维护以及人们是否正确使用口罩。我们在这个系统中使用了定制的attention-inceptionv 3模型来识别这两个组件。我们使用了两个不同的数据集以及10，800张图像，包括带和不带面罩的图像。训练精度达到98%，验证精度达到99.5%。该系统可以进行约98.2%的精度值和每秒帧速率（FPS）为25.0。所以，有了这个系统，我们就可以识别出病毒传播可能性最高的高危区域。这可能有助于当局采取必要步骤，确定这些危险地区的位置，并提醒当地人民确保立即采取适当的预防措施。
摘要：One of the deadliest pandemics is now happening in the current world due to COVID-19. This contagious virus is spreading like wildfire around the whole world. To minimize the spreading of this virus, World Health Organization (WHO) has made protocols mandatory for wearing face masks and maintaining 6 feet physical distance. In this paper, we have developed a system that can detect the proper maintenance of that distance and people are properly using masks or not. We have used the customized attention-inceptionv3 model in this system for the identification of those two components. We have used two different datasets along with 10,800 images including both with and without Face Mask images. The training accuracy has been achieved 98% and validation accuracy 99.5%. The system can conduct a precision value of around 98.2% and the frame rate per second (FPS) was 25.0. So, with this system, we can identify high-risk areas with the highest possibility of the virus spreading zone. This may help authorities to take necessary steps to locate those risky areas and alert the local people to ensure proper precautions in no time.

【5】Comparing Baseline and Day-1 Diffusion MRI Using Multimodal Deep Embeddings for Stroke Outcome Prediction
标题：使用多模式深度嵌入比较基线和第1天的扩散MRI预测中风结局
链接：https://arxiv.org/abs/2512.02088

作者：Sina Raeisadigh,Myles Joshua Toledo Tan,Henning Müller,Abderrahmane Hedjoudje
备注：5 pages, 5 figures, 2 tables
摘要：本研究比较了基线（J 0）和24小时（J1）弥散磁共振成像（MRI）对急性缺血性卒中（AIS）后3个月功能结局的预测。分析了74例AIS患者的配对表观扩散系数（ADC）扫描和临床资料。将三维ResNet-50嵌入与结构化临床变量融合，通过主成分分析（<=12个成分）减少，并使用线性支持向量机进行分类，并进行八重分层组交叉验证。J1多模态模型实现了最高的预测性能（AUC = 0.923 +/- 0.085），优于基于J 0的配置（AUC <= 0.86）。描述病变体积特征进一步提高了模型的稳定性和可解释性。这些发现表明，早期治疗后弥散MRI提供了优于治疗前成像的预后价值，并且结合MRI、临床和病变体积特征，可为预测AIS患者的3个月功能结局提供一个稳健且可解释的框架。
摘要：This study compares baseline (J0) and 24-hour (J1) diffusion magnetic resonance imaging (MRI) for predicting three-month functional outcomes after acute ischemic stroke (AIS). Seventy-four AIS patients with paired apparent diffusion coefficient (ADC) scans and clinical data were analyzed. Three-dimensional ResNet-50 embeddings were fused with structured clinical variables, reduced via principal component analysis (<=12 components), and classified using linear support vector machines with eight-fold stratified group cross-validation. J1 multimodal models achieved the highest predictive performance (AUC = 0.923 +/- 0.085), outperforming J0-based configurations (AUC <= 0.86). Incorporating lesion-volume features further improved model stability and interpretability. These findings demonstrate that early post-treatment diffusion MRI provides superior prognostic value to pre-treatment imaging and that combining MRI, clinical, and lesion-volume features produces a robust and interpretable framework for predicting three-month functional outcomes in AIS patients.

【6】Parallel Multi-Circuit Quantum Feature Fusion in Hybrid Quantum-Classical Convolutional Neural Networks for Breast Tumor Classification
标题：混合量子经典卷积神经网络中并行多路量子特征融合用于乳腺肿瘤分类
链接：https://arxiv.org/abs/2512.02066

作者：Ece Yurtseven
摘要：量子机器学习已经成为一种很有前途的方法，可以改善医学成像等高维数据领域的特征提取和分类任务。在这项工作中，我们提出了一种混合量子经典卷积神经网络（QCNN）架构，旨在对BreastMNIST数据集进行二进制分类，这是区分良性和恶性乳腺肿瘤的标准化基准。我们的架构将经典卷积特征提取与两个不同的量子电路集成在一起：幅度编码变分量子电路（VQC）和具有循环纠缠的角度编码VQC电路，两者都在四个量子比特上实现。这些电路生成与经典特征融合的量子特征嵌入，以形成联合特征空间，随后由完全连接的分类器处理。为了确保公平性，混合QCNN与基线经典CNN进行参数匹配，使我们能够隔离量子层的贡献。这两个模型都是在相同的条件下使用Adam优化器和二进制交叉熵损失进行训练的。五次独立运行的实验评估表明，与经典CNN相比，混合QCNN CNN在分类准确度方面实现了统计学上的显著改善，如单侧Wilcoxon符号秩检验（p = 0.03125）所验证的，并得到Cohen d = 2.14的大效应量的支持。我们的研究结果表明，混合QCNN架构可以利用纠缠和量子特征融合来增强医学图像分类任务。这项工作建立了一个统计验证框架，用于评估生物医学应用中的混合量子模型，并强调了扩展到更大数据集和在近期量子硬件上部署的途径。
摘要：Quantum machine learning has emerged as a promising approach to improve feature extraction and classification tasks in high-dimensional data domains such as medical imaging. In this work, we present a hybrid Quantum-Classical Convolutional Neural Network (QCNN) architecture designed for the binary classification of the BreastMNIST dataset, a standardized benchmark for distinguishing between benign and malignant breast tumors. Our architecture integrates classical convolutional feature extraction with two distinct quantum circuits: an amplitude-encoding variational quantum circuit (VQC) and an angle-encoding VQC circuit with circular entanglement, both implemented on four qubits. These circuits generate quantum feature embeddings that are fused with classical features to form a joint feature space, which is subsequently processed by a fully connected classifier. To ensure fairness, the hybrid QCNN is parameter-matched against a baseline classical CNN, allowing us to isolate the contribution of quantum layers. Both models are trained under identical conditions using the Adam optimizer and binary cross-entropy loss. Experimental evaluation in five independent runs demonstrates that the hybrid QCNN achieves statistically significant improvements in classification accuracy compared to the classical CNN, as validated by a one-sided Wilcoxon signed rank test (p = 0.03125) and supported by large effect size of Cohen's d = 2.14. Our results indicate that hybrid QCNN architectures can leverage entanglement and quantum feature fusion to enhance medical image classification tasks. This work establishes a statistical validation framework for assessing hybrid quantum models in biomedical applications and highlights pathways for scaling to larger datasets and deployment on near-term quantum hardware.

蒸馏|知识提取(2篇)

【1】Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models
标题：基于流的模型中快速似然评估和采样的联合蒸馏
链接：https://arxiv.org/abs/2512.02636

作者：Xinyue Ai,Yutong He,Albert Gu,Ruslan Salakhutdinov,J Zico Kolter,Nicholas Matthew Boffi,Max Simchowitz
摘要：对数似然评估在生成模型中实现了重要功能，包括模型比较、某些微调目标和许多下游应用。然而，矛盾的是，当今一些最好的生成模型-扩散和基于流的模型-仍然需要数百到数千个神经功能评估（NFE）来计算单个可能性。虽然最近的蒸馏方法已经成功地将采样加速到几个步骤，但它们是以可能性易处理性为代价实现的：现有方法要么完全放弃可能性计算，要么仍然需要在整个轨迹上进行昂贵的积分。我们提出了快速流动联合蒸馏（F2 D2），一个框架，同时减少了两个数量级的采样和似然评估所需的NFE的数量。我们的关键见解是，在连续归一化流中，采样和似然的耦合ODE是从共享的底层速度场计算的，这使我们能够使用单个模型联合提取采样轨迹和累积发散。F2 D2是模块化的，与现有的基于流量的几步采样模型兼容，只需要一个额外的发散预测头。实验证明，F2 D2能够通过几步评估实现精确的对数似然，同时保持高样本质量，解决了基于流的生成模型中长期存在的计算瓶颈。作为我们方法的应用，我们提出了一种轻量级的自引导方法，该方法使2步MeanFlow模型的性能优于仅使用一个额外的向后NFE的1024步教师模型。
摘要：Log-likelihood evaluation enables important capabilities in generative models, including model comparison, certain fine-tuning objectives, and many downstream applications. Yet paradoxically, some of today's best generative models -- diffusion and flow-based models -- still require hundreds to thousands of neural function evaluations (NFEs) to compute a single likelihood. While recent distillation methods have successfully accelerated sampling to just a few steps, they achieve this at the cost of likelihood tractability: existing approaches either abandon likelihood computation entirely or still require expensive integration over full trajectories. We present fast flow joint distillation (F2D2), a framework that simultaneously reduces the number of NFEs required for both sampling and likelihood evaluation by two orders of magnitude. Our key insight is that in continuous normalizing flows, the coupled ODEs for sampling and likelihood are computed from a shared underlying velocity field, allowing us to jointly distill both the sampling trajectory and cumulative divergence using a single model. F2D2 is modular, compatible with existing flow-based few-step sampling models, and requires only an additional divergence prediction head. Experiments demonstrate F2D2's capability of achieving accurate log-likelihood with few-step evaluations while maintaining high sample quality, solving a long-standing computational bottleneck in flow-based generative models. As an application of our approach, we propose a lightweight self-guidance method that enables a 2-step MeanFlow model to outperform a 1024 step teacher model with only a single additional backward NFE.

【2】FDRMFL:Multi-modal Federated Feature Extraction Model Based on Information Maximization and Contrastive Learning
标题：FDRMFL：基于信息最大化和对比学习的多模式联邦特征提取模型
链接：https://arxiv.org/abs/2512.02076

作者：Haozhe Wu
备注：14pages,6figures
摘要：本文研究了多模态数据回归中的特征提取问题。为解决现实世界中的三个核心挑战：有限和非IID数据、多模态信息的有效提取和融合以及模型学习中的灾难性遗忘，提出了一种任务驱动的监督多模态联合特征提取方法。该方法集成了多模态信息提取和对比学习机制，并能适应不同的神经网络结构作为每种模态数据的潜在映射函数。它支持每个客户端自主学习多模态数据的低维表示，并可以通过参数调整灵活控制低维特征内预测变量中关于响应变量的有效信息的保留程度。该方法构建的多约束学习框架利用均方误差损失保证了回归精度。通过互信息保持约束、对称Kullback-Leibler发散约束和模型间对比约束的协同作用，分别实现了任务相关信息的保持、多模态特征的提取、融合和对齐以及非IID场景下表征漂移和灾难性遗忘的缓解。这确保了特征提取过程始终以提高下游回归任务的性能为中心。仿真和真实数据分析的实验结果表明，与经典的特征提取方法相比，该方法在下游回归任务上实现了更显著的性能提升.
摘要：This study focuses on the feature extraction problem in multi-modal data regression. To address three core challenges in real-world scenarios: limited and non-IID data, effective extraction and fusion of multi-modal information, and susceptibility to catastrophic forgetting in model learning, a task-driven supervised multi-modal federated feature extraction method is proposed. The method integrates multi-modal information extraction and contrastive learning mechanisms, and can adapt to different neural network structures as the latent mapping functions for data of each modality. It supports each client to independently learn low-dimensional representations of multi-modal data, and can flexibly control the degree of retention of effective information about the response variable in the predictive variables within the low-dimensional features through parameter tuning. The multi-constraint learning framework constructed by the method guarantees regression accuracy using Mean Squared Error loss. Through the synergistic effect of mutual information preservation constraint, symmetric Kullback-Leibler divergence constraint, and inter-model contrastive constraint, it achieves the retention of task-related information, the extraction, fusion, and alignment of multi-modal features, and the mitigation of representation drift and catastrophic forgetting in non-IID scenarios, respectively. This ensures that the feature extraction process always centers on improving the performance of downstream regression tasks. Experimental results from simulations and real-world data analysis demonstrate that the proposed method achieves more significant performance improvement on downstream regression tasks compared with classical feature extraction techniques.

聚类(1篇)

【1】CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer
标题：CREST：通过直升机引导的跨语言传输的通用安全护栏
链接：https://arxiv.org/abs/2512.02711

作者：Lavish Bansal,Naman Mishra
备注：8 Pages, 5 Figures, Under Review
摘要：确保大型语言模型（LLM）中的内容安全对于它们在现实世界中的部署至关重要。然而，现有的安全护栏主要是为高资源语言量身定制的，这使得世界上很大一部分人口在低资源语言中的代表性不足。为了解决这个问题，我们引入了CREST（跨语言高效安全传输），这是一种参数高效的多语言安全分类模型，仅用0.5B参数就支持100种语言。通过对仅13种高资源语言的策略选择子集进行训练，我们的模型利用了从几种到100种语言的基于集群的跨语言迁移，从而能够有效地推广到看不见的高资源和低资源语言。这种方法解决了资源匮乏环境中训练数据有限的挑战。我们对六个安全基准进行了全面评估，以证明CREST的性能优于现有同等规模的最先进护栏，并与具有更大参数计数（2.5B参数及以上）的模型相比，取得了有竞争力的结果。我们的研究结果强调了特定语言护栏的局限性，并强调了开发通用的、与语言无关的安全系统的重要性，这些系统可以有效地扩展以服务于全球人口。
摘要：Ensuring content safety in large language models (LLMs) is essential for their deployment in real-world applications. However, existing safety guardrails are predominantly tailored for high-resource languages, leaving a significant portion of the world's population underrepresented who communicate in low-resource languages. To address this, we introduce CREST (CRoss-lingual Efficient Safety Transfer), a parameter-efficient multilingual safety classification model that supports 100 languages with only 0.5B parameters. By training on a strategically chosen subset of only 13 high-resource languages, our model utilizes cluster-based cross-lingual transfer from a few to 100 languages, enabling effective generalization to both unseen high-resource and low-resource languages. This approach addresses the challenge of limited training data in low-resource settings. We conduct comprehensive evaluations across six safety benchmarks to demonstrate that CREST outperforms existing state-of-the-art guardrails of comparable scale and achieves competitive results against models with significantly larger parameter counts (2.5B parameters and above). Our findings highlight the limitations of language-specific guardrails and underscore the importance of developing universal, language-agnostic safety systems that can scale effectively to serve global populations.

超分辨率|去噪|去模糊|去雾(1篇)

【1】SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting
标题：SplatSuRe：选择性超分辨率，实现多视图一致的3D高斯飞溅
链接：https://arxiv.org/abs/2512.02172

作者：Pranav Asthana,Alex Hanson,Allen Tu,Tom Goldstein,Matthias Zwicker,Amitabh Varshney
备注：Project Page: https://splatsure.github.io/
摘要：3D高斯溅射（3DGS）实现了高质量的新颖视图合成，激发了人们对生成比训练期间可用的分辨率更高的渲染的兴趣。一种自然的策略是将超分辨率（SR）应用于低分辨率（LR）输入视图，但独立增强每个图像会引入多视图不一致性，导致模糊渲染。现有方法试图通过学习神经组件、时间一致的视频先验或LR和SR视图的联合优化来减轻这些不一致性，但所有方法都在每个图像上统一应用SR。相比之下，我们的关键见解是，特写LR视图可能包含在更远的视图中也捕获的区域的高频信息，并且我们可以使用相对于场景几何形状的相机姿势来通知在何处添加SR内容。基于这一认识，我们提出了SplatSuRe，这是一种仅在缺乏高频监督的欠采样区域选择性地应用SR内容的方法，从而产生更清晰，更一致的结果。在Tanks & Temples、Deep Blending和Mip-NeRF 360中，我们的方法在保真度和感知质量方面都超过了基线。值得注意的是，我们的增益是最显着的局部前景区域，需要更高的细节。
摘要：3D Gaussian Splatting (3DGS) enables high-quality novel view synthesis, motivating interest in generating higher-resolution renders than those available during training. A natural strategy is to apply super-resolution (SR) to low-resolution (LR) input views, but independently enhancing each image introduces multi-view inconsistencies, leading to blurry renders. Prior methods attempt to mitigate these inconsistencies through learned neural components, temporally consistent video priors, or joint optimization on LR and SR views, but all uniformly apply SR across every image. In contrast, our key insight is that close-up LR views may contain high-frequency information for regions also captured in more distant views, and that we can use the camera pose relative to scene geometry to inform where to add SR content. Building from this insight, we propose SplatSuRe, a method that selectively applies SR content only in undersampled regions lacking high-frequency supervision, yielding sharper and more consistent results. Across Tanks & Temples, Deep Blending and Mip-NeRF 360, our approach surpasses baselines in both fidelity and perceptual quality. Notably, our gains are most significant in localized foreground regions where higher detail is desired.

自动驾驶|车辆|车道检测等(1篇)

【1】Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation
标题：学习多模式嵌入用于交通事故预测和原因估计
链接：https://arxiv.org/abs/2512.02920

作者：Ziniu Zhang,Minxuan Duan,Haris N. Koutsopoulos,Hongyang R. Zhang
备注：17 pages. To appear in KDD'26 Datasets
摘要：我们考虑使用道路网络数据和与道路图节点对齐的卫星图像来分析交通事故模式。以往的交通事故预测工作主要依赖于道路网络的结构特征，而忽略了路面及其周围环境的物理和环境信息。在这项工作中，我们构建了一个横跨美国六个州的大型多模态数据集，其中包含来自官方来源的900万条交通事故记录，以及道路网络每个节点的100万张高分辨率卫星图像。此外，每个节点都用诸如区域的天气统计和道路类型（例如，住宅区对高速公路），并且每个边缘用交通量信息注释（即，年平均日交通量）。利用这个数据集，我们对整合视觉和网络嵌入的多模态学习方法进行了全面评估。我们的研究结果表明，整合两种数据模式可以提高预测准确性，平均AUROC为90.1美元，比仅利用图结构的图神经网络模型增加了3.7美元。利用改进的嵌入方法，我们进行了基于匹配估计的因果分析，以估计影响交通事故的关键因素。我们发现，在调整其他混杂因素后，事故率在较高的降水量下上升24%，在高速公路等高速公路上上升22%，由于季节性模式上升29%。消融研究证实，卫星图像特征对于实现准确预测至关重要。
摘要：We consider analyzing traffic accident patterns using both road network data and satellite images aligned to road graph nodes. Previous work for predicting accident occurrences relies primarily on road network structural features while overlooking physical and environmental information from the road surface and its surroundings. In this work, we construct a large multimodal dataset across six U.S. states, containing nine million traffic accident records from official sources, and one million high-resolution satellite images for each node of the road network. Additionally, every node is annotated with features such as the region's weather statistics and road type (e.g., residential vs. motorway), and each edge is annotated with traffic volume information (i.e., Average Annual Daily Traffic). Utilizing this dataset, we conduct a comprehensive evaluation of multimodal learning methods that integrate both visual and network embeddings. Our findings show that integrating both data modalities improves prediction accuracy, achieving an average AUROC of $90.1\%$, which is a $3.7\%$ gain over graph neural network models that only utilize graph structures. With the improved embeddings, we conduct a causal analysis based on a matching estimator to estimate the key contributing factors influencing traffic accidents. We find that accident rates rise by $24\%$ under higher precipitation, by $22\%$ on higher-speed roads such as motorways, and by $29\%$ due to seasonal patterns, after adjusting for other confounding factors. Ablation studies confirm that satellite imagery features are essential for achieving accurate prediction.

联邦学习|隐私保护|加密(1篇)

【1】Decentralized Fairness Aware Multi Task Federated Learning for VR Network
标题：VR网络的去中心化公平意识多任务联邦学习
链接：https://arxiv.org/abs/2512.02513

作者：Krishnendu S. Tharakan,Carlo Fischione
备注：accepted for IEEE Globecom Workshop 2025
摘要：无线连接有望解开虚拟现实（VR）体验的枷锁，允许用户随时随地参与。然而，由于严格的体验质量要求、低延迟限制和有限的VR设备功能，无线传输无缝、高质量、实时的VR视频具有挑战性。本文通过引入一种新的分散式多任务公平联邦学习（DMTFL）的缓存，缓存和预取每个VR用户的视野（FOV）在基站（BS）的基础上，为每个BS量身定制的缓存策略来解决这些挑战。在联邦学习（FL）中，通常偏向于某些用户，并且单个全局模型无法捕获用户和BS之间的统计异质性。相比之下，所提出的DMTFL算法通过在每个BS处学习个体缓存模型来个性化内容递送。这些模型被进一步优化，以在任何目标分布下表现良好，同时通过Rademacher复杂性和损失的可能近似正确（PAC）界限提供理论保证。使用一个现实的VR头部跟踪数据集，我们的模拟表明，我们提出的DMTFL算法相比基线算法的优越性。
摘要：Wireless connectivity promises to unshackle virtual reality (VR) experiences, allowing users to engage from anywhere, anytime. However, delivering seamless, high-quality, real-time VR video wirelessly is challenging due to the stringent quality of experience requirements, low latency constraints, and limited VR device capabilities. This paper addresses these challenges by introducing a novel decentralized multi task fair federated learning (DMTFL) based caching that caches and prefetches each VR user's field of view (FOV) at base stations (BSs) based on the caching strategies tailored to each BS. In federated learning (FL) in its naive form, often biases toward certain users, and a single global model fails to capture the statistical heterogeneity across users and BSs. In contrast, the proposed DMTFL algorithm personalizes content delivery by learning individual caching models at each BS. These models are further optimized to perform well under any target distribution, while providing theoretical guarantees via Rademacher complexity and a probably approximately correct (PAC) bound on the loss. Using a realistic VR head-tracking dataset, our simulations demonstrate the superiority of our proposed DMTFL algorithm compared to baseline algorithms.

推理|分析|理解|解释(11篇)

【1】Pruning AMR: Efficient Visualization of Implicit Neural Representations via Weight Matrix Analysis
标题：修剪AMR：通过权重矩阵分析实现隐式神经表示的有效可视化
链接：https://arxiv.org/abs/2512.02967

作者：Jennifer Zvonek,Andrew Gillette
摘要：隐式神经表示（INR）是一种近似时空函数的神经网络。许多内存密集型可视化任务，包括现代4D CT扫描方法，都将数据原生地表示为INR。虽然INR比存储在网格上的传统数据更节省内存，但许多可视化任务仍然需要离散化为规则网格。我们提出PruningAMR，一种算法，建立一个网格的分辨率适应几何特征编码的INR。为了识别这些几何特征，我们使用插值分解修剪方法的权重矩阵的INR。由此产生的修剪网络用于指导自适应网格细化，使自动网格生成量身定制的功能的底层分辨率。从预先训练的INR开始-不访问其训练数据-我们生成一个可变分辨率的可视化，节省了大量内存。
摘要：An implicit neural representation (INR) is a neural network that approximates a spatiotemporal function. Many memory-intensive visualization tasks, including modern 4D CT scanning methods, represent data natively as INRs. While INRs are prized for being more memory-efficient than traditional data stored on a lattice, many visualization tasks still require discretization to a regular grid. We present PruningAMR, an algorithm that builds a mesh with resolution adapted to geometric features encoded by the INR. To identify these geometric features, we use an interpolative decomposition pruning method on the weight matrices of the INR. The resulting pruned network is used to guide adaptive mesh refinement, enabling automatic mesh generation tailored to the underlying resolution of the function. Starting from a pre-trained INR--without access to its training data--we produce a variable resolution visualization with substantial memory savings.

【2】FiMMIA: scaling semantic perturbation-based membership inference across modalities
标题：FiMMIA：跨模式扩展基于语义扰动的隶属关系推断
链接：https://arxiv.org/abs/2512.02786

作者：Anton Emelyanov,Sergei Kudriashov,Alena Fenogenova
备注：System demo track paper for EACL 2026
摘要：隶属度推断攻击（MIA）旨在确定特定数据点是否包含在目标模型的训练集中。虽然已经开发了许多方法来检测大型语言模型（LLM）中的数据污染，但由于多模态组件自适应和多个输入之间可能的分布偏移引入的不稳定性，它们在多模态LLM（MLLM）上的性能不足。在这项工作中，我们研究了多模态成员推理，并解决了两个问题：第一，通过识别现有数据集中的分布变化，第二，通过释放一个扩展的基线管道来检测它们。我们还将基于扰动的隶属度推理方法推广到MLLM，并发布了\textbf{FiMMIA} --一个用于\textbf{M}多模态\textbf{MIA}的模块化\textbf{F}推理。脚注{源代码和框架已在MIT许可下通过\href{https：//github.com/ai-forever/data_leakage_detect}{link}公开提供。视频演示可在\href{https：//youtu.be/a9L4-H80aSg}{YouTube}上获得。}我们的方法训练一个神经网络来分析目标模型在扰动输入上的行为，捕捉成员和非成员之间的分布差异。各种微调的多模态模型的综合评价表明，我们的扰动为基础的成员推理攻击在多模态域的有效性。
摘要：Membership Inference Attacks (MIAs) aim to determine whether a specific data point was included in the training set of a target model. Although there are have been numerous methods developed for detecting data contamination in large language models (LLMs), their performance on multimodal LLMs (MLLMs) falls short due to the instabilities introduced through multimodal component adaptation and possible distribution shifts across multiple inputs. In this work, we investigate multimodal membership inference and address two issues: first, by identifying distribution shifts in the existing datasets, and second, by releasing an extended baseline pipeline to detect them. We also generalize the perturbation-based membership inference methods to MLLMs and release \textbf{FiMMIA} -- a modular \textbf{F}ramework for \textbf{M}ultimodal \textbf{MIA}.\footnote{The source code and framework have been made publicly available under the MIT license via \href{https://github.com/ai-forever/data_leakage_detect}{link}.The video demonstration is available on \href{https://youtu.be/a9L4-H80aSg}{YouTube}.} Our approach trains a neural network to analyze the target model's behavior on perturbed inputs, capturing distributional differences between members and non-members. Comprehensive evaluations on various fine-tuned multimodal models demonstrate the effectiveness of our perturbation-based membership inference attacks in multimodal domains.

【3】Sparse Computations in Deep Learning Inference
标题：深度学习推理中的稀疏计算
链接：https://arxiv.org/abs/2512.02550

作者：Ioanna Tasou,Panagiotis Mpakos,Angelos Vlachos,Dionysios Adamopoulos,Georgios Giannakopoulos,Konstantinos Katsikopoulos,Ioannis Karaparisis,Maria Lazou,Spyridon Loukovitis,Areti Mei,Anastasia Poulopoulou,Angeliki Dimitriou,Giorgos Filandrianos,Dimitrios Galanopoulos,Vasileios Karampinis,Ilias Mitsouras,Nikolaos Spanos,Petros Anastasiadis,Ioannis Doudalis,Konstantinos Nikas,George Retsinas,Paraskevi Tzouveli,Christina Giannoula,Nectarios Koziris,Nikela Papadopoulou,Giorgos Stamou,Athanasios Voulodimos,Georgios Goumas
摘要：现代深度神经网络（DNN）的计算需求是巨大的，并且不断增长。虽然培训成本通常会引起公众的关注，但推理需求也会导致显著的计算、能源和环境足迹。稀疏性是大幅减少这些资源需求的关键机制。然而，它的潜力在很大程度上尚未开发，尚未完全融入生产人工智能系统。为了弥合这一差距，这项工作为热衷于参与深度学习推理优化的性能工程师提供了必要的知识和见解。特别是，在这项工作中，我们：a）讨论可以在DNN推理中使用的各种形式的稀疏性，b）解释原始密集计算如何转换为稀疏内核，c）提供对CPU和GPU实现这些内核的最新技术的广泛文献综述，d）讨论稀疏数据集在支持稀疏性相关研究和开发中的可用性，e）探索提供强大稀疏支持的当前软件工具和框架，以及f）呈现关键SpMM和SDDMM内核在CPU和GPU平台上的不同实现的评估结果。最终，本文旨在为寻求在生产中开发和部署高效稀疏深度学习模型的性能工程师提供资源。
摘要：The computational demands of modern Deep Neural Networks (DNNs) are immense and constantly growing. While training costs usually capture public attention, inference demands are also contributing in significant computational, energy and environmental footprints. Sparsity stands out as a critical mechanism for drastically reducing these resource demands. However, its potential remains largely untapped and is not yet fully incorporated in production AI systems. To bridge this gap, this work provides the necessary knowledge and insights for performance engineers keen to get involved in deep learning inference optimization. In particular, in this work we: a) discuss the various forms of sparsity that can be utilized in DNN inference, b) explain how the original dense computations translate to sparse kernels, c) provide an extensive bibliographic review of the state-of-the-art in the implementation of these kernels for CPUs and GPUs, d) discuss the availability of sparse datasets in support of sparsity-related research and development, e) explore the current software tools and frameworks that provide robust sparsity support, and f) present evaluation results of different implementations of the key SpMM and SDDMM kernels on CPU and GPU platforms. Ultimately, this paper aims to serve as a resource for performance engineers seeking to develop and deploy highly efficient sparse deep learning models in productions.

【4】Water Quality Estimation Through Machine Learning Multivariate Analysis
标题：通过机器学习多元分析评价水质
链接：https://arxiv.org/abs/2512.02508

作者：Marco Cardia,Stefano Chessa,Alessio Micheli,Antonella Giuliana Luminare,Francesca Gambineri
备注：The paper has been accepted at Italian Workshop on Neural Networks (WIRN) 2024
摘要：水的质量是农业食品部门质量的关键。水用于农业灌溉施肥、畜牧业和农业食品加工业。在该行业逐步数字化的背景下，水质自动评估因此成为一项重要资产。在这项工作中，我们提出了紫外可见（UV-Vis）光谱与机器学习在水质评估的背景下，旨在确保水安全和水法规的合规性的集成。此外，我们强调模型的可解释性的重要性，采用SHapley加法解释（SHAP），以了解在不同波长的吸光度的预测的贡献。我们的方法证明了快速，准确和可解释的关键水质参数评估的潜力。
摘要：The quality of water is key for the quality of agrifood sector. Water is used in agriculture for fertigation, for animal husbandry, and in the agrifood processing industry. In the context of the progressive digitalization of this sector, the automatic assessment of the quality of water is thus becoming an important asset. In this work, we present the integration of Ultraviolet-Visible (UV-Vis) spectroscopy with Machine Learning in the context of water quality assessment aiming at ensuring water safety and the compliance of water regulation. Furthermore, we emphasize the importance of model interpretability by employing SHapley Additive exPlanations (SHAP) to understand the contribution of absorbance at different wavelengths to the predictions. Our approach demonstrates the potential for rapid, accurate, and interpretable assessment of key water quality parameters.

【5】WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
标题：WorldMM：用于长视频推理的动态多模式记忆代理
链接：https://arxiv.org/abs/2512.02425

作者：Woongyeong Yeo,Kangsan Kim,Jaehong Yoon,Sung Ju Hwang
备注：Project page : https://worldmm.github.io
摘要：视频大语言模型的最新进展已经证明了理解短片段的强大能力。然而，由于上下文容量有限以及抽象过程中关键视觉细节的丢失，将它们扩展到数小时或数天长的视频仍然具有很大的挑战性。现有的内存增强方法通过利用视频片段的文本摘要来缓解这一问题，但它们严重依赖于文本，并且在对复杂场景进行推理时无法利用视觉证据。此外，从固定的时间尺度检索进一步限制了它们在捕获跨越可变持续时间的事件时的灵活性。为了解决这个问题，我们介绍了WorldMM，一种新型的多模态记忆代理，它可以从多个互补的记忆中构建和检索，包括文本和视觉表示。WorldMM包括三种类型的记忆：情景记忆在多个时间尺度上索引事实事件，语义记忆不断更新高级概念知识，视觉记忆保留有关场景的详细信息。在推理期间，自适应检索代理迭代地选择最相关的存储器源，并基于查询利用多个时间粒度，继续直到它确定已经收集了足够的信息。WorldMM在五个长视频问答基准测试中的表现明显优于现有基准，与之前最先进的方法相比，平均性能提升了8.4%，显示了其在长视频推理方面的有效性。
摘要：Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.

【6】OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning
标题：OmniGuard：具有深思熟虑推理的统一全模式护栏
链接：https://arxiv.org/abs/2512.02306

作者：Boyu Zhu,Xiaofei Wen,Wenjie Jacky Mo,Tinghui Zhu,Yanan Xie,Peng Qi,Muhao Chen
摘要：处理文本、图像、视频和音频的全模态大型语言模型（OLLM）为人机交互中的安全和价值护栏带来了新的挑战。之前的护栏研究主要针对单峰设置，通常将框架保护为二进制分类，这限制了不同模态和任务的鲁棒性。为了解决这一差距，我们提出了OmniGuard，第一个家庭的全模态护栏，执行保护所有模态与深思熟虑的推理能力。为了支持OMNIGUARD的训练，我们策划了一个大型，全面的全模态安全数据集，包括超过21万个不同的样本，通过单峰和跨模态样本涵盖所有模态的输入。每个样本都标注了结构化的安全标签，并通过有针对性的蒸馏从专家模型中精心策划安全评论。在15个基准测试上的大量实验表明，OmniGuard在广泛的多模态安全场景中实现了强大的有效性和通用性。重要的是，OmniGuard提供了一个统一的框架，可以执行政策并减轻全模式的风险，为构建更强大，更强大的全模式保障系统铺平了道路。
摘要：Omni-modal Large Language Models (OLLMs) that process text, images, videos, and audio introduce new challenges for safety and value guardrails in human-AI interaction. Prior guardrail research largely targets unimodal settings and typically frames safeguarding as binary classification, which limits robustness across diverse modalities and tasks. To address this gap, we propose OmniGuard, the first family of omni-modal guardrails that performs safeguarding across all modalities with deliberate reasoning ability. To support the training of OMNIGUARD, we curate a large, comprehensive omni-modal safety dataset comprising over 210K diverse samples, with inputs that cover all modalities through both unimodal and cross-modal samples. Each sample is annotated with structured safety labels and carefully curated safety critiques from expert models through targeted distillation. Extensive experiments on 15 benchmarks show that OmniGuard achieves strong effectiveness and generalization across a wide range of multimodal safety scenarios. Importantly, OmniGuard provides a unified framework that enforces policies and mitigates risks in omni-modalities, paving the way toward building more robust and capable omnimodal safeguarding systems.

【7】The Effect of Enforcing Fairness on Reshaping Explanations in Machine Learning Models
标题：机器学习模型中强制公平性对重塑解释的影响
链接：https://arxiv.org/abs/2512.02265

作者：Joshua Wolff Anderson,Shyam Visweswaran
备注：10 pages, 3 figures, 2 tables
摘要：医疗保健领域值得信赖的机器学习需要强大的预测性能、公平性和解释。虽然众所周知，提高公平性可以影响预测性能，但人们对公平性的提高如何影响可解释性知之甚少，而可解释性是临床信任的重要组成部分。临床医生可能会犹豫是否依赖于一个模型，该模型的解释在应用公平性约束后发生变化。在这项研究中，我们研究如何通过偏见缓解技术提高公平性重塑基于Shapley的特征排名。我们在三个数据集上应用公平性约束后量化了特征重要性排名的变化：儿科尿路感染风险，直接抗凝出血风险和累犯风险。我们还评估了多个模型类基于Shapley的排名的稳定性。我们发现，增加模型的公平性，在种族分组可以显着改变功能的重要性排名，有时在不同的方式跨组。这些结果强调了在模型评估中需要共同考虑准确性，公平性和可解释性，而不是孤立地考虑。
摘要：Trustworthy machine learning in healthcare requires strong predictive performance, fairness, and explanations. While it is known that improving fairness can affect predictive performance, little is known about how fairness improvements influence explainability, an essential ingredient for clinical trust. Clinicians may hesitate to rely on a model whose explanations shift after fairness constraints are applied. In this study, we examine how enhancing fairness through bias mitigation techniques reshapes Shapley-based feature rankings. We quantify changes in feature importance rankings after applying fairness constraints across three datasets: pediatric urinary tract infection risk, direct anticoagulant bleeding risk, and recidivism risk. We also evaluate multiple model classes on the stability of Shapley-based rankings. We find that increasing model fairness across racial subgroups can significantly alter feature importance rankings, sometimes in different ways across groups. These results highlight the need to jointly consider accuracy, fairness, and explainability in model assessment rather than in isolation.

【8】How Market Volatility Shapes Algorithmic Collusion: A Comparative Analysis of Learning-Based Pricing Algorithms
标题：市场波动性如何统计共谋：基于学习的定价算法的比较分析
链接：https://arxiv.org/abs/2512.02134

作者：Aheer Sravon,Md. Ibrahim,Devdyuti Mazumder,Ridwan Al Aziz
摘要：自主定价算法越来越多地影响数字市场的竞争;然而，它们在现实需求条件下的行为在很大程度上仍未得到检验。本文对四种定价算法--Q学习、PSO、双DQN和DDPG --在三种经典双寡头模型（Logit、Hotelling、Linear）和自回归过程产生的各种需求冲击机制下进行了深入分析。通过利用利润和价格为基础的共谋指数，我们研究算法，市场结构和随机需求之间的相互作用如何协同影响竞争的结果。我们的研究结果表明，在稳定的需求下，竞争学习算法通常会维持超竞争的价格，DDPG表现出最明显的串通倾向。需求冲击产生了明显不同的影响：Logit市场遭受显着的性能下降，Hotelling市场保持稳定，线性市场经历冲击引起的利润膨胀。尽管绝对性能发生了显著变化，但算法的相对排名在不同环境中是一致的。这些结果强调了市场结构和需求不确定性在形成算法竞争方面的至关重要性，同时也有助于围绕自主定价行为展开的政策讨论。
摘要：Autonomous pricing algorithms are increasingly influencing competition in digital markets; however, their behavior under realistic demand conditions remains largely unexamined. This paper offers a thorough analysis of four pricing algorithms -- Q-Learning, PSO, Double DQN, and DDPG -- across three classic duopoly models (Logit, Hotelling, Linear) and under various demand-shock regimes created by auto-regressive processes. By utilizing profit- and price-based collusion indices, we investigate how the interactions among algorithms, market structure, and stochastic demand collaboratively influence competitive outcomes. Our findings reveal that reinforcement-learning algorithms often sustain supra-competitive prices under stable demand, with DDPG demonstrating the most pronounced collusive tendencies. Demand shocks produce notably varied effects: Logit markets suffer significant performance declines, Hotelling markets remain stable, and Linear markets experience shock-induced profit inflation. Despite marked changes in absolute performance, the relative rankings of the algorithms are consistent across different environments. These results underscore the critical importance of market structure and demand uncertainty in shaping algorithmic competition, while also contributing to the evolving policy discussions surrounding autonomous pricing behavior.

【9】Mixed precision accumulation for neural network inference guided by componentwise forward error analysis
标题：基于分量正向误差分析的神经网络推理的混合精度积累
链接：https://arxiv.org/abs/2503.15568

作者：El-Mehdi El Arar,Silviu-Ioan Filip,Theo Mary,Elisa Riccietti
摘要：本文提出了一种基于数学的混合精度累积策略用于神经网络的推理。我们的策略是基于一个新的组件的前向误差分析，解释了传播的错误，在神经网络的前向通过。具体来说，我们的分析表明，线性层输出的每个分量的误差与权重和输入之间的内积的条件数成比例，乘以激活函数的条件数。这些条件数可以从一个组件到另一个组件变化很大，从而创造了引入混合精度的重要机会：每个组件都应该以与这些条件数的乘积成反比的精度进行累积。我们提出了一个数值算法，利用这一观察：它首先计算低精度的所有组件，使用此输出来估计的条件数，并重新计算在更高的精度只有与大的条件数相关的组件。我们在各种网络和数据集上测试了我们的算法，并通过实验证实，与均匀精度累积基线相比，它可以显着提高成本-精度权衡。
摘要：This work proposes a mathematically founded mixed precision accumulation strategy for the inference of neural networks. Our strategy is based on a new componentwise forward error analysis that explains the propagation of errors in the forward pass of neural networks. Specifically, our analysis shows that the error in each component of the output of a linear layer is proportional to the condition number of the inner product between the weights and the input, multiplied by the condition number of the activation function. These condition numbers can vary widely from one component to the other, thus creating a significant opportunity to introduce mixed precision: each component should be accumulated in a precision inversely proportional to the product of these condition numbers. We propose a numerical algorithm that exploits this observation: it first computes all components in low precision, uses this output to estimate the condition numbers, and recomputes in higher precision only the components associated with large condition numbers. We test our algorithm on various networks and datasets and confirm experimentally that it can significantly improve the cost--accuracy tradeoff compared with uniform precision accumulation baselines.

【10】From Betti Numbers to Persistence Diagrams: A Hybrid Quantum Algorithm for Topological Data Analysis
标题：从贝蒂数到持久图：一种用于布局数据分析的混合量子算法
链接：https://arxiv.org/abs/2512.02081

作者：Dong Liu
备注：11 pages
摘要：持久性图是拓扑数据分析的核心工具，在病理监测、药物发现和材料设计中起着至关重要的作用。然而，现有的量子拓扑算法，如LGZ算法，只能有效地计算概括统计，如贝蒂数，无法提供持久性图信息，跟踪个人的拓扑特征的生命周期，严重限制了其实用价值。本文提出了一种新的量子-经典混合算法，首次实现了从“Betti数的量子计算”到“实用持久图的量子获取”的飞跃。该算法利用LGZ量子算法作为有效的特征提取器，挖掘组合拉普拉斯数和贝蒂数的调和形式特征向量，构造专门的拓扑核函数来训练量子支持向量机（QSVM），并学习从量子拓扑特征到持久性图的映射。该算法的核心贡献在于：（1）将量子拓扑计算从统计概括提升到模式识别，极大地拓展了其应用价值：（2）在保持量子计算指数加速比优势的同时，以持久性图的形式获得更多实际应用的拓扑信息;（3）提出了一种新的“经典精确引导量子效率”的混合范式。“这种方法为量子拓扑数据分析的实际实现提供了一条可行的途径。
摘要：Persistence diagrams serve as a core tool in topological data analysis, playing a crucial role in pathological monitoring, drug discovery, and materials design. However, existing quantum topological algorithms, such as the LGZ algorithm, can only efficiently compute summary statistics like Betti numbers, failing to provide persistence diagram information that tracks the lifecycle of individual topological features, severely limiting their practical value. This paper proposes a novel quantum-classical hybrid algorithm that achieves, for the first time, the leap from "quantum computation of Betti numbers" to "quantum acquisition of practical persistence diagrams." The algorithm leverages the LGZ quantum algorithm as an efficient feature extractor, mining the harmonic form eigenvectors of the combinatorial Laplacian as well as Betti numbers, constructing specialized topological kernel functions to train a quantum support vector machine (QSVM), and learning the mapping from quantum topological features to persistence diagrams. The core contributions of this algorithm are: (1) elevating quantum topological computation from statistical summaries to pattern recognition, greatly expanding its application value; (2) obtaining more practical topological information in the form of persistence diagrams for real-world applications while maintaining the exponential speedup advantage of quantum computation; (3) proposing a novel hybrid paradigm of "classical precision guiding quantum efficiency." This method provides a feasible pathway for the practical implementation of quantum topological data analysis.

【11】From 'What-is' to 'What-if' in Human-Factor Analysis: A Post-Occupancy Evaluation Case
标题：人因分析中的“是什么”到“如果什么”：一个职业后评估案例
链接：https://arxiv.org/abs/2512.02060

作者：Xia Chen,Ruiji Sun,Philipp Geyer,André Borrmann,Stefano Schiavon
备注：17 pages, 5 figures
摘要：人为因素分析通常采用相关性分析和显著性检验来确定变量之间的关系。然而，这些描述性的（“是什么”）方法，而有效的识别协会，往往是不够的回答因果关系（“如果”）的问题。在这种情况下，它们的应用往往忽略了混淆和冲突的变量，可能导致偏见和次优或不正确的决定。我们主张在人为因素分析中明确区分描述性和干预性问题，并将因果推理框架专门应用于这些问题，以防止方法不匹配。这种方法解开了复杂的变量关系，并使反事实推理。使用建成环境中心（CBE）居住者调查的入住后评估（POE）数据作为示范案例，我们展示了因果发现如何揭示传统关联分析所遗漏的干预层次和方向关系。因果关联变量和独立变量之间的系统性区别，结合干预优先级排序功能，为复杂的以人为中心的系统提供了广泛的适用性，例如，在建筑科学或人体工程学中，了解干预效果对于优化和决策至关重要。
摘要：Human-factor analysis typically employs correlation analysis and significance testing to identify relationships between variables. However, these descriptive ('what-is') methods, while effective for identifying associations, are often insufficient for answering causal ('what-if') questions. Their application in such contexts often overlooks confounding and colliding variables, potentially leading to bias and suboptimal or incorrect decisions. We advocate for explicitly distinguishing descriptive from interventional questions in human-factor analysis, and applying causal inference frameworks specifically to these problems to prevent methodological mismatches. This approach disentangles complex variable relationships and enables counterfactual reasoning. Using post-occupancy evaluation (POE) data from the Center for the Built Environment's (CBE) Occupant Survey as a demonstration case, we show how causal discovery reveals intervention hierarchies and directional relationships that traditional associational analysis misses. The systematic distinction between causally associated and independent variables, combined with intervention prioritization capabilities, offers broad applicability to complex human-centric systems, for example, in building science or ergonomics, where understanding intervention effects is critical for optimization and decision-making.

检测相关(1篇)

【1】PhishSnap: Image-Based Phishing Detection Using Perceptual Hashing
标题：PhishSnap：使用感知哈希进行基于图像的网络钓鱼检测
链接：https://arxiv.org/abs/2512.02243

作者：Md Abdul Ahad Minhaz,Zannatul Zahan Meem,Md. Shohrab Hossain
备注：IEE Standard Formatting, 3 pages, 3 figures
摘要：网络钓鱼仍然是最普遍的在线威胁之一，利用人类的信任来获取敏感的凭据。现有的基于URL和HTML的检测系统难以对抗混淆和视觉欺骗。本文介绍了\textbf{PhishSnap}，一个隐私保护，设备上的网络钓鱼检测系统，利用感知哈希（pHash）。作为浏览器扩展实现，PhishSnap捕获网页截图，计算视觉哈希，并将其与合法模板进行比较，以识别视觉上相似的网络钓鱼尝试。从PhishTank和Netcraft收集了10，000个URL的\textbf{2024}数据集（70\%/20\%/10\%训练/验证/测试）。由于安全删除，网络钓鱼页面的一个子集不可用，从而减少了数据集的多样性。该系统实现了\textbf{0.79准确率}、\textbf{0.76精确率}和\textbf{0.78召回率}，表明视觉相似性仍然是一种可行的反网络钓鱼措施。整个推理过程在本地进行，确保用户隐私和最小延迟。
摘要：Phishing remains one of the most prevalent online threats, exploiting human trust to harvest sensitive credentials. Existing URL- and HTML-based detection systems struggle against obfuscation and visual deception. This paper presents \textbf{PhishSnap}, a privacy-preserving, on-device phishing detection system leveraging perceptual hashing (pHash). Implemented as a browser extension, PhishSnap captures webpage screenshots, computes visual hashes, and compares them against legitimate templates to identify visually similar phishing attempts. A \textbf{2024 dataset of 10,000 URLs} (70\%/20\%/10\% train/validation/test) was collected from PhishTank and Netcraft. Due to security takedowns, a subset of phishing pages was unavailable, reducing dataset diversity. The system achieved \textbf{0.79 accuracy}, \textbf{0.76 precision}, and \textbf{0.78 recall}, showing that visual similarity remains a viable anti-phishing measure. The entire inference process occurs locally, ensuring user privacy and minimal latency.

分类|识别(3篇)

【1】SAND Challenge: Four Approaches for Dysartria Severity Classification
标题：SAND挑战：味觉障碍严重程度分类的四种方法
链接：https://arxiv.org/abs/2512.02669

作者：Gauri Deshpande,Harish Battula,Ashish Panda,Sunil Kumar Kopparapu
备注：7 pages, 5 figures
摘要：本文提出了一个统一的研究，四个不同的建模方法分类构音障碍的严重程度在语音分析神经退行性疾病（SAND）的挑战。所有模型都使用一个通用的语音记录数据集来处理相同的五类分类任务。我们调查：（1）在频谱图图像上利用Vision Transformer的ViT-OF方法，（2）使用具有多数投票融合的八个1-D CNN的1D-CNN方法，（3）使用具有多数投票融合的九个BiLSTM模型的BiLSTM-OF方法，以及（4）通过两阶段学习框架组合声门和共振峰特征的分层XGBoost集成。每种方法进行了描述，并对它们在53个扬声器的验证集上的性能进行了比较。结果表明，虽然经过特征设计的XGBoost集成达到了最高的宏F1（0.86），但深度学习模型（ViT，CNN，BiLSTM）达到了具有竞争力的F1分数（0.70），并提供了对问题的补充见解。
摘要：This paper presents a unified study of four distinct modeling approaches for classifying dysarthria severity in the Speech Analysis for Neurodegenerative Diseases (SAND) challenge. All models tackle the same five class classification task using a common dataset of speech recordings. We investigate: (1) a ViT-OF method leveraging a Vision Transformer on spectrogram images, (2) a 1D-CNN approach using eight 1-D CNN's with majority-vote fusion, (3) a BiLSTM-OF approach using nine BiLSTM models with majority vote fusion, and (4) a Hierarchical XGBoost ensemble that combines glottal and formant features through a two stage learning framework. Each method is described, and their performances on a validation set of 53 speakers are compared. Results show that while the feature-engineered XGBoost ensemble achieves the highest macro-F1 (0.86), the deep learning models (ViT, CNN, BiLSTM) attain competitive F1-scores (0.70) and offer complementary insights into the problem.

【2】Modeling and Inverse Identification of Interfacial Heat Conduction in Finite Layer and Semi-Infinite Substrate Systems via a Physics-Guided Neural Framework
标题：通过物理引导神经框架对有限层和半无限基片系统中界面热传导进行建模和逆识别
链接：https://arxiv.org/abs/2512.02618

作者：Wenhao Sha,Tienchong Chang
摘要：半导体器件中的热传递由芯片和衬底组件主导，其中有限芯片层内产生的热量消散到具有高得多的热物理性质的半无限衬底中。这种不匹配产生陡峭的界面温度梯度，使得瞬态热响应对界面高度敏感。传统的数值求解器需要过度的离散化来解决这些动力学问题，而物理信息神经网络（PINN）通常在材料界面附近表现出不稳定的收敛性和物理一致性的损失。为了解决这些挑战，我们引入了HeatTransFormer，一个物理指导的Transformer架构，用于界面主导的扩散问题。该框架集成了物理信息时空采样，基于拉普拉斯的激活仿真分析扩散解决方案，和无掩模的注意力机制，支持双向时空耦合。这些组件使模型能够解决陡峭的梯度，保持物理一致性，并在PINN通常失败的地方保持稳定。当应用于有限层和半无限衬底配置时，HeatTransFormer在界面上产生相干温度场。再加上一个物理约束的逆策略，它进一步使可靠的识别三个未知的热性能同时使用外部测量。总体而言，这项工作表明，物理指导的Transformer架构提供了一个统一的框架，在接口为主的热系统的正向和反向建模。
摘要：Heat transfer in semiconductor devices is dominated by chip and substrate assemblies, where heat generated within a finite chip layer dissipates into a semi-infinite substrate with much higher thermophysical properties. This mismatch produces steep interfacial temperature gradients, making the transient thermal response highly sensitive to the interface. Conventional numerical solvers require excessive discretization to resolve these dynamics, while physics-informed neural networks (PINNs) often exhibit unstable convergence and loss of physical consistency near the material interface. To address these challenges, we introduce HeatTransFormer, a physics-guided Transformer architecture for interface-dominated diffusion problems. The framework integrates physically informed spatiotemporal sampling, a Laplace-based activation emulating analytical diffusion solutions, and a mask-free attention mechanism supporting bidirectional spatiotemporal coupling. These components enable the model to resolve steep gradients, maintain physical consistency, and remain stable where PINNs typically fail. HeatTransFormer produces coherent temperature fields across the interface when applied to a finite layer and semi-infinite substrate configuration. Coupled with a physics-constrained inverse strategy, it further enables reliable identification of three unknown thermal properties simultaneously using only external measurements. Overall, this work demonstrates that physics-guided Transformer architectures provide a unified framework for forward and inverse modeling in interface-dominated thermal systems.

【3】Laplace Approximation For Tensor Train Kernel Machines In System Identification
标题：系统识别中张量列车核心机的拉普拉斯逼近
链接：https://arxiv.org/abs/2512.02532

作者：Albert Saiapin,Kim Batselier
备注：6 pages, 2 figures, 4 tables. Submitted to IFAC 2026. Code available at: https://github.com/AlbMLpy/laplace-ttkm
摘要：为了解决高斯过程（GP）回归的可扩展性限制，已经提出了几种近似技术。一种这样的方法是基于张量网络，其利用指数数量的基函数，而不会产生指数计算成本。然而，将该模型扩展到完全概率公式引入了几个设计挑战。特别是，张量训练（TT）模型，这是不清楚的TT核心应在贝叶斯方式处理。我们引入了一个贝叶斯张量训练核机器，该机器应用拉普拉斯近似来估计选定的TT-核心上的后验分布，并采用变分推理（VI）的精度超参数。实验表明，核心选择在很大程度上独立于TT排名和特征结构，并且VI取代了交叉验证，同时提供了高达65倍的训练速度。该方法的有效性证明了逆动力学问题。
摘要：To address the scalability limitations of Gaussian process (GP) regression, several approximation techniques have been proposed. One such method is based on tensor networks, which utilizes an exponential number of basis functions without incurring exponential computational cost. However, extending this model to a fully probabilistic formulation introduces several design challenges. In particular, for tensor train (TT) models, it is unclear which TT-core should be treated in a Bayesian manner. We introduce a Bayesian tensor train kernel machine that applies Laplace approximation to estimate the posterior distribution over a selected TT-core and employs variational inference (VI) for precision hyperparameters. Experiments show that core selection is largely independent of TT-ranks and feature structure, and that VI replaces cross-validation while offering up to 65x faster training. The method's effectiveness is demonstrated on an inverse dynamics problem.

表征(2篇)

【1】FAIRY2I: Universal Extremely-Low Bit QAT framework via Widely-Linear Representation and Phase-Aware Quantization
标题：FAIRY 2I：通过宽线性表示和相感知量化的通用极低位QAT框架
链接：https://arxiv.org/abs/2512.02901

作者：Feiyu Wang,Xinyu Tan,Bokai Huang,Yihao Zhang,Guoan Wang,Peizhuang Cong,Tong Yang
备注：15 pages, 3 figures
摘要：大型语言模型（LLM）已经彻底改变了人工智能，但其庞大的内存和计算需求需要积极的量化，越来越多地将表示推向理论极限。虽然像iFairy这样的复值LLM与实值LLM相比，为低比特表示提供了更好的机会，但它们需要从头开始训练，从而阻止了对预先训练的实值基础模型的庞大生态系统的利用。在这里，我们介绍了Fairy 2 i，这是一个通用框架，可以将预先训练的实值层转换为等效的宽线性复杂形式，在重用现有检查点的同时实现极低比特的量化。通过证明真实和宽线性映射之间的无损数学等价性，我们将标准的Transformers转换为复域，并采用具有高效的四阶单位根码本的相位感知量化方案。此外，我们引入了一个递归的残差量化机制，迭代地最大限度地减少量化误差，允许推理进行通过有效的无乘法积累。我们证明了Fairy 2 i在有效的2位精度下将LLaMA-2 7 B的性能恢复到几乎与全精度基线相当的水平，显著优于最先进的实值二进制和三进制量化方法。这项工作弥合了复值算术的表示效率与预训练模型的实际效用之间的差距，为商品硬件上的高效推理铺平了新的道路。
摘要：Large language models (LLMs) have revolutionized artificial intelligence, yet their massive memory and computational demands necessitate aggressive quantization, increasingly pushing representations toward the theoretical limit of a single bit. While complex-valued LLMs, such as iFairy, offer a superior chance for low-bit representation compared to real-valued counterparts, they require training from scratch, preventing the utilization of the vast ecosystem of pre-trained real-valued foundation models. Here we present Fairy2i, a universal framework that transforms pre-trained real-valued layers into an equivalent widely-linear complex form, enabling extremely low-bit quantization while reusing existing checkpoints. By proving a lossless mathematical equivalence between real and widely-linear maps, we convert standard Transformers into the complex domain and employ a phase-aware quantization scheme with a highly efficient codebook of fourth roots of unity. Furthermore, we introduce a recursive residual quantization mechanism that iteratively minimizes quantization error, allowing inference to proceed via efficient multiplication-free accumulation. We demonstrate that Fairy2i restores the performance of LLaMA-2 7B at an effective 2-bit precision to levels nearly comparable with full-precision baselines, significantly outperforming state-of-the-art real-valued binary and ternary quantization methods. This work bridges the gap between the representational efficiency of complex-valued arithmetic and the practical utility of pre-trained models, paving a new way for efficient inference on commodity hardware.

【2】Representation of Inorganic Synthesis Reactions and Prediction: Graphical Framework and Datasets
标题：无机合成反应的表示和预测：图形框架和数据集
链接：https://arxiv.org/abs/2512.02947

作者：Samuel Andrello,Daniel Alabi,Simon J. L. Billinge
备注：For associated code and datasets, see https://github.com/8bitsam/actiongraph-testbench
摘要：虽然机器学习已经能够快速预测具有新特性的无机材料，但确定如何合成这些材料的挑战在很大程度上仍未解决。以前的工作主要集中在预测前体或反应条件，但很少在全合成途径。我们介绍了无环图，一个有向无环图框架，编码的化学和程序结构，在合成操作，无机合成反应。使用13，017个来自材料项目的文本挖掘固态合成反应，我们表明，将PCA-简化的RISK图邻接矩阵纳入$k$-最近邻检索模型显着提高了合成途径预测。虽然PCAGraph框架仅分别导致前兆和操作F1分数（不同数量的PCA分量的平均值）增加1.34%和2.76%，但操作长度匹配准确度提高了3.4倍（从15.8%提高到53.3%）。我们观察到一个有趣的权衡，其中前兆预测性能在10-11个PCA分量处达到峰值，而操作预测继续提高到30个分量。这表明组成信息主导前体选择，而结构信息对操作排序至关重要。总的来说，OCTGraph框架展示了强大的潜力，随着进一步采用，其全部好处都可以有效实现。
摘要：While machine learning has enabled the rapid prediction of inorganic materials with novel properties, the challenge of determining how to synthesize these materials remains largely unsolved. Previous work has largely focused on predicting precursors or reaction conditions, but only rarely on full synthesis pathways. We introduce the ActionGraph, a directed acyclic graph framework that encodes both the chemical and procedural structure, in terms of synthesis operations, of inorganic synthesis reactions. Using 13,017 text-mined solid-state synthesis reactions from the Materials Project, we show that incorporating PCA-reduced ActionGraph adjacency matrices into a $k$-nearest neighbors retrieval model significantly improves synthesis pathway prediction. While the ActionGraph framework only results in a 1.34% and 2.76% increase in precursor and operation F1 scores (average over varying numbers of PCA components) respectively, the operation length matching accuracy rises 3.4 times (from 15.8% to 53.3%). We observe an interesting trade-off where precursor prediction performance peaks at 10-11 PCA components while operation prediction continues improving up to 30 components. This suggests composition information dominates precursor selection while structural information is critical for operation sequencing. Overall, the ActionGraph framework demonstrates strong potential, and with further adoption, its full range of benefits can be effectively realized.

3D|3D重建等相关(2篇)

【1】DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions
标题：DF-Mamba：交互中3D手位姿估计的可变形状态空间建模
链接：https://arxiv.org/abs/2512.02727

作者：Yifan Zhou,Takehiko Ohkawa,Guwenxiao Zhou,Kanoko Goto,Takumi Hirose,Yusuke Sekikawa,Nakamasa Inoue
备注：Accepted to WACV 2026. Project page: https://tkhkaeio.github.io/projects/25-dfmamba/index.html
摘要：日常手部交互建模通常会遇到严重的遮挡，例如双手重叠时，这突出了在3D手部姿态估计（HPE）中进行鲁棒特征学习的需求。为了处理这种被遮挡的手部图像，有效地学习局部图像特征（例如，对于被遮挡的关节）和全局上下文（例如，来自关节间、手间或场景的线索）。然而，大多数当前的3D HPE方法仍然依赖于ResNet进行特征提取，并且这种CNN的归纳偏差可能对于3D HPE不是最佳的，因为其对全局上下文进行建模的能力有限。为了解决这个限制，我们提出了一个有效和高效的框架，用于使用最近的状态空间建模（即，变形曼巴（Deformable Mamba，DF-Mamba）DF-Mamba旨在通过Mamba的选择性状态建模和建议的可变形状态扫描来捕获标准卷积之外的全局上下文线索。具体来说，对于卷积后的局部特征，我们的可变形扫描将这些特征聚集在图像中，同时选择性地保留代表全局上下文的有用线索。这种方法显着提高了结构化3D HPE的准确性，推理速度与ResNet-50相当。我们的实验涉及对五个不同数据集的广泛评估，包括单手和双手场景，仅手和手-物体交互，以及RGB和基于深度的估计。DF-Mamba在所有数据集上的性能都优于最新的图像主干，包括VMamba和Spatial-Mamba，并实现了最先进的性能。
摘要：Modeling daily hand interactions often struggles with severe occlusions, such as when two hands overlap, which highlights the need for robust feature learning in 3D hand pose estimation (HPE). To handle such occluded hand images, it is vital to effectively learn the relationship between local image features (e.g., for occluded joints) and global context (e.g., cues from inter-joints, inter-hands, or the scene). However, most current 3D HPE methods still rely on ResNet for feature extraction, and such CNN's inductive bias may not be optimal for 3D HPE due to its limited capability to model the global context. To address this limitation, we propose an effective and efficient framework for visual feature extraction in 3D HPE using recent state space modeling (i.e., Mamba), dubbed Deformable Mamba (DF-Mamba). DF-Mamba is designed to capture global context cues beyond standard convolution through Mamba's selective state modeling and the proposed deformable state scanning. Specifically, for local features after convolution, our deformable scanning aggregates these features within an image while selectively preserving useful cues that represent the global context. This approach significantly improves the accuracy of structured 3D HPE, with comparable inference speed to ResNet-50. Our experiments involve extensive evaluations on five divergent datasets including single-hand and two-hand scenarios, hand-only and hand-object interactions, as well as RGB and depth-based estimation. DF-Mamba outperforms the latest image backbones, including VMamba and Spatial-Mamba, on all datasets and achieves state-of-the-art performance.

【2】Training Dynamics of Learning 3D-Rotational Equivariance
标题：学习3D旋转等方差的训练动力学
链接：https://arxiv.org/abs/2512.02303

作者：Max W. Shen,Ewa Nowara,Michael Maser,Kyunghyun Cho
备注：Accepted to Transactions on Machine Learning Research (TMLR)
摘要：虽然数据增强被广泛用于训练不可知模型，但仍不清楚它们如何快速有效地学习尊重对称性。我们研究这一点，通过推导出一个原则性的措施，等方差误差，凸损失，计算总损失的百分比归因于学习对称性的缺陷。我们将我们的经验研究集中在高维分子任务（流匹配，力场预测，去噪体素）的3D旋转等方差上，并发现模型在1 k-10 k训练步骤内将等方差误差快速降低到$\leq$2\%的保持损失，这对模型和数据集大小是鲁棒的。这是因为学习3D旋转等方差是一个更容易的学习任务，比主要的预测任务更平滑，条件更好的损失景观。对于3D旋转，在整个训练过程中，非等变模型的损失惩罚很小，因此它们可以在每GPU小时实现比等变模型更低的测试损失，除非等变“效率差距”缩小。我们还从实验和理论上研究了相对等方差误差，学习梯度和模型参数之间的关系。
摘要：While data augmentation is widely used to train symmetry-agnostic models, it remains unclear how quickly and effectively they learn to respect symmetries. We investigate this by deriving a principled measure of equivariance error that, for convex losses, calculates the percent of total loss attributable to imperfections in learned symmetry. We focus our empirical investigation to 3D-rotation equivariance on high-dimensional molecular tasks (flow matching, force field prediction, denoising voxels) and find that models reduce equivariance error quickly to $\leq$2\% held-out loss within 1k-10k training steps, a result robust to model and dataset size. This happens because learning 3D-rotational equivariance is an easier learning task, with a smoother and better-conditioned loss landscape, than the main prediction task. For 3D rotations, the loss penalty for non-equivariant models is small throughout training, so they may achieve lower test loss than equivariant models per GPU-hour unless the equivariant ``efficiency gap'' is narrowed. We also experimentally and theoretically investigate the relationships between relative equivariance error, learning gradients, and model parameters.

编码器(1篇)

【1】Quantum feature encoding optimization
标题：量子特征编码优化
链接：https://arxiv.org/abs/2512.02422

作者：Tommaso Fioravanti,Brian Quanz,Gabriele Agliardi,Edgar Andres Ruiz Guzman,Ginés Carrascal,Jae-Eun Park
摘要：量子机器学习（QML）有望在复杂性和准确性方面增强机器学习建模。该领域的一个关键挑战是输入数据的编码，这在确定QML模型的性能方面起着关键作用。在这项工作中，我们解决了编码的一个基本上未解决的方面，这是QML建模所独有的-而不是调整用于编码的analog，我们考虑调整数据如何传递到analog。我们专门实现了利用经典数据操作的QML管道（即，排序、选择和加权特征）作为预处理步骤，并评估编码的这些方面是否会对QML模型性能产生重大影响，以及它们是否可以有效地优化以提高性能。我们的实验结果，适用于各种各样的数据集，animals，和电路大小，具有代表性的QML方法，表明通过优化如何在animals中编码的功能，我们可以大大提高和一致的QML模型的性能，使这些技术在未来的QML应用程序集成一个令人信服的情况下。最后，我们通过使用具有100个量子位电路的真实量子硬件运行它来证明这种方法的实际可行性，并且在这种情况下也成功地实现了改进的QML建模性能。
摘要：Quantum Machine Learning (QML) holds the promise of enhancing machine learning modeling in terms of both complexity and accuracy. A key challenge in this domain is the encoding of input data, which plays a pivotal role in determining the performance of QML models. In this work, we tackle a largely unaddressed aspect of encoding that is unique to QML modeling -- rather than adjusting the ansatz used for encoding, we consider adjusting how data is conveyed to the ansatz. We specifically implement QML pipelines that leverage classical data manipulation (i.e., ordering, selecting, and weighting features) as a preprocessing step, and evaluate if these aspects of encoding can have a significant impact on QML model performance, and if they can be effectively optimized to improve performance. Our experimental results, applied across a wide variety of data sets, ansatz, and circuit sizes, with a representative QML approach, demonstrate that by optimizing how features are encoded in an ansatz we can substantially and consistently improve the performance of QML models, making a compelling case for integrating these techniques in future QML applications. Finally we demonstrate the practical feasibility of this approach by running it using real quantum hardware with 100 qubit circuits and successfully achieving improved QML modeling performance in this case as well.

优化|敛散性(4篇)

【1】OptPO: Optimal Rollout Allocation for Test-time Policy Optimization
标题：OptPO：测试时策略优化的最佳推出分配
链接：https://arxiv.org/abs/2512.02882

作者：Youkang Wang,Jian Wang,Rubing Chen,Tianyi Zeng,Xiao-Yong Wei,Qing Li
备注：Work in Progress
摘要：测试时策略优化使大型语言模型（LLM）能够通过利用自生成的推出的反馈来适应分布变化。然而，现有的方法依赖于固定预算的多数投票来估计奖励，从而产生大量的计算冗余。我们提出了最佳的测试时策略优化（OptPO），一个原则性的框架，自适应地分配推理预算的最佳部署分配。通过将投票过程制定为贝叶斯序贯概率比检验，一旦共识答案的后验置信度超过指定阈值，OptPO就会动态停止采样。至关重要的是，它利用保留的推出进行策略更新，与PPO或GRPO等算法无缝集成，而不需要地面实况标签。在不同的推理基准测试中，与固定样本基线相比，OptPO显著降低了推出开销，同时保持或提高了准确性。通过将统计最优停止与测试时学习相统一，OptPO为测试时自适应提供了一个计算效率高的范例。源代码将在接受后在https：//open-upon-acceptance上开放。
摘要：Test-time policy optimization enables large language models (LLMs) to adapt to distribution shifts by leveraging feedback from self-generated rollouts. However, existing methods rely on fixed-budget majority voting to estimate rewards, incurring substantial computational redundancy. We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets. By formulating the voting process as a Bayesian sequential probability ratio test, OptPO dynamically halts sampling once the posterior confidence in a consensus answer exceeds a specified threshold. Crucially, it utilizes the retained rollouts for on-policy updates, seamlessly integrating with algorithms like PPO or GRPO without requiring ground-truth labels. Across diverse reasoning benchmarks, OptPO significantly reduces rollout overhead compared to fixed-sample baselines while preserving or improving accuracy. By unifying statistically optimal stopping with test-time learning, OptPO offers a computationally efficient paradigm for test-time adaptation. The source code will be open upon acceptance at https://open-upon-acceptance.

【2】SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization
标题：SeeNav-Agent：通过视觉提示和步骤级策略优化增强视觉语言导航
链接：https://arxiv.org/abs/2512.02631

作者：Zhengcheng Wang,Zichuan Lin,Yijun Yang,Haobo Fu,Deheng Ye
备注：12 pages,6 figures
摘要：现有的基于大型视觉语言模型（LVLM）的视觉语言导航（VLN）代理经常遭受感知错误，推理错误和规划错误，这显着阻碍了他们的导航性能。为了解决这些局限性，一个新的VLN代理框架，命名为SeeNav-Agent，提出了这项工作。首先，为了减少VLN智能体的视觉模块的感知幻觉，在输入空间中引入双视图视觉提示（VP）技术，这也可以提高智能体对当前空间状态的理解。随后，一个新的步骤级强化微调（RFT）的方法，步骤奖励组策略优化（SRPING），设计用于VLN代理的后训练。在SRAND中，我们首先为导航任务定义可验证的过程奖励，然后通过随机分组不同的导航步骤来执行有效的步骤级优势估计。SRRENT为VLN代理的强化学习过程提供密集的奖励信号，并增强其规划能力。在OpenedBench导航基准上的实验结果表明，通过引入zero-shot VP模块，GPT-4.1实现了86.7%的导航成功率，超过了目前最好的LVLM约20个百分点（pp）。通过基于SRANDT的后训练，Qwen2.5-VL-3B模型的导航成功率达到72.3%，比现有的最佳LVLM模型高出5.6个百分点。与GRPO和Giovanni等RFT算法相比，该算法在训练稳定性、收敛效率和泛化能力等方面都有显著提高.
摘要：Existing Vision-Language Navigation (VLN) agents based on Large Vision-Language Models (LVLMs) often suffer from perception errors, reasoning errors, and planning errors, which significantly hinder their navigation performance. To address these limitations, a novel VLN agent framework, named SeeNav-Agent, is proposed in this work. First, to reduce perception hallucinations of the visual module of the VLN agent, a dual-view Visual Prompt (VP) technique is introduced in the input space, which can also improve the agent's understanding of current spatial states. Subsequently, a novel step-level Reinforcement Fine-Tuning (RFT) method, Step Reward Group Policy Optimization (SRGPO), is designed for the post-training of VLN agents. In SRGPO, we first define verifiable process rewards for the navigation task, and then perform efficient step-level advantage estimation by randomly grouping different navigation steps. SRGPO provides dense reward signals for the reinforcement learning process of the VLN agent and enhances its planning capability. Experimental results on the EmbodiedBench Navigation benchmark indicate that by introducing the zero-shot VP module, the GPT-4.1 achieves a navigation success rate of 86.7%, surpassing the current best LVLM by approximately 20 percentage points (pp). Through post-training based on SRGPO, the Qwen2.5-VL-3B model reaches a navigation success rate of 72.3%, outperforming the best existing LVLM model by 5.6 pp. Moreover, compared to RFT algorithms such as GRPO and GiGPO, the proposed SRGPO demonstrates significant improvements in training stability, convergence efficiency, and generalization capability.

【3】A Fully First-Order Layer for Differentiable Optimization
标题：用于差异优化的完全一阶层
链接：https://arxiv.org/abs/2512.02494

作者：Zihao Zhao,Kai-Chia Mo,Shing-Hei Ho,Brandon Amos,Kai Wang
摘要：可微优化层使学习系统能够通过解决嵌入式优化问题来做出决策。然而，通过隐式微分计算梯度需要求解具有Hessian项的线性系统，这是计算和存储密集型的。为了解决这一挑战，我们提出了一种新的算法，只使用一阶信息计算梯度。关键的洞察力是重写可微优化作为一个双层优化问题，并利用最新进展的双层方法。具体来说，我们引入了一个有效集拉格朗日超梯度预言机，避免海森评估，并提供有限时间，非渐近近似保证。我们证明了仅使用一阶信息就可以在$\tilde{\oo}（1）$时间内计算近似超梯度，从而导致约束双层优化的总体复杂度为$\tilde{\oo}（δ^{-1}ε^{-3}）$，这与非光滑非凸优化的最佳已知速率相匹配。此外，我们还发布了一个开源的Python库，可以很容易地从现有的求解器中进行调整。我们的代码可从以下网址获得：https://github.com/guaguakai/FFOLayer。
摘要：Differentiable optimization layers enable learning systems to make decisions by solving embedded optimization problems. However, computing gradients via implicit differentiation requires solving a linear system with Hessian terms, which is both compute- and memory-intensive. To address this challenge, we propose a novel algorithm that computes the gradient using only first-order information. The key insight is to rewrite the differentiable optimization as a bilevel optimization problem and leverage recent advances in bilevel methods. Specifically, we introduce an active-set Lagrangian hypergradient oracle that avoids Hessian evaluations and provides finite-time, non-asymptotic approximation guarantees. We show that an approximate hypergradient can be computed using only first-order information in $\tilde{\oo}(1)$ time, leading to an overall complexity of $\tilde{\oo}(δ^{-1}ε^{-3})$ for constrained bilevel optimization, which matches the best known rate for non-smooth non-convex optimization. Furthermore, we release an open-source Python library that can be easily adapted from existing solvers. Our code is available here: https://github.com/guaguakai/FFOLayer.

【4】Safeguarded Stochastic Polyak Step Sizes for Non-smooth Optimization: Robust Performance Without Small (Sub)Gradients
标题：用于非光滑优化的受保护随机Polyak步骤大小：没有小（子）因素的稳健性能
链接：https://arxiv.org/abs/2512.02342

作者：Dimitris Oikonomou,Nicolas Loizou
备注：28 pages, 15 figures
摘要：随机Polyak步长（SPS）已被证明是随机梯度下降（SGD）的一个有前途的选择，相对于光滑凸和非凸优化问题（包括深度神经网络训练）的最新方法，它具有竞争力的性能。然而，这种方法的扩展到非平滑设置仍然处于早期阶段，往往依赖于插值假设或需要最优解的知识。在这项工作中，我们提出了一种新的SPS变体，保障SPS（SPS$_{safe}$），随机次梯度方法，并提供严格的收敛保证非光滑凸优化，不需要强假设。我们进一步将动量的更新规则，产生同样紧密的理论结果。对凸基准和深度神经网络的综合实验证实了我们的理论：所提出的步长加速了收敛，降低了方差，并始终优于现有的自适应基线。最后，在深度神经网络训练的背景下，我们的方法通过解决消失梯度问题展示了强大的性能。
摘要：The stochastic Polyak step size (SPS) has proven to be a promising choice for stochastic gradient descent (SGD), delivering competitive performance relative to state-of-the-art methods on smooth convex and non-convex optimization problems, including deep neural network training. However, extensions of this approach to non-smooth settings remain in their early stages, often relying on interpolation assumptions or requiring knowledge of the optimal solution. In this work, we propose a novel SPS variant, Safeguarded SPS (SPS$_{safe}$), for the stochastic subgradient method, and provide rigorous convergence guarantees for non-smooth convex optimization with no need for strong assumptions. We further incorporate momentum into the update rule, yielding equally tight theoretical results. Comprehensive experiments on convex benchmarks and deep neural networks corroborate our theory: the proposed step size accelerates convergence, reduces variance, and consistently outperforms existing adaptive baselines. Finally, in the context of deep neural network training, our method demonstrates robust performance by addressing the vanishing gradient problem.

预测|估计(7篇)

【1】StockMem: An Event-Reflection Memory Framework for Stock Forecasting
标题：StockMem：股票预测的事件反射记忆框架
链接：https://arxiv.org/abs/2512.02720

作者：He Wang,Wenyilin Xiao,Songqiao Han,Hailiang Huang
摘要：由于市场波动性及其对实时事件的敏感性，股票价格预测具有挑战性。虽然大型语言模型（LLM）为基于文本的预测提供了新的途径，但它们在金融领域的应用受到噪音新闻数据和文本中缺乏明确答案的阻碍。通用内存架构很难识别价格变动的关键驱动因素。为了解决这个问题，我们提出了StockMem，一个事件反射双层内存框架。它将新闻结构化为事件，并沿着两个维度挖掘它们：横向整合整合日常事件，而纵向跟踪捕获事件演变以提取反映市场预期差异的增量信息。这构建了时间事件知识库。通过分析事件-价格动态，该框架进一步形成了因果经验的反思知识库。对于预测，它检索类似的历史场景和原因与当前事件，增量数据和过去的经验。实验表明，StockMem优于现有的存储器架构，并通过跟踪影响价格的信息链，提高财务预测决策的透明度，提供卓越的，可解释的推理。
摘要：Stock price prediction is challenging due to market volatility and its sensitivity to real-time events. While large language models (LLMs) offer new avenues for text-based forecasting, their application in finance is hindered by noisy news data and the lack of explicit answers in text. General-purpose memory architectures struggle to identify the key drivers of price movements. To address this, we propose StockMem, an event-reflection dual-layer memory framework. It structures news into events and mines them along two dimensions: horizontal consolidation integrates daily events, while longitudinal tracking captures event evolution to extract incremental information reflecting market expectation discrepancies. This builds a temporal event knowledge base. By analyzing event-price dynamics, the framework further forms a reflection knowledge base of causal experiences. For prediction, it retrieves analogous historical scenarios and reasons with current events, incremental data, and past experiences. Experiments show StockMem outperforms existing memory architectures and provides superior, explainable reasoning by tracing the information chain affecting prices, enhancing decision transparency in financial forecasting.

【2】Hybrid(Penalized Regression and MLP) Models for Outcome Prediction in HDLSS Health Data
标题：HDLSS健康数据结果预测的混合（惩罚回归和MLP）模型
链接：https://arxiv.org/abs/2512.02489

作者：Mithra D K
摘要：我提出了一个应用程序，建立机器学习技术的NHANES健康调查数据预测糖尿病状态。我比较了基线模型（逻辑回归，随机森林，XGBoost）与使用XGBoost特征编码器和轻量级多层感知器（MLP）头的混合方法。实验表明，该混合模型获得了改善的AUC和平衡的准确性相比，基线上处理的NHANES子集。我发布代码和可复制的脚本以鼓励复制。
摘要：I present an application of established machine learning techniques to NHANES health survey data for predicting diabetes status. I compare baseline models (logistic regression, random forest, XGBoost) with a hybrid approach that uses an XGBoost feature encoder and a lightweight multilayer perceptron (MLP) head. Experiments show the hybrid model attains improved AUC and balanced accuracy compared to baselines on the processed NHANES subset. I release code and reproducible scripts to encourage replication.

【3】TabGRU: An Enhanced Design for Urban Rainfall Intensity Estimation Using Commercial Microwave Links
标题：TabGRU：使用商业微波链路估计城市降雨强度的增强设计
链接：https://arxiv.org/abs/2512.02465

作者：Xingwang Li,Mengyun Chen,Jiamou Liu,Sijie Wang,Shuanggen Jin,Jafet C. M. Andersson,Jonas Olsson,Remco,van de Beek,Hai Victor Habi,Congzheng Han
摘要：面对全球城市化加速和极端天气事件日益频繁的情况，高分辨率的城市降雨监测对于建设具有复原力的智慧城市至关重要。商业微波链路（CMLs）是一种新兴的数据源，在这方面具有巨大的潜力。传统的基于CMLs的降雨反演依赖于基于物理的模型，这些模型通常会遇到信号噪声和非线性衰减等现实世界的复杂性。为了解决这些限制，本文提出了一种基于Transformer和双向门控递归单元（BiGRU）的新型混合深度学习架构，我们将其命名为TabGRU。该设计协同捕获CML信号数据中的长期依赖性和局部顺序特征。该模型通过可学习的位置嵌入和注意力池机制进一步增强，以提高其动态特征提取和泛化能力。该模型在瑞典哥德堡的公共基准数据集上进行了验证（2015年6月至9月）。该评估使用了两个雨量计（Torp和Barl）在测试期间（8月22日至31日）的12个子链接，涵盖了大约10个不同的降雨事件。提出的TabGRU模型表现出一致的优势，优于深度学习基线，并在Torp站点（0.91）和Barl站点（0.96）实现了高决定系数（R2）。此外，与基于物理学的方法相比，TabGRU保持了更高的准确性，并且在缓解PL模型在峰值降雨事件期间观察到的显著高估问题方面特别有效。本次评估证实，TabGRU模型能够有效克服传统方法的局限性，为测试条件下基于CML的城市降雨监测提供了稳健、准确的解决方案。
摘要：In the face of accelerating global urbanization and the increasing frequency of extreme weather events, highresolution urban rainfall monitoring is crucial for building resilient smart cities. Commercial Microwave Links (CMLs) are an emerging data source with great potential for this task.While traditional rainfall retrieval from CMLs relies on physicsbased models, these often struggle with real-world complexities like signal noise and nonlinear attenuation. To address these limitations, this paper proposes a novel hybrid deep learning architecture based on the Transformer and a Bidirectional Gated Recurrent Unit (BiGRU), which we name TabGRU. This design synergistically captures both long-term dependencies and local sequential features in the CML signal data. The model is further enhanced by a learnable positional embedding and an attention pooling mechanism to improve its dynamic feature extraction and generalization capabilities. The model was validated on a public benchmark dataset from Gothenburg, Sweden (June-September 2015). The evaluation used 12 sub-links from two rain gauges (Torp and Barl) over a test period (August 22-31) covering approximately 10 distinct rainfall events. The proposed TabGRU model demonstrated consistent advantages, outperforming deep learning baselines and achieving high coefficients of determination (R2) at both the Torp site (0.91) and the Barl site (0.96). Furthermore, compared to the physics-based approach, TabGRU maintained higher accuracy and was particularly effective in mitigating the significant overestimation problem observed in the PL model during peak rainfall events. This evaluation confirms that the TabGRU model can effectively overcome the limitations of traditional methods, providing a robust and accurate solution for CML-based urban rainfall monitoring under the tested conditions.

【4】Forecasting MBTA Transit Dynamics: A Performance Benchmarking of Statistical and Machine Learning Models
标题：预测MBTA交通动态：统计和机器学习模型的性能基准
链接：https://arxiv.org/abs/2512.02336

作者：Sai Siddharth Nalamalpu,Kaining Yuan,Aiden Zhou,Eugene Pinsky
备注：14 pages 9 figures
摘要：马萨诸塞湾交通管理局（MBTA）是波士顿的主要公共交通提供商，运营多种交通工具，包括火车，地铁和公共汽车。然而，该系统经常面临延误和乘客量的波动，这对效率和乘客满意度产生了负面影响。为了进一步了解这种现象，本文比较了现有的和独特的方法的性能，以确定最佳的方法，在预测门控站入口的地铁系统（地铁使用的代理）和整个MBTA系统的延误数量。为了做到这一点，这项研究考虑了倾向于影响公共交通的因素，如星期几，季节，压力，风速，平均温度和降水。本文评估了10种统计和机器学习模型在预测第二天地铁使用量方面的性能。在预测延迟计数，模型的数量扩展到每天11个，通过引入自激点过程模型，代表MBTA延迟建模的点过程框架的独特应用。这项研究涉及实验选择性包含的功能，以确定功能的重要性，通过均方根误差（RMSE）测试模型的准确性。值得注意的是，与天气数据相比，提供一周中的某一天或某个季节的数据对预测准确性有更大的好处;事实上，提供天气数据通常会影响性能，这表明模型有过拟合的趋势。
摘要：The Massachusetts Bay Transportation Authority (MBTA) is the main public transit provider in Boston, operating multiple means of transport, including trains, subways, and buses. However, the system often faces delays and fluctuations in ridership volume, which negatively affect efficiency and passenger satisfaction. To further understand this phenomenon, this paper compares the performance of existing and unique methods to determine the best approach in predicting gated station entries in the subway system (a proxy for subway usage) and the number of delays in the overall MBTA system. To do so, this research considers factors that tend to affect public transportation, such as day of week, season, pressure, wind speed, average temperature, and precipitation. This paper evaluates the performance of 10 statistical and machine learning models on predicting next-day subway usage. On predicting delay count, the number of models is extended to 11 per day by introducing a self-exciting point process model, representing a unique application of a point-process framework for MBTA delay modeling. This research involves experimenting with the selective inclusion of features to determine feature importance, testing model accuracy via Root Mean Squared Error (RMSE). Remarkably, it is found that providing either day of week or season data has a more substantial benefit to predictive accuracy compared to weather data; in fact, providing weather data generally worsens performance, suggesting a tendency of models to overfit.

【5】Unlocking the Power of Boltzmann Machines by Parallelizable Sampler and Efficient Temperature Estimation
标题：通过可并行化采样器和高效的温度估计释放Boltzmann机器的力量
链接：https://arxiv.org/abs/2512.02323

作者：Kentaro Kubo,Hayato Goto
备注：16 pages, 14 figures
摘要：玻尔兹曼机（BM）是一种强大的基于能量的生成模型，但其沉重的训练成本在很大程度上限制了受限BM（RBM）的实际使用，这些BM使用一种称为对比发散的有效学习方法进行训练。更精确的学习通常需要马尔可夫链蒙特卡罗（MCMC）玻尔兹曼采样，但由于更有表现力的模型难以并行化，因此非常耗时。为了解决这个问题，我们首先提出了一个新的玻尔兹曼采样器的灵感来自量子启发的组合优化称为模拟分叉（SB）。这种受SB启发的方法，我们将其命名为Langevin SB（LSB），可以实现并行采样，同时保持与MCMC相当的精度。此外，这不仅适用于RBM，而且适用于具有一般耦合的BM。然而，LSB不能控制输出玻尔兹曼分布的逆温度，这阻碍了学习并降低了性能。为了克服这一限制，我们还开发了一种有效的方法来估计学习过程中的逆温度，我们称之为条件期望匹配（CEM）。通过结合LSB和CEM，我们建立了一个有效的学习框架，BM具有更大的表达能力比RBM。我们将此框架称为采样器自适应学习（SAL）。SAL为RBM之外的基于能量的生成建模开辟了新的途径。
摘要：Boltzmann machines (BMs) are powerful energy-based generative models, but their heavy training cost has largely confined practical use to Restricted BMs (RBMs) trained with an efficient learning method called contrastive divergence. More accurate learning typically requires Markov chain Monte Carlo (MCMC) Boltzmann sampling, but it is time-consuming due to the difficulty of parallelization for more expressive models. To address this limitation, we first propose a new Boltzmann sampler inspired by a quantum-inspired combinatorial optimization called simulated bifurcation (SB). This SB-inspired approach, which we name Langevin SB (LSB), enables parallelized sampling while maintaining accuracy comparable to MCMC. Furthermore, this is applicable not only to RBMs but also to BMs with general couplings. However, LSB cannot control the inverse temperature of the output Boltzmann distribution, which hinders learning and degrades performance. To overcome this limitation, we also developed an efficient method for estimating the inverse temperature during the learning process, which we call conditional expectation matching (CEM). By combining LSB and CEM, we establish an efficient learning framework for BMs with greater expressive power than RBMs. We refer to this framework as sampler-adaptive learning (SAL). SAL opens new avenues for energy-based generative modeling beyond RBMs.

【6】DPWMixer: Dual-Path Wavelet Mixer for Long-Term Time Series Forecasting
标题：DPWMixer：用于长期时间序列预测的双路径小波混合器
链接：https://arxiv.org/abs/2512.02070

作者：Li Qianyang,Zhang Xingjun,Wang Shaoxun,Wei Jia
摘要：长期时间序列预测（LTSF）是计算智能领域的一项重要任务。虽然基于Transformer的模型有效地捕获了长距离依赖关系，但由于数据稀疏，它们通常会遭受二次复杂性和过拟合。相反，有效的线性模型很难描述复杂的非线性局部动态。此外，现有的多尺度框架通常依赖于平均池化，其充当非理想的低通滤波器，导致频谱混叠和高频瞬态的不可逆损失。作为回应，本文提出了DPWMixer，一种计算效率高的双路径架构。该框架是建立在无损哈尔小波金字塔，取代传统的池，利用正交分解，明确解开趋势和局部波动，没有信息损失。为了处理这些组件，我们设计了一个双路径趋势混合器，它集成了用于宏观趋势锚定的全局线性映射和用于微观动态演化的灵活的基于补丁的MLP混合器。最后，一个自适应多尺度融合模块，然后整合不同尺度的预测，加权通道平稳性优化合成。在八个公共基准上的广泛实验表明，我们的方法在最先进的基线上实现了一致的改进。该代码可在https://github.com/hit636/DPWMixer上获得。
摘要：Long-term time series forecasting (LTSF) is a critical task in computational intelligence. While Transformer-based models effectively capture long-range dependencies, they often suffer from quadratic complexity and overfitting due to data sparsity. Conversely, efficient linear models struggle to depict complex non-linear local dynamics. Furthermore, existing multi-scale frameworks typically rely on average pooling, which acts as a non-ideal low-pass filter, leading to spectral aliasing and the irreversible loss of high-frequency transients. In response, this paper proposes DPWMixer, a computationally efficient Dual-Path architecture. The framework is built upon a Lossless Haar Wavelet Pyramid that replaces traditional pooling, utilizing orthogonal decomposition to explicitly disentangle trends and local fluctuations without information loss. To process these components, we design a Dual-Path Trend Mixer that integrates a global linear mapping for macro-trend anchoring and a flexible patch-based MLP-Mixer for micro-dynamic evolution. Finally, An adaptive multi-scale fusion module then integrates predictions from diverse scales, weighted by channel stationarity to optimize synthesis. Extensive experiments on eight public benchmarks demonstrate that our method achieves a consistent improvement over state-of-the-art baselines. The code is available at https://github.com/hit636/DPWMixer.

【7】Opening the Black Box: Nowcasting Singapore's GDP Growth and its Explainability
标题：打开黑匣子：当前预测新加坡GDP增长及其解释性
链接：https://arxiv.org/abs/2512.02092

作者：Luca Attolico
备注：PhD thesis, University of Macerata (2025). PhD programme: Quantitative Methods for Policy Evaluation (Cycle XXXVII). Supervisors: Rosaria Romano, Jamus Jerome Lim
摘要：及时评估目前的状况至关重要，特别是对于像新加坡这样的小型开放经济体，外部冲击会迅速传导到国内活动。我们使用约70个指标的高维度面板，包括1990年Q1 - 2023年Q2的经济和金融指标，开发了季度GDP增长的实时临近预测框架。该分析涵盖了惩罚回归，降维方法，集成学习算法和神经架构，以随机游走，AR（3）和动态因子模型为基准。该管道通过扩展窗口向前走设计与贝叶斯超参数优化保持时间顺序，并使用移动块引导程序来构建预测区间和获得特征重要性度量的置信带。它采用了特定于模型和基于XAI的可解释性工具。模型置信度集程序识别统计上优越的学习者，然后通过简单的，加权的，指数加权的计划相结合，由此产生的随时间变化的权重提供了一个可解释的表示模型的贡献。通过Giacomini-White测试评估预测能力。实证结果表明，惩罚回归、降维模型和GRU网络的表现始终优于所有基准，RMSFE降低约40-60%;聚合可带来进一步的收益。产出归因方法强调工业生产、对外贸易和劳动力市场指标是新加坡短期增长动力的主要驱动力。
摘要：Timely assessment of current conditions is essential especially for small, open economies such as Singapore, where external shocks transmit rapidly to domestic activity. We develop a real-time nowcasting framework for quarterly GDP growth using a high-dimensional panel of approximately 70 indicators, encompassing economic and financial indicators over 1990Q1-2023Q2. The analysis covers penalized regressions, dimensionality-reduction methods, ensemble learning algorithms, and neural architectures, benchmarked against a Random Walk, an AR(3), and a Dynamic Factor Model. The pipeline preserves temporal ordering through an expanding-window walk-forward design with Bayesian hyperparameter optimization, and uses moving block-bootstrap procedures both to construct prediction intervals and to obtain confidence bands for feature-importance measures. It adopts model-specific and XAI-based explainability tools. A Model Confidence Set procedure identifies statistically superior learners, which are then combined through simple, weighted, and exponentially weighted schemes; the resulting time-varying weights provide an interpretable representation of model contributions. Predictive ability is assessed via Giacomini-White tests. Empirical results show that penalized regressions, dimensionality-reduction models, and GRU networks consistently outperform all benchmarks, with RMSFE reductions of roughly 40-60%; aggregation delivers further gains. Feature-attribution methods highlight industrial production, external trade, and labor-market indicators as dominant drivers of Singapore's short-run growth dynamics.

其他神经网络|深度学习|模型|建模(26篇)

【1】Learning Physically Consistent Lagrangian Control Models Without Acceleration Measurements
标题：无需加速度测量即可学习物理一致的拉格朗日控制模型
链接：https://arxiv.org/abs/2512.03035

作者：Ibrahim Laiche,Mokrane Boudaoud,Patrick Gallinari,Pascal Morin
备注：Submitted to the L4DC 2026
摘要：本文研究了拉格朗日系统的建模和控制涉及非保守力使用的混合方法，不需要加速度计算。它特别侧重于物理一致的模型，这是基于模型的控制合成必不可少的推导和识别。拉格朗日或哈密顿神经网络提供了有用的结构保证，但这种模型的学习往往会导致不一致的模型，特别是在训练数据有限，部分和噪声的真实物理系统。出于这一观察和目标，利用这些模型的基于模型的非线性控制，学习算法依赖于一个原始的损失函数，提出了提高拉格朗日系统的物理一致性。不同的基于学习的建模方法与所提出的解决方案的比较分析表明，在物理一致性方面的学习模型，模拟和实验系统的显着改善。然后利用模型的一致性来证明，在实验基准，反馈线性化和基于能量的控制技术的方法的实际意义。
摘要：This article investigates the modeling and control of Lagrangian systems involving non-conservative forces using a hybrid method that does not require acceleration calculations. It focuses in particular on the derivation and identification of physically consistent models, which are essential for model-based control synthesis. Lagrangian or Hamiltonian neural networks provide useful structural guarantees but the learning of such models often leads to inconsistent models, especially on real physical systems where training data are limited, partial and noisy. Motivated by this observation and the objective to exploit these models for model-based nonlinear control, a learning algorithm relying on an original loss function is proposed to improve the physical consistency of Lagrangian systems. A comparative analysis of different learning-based modeling approaches with the proposed solution shows significant improvements in terms of physical consistency of the learned models, on both simulated and experimental systems. The model's consistency is then exploited to demonstrate, on an experimental benchmark, the practical relevance of the proposed methodology for feedback linearization and energy-based control techniques.

【2】LORE: A Large Generative Model for Search Relevance
标题：LORE：搜索相关性的大生成模型
链接：https://arxiv.org/abs/2512.03025

作者：Chenji Lu,Zhuo Chen,Hui Zhao,Zhiyuan Zeng,Gang Zhao,Junjie Ren,Ruicong Xu,Haoran Li,Songyan Liu,Pengjie Wang,Jian Xu,Bo Zheng
摘要：成就我们介绍了LORE，一个系统的框架，在电子商务搜索基于大生成模型的相关性。经过三年多的部署和迭代，LORE在在线GoodRate指标上实现了累计+27\%的改进。此报告分享了在其整个开发生命周期中获得的宝贵经验，涵盖数据、功能、培训、评估和部署。洞察力.虽然现有的作品应用思想链（CoT）来增强相关性，但它们经常遇到性能上限。我们认为，这源于治疗相关性作为一个单一的任务，缺乏原则性的解构。我们的关键见解是，相关性包括不同的能力：知识和推理，多模态匹配，并遵守规则。我们认为，质量驱动的分解是突破当前性能瓶颈的关键。捐款. LORE为LLM相关生命周期提供了一个完整的蓝图。主要贡献包括：（1）一个两阶段的训练范式结合渐进CoT合成通过SFT与人类偏好对齐通过RL。(2)一个全面的基准，RAIR，旨在评估这些核心能力。(3)查询频率分层部署策略，有效地将离线LLM功能转移到在线系统。LORE为其他垂直领域提供了实用的解决方案和方法参考。
摘要：Achievement. We introduce LORE, a systematic framework for Large Generative Model-based relevance in e-commerce search. Deployed and iterated over three years, LORE achieves a cumulative +27\% improvement in online GoodRate metrics. This report shares the valuable experience gained throughout its development lifecycle, spanning data, features, training, evaluation, and deployment. Insight. While existing works apply Chain-of-Thought (CoT) to enhance relevance, they often hit a performance ceiling. We argue this stems from treating relevance as a monolithic task, lacking principled deconstruction. Our key insight is that relevance comprises distinct capabilities: knowledge and reasoning, multi-modal matching, and rule adherence. We contend that a qualitative-driven decomposition is essential for breaking through current performance bottlenecks. Contributions. LORE provides a complete blueprint for the LLM relevance lifecycle. Key contributions include: (1) A two-stage training paradigm combining progressive CoT synthesis via SFT with human preference alignment via RL. (2) A comprehensive benchmark, RAIR, designed to evaluate these core capabilities. (3) A query frequency-stratified deployment strategy that efficiently transfers offline LLM capabilities to the online system. LORE serves as both a practical solution and a methodological reference for other vertical domains.

【3】ProteinPNet: Prototypical Part Networks for Concept Learning in Spatial Proteomics
标题：ProteinPNet：用于空间蛋白质组学概念学习的原型部分网络
链接：https://arxiv.org/abs/2512.02983

作者：Louis McConnell,Jieran Sun,Theo Maffei,Raphael Gottardo,Marianna Rapsomaniki
摘要：了解肿瘤微环境（TME）的空间结构对于推进精确肿瘤学至关重要。我们提出了ProteinPNet，一个新的框架的基础上，发现TME图案的空间蛋白质组学数据的原型部分网络。与传统的事后可扩展性模型不同，ProteinPNet通过监督训练直接学习有区别的，可解释的，忠实的空间原型。我们验证了我们的方法在合成数据集与地面真相图案，并进一步测试它在现实世界的肺癌空间蛋白质组学数据集。ProteinPNet始终识别与不同肿瘤亚型一致的具有生物学意义的原型。通过图形和形态学分析，我们表明，这些原型捕获可解释的功能，指出免疫浸润和组织模块化的差异。我们的研究结果强调了基于原型的学习揭示TME内可解释的空间生物标志物的潜力，并对空间组学中的机制发现产生影响。
摘要：Understanding the spatial architecture of the tumor microenvironment (TME) is critical to advance precision oncology. We present ProteinPNet, a novel framework based on prototypical part networks that discovers TME motifs from spatial proteomics data. Unlike traditional post-hoc explanability models, ProteinPNet directly learns discriminative, interpretable, faithful spatial prototypes through supervised training. We validate our approach on synthetic datasets with ground truth motifs, and further test it on a real-world lung cancer spatial proteomics dataset. ProteinPNet consistently identifies biologically meaningful prototypes aligned with different tumor subtypes. Through graphical and morphological analyses, we show that these prototypes capture interpretable features pointing to differences in immune infiltration and tissue modularity. Our results highlight the potential of prototype-based learning to reveal interpretable spatial biomarkers within the TME, with implications for mechanistic discovery in spatial omics.

【4】Hypothesis Testing for Generalized Thurstone Models
标题：广义Thurstone模型的假设检验
链接：https://arxiv.org/abs/2512.02912

作者：Anuran Makur,Japneet Singh
备注：35 pages, 9 figures
摘要：在这项工作中，我们开发了一个假设检验框架，以确定是否成对比较数据产生的一个潜在的广义Thurstone模型$\mathcal{T}_F$为一个给定的选择函数$F$。虽然以前的工作主要集中在参数估计和不确定性量化等模型，我们解决的根本问题，极小极大假设检验的$\mathcal{T}_F$模型。我们制定这个测试问题，通过引入一般成对比较模型和类$\mathcal{T}_F$模型之间的分离距离的概念。然后，我们推导出上界和下界的关键阈值的测试，依赖于观察图的拓扑结构。对于完整观察图的特殊情况，该阈值的缩放比例为$Θ（（nk）^{-1/2}）$，其中$n$是代理的数量，$k$是每对的比较次数。此外，我们提出了一个假设检验的基础上，我们的分离距离，构造置信区间，建立时间均匀的第一和第二类错误的概率上的界限，使用逆鞅技术，并推导出极大极小下界使用信息论的方法。最后，我们通过对合成和真实世界数据集的实验来验证我们的结果。
摘要：In this work, we develop a hypothesis testing framework to determine whether pairwise comparison data is generated by an underlying \emph{generalized Thurstone model} $\mathcal{T}_F$ for a given choice function $F$. While prior work has predominantly focused on parameter estimation and uncertainty quantification for such models, we address the fundamental problem of minimax hypothesis testing for $\mathcal{T}_F$ models. We formulate this testing problem by introducing a notion of separation distance between general pairwise comparison models and the class of $\mathcal{T}_F$ models. We then derive upper and lower bounds on the critical threshold for testing that depend on the topology of the observation graph. For the special case of complete observation graphs, this threshold scales as $Θ((nk)^{-1/2})$, where $n$ is the number of agents and $k$ is the number of comparisons per pair. Furthermore, we propose a hypothesis test based on our separation distance, construct confidence intervals, establish time-uniform bounds on the probabilities of type I and II errors using reverse martingale techniques, and derive minimax lower bounds using information-theoretic methods. Finally, we validate our results through experiments on synthetic and real-world datasets.

【5】VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling
标题：VLA模型比您想象的更可推广：重新审视物理和空间建模
链接：https://arxiv.org/abs/2512.02902

作者：Weiqi Li,Quande Zhang,Ruifeng Zhai,Liang Lin,Guangrun Wang
摘要：视觉-语言-动作（VLA）模型具有很强的分布式性能，但在新的摄像机视点和视觉扰动下会急剧下降。我们发现，这种脆性主要是由于空间建模，而不是物理建模的不对齐。为了解决这个问题，我们提出了一个一次性的适应框架，通过轻量级的，可学习的更新重新校准视觉表示。我们的第一种方法，特征令牌调制（FTM），应用全局仿射变换的视觉令牌，并提高Libero的视点精度从48.5%到87.1%，只有4K参数。在此基础上，特征线性自适应（FLA）为ViT编码器引入了低秩更新，使用470万个参数实现了90.8%的成功率-以更低的成本匹配LoRA规模的微调。总之，这些结果揭示了预训练VLA模型中大量未开发的鲁棒性，并证明了有针对性的最小视觉适应足以恢复视点泛化。
摘要：Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8% success with 4.7M parameters -- matching LoRA-scale finetuning at far lower cost. Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.

【6】From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity
标题：从导航到细化：通过Oracle Speed揭示基于流的扩散模型的两阶段性质
链接：https://arxiv.org/abs/2512.02826

作者：Haoming Liu,Jinnuo Liu,Yanhao Li,Liuyang Bai,Yunkai Ji,Yuanhe Guo,Shenji Wan,Hongyi Wen
备注：Preprint version; 15 pages, 16 figures
摘要：基于流的扩散模型已经成为跨图像和视频训练生成模型的领先范例。然而，他们的记忆泛化行为仍然知之甚少。在这项工作中，我们重新访问的流量匹配（FM）的目标，并研究其边际速度场，它承认一个封闭形式的表达，允许精确计算的甲骨文FM目标。分析这个预言速度场揭示了基于流的扩散模型固有地制定了两个阶段的训练目标：由混合数据模式引导的早期阶段，以及由最近的数据样本主导的后期阶段。两阶段目标导致不同的学习行为：早期导航阶段概括数据模式以形成全局布局，而后期细化阶段则越来越多地记住细粒度的细节。利用这些见解，我们解释了实用技术的有效性，如时步转移的时间表，无分类器的指导间隔，和潜在的空间设计选择。我们的研究加深了对扩散模型训练动态的理解，并为指导未来的架构和算法改进提供了原则。
摘要：Flow-based diffusion models have emerged as a leading paradigm for training generative models across images and videos. However, their memorization-generalization behavior remains poorly understood. In this work, we revisit the flow matching (FM) objective and study its marginal velocity field, which admits a closed-form expression, allowing exact computation of the oracle FM target. Analyzing this oracle velocity field reveals that flow-based diffusion models inherently formulate a two-stage training target: an early stage guided by a mixture of data modes, and a later stage dominated by the nearest data sample. The two-stage objective leads to distinct learning behaviors: the early navigation stage generalizes across data modes to form global layouts, whereas the later refinement stage increasingly memorizes fine-grained details. Leveraging these insights, we explain the effectiveness of practical techniques such as timestep-shifted schedules, classifier-free guidance intervals, and latent space design choices. Our study deepens the understanding of diffusion model training dynamics and offers principles for guiding future architectural and algorithmic improvements.

【7】Embedding networks with the random walk first return time distribution
标题：具有随机游走首次返回时间分布的嵌入网络
链接：https://arxiv.org/abs/2512.02694

作者：Vedanta Thapar,Renaud Lambiotte,George T. Cantwell
摘要：我们提出的第一个返回时间分布（FRTD）的随机游走作为一个可解释的和数学接地节点嵌入。FRTD为每个节点分配一个概率质量函数，允许我们使用离散分布的标准度量来定义任何节点对之间的距离。我们提出了几个参数来激励FRTD嵌入。首先，我们表明，FRTD是严格的信息比本征值谱，但不足以完整的图形识别，从而放置FRTD之间的共谱和同构等价。其次，我们认为，FRTD节点之间的等价捕获结构相似性。第三，我们经验证明，FRTD嵌入优于手动设计的图度量网络对齐任务。最后，我们表明，随机网络，近似匹配所需目标的FRTD也保留其他显着的功能。这些结果共同证明了FRTD是复杂网络的简单且数学原则的嵌入。
摘要：We propose the first return time distribution (FRTD) of a random walk as an interpretable and mathematically grounded node embedding. The FRTD assigns a probability mass function to each node, allowing us to define a distance between any pair of nodes using standard metrics for discrete distributions. We present several arguments to motivate the FRTD embedding. First, we show that FRTDs are strictly more informative than eigenvalue spectra, yet insufficient for complete graph identification, thus placing FRTD equivalence between cospectrality and isomorphism. Second, we argue that FRTD equivalence between nodes captures structural similarity. Third, we empirically demonstrate that the FRTD embedding outperforms manually designed graph metrics in network alignment tasks. Finally, we show that random networks that approximately match the FRTD of a desired target also preserve other salient features. Together these results demonstrate the FRTD as a simple and mathematically principled embedding for complex networks.

【8】Distill, Forget, Repeat: A Framework for Continual Unlearning in Text-to-Image Diffusion Models
标题：提取、忘记、重复：文本到图像扩散模型中持续忘记学习的框架
链接：https://arxiv.org/abs/2512.02657

作者：Naveen George,Naoki Murata,Yuhta Takida,Konda Reddy Mopuri,Yuki Mitsufuji
备注：Preprint
摘要：最近，在庞大的网络规模数据集上训练的视觉生成模型的快速增长，与数据隐私法规和版权法（如GDPR的“被遗忘的权利”）产生了巨大的紧张关系。这就需要机器非学习（MU）来去除特定的概念，而不需要高昂的再培训成本。然而，现有的MU技术从根本上不适合删除请求顺序到达的真实场景，这种设置被称为持续学习（CUL）。在持续的环境中天真地应用一次性方法会引发稳定性危机，导致一连串的退化，其特征是保留崩溃，加重对相关概念的附带损害，以及生成质量的急剧下降。为了解决这一关键挑战，我们引入了一种新的基于生成蒸馏的持续遗忘框架，该框架可以确保在删除请求序列下有针对性和稳定的遗忘。通过将每个遗忘步骤重新构建为一个多目标的师生蒸馏过程，该框架利用了持续学习的原则来保持模型的完整性。在10步顺序基准测试上的实验表明，我们的方法以更好的保真度遗忘遗忘概念，并且在不显著干扰保留概念或整体图像质量的情况下实现了这一点，大大优于基线。该框架为负责任地部署和维护大规模生成模型提供了一条可行的途径，使行业能够以切实有效的方式遵守正在进行的数据删除请求。
摘要：The recent rapid growth of visual generative models trained on vast web-scale datasets has created significant tension with data privacy regulations and copyright laws, such as GDPR's ``Right to be Forgotten.'' This necessitates machine unlearning (MU) to remove specific concepts without the prohibitive cost of retraining. However, existing MU techniques are fundamentally ill-equipped for real-world scenarios where deletion requests arrive sequentially, a setting known as continual unlearning (CUL). Naively applying one-shot methods in a continual setting triggers a stability crisis, leading to a cascade of degradation characterized by retention collapse, compounding collateral damage to related concepts, and a sharp decline in generative quality. To address this critical challenge, we introduce a novel generative distillation based continual unlearning framework that ensures targeted and stable unlearning under sequences of deletion requests. By reframing each unlearning step as a multi-objective, teacher-student distillation process, the framework leverages principles from continual learning to maintain model integrity. Experiments on a 10-step sequential benchmark demonstrate that our method unlearns forget concepts with better fidelity and achieves this without significant interference to the performance on retain concepts or the overall image quality, substantially outperforming baselines. This framework provides a viable pathway for the responsible deployment and maintenance of large-scale generative models, enabling industries to comply with ongoing data removal requests in a practical and effective manner.

【9】Tensor Network Based Feature Learning Model
标题：基于张量网络的特征学习模型
链接：https://arxiv.org/abs/2512.02547

作者：Albert Saiapin,Kim Batselier
备注：11 pages, 2 figures, 2 tables. Code available at: https://github.com/AlbMLpy/TN-FL-Model
摘要：提出了许多近似来规避基于核的算法的立方复杂性，允许它们应用于大规模数据集。一种策略是通过使用张量积结构多项式和傅立叶特征将数据映射到高维空间来考虑学习问题的原始公式。由于这些张量积功能的维数灾难有效地解决了张量网络重新参数化的模型参数。然而，模型训练的另一个重要方面-识别最佳特征超参数-尚未得到解决，通常使用标准交叉验证方法进行处理。在本文中，我们介绍了特征学习（FL）模型，它通过将张量积特征表示为可学习的规范多元分解（CPD）来解决这个问题。通过利用这种CPD结构，我们使用交替最小二乘（ALS）优化方法有效地学习与不同特征相关的超参数以及模型参数。通过对不同维度和规模的真实数据的实验，我们证明了FL模型的有效性。结果表明，FL模型的训练速度比标准交叉验证模型快3-5倍，预测质量与标准交叉验证模型相当。
摘要：Many approximations were suggested to circumvent the cubic complexity of kernel-based algorithms, allowing their application to large-scale datasets. One strategy is to consider the primal formulation of the learning problem by mapping the data to a higher-dimensional space using tensor-product structured polynomial and Fourier features. The curse of dimensionality due to these tensor-product features was effectively solved by a tensor network reparameterization of the model parameters. However, another important aspect of model training - identifying optimal feature hyperparameters - has not been addressed and is typically handled using the standard cross-validation approach. In this paper, we introduce the Feature Learning (FL) model, which addresses this issue by representing tensor-product features as a learnable Canonical Polyadic Decomposition (CPD). By leveraging this CPD structure, we efficiently learn the hyperparameters associated with different features alongside the model parameters using an Alternating Least Squares (ALS) optimization method. We prove the effectiveness of the FL model through experiments on real data of various dimensionality and scale. The results show that the FL model can be consistently trained 3-5 times faster than and have the prediction quality on par with a standard cross-validated model.

【10】WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling
标题：WorldPack：压缩内存改善视频世界建模中的空间一致性
链接：https://arxiv.org/abs/2512.02473

作者：Yuta Oshima,Yusuke Iwasawa,Masahiro Suzuki,Yutaka Matsuo,Hiroki Furuta
摘要：视频世界模型已经吸引了显着的关注，因为它们能够产生高保真的未来视觉观察条件下，过去的观察和导航行动。时间和空间一致的长期世界建模一直是一个长期存在的问题，即使是最近的最先进的模型也无法解决，这是由于长上下文输入的计算成本非常昂贵。在本文中，我们提出了WorldPack，一个视频世界模型，具有高效的压缩内存，显着提高了空间的一致性，保真度和长期生成的质量，尽管更短的上下文长度。我们的压缩内存包括轨迹包装和内存检索;轨迹包装实现了高上下文效率，内存检索保持了卷展的一致性，并帮助需要空间推理的长期生成。我们的性能是用LoopNav评估的，LoopNav是Minecraft上的一个基准，专门用于评估长期一致性，我们验证了WorldPack明显优于强大的最先进的模型。
摘要：Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions. Temporally- and spatially-consistent, long-term world modeling has been a long-standing problem, unresolved with even recent state-of-the-art models, due to the prohibitively expensive computational costs for long-context inputs. In this paper, we propose WorldPack, a video world model with efficient compressed memory, which significantly improves spatial consistency, fidelity, and quality in long-term generation despite much shorter context length. Our compressed memory consists of trajectory packing and memory retrieval; trajectory packing realizes high context efficiency, and memory retrieval maintains the consistency in rollouts and helps long-term generations that require spatial reasoning. Our performance is evaluated with LoopNav, a benchmark on Minecraft, specialized for the evaluation of long-term consistency, and we verify that WorldPack notably outperforms strong state-of-the-art models.

【11】Risk-Sensitive Q-Learning in Continuous Time with Application to Dynamic Portfolio Selection
标题：连续时间的风险敏感Q学习及其在动态投资组合选择中的应用
链接：https://arxiv.org/abs/2512.02386

作者：Chuhan Xie
摘要：研究了连续时间环境下的风险敏感强化学习问题，其中环境由可控随机微分方程描述，目标函数为潜在非线性的累积报酬泛函.我们证明了，当功能是一个优化的确定性等价（OCE），最优策略是马尔可夫关于一个增广的环境。我们还提出了\textit{CT-RS-q}，一个风险敏感的q学习算法的基础上，一个新的鞅表征方法。最后，我们对一个动态投资组合问题进行了仿真研究，并说明了我们的算法的有效性。
摘要：This paper studies the problem of risk-sensitive reinforcement learning (RSRL) in continuous time, where the environment is characterized by a controllable stochastic differential equation (SDE) and the objective is a potentially nonlinear functional of cumulative rewards. We prove that when the functional is an optimized certainty equivalent (OCE), the optimal policy is Markovian with respect to an augmented environment. We also propose \textit{CT-RS-q}, a risk-sensitive q-learning algorithm based on a novel martingale characterization approach. Finally, we run a simulation study on a dynamic portfolio selection problem and illustrate the effectiveness of our algorithm.

【12】Retrieval-Augmented Memory for Online Learning
标题：用于在线学习的检索增强记忆
链接：https://arxiv.org/abs/2512.02333

作者：Wenzhang Du
备注：11 pages, 3 figures
摘要：检索增强模型将参数预测器与非参数记忆相结合，但它们在具有概念漂移的流监督学习中的使用还没有得到很好的理解。我们研究了非平稳环境中的在线分类，并提出了检索增强记忆在线学习（RAM-OL），一个简单的扩展随机梯度下降，保持一个小的缓冲区，过去的例子。在每个时间步，RAM-OL检索隐藏表示空间中当前输入的几个最近邻居，并在当前示例和检索到的邻居上联合更新模型。我们比较了一个天真的重放变体与门控重放变体，该变体使用时间窗口，相似性阈值和梯度重新加权来约束邻居，以平衡相关过去数据的快速重用对过时制度的鲁棒性。从理论的角度来看，我们解释RAM-OL下的有界漂移模型，并讨论如何检索可以降低适应成本，提高后悔常数时，随着时间的推移模式重现。从经验上讲，我们在一个简单的在线多层感知器上实例化RAM-OL，并在来自电价，电力负荷和航空公司延迟数据的三个真实数据流上对其进行评估。在强烈和周期性漂移的流中，RAM-OL将prequentive准确性提高了约7个百分点，并大大降低了随机种子之间的方差，而在嘈杂的航空公司流中，门控变体与纯在线基线非常匹配。这些结果表明，检索增强记忆是一个实用的和强大的工具，在线学习下的概念漂移。
摘要：Retrieval-augmented models couple parametric predictors with non-parametric memories, but their use in streaming supervised learning with concept drift is not well understood. We study online classification in non-stationary environments and propose Retrieval-Augmented Memory for Online Learning (RAM-OL), a simple extension of stochastic gradient descent that maintains a small buffer of past examples. At each time step, RAM-OL retrieves a few nearest neighbours of the current input in the hidden representation space and updates the model jointly on the current example and the retrieved neighbours. We compare a naive replay variant with a gated replay variant that constrains neighbours using a time window, similarity thresholds, and gradient reweighting, in order to balance fast reuse of relevant past data against robustness to outdated regimes. From a theoretical perspective, we interpret RAM-OL under a bounded drift model and discuss how retrieval can reduce adaptation cost and improve regret constants when patterns recur over time. Empirically, we instantiate RAM-OL on a simple online multilayer perceptron and evaluate it on three real-world data streams derived from electricity pricing, electricity load, and airline delay data. On strongly and periodically drifting streams, RAM-OL improves prequential accuracy by up to about seven percentage points and greatly reduces variance across random seeds, while on a noisy airline stream the gated variant closely matches the purely online baseline. These results show that retrieval-augmented memory is a practical and robust tool for online learning under concept drift.

【13】Limitations of Membership Queries in Testable Learning
标题：测试性学习中会员资格的局限性
链接：https://arxiv.org/abs/2512.02279

作者：Jane Lange,Mingda Qiao
备注：Conference: ITCS 2026
摘要：成员查询（MQ）通常会加速学习任务，特别是在特定于分布的设置中。我们证明了在Rubinfeld和Vasilyan [RV 23]的可测试学习模型中，成员查询不能降低可测试学习算法的时间复杂度，而不能降低仅样本分布特定学习的复杂度。在可测试学习模型中，每当数据分布满足期望的属性时，学习器必须输出一个假设，并且如果它输出一个假设，则该假设必须是接近最优的。我们给出了一个从基于样本的布尔概念类的反驳，如[Vadhan 17，KL 18]中所提出的，到带查询的可测试学习（TL-Q）的一般简化。这通过[KL 18]中给出的从学习到反驳的简化产生了TL-Q的下界。其结果是，相对于一个概念类和一个分布族，没有$m$-样本TL-Q算法可以是超多项式时间效率比最好的$m$-样本PAC学习。最后，我们定义了一类“自动”MQ算法，其中包括许多已知的分布特定的MQ学习者，如那些基于影响估计或子立方体条件统计查询。我们表明，TL-Q算法在这一类意味着有效的查询反驳和学习算法。因此，结合已知的SQ维下界，我们的研究结果意味着这些有效的成员查询学习器不能被测试。
摘要：Membership queries (MQ) often yield speedups for learning tasks, particularly in the distribution-specific setting. We show that in the \emph{testable learning} model of Rubinfeld and Vasilyan [RV23], membership queries cannot decrease the time complexity of testable learning algorithms beyond the complexity of sample-only distribution-specific learning. In the testable learning model, the learner must output a hypothesis whenever the data distribution satisfies a desired property, and if it outputs a hypothesis, the hypothesis must be near-optimal. We give a general reduction from sample-based \emph{refutation} of boolean concept classes, as presented in [Vadhan17, KL18], to testable learning with queries (TL-Q). This yields lower bounds for TL-Q via the reduction from learning to refutation given in [KL18]. The result is that, relative to a concept class and a distribution family, no $m$-sample TL-Q algorithm can be super-polynomially more time-efficient than the best $m$-sample PAC learner. Finally, we define a class of ``statistical'' MQ algorithms that encompasses many known distribution-specific MQ learners, such as those based on influence estimation or subcube-conditional statistical queries. We show that TL-Q algorithms in this class imply efficient statistical-query refutation and learning algorithms. Thus, combined with known SQ dimension lower bounds, our results imply that these efficient membership query learners cannot be made testable.

【14】Verifying Closed-Loop Contractivity of Learning-Based Controllers via Partitioning
标题：通过分区验证基于学习的控制器的闭环收缩性
链接：https://arxiv.org/abs/2512.02262

作者：Alexander Davydov
摘要：研究了非线性控制系统的闭环收缩性检验问题，该系统的控制器和收缩度量均由神经网络参数化。通过利用区间分析和区间界传播，我们推导出一个易于处理和可扩展的闭环收缩性的充分条件，减少到检查对称Metzler矩阵的主导特征值是非正的。我们将这个充分条件与域划分策略相结合，将这个充分条件整合到训练中。所提出的方法进行了验证倒立摆系统，展示了学习神经网络控制器和收缩度量，证明满足收缩条件的能力。
摘要：We address the problem of verifying closed-loop contraction in nonlinear control systems whose controller and contraction metric are both parameterized by neural networks. By leveraging interval analysis and interval bound propagation, we derive a tractable and scalable sufficient condition for closed-loop contractivity that reduces to checking that the dominant eigenvalue of a symmetric Metzler matrix is nonpositive. We combine this sufficient condition with a domain partitioning strategy to integrate this sufficient condition into training. The proposed approach is validated on an inverted pendulum system, demonstrating the ability to learn neural network controllers and contraction metrics that provably satisfy the contraction condition.

【15】On the Approximation of Phylogenetic Distance Functions by Artificial Neural Networks
标题：人工神经网络对系统发生距离函数的逼近
链接：https://arxiv.org/abs/2512.02223

作者：Benjamin K. Rosenzweig,Matthew W. Hahn
备注：10 pages
摘要：推断生物样品之间的系统发育关系是现代生物学中的一个基本问题。虽然基于距离的分层聚类算法在这项任务上取得了早期的成功，但这些算法已经被基于复杂分子进化模型的贝叶斯和最大似然搜索程序所取代。在这项工作中，我们描述了最小的神经网络架构，可以近似经典的系统发育距离函数和各种分子进化模型下学习距离所需的属性。与基于模型的推理（以及最近提出的无模型卷积和Transformer网络）相比，这些架构具有较小的计算占用空间，并且可扩展到大量的分类群和分子特征。学习的距离函数泛化良好，并且在给定适当的训练数据集的情况下，实现了与最先进的推理方法相当的结果。
摘要：Inferring the phylogenetic relationships among a sample of organisms is a fundamental problem in modern biology. While distance-based hierarchical clustering algorithms achieved early success on this task, these have been supplanted by Bayesian and maximum likelihood search procedures based on complex models of molecular evolution. In this work we describe minimal neural network architectures that can approximate classic phylogenetic distance functions and the properties required to learn distances under a variety of molecular evolutionary models. In contrast to model-based inference (and recently proposed model-free convolutional and transformer networks), these architectures have a small computational footprint and are scalable to large numbers of taxa and molecular characters. The learned distance functions generalize well and, given an appropriate training dataset, achieve results comparable to state-of-the art inference methods.

【16】WhAM: Towards A Translative Model of Sperm Whale Vocalization
标题：WhAM：迈向抹香鲸发声的翻译模型
链接：https://arxiv.org/abs/2512.02206

作者：Orr Paradise,Pranav Muralikrishnan,Liangyuan Chen,Hugo Flores García,Bryan Pardo,Roee Diamant,David F. Gruber,Shane Gero,Shafi Goldwasser
备注：NeurIPS 2025
摘要：抹香鲸通过一系列短的咔哒声进行交流，这些咔哒声被称为尾音。我们提出了WhAM（鲸鱼声学模型），第一个基于变压器的模型，能够从任何音频提示生成合成抹香鲸尾。WhAM是通过微调VampNet构建的，VampNet是一个在音乐音频上预训练的掩蔽声学令牌模型，使用了过去二十年收集的10k尾波录音。通过迭代掩蔽标记预测，WhAM生成高保真合成尾波，保留源录音的关键声学特征。我们使用Fréchet Audio Distance并通过与专家海洋生物学家的感知研究来评估WhAM的合成尾波。在包括节奏、社会单位和元音分类在内的下游分类任务中，WhAM的学习表征实现了强大的性能，尽管它是针对生成而不是分类进行训练的。我们的代码可在https://github.com/Project-CETI/wham上获得
摘要：Sperm whales communicate in short sequences of clicks known as codas. We present WhAM (Whale Acoustics Model), the first transformer-based model capable of generating synthetic sperm whale codas from any audio prompt. WhAM is built by finetuning VampNet, a masked acoustic token model pretrained on musical audio, using 10k coda recordings collected over the past two decades. Through iterative masked token prediction, WhAM generates high-fidelity synthetic codas that preserve key acoustic features of the source recordings. We evaluate WhAM's synthetic codas using Fréchet Audio Distance and through perceptual studies with expert marine biologists. On downstream classification tasks including rhythm, social unit, and vowel classification, WhAM's learned representations achieve strong performance, despite being trained for generation rather than classification. Our code is available at https://github.com/Project-CETI/wham

【17】Modelling the Doughnut of social and planetary boundaries with frugal machine learning
标题：用节俭的机器学习建模社会和地球边界的甜甜圈
链接：https://arxiv.org/abs/2512.02200

作者：Stefano Vrizzi,Daniel W. O'Neill
摘要：社会和地球边界的“甜甜圈”已经成为评估环境和社会可持续性的流行框架。在这里，我们提供了一个概念验证分析，展示了机器学习（ML）方法如何应用于甜甜圈的简单宏观经济模型。首先，我们展示了如何使用ML方法来找到与“生活在甜甜圈内”相一致的政策参数。其次，我们展示了强化学习代理如何在参数空间中识别朝向所需策略的最佳轨迹。我们测试的方法，包括随机森林分类器和$Q$学习，是一种简单的机器学习方法，能够找到实现环境和社会可持续性的政策参数组合。下一步是将这些方法应用于更复杂的生态宏观经济模型。
摘要：The 'Doughnut' of social and planetary boundaries has emerged as a popular framework for assessing environmental and social sustainability. Here, we provide a proof-of-concept analysis that shows how machine learning (ML) methods can be applied to a simple macroeconomic model of the Doughnut. First, we show how ML methods can be used to find policy parameters that are consistent with 'living within the Doughnut'. Second, we show how a reinforcement learning agent can identify the optimal trajectory towards desired policies in the parameter space. The approaches we test, which include a Random Forest Classifier and $Q$-learning, are frugal ML methods that are able to find policy parameter combinations that achieve both environmental and social sustainability. The next step is the application of these methods to a more complex ecological macroeconomic model.

【18】PIBNet: a Physics-Inspired Boundary Network for Multiple Scattering Simulations
标题：PIBNet：一个受物理启发的边界网络，用于多重散射模拟
链接：https://arxiv.org/abs/2512.02049

作者：Rémi Marsal,Stéphanie Chaillat
摘要：边界元法（BEM）为求解无界均匀区域中的多次散射问题提供了一个有效的数值框架，因为它减少了对区域边界的离散，从而压缩了计算复杂度。该程序首先包括在确定的解决方案跟踪域的边界上通过求解边界积分方程，之后的体积解决方案可以恢复在低计算成本与边界积分表示。由于边界元法的第一步代表了主要的计算瓶颈，因此我们引入了PIBNet，这是一种基于学习的方法，旨在近似解迹。该方法利用物理启发的基于图形的策略来有效地建模障碍物及其远程交互。然后，我们介绍了一种新的多尺度图神经网络的结构来模拟多次散射。为了训练和评估我们的网络，我们提出了一个由不同类型多重散射问题的多个数据集组成的基准测试。结果表明，我们的方法不仅超越了现有的国家的最先进的基于学习的方法所考虑的任务，但也表现出优越的泛化设置与障碍物的数量增加。github.com/ENSTA-U2IS-AI/pibnet
摘要：The boundary element method (BEM) provides an efficient numerical framework for solving multiple scattering problems in unbounded homogeneous domains, since it reduces the discretization to the domain boundaries, thereby condensing the computational complexity. The procedure first consists in determining the solution trace on the boundaries of the domain by solving a boundary integral equation, after which the volumetric solution can be recovered at low computational cost with a boundary integral representation. As the first step of the BEM represents the main computational bottleneck, we introduce PIBNet, a learning-based approach designed to approximate the solution trace. The method leverages a physics-inspired graph-based strategy to model obstacles and their long-range interactions efficiently. Then, we introduce a novel multiscale graph neural network architecture for simulating the multiple scattering. To train and evaluate our network, we present a benchmark consisting of several datasets of different types of multiple scattering problems. The results indicate that our approach not only surpasses existing state-of-the-art learning-based methods on the considered tasks but also exhibits superior generalization to settings with an increased number of obstacles. github.com/ENSTA-U2IS-AI/pibnet

【19】Pharmacophore-based design by learning on voxel grids
标题：通过体素网格学习进行基于药效团的设计
链接：https://arxiv.org/abs/2512.02031

作者：Omar Mahmood,Pedro O. Pinheiro,Richard Bonneau,Saeed Saremi,Vishnu Sresht
摘要：基于配体的药物发现（LBDD）依赖于利用已知的蛋白质靶点结合剂来发现结构上不同的分子，这些分子类似地可能结合。该过程通常涉及使用分子相似性的某种度量针对分子文库对已知结合物（查询）的强力搜索。一种流行的方法将已知结合剂的药效团形状轮廓覆盖到针对每个文库分子列举的3D构象，计算重叠，并挑选一组具有高重叠的不同文库分子。虽然这种虚拟筛选工作流程在命中多样化，支架跳跃和专利破坏方面取得了相当大的成功，但它与文库大小的比例很差，并将候选物的生成限制在现有的文库化合物上。利用最新的进展，基于体素的生成建模，我们提出了一个基于药效团的生成模型和工作流程，解决传统的基于药效团的虚拟筛选的缩放和繁殖力问题。我们介绍\n {VoxCap}，体素字幕方法生成SMILES字符串从体素化的分子表示。我们提出了两个工作流程作为实际用例以及基于药效团生成的基准：重新设计，我们的目标是生成与查询分子具有高药效团形状相似性的新分子，以及快速搜索，其目的是将生成设计与廉价的2D子结构相似性搜索相结合，以进行有效的命中识别。我们的研究结果表明，VoxCap显着优于以前的方法，在产生不同的\textit{de-novo}命中。当与我们的快速搜索工作流程相结合时，VoxCap将计算时间减少了几个数量级，同时返回所有查询分子的命中，从而能够搜索难以通过蛮力搜索的大型库。
摘要：Ligand-based drug discovery (LBDD) relies on making use of known binders to a protein target to find structurally diverse molecules similarly likely to bind. This process typically involves a brute force search of the known binder (query) against a molecular library using some metric of molecular similarity. One popular approach overlays the pharmacophore-shape profile of the known binder to 3D conformations enumerated for each of the library molecules, computes overlaps, and picks a set of diverse library molecules with high overlaps. While this virtual screening workflow has had considerable success in hit diversification, scaffold hopping, and patent busting, it scales poorly with library sizes and restricts candidate generation to existing library compounds. Leveraging recent advances in voxel-based generative modelling, we propose a pharmacophore-based generative model and workflows that address the scaling and fecundity issues of conventional pharmacophore-based virtual screening. We introduce \emph{VoxCap}, a voxel captioning method for generating SMILES strings from voxelised molecular representations. We propose two workflows as practical use cases as well as benchmarks for pharmacophore-based generation: \emph{de-novo} design, in which we aim to generate new molecules with high pharmacophore-shape similarities to query molecules, and fast search, which aims to combine generative design with a cheap 2D substructure similarity search for efficient hit identification. Our results show that VoxCap significantly outperforms previous methods in generating diverse \textit{de-novo} hits. When combined with our fast search workflow, VoxCap reduces computational time by orders of magnitude while returning hits for all query molecules, enabling the search of large libraries that are intractable to search by brute force.

【20】Revisiting Theory of Contrastive Learning for Domain Generalization
标题：重新审视领域概括的对比学习理论
链接：https://arxiv.org/abs/2512.02831

作者：Ali Alvandi,Mina Rezaei
备注：19 pages
摘要：对比学习是自监督表示学习中最流行和最强大的方法之一，其目标是将语义相似的样本映射到一起，同时在潜在空间中分离不同的样本。现有的理论方法假设下游任务类是从预训练阶段使用的相同潜在类分布中提取的。然而，在现实世界中，下游任务不仅可能在同一标签空间内表现出分布变化，而且还会引入新的或更广泛的标签空间，从而导致领域泛化的挑战。在这项工作中，我们引入了新的泛化界限，明确占两种类型的不匹配：域转移和域泛化。具体来说，我们分析了这样的场景：下游任务（i）从相同的潜在类空间中提取类，但具有移位分布，或者（ii）涉及超出预训练期间所见的新标签空间。我们的分析揭示了对比学习表示的性能如何取决于预训练和下游分布之间的统计差异。这种扩展的视角使我们能够在涉及预训练潜在类集之外的类分布的平均分类任务上获得学习表示的性能的可证明保证。
摘要：Contrastive learning is among the most popular and powerful approaches for self-supervised representation learning, where the goal is to map semantically similar samples close together while separating dissimilar ones in the latent space. Existing theoretical methods assume that downstream task classes are drawn from the same latent class distribution used during the pretraining phase. However, in real-world settings, downstream tasks may not only exhibit distributional shifts within the same label space but also introduce new or broader label spaces, leading to domain generalization challenges. In this work, we introduce novel generalization bounds that explicitly account for both types of mismatch: domain shift and domain generalization. Specifically, we analyze scenarios where downstream tasks either (i) draw classes from the same latent class space but with shifted distributions, or (ii) involve new label spaces beyond those seen during pretraining. Our analysis reveals how the performance of contrastively learned representations depends on the statistical discrepancy between pretraining and downstream distributions. This extended perspective allows us to derive provable guarantees on the performance of learned representations on average classification tasks involving class distributions outside the pretraining latent class set.

【21】Generative modeling using evolved quantum Boltzmann machines
标题：使用进化的量子Boltzmann机进行生成建模
链接：https://arxiv.org/abs/2512.02721

作者：Mark M. Wilde
备注：30 pages, 2 figures
摘要：Born-rule generative modeling是量子机器学习的核心任务，旨在学习可以通过测量复杂量子态来有效采样的概率分布。一个希望是量子模型能够有效地捕获仅通过经典方法难以学习和模拟的概率分布。量子玻尔兹曼机在大约十年前就被提出来，但有效的训练方法仍然难以捉摸。在本文中，我通过提出一个实用的解决方案来克服这个障碍，该解决方案可以训练量子玻尔兹曼机进行Born规则生成建模。该提议中的两个关键成分是经典相对熵的Donsker-Varadhan变分表示和[Patel等人，arXiv：2410.12935]。我提出了一个更一般的模型，称为进化量子玻尔兹曼机[Minervini等人，arXiv：2501.03367]，它结合了参数化的实时和实时演化。我还展示了如何将研究结果扩展到相对熵之外的其他可测量性措施。最后，我提出了四种不同的混合量子经典算法的极大极小优化的培训，我讨论了他们的理论收敛保证。
摘要：Born-rule generative modeling, a central task in quantum machine learning, seeks to learn probability distributions that can be efficiently sampled by measuring complex quantum states. One hope is for quantum models to efficiently capture probability distributions that are difficult to learn and simulate by classical means alone. Quantum Boltzmann machines were proposed about one decade ago for this purpose, yet efficient training methods have remained elusive. In this paper, I overcome this obstacle by proposing a practical solution that trains quantum Boltzmann machines for Born-rule generative modeling. Two key ingredients in the proposal are the Donsker-Varadhan variational representation of the classical relative entropy and the quantum Boltzmann gradient estimator of [Patel et al., arXiv:2410.12935]. I present the main result for a more general ansatz known as an evolved quantum Boltzmann machine [Minervini et al., arXiv:2501.03367], which combines parameterized real- and imaginary-time evolution. I also show how to extend the findings to other distinguishability measures beyond relative entropy. Finally, I present four different hybrid quantum-classical algorithms for the minimax optimization underlying training, and I discuss their theoretical convergence guarantees.

【22】Bayesian Physics-Informed Neural Networks for Inverse Problems (BPINN-IP): Application in Infrared Image Processing
标题：用于反问题的基于Bayesian物理的神经网络（BPINN-IP）：在红外图像处理中的应用
链接：https://arxiv.org/abs/2512.02495

作者：Ali Mohammad-Djafari,Ning Chu,Li Wang
备注：31 page, paper in revision, submitted in Journal of the Franklin Institute, 2025
摘要：逆问题出现在科学和工程领域，其目标是从间接和噪声观测中推断隐藏的参数或物理场。经典的方法，如变分正则化和贝叶斯推理，提供了良好的理论基础处理不适定性。然而，这些方法在高维环境中或当正演模型受复杂物理学支配时，通常在计算上受到限制。物理信息神经网络（PINNs）最近已经成为一个很有前途的框架，通过将物理定律直接嵌入到神经网络的训练过程中来解决逆问题。在本文中，我们介绍了一个新的视角贝叶斯物理信息神经网络（BPINN）框架，扩展经典PINN明确纳入训练数据的生成，建模和测量的不确定性，通过贝叶斯先验建模和做推理与后验法律。此外，当我们专注于逆问题，我们称之为这种方法BPINN-IP，我们表明，标准PINN制定自然出现其特殊情况下对应的最大后验概率（MAP）估计。这种统一的公式允许同时利用物理约束，先验知识和数据驱动的推理，同时通过后验分布实现不确定性量化。为了证明所提出的框架的有效性，我们考虑在红外图像处理中出现的逆问题，包括反卷积和超分辨率，并在模拟和真实的工业数据上给出结果。
摘要：Inverse problems arise across scientific and engineering domains, where the goal is to infer hidden parameters or physical fields from indirect and noisy observations. Classical approaches, such as variational regularization and Bayesian inference, provide well established theoretical foundations for handling ill posedness. However, these methods often become computationally restrictive in high dimensional settings or when the forward model is governed by complex physics. Physics Informed Neural Networks (PINNs) have recently emerged as a promising framework for solving inverse problems by embedding physical laws directly into the training process of neural networks. In this paper, we introduce a new perspective on the Bayesian Physics Informed Neural Network (BPINN) framework, extending classical PINNs by explicitly incorporating training data generation, modeling and measurement uncertainties through Bayesian prior modeling and doing inference with the posterior laws. Also, as we focus on the inverse problems, we call this method BPINN-IP, and we show that the standard PINN formulation naturally appears as its special case corresponding to the Maximum A Posteriori (MAP) estimate. This unified formulation allows simultaneous exploitation of physical constraints, prior knowledge, and data-driven inference, while enabling uncertainty quantification through posterior distributions. To demonstrate the effectiveness of the proposed framework, we consider inverse problems arising in infrared image processing, including deconvolution and super-resolution, and present results on both simulated and real industrial data.

【23】Quantum Machine Learning for Secondary Frequency Control
标题：二次频率控制的量子机器学习
链接：https://arxiv.org/abs/2512.02065

作者：Younes Ghazagh Jahed,Alireza Khatiri
备注：8 pages, 6 figures
摘要：电力系统中的频率控制对维持稳定和防止停电至关重要。传统的方法，如元启发式算法和机器学习，面临着实时适用性和可扩展性的限制。提出了一种利用纯变分量子电路（VQC）实现柴油发电机二次频率实时控制的新方法。与混合经典量子模型不同，所提出的VQC在执行期间独立运行，消除了经典量子数据交换的延迟。VQC通过监督学习进行训练，以使用预先计算的查找表将历史频率偏差映射到最佳比例积分（PI）控制器参数。仿真结果表明，VQC实现了高的预测精度（90%以上）与足够的量子测量拍摄和推广以及在不同的测试事件。量子优化PI参数显著改善瞬态响应，减少频率波动和建立时间。
摘要：Frequency control in power systems is critical to maintaining stability and preventing blackouts. Traditional methods like meta-heuristic algorithms and machine learning face limitations in real-time applicability and scalability. This paper introduces a novel approach using a pure variational quantum circuit (VQC) for real-time secondary frequency control in diesel generators. Unlike hybrid classical-quantum models, the proposed VQC operates independently during execution, eliminating latency from classical-quantum data exchange. The VQC is trained via supervised learning to map historical frequency deviations to optimal Proportional-Integral (PI) controller parameters using a pre-computed lookup table. Simulations demonstrate that the VQC achieves high prediction accuracy (over 90%) with sufficient quantum measurement shots and generalizes well across diverse test events. The quantum-optimized PI parameters significantly improve transient response, reducing frequency fluctuations and settling time.

【24】Integration of LSTM Networks in Random Forest Algorithms for Stock Market Trading Predictions
标题：将LSTM网络集成到随机森林算法中用于股市交易预测
链接：https://arxiv.org/abs/2512.02036

作者：Juan C. King,Jose M. Amigo
备注：24 pages, 7 Figures, 2 Tables
摘要：本文的目的是分析和选择的股票交易系统，结合不同的模型与不同性质的数据，如金融和微观经济信息。具体来说，基于作者以前的工作，并应用机器学习和深度学习的先进技术，我们的目标是制定股票市场的交易算法，并通过实证检验统计优势，从而改进文献中发表的结果。我们的方法将长短期记忆（LSTM）网络与基于决策树的算法（如随机森林和梯度提升）集成在一起。前者分析金融资产的价格模式，后者则提供公司的经济数据。使用国际公司的数据和10个工作日的预测对算法交易进行的数值模拟证实，基于基本面和技术变量的方法可以优于不结合这两种类型变量的通常方法。在这样做的过程中，随机森林成为决策树中表现最好的。我们还讨论了如何通过选择技术变量来提高这种混合方法的预测性能。
摘要：The aim of this paper is the analysis and selection of stock trading systems that combine different models with data of different nature, such as financial and microeconomic information. Specifically, based on previous work by the authors and applying advanced techniques of Machine Learning and Deep Learning, our objective is to formulate trading algorithms for the stock market with empirically tested statistical advantages, thus improving results published in the literature. Our approach integrates Long Short-Term Memory (LSTM) networks with algorithms based on decision trees, such as Random Forest and Gradient Boosting. While the former analyze price patterns of financial assets, the latter are fed with economic data of companies. Numerical simulations of algorithmic trading with data from international companies and 10-weekday predictions confirm that an approach based on both fundamental and technical variables can outperform the usual approaches, which do not combine those two types of variables. In doing so, Random Forest turned out to be the best performer among the decision trees. We also discuss how the prediction performance of such a hybrid approach can be boosted by selecting the technical variables.

【25】On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts
标题：关于流畅性障碍和流畅性塑造制品的代币级建模的困难
链接：https://arxiv.org/abs/2512.02027

作者：Kashaf Gulzar,Dominik Wagner,Sebastian P. Bayerl,Florian Hönig,Tobias Bocklet,Korbinian Riedhammer
备注：6 pages, 1 figure. Accepted to ASRU 2025. This is the arXiv preprint of the accepted paper
摘要：即使对于现代端到端（E2 E）自动语音识别（ASR）框架，口吃语音的自动转录仍然是一个挑战。不流畅和流畅性形成伪影往往被忽视，导致非逐字翻译，临床和研究价值有限。我们提出了一个参数有效的适应方法来解码不流畅和流畅的修改，作为特殊的令牌内transmittance，评估模拟（LibriStutter，英语）和自然（KSoF，德国）口吃语音数据集。为了减轻ASR表现差异和对英语的偏见，我们引入了一个多步微调策略与语言自适应预训练。标记化分析进一步突出了标记器以英语为中心的偏见，这对提高德语数据的性能提出了挑战。我们的研究结果表明，轻量级的适应技术的有效性dysfluency-aware ASR，同时暴露出多语言E2 E系统的关键限制。
摘要：Automatic transcription of stuttered speech remains a challenge, even for modern end-to-end (E2E) automatic speech recognition (ASR) frameworks. Dysfluencies and fluency-shaping artifacts are often overlooked, resulting in non-verbatim transcriptions with limited clinical and research value. We propose a parameter-efficient adaptation method to decode dysfluencies and fluency modifications as special tokens within transcriptions, evaluated on simulated (LibriStutter, English) and natural (KSoF, German) stuttered speech datasets. To mitigate ASR performance disparities and bias towards English, we introduce a multi-step fine-tuning strategy with language-adaptive pretraining. Tokenization analysis further highlights the tokenizer's English-centric bias, which poses challenges for improving performance on German data. Our findings demonstrate the effectiveness of lightweight adaptation techniques for dysfluency-aware ASR while exposing key limitations in multilingual E2E systems.

【26】DySTAN: Joint Modeling of Sedentary Activity and Social Context from Smartphone Sensors
标题：DySTAN：智能手机传感器的静坐活动和社交背景联合建模
链接：https://arxiv.org/abs/2512.02025

作者：Aditya Sneh,Nilesh Kumar Sahu,Snehil Gupta,Haroon R. Lone
摘要：从智能手机传感器数据中准确识别人类环境仍然是一个重大挑战，特别是在久坐不动的环境中，学习、听课、放松和进食等活动表现出高度相似的惯性模式。此外，社会背景在理解用户行为方面起着至关重要的作用，但在移动传感研究中往往被忽视。为了解决这些差距，我们引入了LogMe，这是一款移动传感应用程序，它被动地收集智能手机传感器数据（加速度计，陀螺仪，磁力计和旋转矢量），并提示用户每小时进行一次自我报告，以捕获久坐活动和社交环境。使用这个双标签数据集，我们提出了DySTAN（Dynamic Cross-Stitch with Task Attention Network），这是一个多任务学习框架，可以从共享的传感器输入中联合分类两个上下文维度。它将特定任务层与跨任务注意力集成在一起，以有效地模拟细微的区别。与单任务CNN-BiLSTM-GRU（CBG）模型相比，DySTAN将久坐活动宏F1评分提高了21.8%，与最强的多任务基线Sluice Network（SN）相比，提高了8.2%。这些结果表明，建模多个，共同出现的上下文维度，以提高移动上下文识别的准确性和鲁棒性的重要性。
摘要：Accurately recognizing human context from smartphone sensor data remains a significant challenge, especially in sedentary settings where activities such as studying, attending lectures, relaxing, and eating exhibit highly similar inertial patterns. Furthermore, social context plays a critical role in understanding user behavior, yet is often overlooked in mobile sensing research. To address these gaps, we introduce LogMe, a mobile sensing application that passively collects smartphone sensor data (accelerometer, gyroscope, magnetometer, and rotation vector) and prompts users for hourly self-reports capturing both sedentary activity and social context. Using this dual-label dataset, we propose DySTAN (Dynamic Cross-Stitch with Task Attention Network), a multi-task learning framework that jointly classifies both context dimensions from shared sensor inputs. It integrates task-specific layers with cross-task attention to model subtle distinctions effectively. DySTAN improves sedentary activity macro F1 scores by 21.8% over a single-task CNN-BiLSTM-GRU (CBG) model and by 8.2% over the strongest multi-task baseline, Sluice Network (SN). These results demonstrate the importance of modeling multiple, co-occurring context dimensions to improve the accuracy and robustness of mobile context recognition.

其他(20篇)

【1】Fast Gaussian Process Approximations for Autocorrelated Data
标题：自相关数据的快速高斯过程逼近
链接：https://arxiv.org/abs/2512.02925

作者：Ahmadreza Chokhachian,Matthias Katzfuss,Yu Ding
备注：Accepted by the INFORMS Journal on Data Science
摘要：本文研究了在自相关数据上训练的高斯过程模型的加速计算问题。高斯过程模型是非线性回归应用中常用的强大工具。标准回归模型假设随机样本和独立、相同分布的噪声。各种加速高斯过程回归的快速近似在此标准设置下工作。但对于自相关数据，未能考虑自相关会导致一种称为时间过拟合的现象，这种现象会降低模型在新测试实例上的性能。为了处理自相关数据，必须修改现有的快速高斯过程近似;一种这样的方法是将原始相关的数据点分割成块，其中块数据被去相关。这项工作解释了如何使一些现有的高斯过程近似工作与块数据。在不同应用数据集上的数值实验表明，所提出的方法可以显着加速自相关数据上高斯过程回归的计算，而不会影响模型预测性能。
摘要：This paper is concerned with the problem of how to speed up computation for Gaussian process models trained on autocorrelated data. The Gaussian process model is a powerful tool commonly used in nonlinear regression applications. Standard regression modeling assumes random samples and an independently, identically distributed noise. Various fast approximations that speed up Gaussian process regression work under this standard setting. But for autocorrelated data, failing to account for autocorrelation leads to a phenomenon known as temporal overfitting that deteriorates model performance on new test instances. To handle autocorrelated data, existing fast Gaussian process approximations have to be modified; one such approach is to segment the originally correlated data points into blocks in which the blocked data are de-correlated. This work explains how to make some of the existing Gaussian process approximations work with blocked data. Numerical experiments across diverse application datasets demonstrate that the proposed approaches can remarkably accelerate computation for Gaussian process regression on autocorrelated data without compromising model prediction performance.

【2】Assessing the performance of correlation-based multi-fidelity neural emulators
标题：评估基于相关性的多保真神经模拟器的性能
链接：https://arxiv.org/abs/2512.02868

作者：Cristian J. Villatoro,Gianluca Geraci,Daniele E. Schiavazzi
摘要：当底层的高保真模型在计算上昂贵时，诸如优化、不确定性量化或推理之类的外循环任务很容易变得难以处理。类似地，数据驱动的架构通常需要大型数据集才能以足够的准确度执行预测任务。缓解这些挑战的一种可能方法是开发多保真度仿真器，利用可能有偏见的廉价低保真度信息，同时使用稀缺的准确高保真数据校正和改进预测。本研究调查了多保真度神经仿真器的性能，神经网络旨在通过将有限的高保真度数据与丰富的低保真度模型解决方案相结合来学习输入到输出的映射。我们调查这样的仿真器的性能，低维和高维函数，振荡字符，在存在不连续性的情况下，集合的模型具有相同和不同的参数化，并为可能大量的潜在损坏的低保真源。在此过程中，我们考虑了大量的架构，超参数和数据集配置，包括具有不同光谱偏差的网络（多层感知器，Siren和Kolmogorov Arnold网络），坐标编码的各种机制，精确或可学习的低保真度信息，以及不同的训练数据集大小。我们进一步分析了多保真度方法的附加值，对每种情况进行等效的单保真度测试，量化通过融合多个信息源获得的性能增益。
摘要：Outer loop tasks such as optimization, uncertainty quantification or inference can easily become intractable when the underlying high-fidelity model is computationally expensive. Similarly, data-driven architectures typically require large datasets to perform predictive tasks with sufficient accuracy. A possible approach to mitigate these challenges is the development of multi-fidelity emulators, leveraging potentially biased, inexpensive low-fidelity information while correcting and refining predictions using scarce, accurate high-fidelity data. This study investigates the performance of multi-fidelity neural emulators, neural networks designed to learn the input-to-output mapping by integrating limited high-fidelity data with abundant low-fidelity model solutions. We investigate the performance of such emulators for low and high-dimensional functions, with oscillatory character, in the presence of discontinuities, for collections of models with equal and dissimilar parametrization, and for a possibly large number of potentially corrupted low-fidelity sources. In doing so, we consider a large number of architectural, hyperparameter, and dataset configurations including networks with a different amount of spectral bias (Multi-Layered Perceptron, Siren and Kolmogorov Arnold Network), various mechanisms for coordinate encoding, exact or learnable low-fidelity information, and for varying training dataset size. We further analyze the added value of the multi-fidelity approach by conducting equivalent single-fidelity tests for each case, quantifying the performance gains achieved through fusing multiple sources of information.

【3】Are Detectors Fair to Indian IP-AIGC? A Cross-Generator Study
标题：探测器对印度IP-AIGC公平吗？跨发电机研究
链接：https://arxiv.org/abs/2512.02850

作者：Vishal Dubey,Pallavi Tyagi
摘要：现代图像编辑器可以生成身份保留AIGC（IP-AIGC），其中同一个人以新的服装，背景或照明出现。目前检测器在这一制度中的稳健性和公平性仍不清楚，特别是对于代表性不足的人群。我们提出了我们认为是第一个系统的研究IP-AIGC检测印度和南亚的面孔，量化跨发电机泛化和人口内的性能。我们从FairFD和HAV-DF中组装以印度为中心的训练片段，并使用具有身份保留提示的商业Web UI生成器（Gemini和ChatGPT）构建两个保留的IP-AIGC测试集（HIDF-img-ip-genai和HIDF-vid-ip-genai）。我们在预训练（PT）和微调（FT）方案下评估了两种最先进的检测器（AIDE和Effort），并报告了AUC，AP，EER和准确性。微调产生强大的域内增益（例如，HAV-DF试验的努力AUC为0.739至0.944; AIDE EER为0.484至0.259），但印度队列的保持IP-AIGC的性能始终下降（例如，HIDF-img-ip-genai上的AIDE AUC 0.923至0.563;努力0.740至0.533），这指示对训练生成器线索的过拟合。在非IP HIDF图像上，PT性能仍然很高，这表明身份保留编辑的特定脆性，而不是一般的分布偏移。我们的研究建立了IP-AIGC-Indian作为一个具有挑战性和实际相关的场景，并激励代表性保持适应和印度意识的基准策展，以缩小AIGC检测中的泛化差距。
摘要：Modern image editors can produce identity-preserving AIGC (IP-AIGC), where the same person appears with new attire, background, or lighting. The robustness and fairness of current detectors in this regime remain unclear, especially for under-represented populations. We present what we believe is the first systematic study of IP-AIGC detection for Indian and South-Asian faces, quantifying cross-generator generalization and intra-population performance. We assemble Indian-focused training splits from FairFD and HAV-DF, and construct two held-out IP-AIGC test sets (HIDF-img-ip-genai and HIDF-vid-ip-genai) using commercial web-UI generators (Gemini and ChatGPT) with identity-preserving prompts. We evaluate two state-of-the-art detectors (AIDE and Effort) under pretrained (PT) and fine-tuned (FT) regimes and report AUC, AP, EER, and accuracy. Fine-tuning yields strong in-domain gains (for example, Effort AUC 0.739 to 0.944 on HAV-DF-test; AIDE EER 0.484 to 0.259), but consistently degrades performance on held-out IP-AIGC for Indian cohorts (for example, AIDE AUC 0.923 to 0.563 on HIDF-img-ip-genai; Effort 0.740 to 0.533), which indicates overfitting to training-generator cues. On non-IP HIDF images, PT performance remains high, which suggests a specific brittleness to identity-preserving edits rather than a generic distribution shift. Our study establishes IP-AIGC-Indian as a challenging and practically relevant scenario and motivates representation-preserving adaptation and India-aware benchmark curation to close generalization gaps in AIGC detection.

【4】Self-Improving AI Agents through Self-Play
标题：通过自我游戏自我改进人工智能代理
链接：https://arxiv.org/abs/2512.02731

作者：Przemyslaw Chojecki
摘要：我们扩展的心理测量电池的模块理论框架的动力系统的域。虽然以前的工作建立了AAI能力得分作为一个静态功能的代理表示的空间，本文正式代理作为一个流$v_r$参数化的计算资源$r$，由一个递归的生成器，验证器，更新（GVU）运营商。我们证明了该算子在参数流形$Θ$上生成一个向量场，并且我们将自改进系数$κ$确定为沿着该流的能力泛函的李导数。这项工作的主要贡献是方差不等式的推导，这是一个谱条件，足以（在轻度正则性下）自我改进的稳定性。我们证明了$κ> 0$的一个充分条件是，直到曲率和步长效应，生成和验证的组合噪声必须足够小。然后，我们应用这种形式主义统一语言自我发挥（LSP），自我纠正，和综合数据自举最近的文献。我们证明了STaR，SPIN，Reflexion，GANs和AlphaZero等架构是GVU算子的特定拓扑实现，通过过滤，对抗性歧视或在正式系统中接地来满足方差不等式。
摘要：We extend the moduli-theoretic framework of psychometric batteries to the domain of dynamical systems. While previous work established the AAI capability score as a static functional on the space of agent representations, this paper formalizes the agent as a flow $ν_r$ parameterized by computational resource $r$, governed by a recursive Generator-Verifier-Updater (GVU) operator. We prove that this operator generates a vector field on the parameter manifold $Θ$, and we identify the coefficient of self-improvement $κ$ as the Lie derivative of the capability functional along this flow. The central contribution of this work is the derivation of the Variance Inequality, a spectral condition that is sufficient (under mild regularity) for the stability of self-improvement. We show that a sufficient condition for $κ> 0$ is that, up to curvature and step-size effects, the combined noise of generation and verification must be small enough. We then apply this formalism to unify the recent literature on Language Self-Play (LSP), Self-Correction, and Synthetic Data bootstrapping. We demonstrate that architectures such as STaR, SPIN, Reflexion, GANs and AlphaZero are specific topological realizations of the GVU operator that satisfy the Variance Inequality through filtration, adversarial discrimination, or grounding in formal systems.

【5】Conformal Correction for Efficiency May be at Odds with Entropy
标题：效率的保形修正可能与熵有一定的关系
链接：https://arxiv.org/abs/2512.02704

作者：Senrong Xu,Tianyu Wang,Zenan Li,Yuan Yao,Taolue Chen,Feng Xu,Xiaoxing Ma
摘要：共形预测（CP）提供了一个全面的框架，为黑盒机器学习模型生成统计上严格的不确定性集。为了进一步提高CP的效率，提出了共形校正，以使用共形感知的低效率损失用额外的模块来微调或包装基本模型。在这项工作中，我们从经验和理论上确定了CP效率和模型预测熵之间的权衡。然后，我们提出了一个熵约束的共形校正方法，探索一个更好的效率和熵之间的帕累托最优。计算机视觉和图形数据集上的大量实验结果证明了该方法的有效性。例如，在给定熵阈值的情况下，它可以将最先进的CP方法的效率显著提高高达34.4%。
摘要：Conformal prediction (CP) provides a comprehensive framework to produce statistically rigorous uncertainty sets for black-box machine learning models. To further improve the efficiency of CP, conformal correction is proposed to fine-tune or wrap the base model with an extra module using a conformal-aware inefficiency loss. In this work, we empirically and theoretically identify a trade-off between the CP efficiency and the entropy of model prediction. We then propose an entropy-constrained conformal correction method, exploring a better Pareto optimum between efficiency and entropy. Extensive experimental results on both computer vision and graph datasets demonstrate the efficacy of the proposed method. For instance, it can significantly improve the efficiency of state-of-the-art CP methods by up to 34.4%, given an entropy threshold.

【6】VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm
标题：VLM-Pruner：高效VLM离心机令牌修剪范式中的空间稀疏性缓冲
链接：https://arxiv.org/abs/2512.02700

作者：Zhenkai Wu,Xiaowen Ma,Zhenliang Ni,Dengming Zhang,Han Shu,Xin Jiang,Xinghao Chen
摘要：视觉语言模型（VLM）在图像理解任务中表现出色，但大量的视觉标记会带来巨大的计算成本，阻碍了在移动设备上的部署。许多剪枝方法仅依赖于令牌重要性，因此忽略了令牌间的冗余，保留了大量重复的令牌并浪费了容量。虽然已经提出了一些冗余感知的方法，他们往往忽略了视觉令牌之间的空间关系。这可能导致保留的标记的选择过于稀疏，无法充分覆盖目标对象的区域。为了解决这些限制，我们提出了VLM-Pruner，一种无需训练的令牌修剪算法，可以显式地平衡冗余和空间稀疏性。我们引入了一个离心令牌修剪范式，使近到远的选择，同时优先保留细粒度的对象细节。此外，我们设计了一个缓冲空间稀疏（BSS）的标准，推迟选择空间上遥远的令牌。我们进一步采用并行贪婪策略进行令牌选择效率。为了减轻修剪造成的信息损失，我们选择性地将丢弃的标记中的突出信息融合到保留的标记中。综合比较表明，VLM-Pruner在五个VLM上始终优于强基线，修剪率为88.9%，同时提供端到端的推理加速。
摘要：Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9\% pruning rate, while delivering an end-to-end inference speedup.

【7】Generative Multi-modal Feedback for Singing Voice Synthesis Evaluation
标题：用于歌唱声音合成评估的生成多模式反馈
链接：https://arxiv.org/abs/2512.02523

作者：Xueyan Li,Yuxin Wang,Mengjie Jiang,Qingzi Zhu,Jiang Zhang,Zoey Kim,Yazhe Niu
备注：16 pages, 5 figures
摘要：歌唱声音合成（SVS）已经取得了显著的进步，使模型能够生成具有准确音高和一致风格的人声。随着这些能力的提高，对可靠评估和优化的需求变得越来越重要。然而，目前的方法，如奖励系统，往往依赖于单一的数字分数，努力捕捉各种维度，如措辞或表达，并需要昂贵的注释，限制了可解释性和泛化。为了解决这些问题，我们提出了一种生成反馈（即，奖励模型）框架，为SVS评估提供多维语言和音频反馈。我们的方法利用音频语言模型来生成文本和音频评论-涵盖旋律、内容和听觉质量等方面。该模型在混合数据集上进行了微调，该数据集结合了人类音乐反应和MLLM的合成评论，增强了多样性和语言丰富性。定量实验验证了所提出的数据集和训练策略的有效性，表明该框架产生了适合指导生成模型改进的音乐准确和可解释的评价。代码在[https：//github.com/opendilab/VocalCritic]（https：//github.com/opendilab/VocalCritic）
摘要：Singing voice synthesis (SVS) has advanced significantly, enabling models to generate vocals with accurate pitch and consistent style. As these capabilities improve, the need for reliable evaluation and optimization becomes increasingly critical. However, current methods like reward systems often rely on single numerical scores, struggle to capture various dimensions such as phrasing or expressiveness, and require costly annotations, limiting interpretability and generalization. To address these issues, we propose a generative feedback (i.e., reward model) framework that provides multi-dimensional language and audio feedback for SVS assessment. Our approach leverages an audio-language model to generate text and audio critiques-covering aspects such as melody, content, and auditory quality. The model is fine-tuned on a hybrid dataset combining human music reactions and synthetic critiques from a MLLMs, enhancing diversity and linguistic richness. Quantitative experiments validate the effectiveness of the proposed dataset and training strategy, demonstrating that the framework produces musically accurate and interpretable evaluations suitable for guiding generative model improvement. The code is at [https://github.com/opendilab/VocalCritic](https://github.com/opendilab/VocalCritic)

【8】Stress-Testing Causal Claims via Cardinality Repairs
标题：通过Cardinality修复进行压力测试因果关系声明
链接：https://arxiv.org/abs/2512.02491

作者：Yarden Gabbay,Haoquan Guan,Shaull Almagor,El Kindi Rezig,Brit Youngmann,Babak Salimi
摘要：从观察数据中得出的因果分析支持医疗保健，公共政策和经济学等领域的高风险决策。然而，这样的结论可能令人惊讶地脆弱：即使是微小的数据错误--重复记录或输入错误--也可能彻底改变因果关系。这就提出了一个基本问题：对于数据中的小的、有针对性的修改，因果声明的可靠性有多大？解决这一问题对于确保经验发现的可靠性、可解释性和可再现性至关重要。我们介绍SubCure，一个通过基数修复进行健壮性审计的框架。给定一个因果查询和用户指定的估计效果的目标范围，SubCure识别一小组元组或子群体，其移除将估计值转移到所需范围内。这一过程不仅量化了因果结论的敏感性，而且还精确定位了驱动这些结论的数据的特定区域。我们正式这个问题下的元组和模式级删除设置，并显示两者都是NP完全的。为了扩展到大型数据集，我们开发了有效的算法，该算法结合了机器非学习技术，以增量更新因果估计，而无需从头开始重新训练。我们评估SubCure在四个真实世界的数据集，涵盖不同的应用领域。在每种情况下，它都揭示了紧凑的，高影响力的子集，这些子集的删除显着改变了因果关系的结论，揭示了传统方法无法检测到的漏洞。我们的研究结果表明，基数修复是一个强大的和通用的工具，压力测试的因果分析和防范误导性的索赔植根于普通的数据不完善。
摘要：Causal analyses derived from observational data underpin high-stakes decisions in domains such as healthcare, public policy, and economics. Yet such conclusions can be surprisingly fragile: even minor data errors - duplicate records, or entry mistakes - may drastically alter causal relationships. This raises a fundamental question: how robust is a causal claim to small, targeted modifications in the data? Addressing this question is essential for ensuring the reliability, interpretability, and reproducibility of empirical findings. We introduce SubCure, a framework for robustness auditing via cardinality repairs. Given a causal query and a user-specified target range for the estimated effect, SubCure identifies a small set of tuples or subpopulations whose removal shifts the estimate into the desired range. This process not only quantifies the sensitivity of causal conclusions but also pinpoints the specific regions of the data that drive those conclusions. We formalize this problem under both tuple- and pattern-level deletion settings and show both are NP-complete. To scale to large datasets, we develop efficient algorithms that incorporate machine unlearning techniques to incrementally update causal estimates without retraining from scratch. We evaluate SubCure across four real-world datasets covering diverse application domains. In each case, it uncovers compact, high-impact subsets whose removal significantly shifts the causal conclusions, revealing vulnerabilities that traditional methods fail to detect. Our results demonstrate that cardinality repair is a powerful and general-purpose tool for stress-testing causal analyses and guarding against misleading claims rooted in ordinary data imperfections.

【9】Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles
标题：通过光谱动力学视角的数据处理：静态极限、动态加速度和实用预言
链接：https://arxiv.org/abs/2512.02409

作者：Yizhou Zhang,Lun Du
摘要：大规模神经模型越来越多地通过数据修剪、合成数据生成、跨模型蒸馏、基于人类反馈的强化学习（RLHF）和基于难度的采样进行训练。虽然其中一些以数据为中心的策略可靠地提高了训练效率和下游性能，但其他策略未能提供有意义的收益-最明显的是自我生成的合成数据，这通常会增加数据集的数量，而不会增强模型的能力。我们将数据策展形式化为重新加权采样分布，并将其效果映射到数据诱导算子的本征结构上。我们的第一个主要结果表明，\textbf{静态修剪诱导有界算子，因此不能改变谱尾指数};它提供了最多有限区域的改进，不能改变渐近神经标度。我们的第二个结果分析了\textbf{时间相关数据策展}，表明能够跟踪谱残差并不断重新规范尾部的理想预言机可以证明加速学习-尽管实际系统只能近似这种行为。
摘要：Large-scale neural models are increasingly trained with data pruning, synthetic data generation, cross-model distillation, reinforcement learning from human feedback (RLHF), and difficulty-based sampling. While several of these data-centric strategies reliably improve training efficiency and downstream performance, others fail to provide meaningful gains -- most notably self-generated synthetic data, which often increases dataset volume without enhancing model capability. We formalize data curation as reweighting the sampling distribution and map its effect onto the eigenstructure of the data-induced operator. Our first main result shows that \textbf{static pruning induces a bounded operator and therefore cannot change the spectral tail exponent}; it provides at most finite-region improvements and cannot alter asymptotic neural scaling. Our second result analyzes \textbf{time-dependent data curation}, showing that an ideal oracle capable of tracking spectral residuals and continuously re-normalizing the tail can provably accelerate learning -- although practical systems can only approximate this behavior.

【10】WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate
标题：WISE：用于稳健多模式多主体辩论的加权迭代专家社会
链接：https://arxiv.org/abs/2512.02405

作者：Anoop Cherian,River Doyle,Eyal Ben-Dov,Suhas Lohit,Kuan-Chuan Peng
摘要：最近的大型语言模型（LLM）在不同的语料库和任务上进行了训练，使它们能够发展互补的优势。多智能体辩论（MAD）已经成为一种利用这些优势进行强大推理的流行方式，尽管它主要应用于纯语言任务，但其在多模态问题上的功效尚未得到充分探索。在本文中，我们研究MAD解决视觉和语言推理问题。我们的设置，使一般化的辩论协议与异构专家，拥有单模态和多模态的能力。为此，我们提出了加权迭代专家社会（WISE），一个广义和模块化的MAD框架，将代理划分为求解器，生成解决方案，和反射器，验证正确性，分配权重，并提供自然语言反馈。为了聚合代理人的解决方案在辩论回合，同时占他们的反应和反馈权重的方差，我们提出了一个修改后的Dawid-Skene算法的后处理，集成了我们的两阶段辩论模型。我们在SMART-840，VisualPuzzles，EvoChart-QA和一个新的SMART-840++数据集上评估了WISE，该数据集具有编程生成的控制难度的问题实例。我们的研究结果表明，WISE在各种多模态任务和LLM配置中，与最先进的MAD设置和聚合方法相比，准确率始终提高了2-7%。
摘要：Recent large language models (LLMs) are trained on diverse corpora and tasks, leading them to develop complementary strengths. Multi-agent debate (MAD) has emerged as a popular way to leverage these strengths for robust reasoning, though it has mostly been applied to language-only tasks, leaving its efficacy on multimodal problems underexplored. In this paper, we study MAD for solving vision-and-language reasoning problems. Our setup enables generalizing the debate protocol with heterogeneous experts that possess single- and multi-modal capabilities. To this end, we present Weighted Iterative Society-of-Experts (WISE), a generalized and modular MAD framework that partitions the agents into Solvers, that generate solutions, and Reflectors, that verify correctness, assign weights, and provide natural language feedback. To aggregate the agents' solutions across debate rounds, while accounting for variance in their responses and the feedback weights, we present a modified Dawid-Skene algorithm for post-processing that integrates our two-stage debate model. We evaluate WISE on SMART-840, VisualPuzzles, EvoChart-QA, and a new SMART-840++ dataset with programmatically generated problem instances of controlled difficulty. Our results show that WISE consistently improves accuracy by 2-7% over the state-of-the-art MAD setups and aggregation methods across diverse multimodal tasks and LLM configurations.

【11】Spatiotemporal Pyramid Flow Matching for Climate Emulation
标题：气候模拟的时空金字塔流匹配
链接：https://arxiv.org/abs/2512.02268

作者：Jeremy Andrew Irvin,Jiaqi Han,Zikui Wang,Abdulaziz Alharbi,Yufei Zhao,Nomin-Erdene Bayarsaikhan,Daniele Visioni,Andrew Y. Ng,Duncan Watson-Parris
摘要：生成模型有可能改变我们模拟地球气候变化的方式。以前的生成方法依赖于天气尺度自回归进行气候模拟，但这对于长期气候范围来说本质上是缓慢的，并且尚未在非平稳强迫下表现出稳定的推出。在这里，我们介绍时空金字塔流（SPF），一类新的流匹配方法，跨空间和时间尺度的分层数据模型。受级联视频模型的启发，SPF将生成轨迹划分为时空金字塔，逐步增加空间分辨率以减少计算，并将每个阶段与相关的时间尺度耦合，以实现金字塔中任何时间级别的直接采样。这种设计，连同规定的物理力（例如，温室气体或气溶胶），能够在多个时间尺度上进行有效的并行气候模拟。在ClimateBench上，SPF在年度和月度时间尺度上优于强流量匹配基线和预训练模型，同时提供快速采样，特别是在较粗糙的时间级别上。为了扩展SPF，我们策划了ClimateSuite，这是迄今为止最大的地球系统模拟集合，包括10个气候模型的33，000多个模拟年和第一个包含气候干预模拟的数据集。我们发现，缩放的SPF模型表现出良好的泛化，以支持跨气候模型的情景。SPF和ClimateSuite共同为跨时间尺度和现实未来情景的准确，高效，概率气候模拟提供了基础。数据和代码可在https://github.com/stanfordmlgroup/spf上公开获取。
摘要：Generative models have the potential to transform the way we emulate Earth's changing climate. Previous generative approaches rely on weather-scale autoregression for climate emulation, but this is inherently slow for long climate horizons and has yet to demonstrate stable rollouts under nonstationary forcings. Here, we introduce Spatiotemporal Pyramid Flows (SPF), a new class of flow matching approaches that model data hierarchically across spatial and temporal scales. Inspired by cascaded video models, SPF partitions the generative trajectory into a spatiotemporal pyramid, progressively increasing spatial resolution to reduce computation and coupling each stage with an associated timescale to enable direct sampling at any temporal level in the pyramid. This design, together with conditioning each stage on prescribed physical forcings (e.g., greenhouse gases or aerosols), enables efficient, parallel climate emulation at multiple timescales. On ClimateBench, SPF outperforms strong flow matching baselines and pre-trained models at yearly and monthly timescales while offering fast sampling, especially at coarser temporal levels. To scale SPF, we curate ClimateSuite, the largest collection of Earth system simulations to date, comprising over 33,000 simulation-years across ten climate models and the first dataset to include simulations of climate interventions. We find that the scaled SPF model demonstrates good generalization to held-out scenarios across climate models. Together, SPF and ClimateSuite provide a foundation for accurate, efficient, probabilistic climate emulation across temporal scales and realistic future scenarios. Data and code is publicly available at https://github.com/stanfordmlgroup/spf .

【12】Orchestration Framework for Financial Agents: From Algorithmic Trading to Agentic Trading
标题：金融代理人的管理框架：从统计交易到统计交易
链接：https://arxiv.org/abs/2512.02227

作者：Jifeng Li,Arnav Grover,Abraham Alpuerto,Yupeng Cao,Xiao-Yang Liu
备注：Accepted at the Workshop on Generative AI in Finance, 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
摘要：金融市场是AI代理的关键任务，因为它的时间动态和低信噪比。建立一个有效的算法交易系统可能需要一个专业的团队多年来开发和测试。在本文中，我们提出了一个编排框架的金融代理人，其目的是民主化的金融情报，以广大公众。我们将传统算法交易系统的每个组件映射到代理，包括规划器，协调器，alpha代理，风险代理，投资组合代理，回测代理，执行代理，审计代理和内存代理。我们提供两个内部交易的例子。对于股票交易任务（2024年4月至2024年12月的每小时数据），我们的方法实现了20.42美元的回报，夏普比率为2.63，最大回撤为-3.59美元，而标准普尔500指数的回报为15.97美元。对于BTC交易任务（2025年7月27日至2025年8月13日的分钟数据），我们的方法实现了8.39美元\%$的回报，夏普比率为0.38美元，最大下跌为-2.80美元\%$，而BTC价格上涨了3.80美元\%$。我们的代码可以在\href{https：//github.com/Open-Finance-Lab/PracticTrading}{GitHub}上找到。
摘要：The financial market is a mission-critical playground for AI agents due to its temporal dynamics and low signal-to-noise ratio. Building an effective algorithmic trading system may require a professional team to develop and test over the years. In this paper, we propose an orchestration framework for financial agents, which aims to democratize financial intelligence to the general public. We map each component of the traditional algorithmic trading system to agents, including planner, orchestrator, alpha agents, risk agents, portfolio agents, backtest agents, execution agents, audit agents, and memory agent. We present two in-house trading examples. For the stock trading task (hourly data from 04/2024 to 12/2024), our approach achieved a return of $20.42\%$, a Sharpe ratio of 2.63, and a maximum drawdown of $-3.59\%$, while the S&P 500 index yielded a return of $15.97\%$. For the BTC trading task (minute data from 27/07/2025 to 13/08/2025), our approach achieved a return of $8.39\%$, a Sharpe ratio of $0.38$, and a maximum drawdown of $-2.80\%$, whereas the BTC price increased by $3.80\%$. Our code is available on \href{https://github.com/Open-Finance-Lab/AgenticTrading}{GitHub}.

【13】InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages
标题：DirectLR：一种为资源不足的语言创建指令数据集的可扩展方法
链接：https://arxiv.org/abs/2512.02213

作者：Mamadou K. Keita,Sebastien Diarra,Christopher Homan,Seydou Diallo
摘要：低资源语言（LRL）的有效文本生成和聊天界面仍然是最先进的大型语言模型（LLM）支持的一个挑战。这主要是由于难以为LRL管理高质量的教学数据集，这是非洲大陆和其他地区所使用的语言普遍存在的限制。目前的方法，如自动翻译和合成数据生成，经常产生缺乏流畅性甚至正交一致性的输出。在本文中，我们介绍了InstructLR，一个新的框架，旨在生成高质量的指令数据集LRL。我们的方法集成了LLM驱动的文本生成与双层质量过滤机制：基于检索增强生成（RAG）的n-shot提示的自动过滤层，和一个人在循环验证层。从任务定义中的MMLU等基准中汲取灵感，InstructLR促进了三个多领域指令基准的创建：ZarmaInstruct-50 k，BambaraInstruct-50 k和FulfuldeInstruct-50 k。
摘要：Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from benchmarks such as MMLU in task definition, InstructLR has facilitated the creation of three multi-domain instruction benchmarks: ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k.

【14】Enforcing Orderedness to Improve Feature Consistency
标题：强制有序性以提高功能一致性
链接：https://arxiv.org/abs/2512.02194

作者：Sophie L. Wang,Alex Quach,Nithin Parsan,John J. Yang
摘要：稀疏自动编码器（SAE）已被广泛用于神经网络的可解释性，但它们的学习特征通常因种子和超参数设置而异。我们引入了有序稀疏自动编码器（OSAE），它通过（1）建立潜在特征的严格排序和（2）确定性地使用每个特征维度，避免了先前嵌套SAE方法的基于采样的近似来扩展Matryoshka SAE。从理论上讲，我们表明，OSAE解决排列的不可识别性设置稀疏字典学习的解决方案是唯一的（自然对称性）。根据Gemma 2 -2B和Pythia-70 M的经验，我们表明OSAE可以帮助提高与Matryoshka基线相比的一致性。
摘要：Sparse autoencoders (SAEs) have been widely used for interpretability of neural networks, but their learned features often vary across seeds and hyperparameter settings. We introduce Ordered Sparse Autoencoders (OSAE), which extend Matryoshka SAEs by (1) establishing a strict ordering of latent features and (2) deterministically using every feature dimension, avoiding the sampling-based approximations of prior nested SAE methods. Theoretically, we show that OSAEs resolve permutation non-identifiability in settings of sparse dictionary learning where solutions are unique (up to natural symmetries). Empirically on Gemma2-2B and Pythia-70M, we show that OSAEs can help improve consistency compared to Matryoshka baselines.

【15】CoatFusion: Controllable Material Coating in Images
标题：CoatFusion：图像中的可控材料涂层
链接：https://arxiv.org/abs/2512.02143

作者：Sagie Levy,Elad Aharoni,Matan Levy,Ariel Shamir,Dani Lischinski
摘要：我们介绍材料涂层，一种新颖的图像编辑任务，模拟应用薄材料层到对象上，同时保留其底层的粗糙和精细的几何形状。材料涂层与现有的“材料转移”方法有着根本的不同，后者旨在取代物体的固有材料，通常会掩盖细微的细节。为了解决这个新任务，我们构建了一个大规模的3D物体合成数据集（110K图像），这些物体具有不同的基于物理的涂层，名为DataCoat110K。然后，我们提出了CoatFusion，一种新的架构，使这一任务的条件下的扩散模型上的二维纹理和颗粒，PBR风格的参数控制，包括粗糙度，金属度，传输，和一个关键的厚度参数。实验和用户研究表明，CoatFusion可以产生逼真、可控的涂层，并且在这项新任务上显著优于现有的材料编辑和传输方法。
摘要：We introduce Material Coating, a novel image editing task that simulates applying a thin material layer onto an object while preserving its underlying coarse and fine geometry. Material coating is fundamentally different from existing "material transfer" methods, which are designed to replace an object's intrinsic material, often overwriting fine details. To address this new task, we construct a large-scale synthetic dataset (110K images) of 3D objects with varied, physically-based coatings, named DataCoat110K. We then propose CoatFusion, a novel architecture that enables this task by conditioning a diffusion model on both a 2D albedo texture and granular, PBR-style parametric controls, including roughness, metalness, transmission, and a key thickness parameter. Experiments and user studies show CoatFusion produces realistic, controllable coatings and significantly outperforms existing material editing and transfer methods on this new task.

【16】Teaching an Online Multi-Institutional Research Level Software Engineering Course with Industry - an Experience Report
标题：与行业一起教授在线多机构研究水平软件工程课程-经验报告
链接：https://arxiv.org/abs/2512.01523

作者：Pankaj Jalote,Y. Raghu Reddy,Vasudeva Varma
备注：7 pages
摘要：新冠疫情使在线教学和学习变得可以接受，学生、教师和行业专业人士都对这种模式感到满意。这种舒适性可以被利用来提供在线多机构研究水平课程，在个别机构可能没有必要的教师来教和/或研究学生注册的区域。如果行业感兴趣的主题，在线提供还允许行业专家轻松地做出贡献和参与。软件工程中的高级主题非常适合尝试这种方法，因为行业通常希望将软件工程的进步纳入其实践中，可能会同意做出贡献并参与其中。在本文中，我们描述了一个实验，在教学过程中题为“软件工程中的人工智能”两个机构之间的积极参与，并分享我们和学生的经验。我们相信，这种协作教学方法可以用于提供研究水平的课程，在计算机科学的任何应用领域的机构谁是小的，很难提供自己的研究水平的课程。
摘要：Covid has made online teaching and learning acceptable and students, faculty, and industry professionals are all comfortable with this mode. This comfort can be leveraged to offer an online multi-institutional research-level course in an area where individual institutions may not have the requisite faculty to teach and/or research students to enroll. If the subject is of interest to industry, online offering also allows industry experts to contribute and participate with ease. Advanced topics in Software Engineering are ideally suited for experimenting with this approach as industry, which is often looking to incorporate advances in software engineering in their practices, is likely to agree to contribute and participate. In this paper we describe an experiment in teaching a course titled "AI in Software Engineering" jointly between two institutions with active industry participation, and share our and student's experience. We believe this collaborative teaching approach can be used for offering research level courses in any applied area of computer science by institutions who are small and find it difficult to offer research level courses on their own.

【17】Rethinking Generalized BCIs: Benchmarking 340,000+ Unique Algorithmic Configurations for EEG Mental Command Decoding
标题：重新思考广义BCI：对EEG心理命令解码的340，000多个独特的脑电信号进行基准测试
链接：https://arxiv.org/abs/2512.02978

作者：Paul Barbaste,Olivier Oullier,Xavier Vasques
备注：28 pages, 8 figures, 2 tables
摘要：由于参与者之间和参与者内部的差异性，用脑电图（EEG）测量的大脑模式的鲁棒解码和分类仍然是现实世界（即科学实验室和医疗设施之外）脑机接口（BCI）应用的主要挑战。在这里，我们提出了一个大规模的基准评估超过340，000+空间和非线性EEG分类的独特组合。我们的方法管道由共同空间模式（CSP），黎曼几何，功能连接性和分形或熵为基础的功能在三个开放访问的EEG数据集的组合。与之前的研究不同，我们的分析在每个参与者的水平上进行，并跨越多个频段（8-15 Hz和8-30 Hz），从而能够直接评估组水平的表现和个体差异。协方差切空间投影（cov-tgsp）和CSP一致达到最高的平均分类精度。然而，他们的有效性是强烈的数据集依赖，和显着的参与者水平的差异持续存在，特别是在最异构的数据集。重要的是，非线性方法在特定个体上的表现优于空间方法，强调了个性化管道选择的必要性。我们的研究结果强调，没有通用的“一刀切”的方法可以最佳解码所有用户或数据集的EEG运动想象模式。未来的工作将需要自适应的，多模式的，可能是新的方法，以充分解决实际BCI应用中的神经生理学的变化，系统可以自动适应什么使每个用户独特。
摘要：Robust decoding and classification of brain patterns measured with electroencephalography (EEG) remains a major challenge for real-world (i.e. outside scientific lab and medical facilities) brain-computer interface (BCI) applications due to well documented inter- and intra-participant variability. Here, we present a large-scale benchmark evaluating over 340,000+ unique combinations of spatial and nonlinear EEG classification. Our methodological pipeline consists in combinations of Common Spatial Patterns (CSP), Riemannian geometry, functional connectivity, and fractal- or entropy-based features across three open-access EEG datasets. Unlike prior studies, our analysis operates at the per-participant level and across multiple frequency bands (8-15 Hz and 8-30 Hz), enabling direct assessment of both group-level performance and individual variability. Covariance tangent space projection (cov-tgsp) and CSP consistently achieved the highest average classification accuracies. However, their effectiveness was strongly dataset-dependent, and marked participant-level differences persisted, particularly in the most heterogeneous of the datasets. Importantly, nonlinear methods outperformed spatial approaches for specific individuals, underscoring the need for personalized pipeline selection. Our findings highlight that no universal 'one-size-fits-all' method can optimally decode EEG motor imagery patterns across all users or datasets. Future work will require adaptive, multimodal, and possibly novel approaches to fully address neurophysiological variability in practical BCI applications where the system can automatically adapt to what makes each user unique.

【18】Molecular Embedding-Based Algorithm Selection in Protein-Ligand Docking
标题：蛋白质-配体对接中基于分子嵌入的算法选择
链接：https://arxiv.org/abs/2512.02328

作者：Jiabao Brad Wang,Siyuan Cao,Hongxuan Wu,Yiliang Yuan,Mustafa Misir
备注：25 pages, 13 figures, 5 tables. Protein-ligand docking, algorithm selection, pretrained embeddings (ESM, ChemBERTa), docking benchmarks, oracle-landscape analysis. Code and data available
摘要：选择一个有效的对接算法是高度依赖于上下文的，没有一种方法在结构，化学或协议制度可靠地执行。我们介绍了MolAS，一个轻量级的算法选择系统，它使用注意力池和浅残差解码器从预训练的蛋白质-配体嵌入预测每个算法的性能。由于只有数百到几千个标记的复合物，MolAS在单一最佳求解器（SBS）上实现了高达15%的绝对改进，并在五个不同的对接基准中缩小了虚拟最佳求解器（VBS）-SBS差距的17-66%。分析的可靠性，嵌入的几何形状，和求解器选择模式表明，MolAS成功时，甲骨文景观表现出低熵和可分离的求解器的行为，但协议引起的层次结构的变化下崩溃。这些研究结果表明，强大的对接AS的主要障碍是不代表性的能力，但不稳定的求解器的排名在整个姿态生成制度，定位MolAS作为一个实用的域中选择器和诊断工具，用于评估时AS是可行的。
摘要：Selecting an effective docking algorithm is highly context-dependent, and no single method performs reliably across structural, chemical, or protocol regimes. We introduce MolAS, a lightweight algorithm selection system that predicts per-algorithm performance from pretrained protein-ligand embeddings using attentional pooling and a shallow residual decoder. With only hundreds to a few thousand labelled complexes, MolAS achieves up to 15% absolute improvement over the single-best solver (SBS) and closes 17-66% of the Virtual Best Solver (VBS)-SBS gap across five diverse docking benchmarks. Analyses of reliability, embedding geometry, and solver-selection patterns show that MolAS succeeds when the oracle landscape exhibits low entropy and separable solver behaviour, but collapses under protocol-induced hierarchy shifts. These findings indicate that the main barrier to robust docking AS is not representational capacity but instability in solver rankings across pose-generation regimes, positioning MolAS as both a practical in-domain selector and a diagnostic tool for assessing when AS is feasible.

【19】Characterizing Continuous and Discrete Hybrid Latent Spaces for Structural Connectomes
标题：结构连接体的连续和离散混合潜空间的特征
链接：https://arxiv.org/abs/2512.02032

作者：Gaurav Rudravaram,Lianrui Zuo,Adam M. Saunders,Michael E. Kim,Praitayini Kanakaraj,Nancy R. Newlin,Aravind R. Krishnan,Elyssa M. McMaster,Chloe Cho,Susan M. Resnick,Lori L. Beason Held,Derek Archer,Timothy J. Hohman,Daniel C. Moyer,Bennett A. Landman
摘要：结构连接组是详细的图表，描绘了不同的大脑区域是如何物理连接的，为衰老，认知和神经退行性疾病提供了重要的见解。然而，这些连接体是高维的，并且紧密互连，这使得它们难以大规模解释和分析。虽然像PCA和自动编码器这样的低维空间通常用于捕获主要的变异源，但它们的潜在空间通常是连续的，并且不能完全反映连接体中变异性的混合性质，其包括连续的（例如，连接强度）和离散因素（例如，成像部位）。出于这一动机，我们提出了一个变分自动编码器（VAE）与混合潜在空间，共同建模的离散和连续的组件。我们分析了来自六项阿尔茨海默病研究的5，761个连接体的大型数据集，其中包含10个采集协议。每个连接体代表来自一个独特的受试者（3579名女性，2182名男性）的单次扫描，年龄在22岁至102岁之间，其中4338名认知正常，809名轻度认知障碍（MCI），614名阿尔茨海默病（AD）。每个连接体包含BrainCOLOR图谱定义的121个大脑区域。我们训练我们的混合VAE在一个无人监督的方式和特征，每个潜在的组件捕获。我们发现，离散空间在捕获与站点相关的细微差异方面特别有效，使用站点标签实现了0.65的调整后兰德指数（ARI），显著优于PCA和标准VAE，然后进行聚类（p < 0.05）。这些结果表明，混合潜在空间可以解开不同来源的变异连接体在一个无监督的方式，大规模的连接体分析提供了潜力。
摘要：Structural connectomes are detailed graphs that map how different brain regions are physically connected, offering critical insight into aging, cognition, and neurodegenerative diseases. However, these connectomes are high-dimensional and densely interconnected, which makes them difficult to interpret and analyze at scale. While low-dimensional spaces like PCA and autoencoders are often used to capture major sources of variation, their latent spaces are generally continuous and cannot fully reflect the mixed nature of variability in connectomes, which include both continuous (e.g., connectivity strength) and discrete factors (e.g., imaging site). Motivated by this, we propose a variational autoencoder (VAE) with a hybrid latent space that jointly models the discrete and continuous components. We analyze a large dataset of 5,761 connectomes from six Alzheimer's disease studies with ten acquisition protocols. Each connectome represents a single scan from a unique subject (3579 females, 2182 males), aged 22 to 102, with 4338 cognitively normal, 809 with mild cognitive impairment (MCI), and 614 with Alzheimer's disease (AD). Each connectome contains 121 brain regions defined by the BrainCOLOR atlas. We train our hybrid VAE in an unsupervised way and characterize what each latent component captures. We find that the discrete space is particularly effective at capturing subtle site-related differences, achieving an Adjusted Rand Index (ARI) of 0.65 with site labels, significantly outperforming PCA and a standard VAE followed by clustering (p < 0.05). These results demonstrate that the hybrid latent space can disentangle distinct sources of variability in connectomes in an unsupervised manner, offering potential for large-scale connectome analysis.

【20】Generative design and validation of therapeutic peptides for glioblastoma based on a potential target ATP5A
标题：基于潜在靶点ATP 5A的胶质母细胞瘤治疗肽的生成设计和验证
链接：https://arxiv.org/abs/2512.02030

作者：Hao Qian,Pu You,Lin Zeng,Jingyuan Zhou,Dengdeng Huang,Kaicheng Li,Shikui Tu,Lei Xu
摘要：胶质母细胞瘤（GBM）仍然是最具侵袭性的肿瘤，迫切需要新的治疗策略。在这里，我们提出了一个干到湿的框架结合生成建模和实验验证，以优化肽靶向ATP 5A，一个潜在的肽结合蛋白GBM。我们的框架引入了第一个铅条件生成模型，重点探索几何相关区域周围的铅肽，并减轻从头方法的组合复杂性。具体来说，我们提出了POTFlow，一个基于\underline{P}rior和\underline{O}ptimal \underline{T} transsport的\underline{Flow}匹配模型用于肽优化。POTFlow采用二级结构信息（例如，螺旋、片、环）作为几何约束，通过优化传输进一步细化，以产生更短的流动路径。通过这种设计，我们的方法实现了最先进的性能相比，五种流行的方法。当应用于GBM时，我们的方法产生选择性抑制细胞活力并显著延长患者来源的异种移植物（PDX）模型中的存活的肽。作为第一个前导肽条件流匹配模型，POTFlow作为治疗肽设计的可推广框架具有很强的潜力。
摘要：Glioblastoma (GBM) remains the most aggressive tumor, urgently requiring novel therapeutic strategies. Here, we present a dry-to-wet framework combining generative modeling and experimental validation to optimize peptides targeting ATP5A, a potential peptide-binding protein for GBM. Our framework introduces the first lead-conditioned generative model, which focuses exploration on geometrically relevant regions around lead peptides and mitigates the combinatorial complexity of de novo methods. Specifically, we propose POTFlow, a \underline{P}rior and \underline{O}ptimal \underline{T}ransport-based \underline{Flow}-matching model for peptide optimization. POTFlow employs secondary structure information (e.g., helix, sheet, loop) as geometric constraints, which are further refined by optimal transport to produce shorter flow paths. With this design, our method achieves state-of-the-art performance compared with five popular approaches. When applied to GBM, our method generates peptides that selectively inhibit cell viability and significantly prolong survival in a patient-derived xenograft (PDX) model. As the first lead peptide-conditioned flow matching model, POTFlow holds strong potential as a generalizable framework for therapeutic peptide design.

机器翻译由腾讯交互翻译提供，仅供参考

点击“阅读原文”获取带摘要的学术速递