Py学习  »  机器学习算法

谷歌颠覆性论文 | 嵌套学习:深度学习架构的幻象(中英双语)

哲学园 • 1 月前 • 322 次点击  
本文转自公号:硬科普

Nested Learning: The Illusion of Deep Learning Architectures

嵌套学习:深度学习架构的幻象


图片



Authors 作者

Ali Behrouz Google Research USAalibehrouz@google.com

阿里·贝胡鲁兹 谷歌研究院 美国alibehrouz@google.com


Meisam Razaviyayn Google Research USArazaviyayn@google.com

梅萨姆·拉扎维扬 谷歌研究院 美国razaviyayn@google.com


Peiling Zhong Google Research USApeilinz@google.com

钟佩琳 谷歌研究院 美国peilinz@google.com


Vahab Mirrokni Google Research USAmirrokni@google.com

瓦哈布·米罗克尼 谷歌研究院 美国mirrokni@google.com

Abstract 摘要

Over the last decades, developing more powerful neural architectures and simultaneously designing optimization algorithms to effectively train them have been the core of research efforts to enhance the capability of machine learning models.

在过去的几十年里,开发更强大的神经网络架构,并同时设计有效训练这些架构的优化算法,一直是增强机器学习模型能力的核心研究方向。

Despite the recent progresses, particularly in developing Language Models (LMs), there are fundamental challenges and unanswered questions about how such models can continually learn/memorize, self-improved, and find “effective solutions,”.

尽管近年来取得了大量进展,尤其是在语言模型(LM)的发展上,但关于这些模型如何实现持续学习/记忆、自我改进,并找到“有效的解决方案”,仍存在根本性的挑战和诸多未解之问。

In this paper, we present a new learning paradigm, called Nested Learning (NL), that coherently represents a model with a set of nested, multi-level, and/or parallel optimization problems, each of which with its own “context flow”.

在本文中,我们提出了一种新的学习范式——称为“嵌套学习(Nested Learning, NL)”。该范式将一个模型一致地表示为一组嵌套的、多层级的、和/或并行的优化问题,每一个优化问题都拥有自身的“上下文流”(context flow)。

NL reveals that existing deep learning methods learns from data through compressing their own context flow, and explain how in-context learning emerges in large models.

NL 揭示了现有深度学习方法实际上是通过压缩其自身的上下文流来从数据中学习,并解释了大型模型中“上下文学习”(in-context learning)是如何自然涌现的。

NL suggests a path (a new dimension to deep learning) to design more expressive learning algorithms with more “levels”, resulting in higher-order in-context learning abilities.

NL 提出了一个方向(也是深度学习的新维度):通过增加更多“层级”来设计更具表达力的学习算法,从而获得更高阶的上下文学习能力。

In addition to its neuroscientifically plausible and mathematically white-box nature, we advocate for its importance by presenting three core contributions:

除了其在神经科学上合理、在数学结构上透明(white-box)的特性外,我们通过提出以下三项核心贡献来强调其重要性:

(1) Deep Optimizers: Based on NL, we show that well-known gradient-based optimizers (e.g., Adam, SGD with Momentum, etc.) are in fact associative memory modules that aim to compress the gradients with gradient descent.

(1)深度优化器(Deep Optimizers):基于 NL,我们展示了诸如 Adam、动量 SGD 等经典的基于梯度的优化器,其实都是通过梯度下降来压缩梯度的“联想记忆模块(associative memory modules)”。

Building on this insight, we present a set of more expressive optimizers with deep memory and/or more powerful learning rules;

基于这一洞见,我们提出了一组更具表达力的优化器,它们拥有深度记忆结构和/或更强的学习规则。

(2) Self-Modifying Titans: Taking advantage of NL’s insights on learning algorithms, we present a novel sequence model that learns how to modify itself by learning its own update algorithm;

(2)自我修改的 Titans(Self-Modifying Titans):利用 NL 对学习算法的洞察,我们提出了一种新的序列模型,它能够通过学习自身的更新算法来“学习如何修改自己”。

and (3) Continuum Memory System: We present a new formulation for memory system that generalizes the traditional viewpoint of “long-term/short-term memory”.

(3)连续体记忆系统(Continuum Memory System):我们提出了一种新的记忆系统形式化方法,它推广了传统的“长期记忆/短期记忆”的观点。

Combining our self-modifying sequence model with the continuum memory system, we present a learning module, called HOPE, showing promising results in language modeling, continual learning, and long-context reasoning tasks.

将我们提出的自我修改序列模型与连续体记忆系统结合,我们推出了一种学习模块——称为 HOPE,它在语言建模、持续学习以及长上下文推理任务中展现了极具前景的结果。

1 Introduction

This version of the paper has been extensively summarized to fit the page limit of NeurIPS camera ready, and some materials, experiments, discussions, and methods are moved to appendix, which might make some parts hard to follow or cause inconsistencies.

本文的这一版本已经被大幅度压缩,以符合 NeurIPS 最终版的篇幅限制;因此,一些材料、实验、讨论和方法被移到了附录。这可能导致正文部分的某些内容较难理解,或存在不一致之处。

To avoid such cases, please read our arXiv version instead [1] (will be available on November 13).

为了避免此类情况,请阅读我们的 arXiv 版本 [1](将于 11 月 13 日发布)。

39th Conference on Neural Information Processing Systems (NeurIPS 2025).

第三十九届神经信息处理系统会议(NeurIPS 2025)。

图片

Figure 1: The uniform and reusable structure as well as multi time scale update in the brain are the key components to unlock the continual learning in humans. Nested Learning (NL) allows for multi time-scale update for each component of the brain, while showing that well-known architectures such as Transformers are in fact linear layers with different frequency updates.

图 1:大脑中统一且可复用的结构,以及多时间尺度的更新机制,是人类实现持续学习的关键因素。嵌套学习(NL)为大脑的每个组成部分提供多时间尺度的更新方式,并展示了诸如 Transformer 之类的经典架构实际上是具有不同更新频率的线性层。

For decades, AI research has focused on designing machine learning algorithms that learn from data [2–5] or experience [6–8]; often by optimizing an objective θ ∈ Θ$ with gradient-based methods.

数十年来,人工智能研究一直专注于设计能够从数据 [2–5] 或经验 [6–8] 中学习的机器学习算法;这些算法通常通过基于梯度的方法,对参数 θ ∈ Θ 上的某个目标函数 L(θ) 进行优化。

While traditional machine learning techniques required careful engineering and domain expertise to design feature extractors, limiting their ability to directly process and learn from natural data [9], deep representation learning offered a fully automated alternative to discover the representations needed for the task.

传统的机器学习技术需要精心的特征工程与领域知识来设计特征提取器,这限制了它们直接处理和学习自然数据的能力 [9];相比之下,深度表征学习提供了一种完全自动化的方式,可以直接发现任务所需的表示。

Thereafter, deep learning has been an inseparable part of the large-scale computational models with seminal success in chemistry and biology [10], games [11, 12], computer vision [13, 14], and multimodal and natural language understanding [15–17].

此后,深度学习成为大规模计算模型中不可分割的一部分,并在化学与生物学 [10]、游戏 [11, 12]、计算机视觉 [13, 14],以及多模态与自然语言理解 [15–17] 等领域取得了开创性的成功。

Stacking of multiple layers, as it is done in deep learning models, provides the models with larger capacity, better expressive power in representing complex features, and more internal computations (e.g., #FLOPS) [18–20], all of which are critical and desirable characteristics for static tasks that require in-distribution predictions over a previously fixed set.

在深度学习模型中,通过堆叠多层结构,可以为模型带来更大的容量、更强的复杂特征表达能力,以及更多的内部计算(例如 FLOPS 数量)[18–20]。这些特征对于那些需要在既定数据分布上进行预测的静态任务来说,都是关键且理想的性质。

This deep design, however, is not a universal solution to all the challenges and cannot help the expressive power of the models in multiple aspects, for example:

然而,这种深度结构的设计并非应对所有挑战的通用解决方案,也无法在多个方面提升模型的表达能力,例如:

(i) The computational depth of deep models might not change with more layers [21, 22], leaving their ability to implement complex algorithms untouched compared to traditional shallow approaches [23];

(i)深度模型的计算深度可能并不会随着增加更多层数而改变 [21, 22],这使得它们在实现复杂算法的能力上,与传统的浅层方法 [23] 相比并无提升;

(ii) The capacity of some class of parameters might show marginal improvement with increasing the depth/width of the model [24];

(ii)某些类别的参数容量,即便随着模型深度/宽度的增加,也可能只呈现出微弱的提升 [24];

(iii) The training process might converge to a suboptimal solution, mainly due to the suboptimal choice of the optimizer or its hyperparameters;

(iii)训练过程可能会收敛到 次优解,主要原因在于优化器或其超参数的选择并不理想;

and (iv) The model’s ability to fast adapt to a new task, continually learn, and/or generalize to out-of-distribution data might not changed with stacking more layers and requires more careful designs.

(iv)模型在以下方面的能力——快速适应新任务、持续学习、以及对分布外数据进行泛化——并不会因为简单地堆叠更多层数而获得改善,而是需要更精细的设计。

The core part of the efforts to overcome the above challenges and to enhance the capability of deep learning models concentrate on:

克服上述挑战并增强深度学习模型能力的核心努力主要集中在以下方面:

(1) developing more expressive class of parameters (i.e., neural architectures) [13, 25–28];

(1)开发表达能力更强的参数类别(即神经网络架构)[13, 25–28];

(2) introducing objectives that can better model the tasks [29–32];

(2)设计能够更好建模任务需求的目标函数 [29–32];

(3) designing more efficient/effective optimization algorithms to find better solutions or with more resilience to forgetting [33–36];

(3)设计更高效/更有效的优化算法,以找到更优的解,或使模型具备更强的抗遗忘能力 [33–36];

and (4) scaling the model size to enhance its expressivity, when the “right” choice of architecture, objective, and optimization algorithms are made [24, 37, 38].

(4)在架构、目标函数和优化算法均做出“正确选择”的前提下,通过扩大模型规模来增强其表达能力 [24, 37, 38]。

Collectively, these advancements and new findings on scaling patterns of deep models have established the foundations upon which Large Language Models (LLMs) have been built.

总体而言,这些进展以及关于深度模型规模化规律的新发现,共同奠定了构建大型语言模型(LLMs)的基础。

The development of LLMs marks a pivotal milestone in deep learning research: a paradigm shift from task-specific models to more general-purpose systems with various emergent capabilities as a result of scaling the “right” architectures [38, 39].

LLM 的发展标志着深度学习研究的一个关键里程碑:研究范式从面向特定任务的模型,转向通过扩展“正确的架构”而形成的具有多种涌现能力的通用系统 [38, 39]。

Despite all their success and remarkable capabilities in diverse sets of tasks [15, 40, 41], LLMs are largely static after their initial deployment phase, meaning that they successfully perform tasks learned during pre- or post-training, but are unable to continually acquire new capabilities beyond their immediate context.

尽管 LLM 在各种任务中取得成功并展现出卓越能力 [15, 40, 41],但它们在最初部署阶段之后基本是“静态”的——也就是说,它们能成功执行在预训练或后训练阶段学到的任务,但无法持续获得超出其直接上下文的新能力。

The only adaptable component of LLMs is their in-context learning ability–a (known to be emergent) characteristic of LLMs that enables fast adaption to the context and so perform zero- or few-shot tasks [38].

LLM 中唯一可适应的部分,是其“上下文学习能力”(in-context learning)——这一能力已知是涌现性质,它使模型能够快速适应输入上下文,因此可以执行零样本或小样本任务 [38]。

Beyond in-context learning, recent efforts to overcome the static nature of LLMs either are computationally expensive, require external components, lack generalization, and/or might suffer from catastrophic forgetting [42–44], which has led researchers to question if there is a need to revisit how to design machine learning models and if a new learning paradigm beyond stacking of layers is required to unleash the capabilities of LLMs in continual setups.

在上下文学习之外,近期试图克服 LLM 静态特性的努力,要么计算代价极高,要么需要额外组件,要么泛化性不足,或可能遭受严重遗忘问题 [42–44];这促使研究者开始思考:是否需要重新审视机器学习模型的设计方式,是否需要一种超越“堆叠层数”之外的新学习范式,以在持续学习环境中真正释放 LLM 的能力。

Current Models only Experience the Immediate Present.

当前的模型只能体验“直接的当下”。

As an analogy and to better illustrate the static nature of LLMs, we use the example of anterograde amnesia–a neurological condition where a person cannot form new long-term memories after the onset of the disorder, while existing memories remain intact [45].

作为一个类比,并为了更清晰地展示 LLM 的静态特性,我们引用“顺行性遗忘症”(anterograde amnesia)的例子——这是一种神经系统疾病,患者在发病后无法形成新的长期记忆,而已有的记忆仍保持完整 [45]。

This condition limits the person’s knowledge and experiences to a short window of present and long past–before the onset of the disorder–which results in continuously experiencing the immediate present as if it were always new.

这种状况把患者的知识与体验限制在一个很短的“当下窗口”以及疾病发生之前的很久以前的记忆中,使患者不断地把当前时刻体验成“全新的”,仿佛每一刻都是第一次经历。

The memory processing system of current LLMs suffer from a similar pattern.

当前 LLM 的记忆处理系统也遭受着类似的模式。

Their knowledge is limited to either, the immediate context that fits into their context window, or the knowledge in MLP layers that stores long-past, before the onset of “end of pre-training.”

它们的知识要么局限于能够装入“上下文窗口”的即时上下文,要么局限于 MLP 层中存储的“久远过去”的知识——也就是“预训练结束”之前的内容。

This analogy, has motivated us to take inspiration from neurophysiology literature and how brain consolidate its short-term memories:

这种类比促使我们从神经生理学文献中汲取灵感,探讨大脑如何巩固其短期记忆:

1.1 Human Brain Perspective and Neurophysiological Motivation

1.1 人脑视角与神经生理学动机

Human brain is highly efficient and effective when it comes to continual learning (a.k.a. effective context management), which is often attributed to neuroplasticity—the brain’s remarkable capacity to change itself in response to new experiences, memories, learning, and even damage [46, 47].

在人类大脑中,持续学习(又称为有效的上下文管理)具有极高的效率和效果,这通常归因于“神经可塑性”——即大脑能够根据新的经验、记忆、学习甚至损伤而改变自身结构的卓越能力 [46, 47]。

Recent studies support that the formation of Long-term memory involves at least two distinct but complementary consolidation processes [48–50]:

近期研究表明,长期记忆的形成至少涉及两种不同但互补的巩固过程 [48–50]:

(1) A rapid “online” consolidation (also known as synaptic consolidation) phase occurs immediately or soon after learning, even during wakefulness.

(1)一种快速的“在线”巩固过程(也称突触巩固),它在学习之后立即发生,或在不久之后发生,甚至可能发生于清醒状态下。

This is when new and initially fragile memory traces are stabilized and begin transferring from short-term to long-term storage;**

在这一阶段,新形成且最初脆弱的记忆痕迹会被稳定下来,并开始从短期记忆转移到长期记忆存储中;

(2) An “offline” consolidation (also known as systems consolidation) process repeats the replay of the recently encoded patterns—during sharp-wave ripples (SWRs) in the hippocampus, coordinated with cortical sleep spindles and slow oscillations—strengthens and reorganizes the memory and supports transfer to cortical sites [51–53].**

(2)一种“离线”巩固过程(也称系统巩固):在这一过程中,最近编码的模式会被反复重播——例如在海马体的锐波涟漪(SWRs)期间,并与皮层的睡眠纺锤波和慢振荡相协调——这一机制能够强化并重组记忆,并促进其向大脑皮层区域迁移 [51–53]。

Coming back to the analogy of anterograde amnesia, evidence indicates that the condition can impact both stages, but especially the online consolidation phase, mainly due to the fact that hippocampus is the gateway for encoding new declarative memories, and so its damage means new information never will be stored in long-term memory.**

回到顺行性遗忘症的类比,有证据表明该状况会影响上述两个阶段,但尤其会损害“在线巩固”阶段——主要原因是海马体是编码新陈述性记忆的关键通道,因此海马体受损意味着新信息永远无法进入长期记忆。

As mentioned above, the design of LLMs, and more specifically Transformer-based backbones, suffers from a similar condition after the pre-training phase.**

如前所述,LLM 的设计——尤其是基于 Transformer 的主干架构——在预训练阶段之后也遭受类似的问题。

That is, the information provided in the context, never impacts the long-term memory parameters (e.g., feedforward layers), and so the model is not capable of acquiring new knowledge or skill, unless the information is still stored in the short-term memory (e.g., attention).**

也就是说,在输入上下文中提供的信息永远不会影响模型的长期记忆参数(例如 MLP 前馈层),因此模型无法获得新的知识或技能,除非这些信息仍暂存于短期记忆结构中(例如注意力机制)。

To this end, although the second stage is equally, or even more, crucial for the consolidation of memories, and its absence can damage the process and might cause loss of memory [54, 55], in this work, we focus on the first stage: memory consolidation as an online process.**

因此,尽管“离线巩固”阶段对记忆巩固同样重要,甚至可能更关键,而其缺失会损害记忆过程、导致记忆丧失 [54, 55],但在本文中,我们主要关注第一个阶段:作为“在线过程”的记忆巩固。

We provide additional discussion on human brain perspective and its connection to NL in Appendix A.**

我们在附录 A 中进一步讨论了人脑视角及其与 NL 的联系。

Notations(符号说明)

Notations. We let x ∈ RN×din be the input, Mt represent the state of memory/model M at time t, K be the keys, V be the values, and Q be the query matrices.**

符号说明:我们令  为输入;令  表示在时间  时刻的记忆/模型的状态;令 (键); 为(值); 为 (查询)矩阵。

We use bold lowercase letters with subscript  to refer to the vector corresponds to the input t (i.e., ).

我们使用带下标  的黑体小写字母来表示与第 ( t ) 个输入对应的向量(例如 

We further refer to the distribution of any entities f as p(f).**

我们进一步将任何实体  的分布记为 

Through the paper, we use simple MLPs with LM ≥ 1 layers and residual connection as the architecture of the memory module M(·).**

在本文中,我们使用带有  层的简单 MLP,并带有残差连接,作为记忆模块 的架构。

When it is needed, we parameterized the memory module with θM ⊇ {W1, W2, ..., WLM}, which at least includes the parameters of linear layers in the MLP.**

在需要时,我们将记忆模块参数化为

这些参数至少包括 MLP 中线性层的参数。

We use superscript with parenthesis to refer to parameters in different levels of nested learning (different update frequency): i.e., W(ℓ).**

我们使用带括号的上标表示嵌套学习不同层级(即不同更新频率)中的参数,例如 ( W^{(\ell)} )。

2 Nested Learning

2 嵌套学习(Nested Learning)

This section discusses the motivations, formal definitions, and general high-level implications of Nested Learning (NL).**

本节讨论嵌套学习的研究动机、形式化定义,以及其一般性的高层含义。

We start with a formulation of associative memory and then by using step-by-step examples, we build the intuition behind architecture decomposition and its connection to modeling a neural network as an integrated system of optimization problems.

我们首先从联想记忆(associative memory)的形式化开始,然后通过逐步示例建立一种直观认识,说明如何进行架构分解,以及如何将神经网络建模为一个由多个优化问题组成的集成系统。

We aim to first show how existing methods and concepts in deep learning fall under the NL paradigm and then we present new formulations that go beyond traditional methods and/or provide insights on how to improve existing algorithms and designs.

我们的目标是先展示已有的深度学习方法和概念如何落入 NL 范式之内,随后提出超越传统方法的新形式化,并提供改进现有算法与设计的理论洞见。

图片

Figure 2: Nested Learning Paradigm that represent a machine learning model and its training procedure as a set of nested optimization problems. (Left) An example of Hybrid architecture. While deep learning perspective, as the flattened image of NL, does not provide insight about the depth of computation in the blocks, NL transparently represent all the inner gradient flows. (Right) A Neural Learning Module: A computational model that learns how to compress its own context flow. For example, the first level corresponds to the model’s the most outer-loop training, often refer to as “pre-training” step.

图 2:嵌套学习范式(Nested Learning Paradigm)将一个机器学习模型及其训练过程表示为一组嵌套的优化问题。 (左)一个混合架构的示例。从深度学习视角来看,其“扁平化”的表示无法揭示各个模块内部真实的计算深度;而在 NL 中,所有内部的梯度流都可以被透明地表示出来。 (右)一个“神经学习模块”(Neural Learning Module):一种能够学习如何压缩自身上下文流的计算模型。例如,第一级就对应于模型的最外层训练循环,也就是通常所称的“预训练(pre-training)”步骤。

2.1 Associative Memory

2.1 联想记忆(Associative Memory)

Associative memory—the ability to form and retrieve connections between events—is a fundamental mental process and is an inseparable component of human learning [56].**

联想记忆——即在事件之间形成并检索连接的能力——是一种基本的心理过程,也是人类学习中不可分割的组成部分 [56]。

Often in the literature, the concept of memorization and learning are used interchangeably; in neuropsychology literature, however, these two are clearly distinguished.

在文献中,“记忆”和“学习”这两个概念常被互换使用;然而在神经心理学的文献中,这两者是被明确区分的。

More specifically, following neuropsychology literature [57], we build our terminology based on the following definition of memory and learning:

更具体地,遵循神经心理学文献 [57],我们基于以下记忆与学习的定义建立我们的术语:

Learning vs. Memorization:

Memory is a neural update caused by an input, and learning is the process for acquiring effective and useful memory.

学习 vs. 记忆:

记忆是由输入引起的神经更新,而学习 是获得有效且有用记忆的过程。

In this work, our goal is to first show that all the elements of a computational sequence model, including optimizers and neural networks, are associative memory systems that compress their own context flow.

在本研究中,我们的目标首先是展示:计算序列模型的所有组成要素(包括优化器和神经网络)实际上都是“联想记忆系统”,它们的作用是压缩自身的上下文流

Broadly speaking, associative memory is an operator that maps a set of keys to a set of values.

广义而言,联想记忆是一种操作符,它将一组 key(键)映射到一组 value(值)。

We follow the general definition of associative memory by Behrouz et al. [58]:

我们采用 Behrouz 等人 [58] 对联想记忆的通用定义:

Definition 1(联想记忆定义)

Definition 1 (Associative Memory). Given a set of keys K ⊆ R⁽ᵈᵏ⁾ and values V ⊆ R⁽ᵈᵛ⁾, associative memory is an operator M : K → V that maps two sets of keys K and values V. To learn such mapping from the data, an objective L̃(·; ·) measures the quality of the mapping and M can be defined as:

翻译

定义 1(联想记忆)。 给定一组 和一组 ,联想记忆是一个算子,用于将 key 映射到 value。 为了从数据中学习这样的映射,我们引入一个目标函数  来衡量映射质量,并将最优的记忆算子定义为:

While the operator itself is a memory and the mapping acts as a memorization process (i.e., memorizing the connections of events in the context), acquiring such effective operator based on the data, is a learning process.

在上述定义中,算子本身即记忆,而映射过程即“记忆化过程”(即:记住上下文中事件之间的关联);而根据数据学习到这样一个有效算子的过程,就是“学习过程”。

It is notable that, here, keys and values can be any arbitrary events that memory aims to map them and are not limited to tokens.

值得注意的是,这里的 key 和 value 可以是任意事件,任何记忆系统希望映射的信息都可以作为 key/value,它们不仅限于 token。

Later in this section, we will discuss that given a context flow, keys and values might be tokens, gradients, sub-sequences, etc.

在本节后续内容中,我们将讨论:在给定一个上下文流时,key/value 可能是 token、本地梯度、子序列等不同类型的数据。

Furthermore, while the term of associative memory is more common in neuroscience and neuropsychology literature, the above formulation is also closely related to data compression and low-dimensional representation.

此外,尽管“联想记忆”这一术语主要用于神经科学和神经心理学文献,但上述形式化也与数据压缩和低维表示密切相关。

That is, one can interpret the optimization process in Equation 1 as the training process of a network M(·) that aims to compress the mappings into its parameters and so represent them in a lower dimensional space.

也就是说,可以将式(1)中的优化过程理解为:训练一个网络 ( M(\cdot) ),使其将 key–value 映射压缩到其参数中,从而在低维空间中表示这些映射。

In sequence modeling, where keys and values are input tokens (e.g., tokenized text), the choice of objective and the optimization process for solving Equation 1 can result in distinct sequence modeling architectures (see [59] and [58]) such as global/local softmax attention [27], or other modern recurrent models [28, 60, 61].

在序列建模中,当 key 和 value 是输入 token(例如分词后的文本)时,针对公式(1)的目标函数选择与优化过程可能会产生截然不同的序列建模架构(见 [59] 与 [58]),例如全局/局部 softmax 注意力机制 [27],或其他现代的循环模型 [28, 60, 61]。

This simple formulation of sequence models provides us with better understanding of their internal process and also a tool to simply compare their modeling power based on their objective and optimization process.

这个对序列模型的简单形式化,使我们能够更好地理解其内部机制,同时也为基于目标函数与优化过程来比较模型表达能力提供了一个简单工具。

In the following, using step-by-step examples, we discuss how this formulation can be applied to all components of a neural architecture (including its optimization process in pre-training) and in fact, how a model is an integrated system of multi-level, nested, and or parallel memories, each of which with its own context flow.

接下来,我们将通过逐步示例说明:上述形式化如何应用于神经网络架构的所有组成部分(包括其预训练阶段的优化过程),以及一个模型究竟如何是一个由多层级、嵌套、或并行的记忆系统构成的整体结构,而每个记忆系统都有其自身的上下文流。

Simple Example of MLP Training

一个简单的 MLP 训练示例

We start with a simple example, in which we aim to train a 1-layer MLP (parameterized with W) for task T and on dataset Dₜᵣₐᵢₙ =x₁,…,xDₜᵣₐᵢₙ|by optimizing the objective L(·;·) with gradient descent.

我们从一个简单的示例开始:我们希望针对任务 ( T ),在数据集

上通过梯度下降优化目标函数  ,来训练一个单层 

In this case, the training process is equivalent to the following optimization problem:

在这种情况下,训练过程等价于以下优化问题:

这是训练目标函数的标准形式。

whose optimization by gradient descent results in a weight update rule equivalent to:

使用梯度下降法求解上述问题,得到的权重更新规则如下:

此处利用链式法则,将梯度拆分为对输出 的梯度与输入向量的外积形式。

where yₜ₊₁ = W xₜ₊₁ is the output of the model for input xₜ₊₁.

其中  是模型对输入 的输出。

Given this formulation, one can let uₜ₊₁ = ∇ᵧₜ₊₁ L(Wₜ; xₜ₊₁) and reformulate the backpropagation process as the solution to an optimization problem on finding an optimal associative memory that maps input data points Dₜᵣₐᵢₙ = {xₜ} to their corresponding uₜ₊₁ values.

根据上述推导,我们可以令

此时反向传播过程即可被重新解释为:寻找一个最优的联想记忆算子,使其将训练数据集

映射到对应的  值。

That is, we let M(·) = Wₜ· parametrizes the memory, and use dot-product similarity to measure the quality of Wₜ’s mapping between xₜ₊₁ and ∇ᵧₜ₊₁ L(Wₜ; xₜ₊₁):**

换言之,我们令记忆算子 ,并使用点积相似度来衡量  将  映射到

的质量:

即:通过最小化点积误差 + 保持与上一状态接近的正则项来更新记忆。

In the above formulation, uₜ₊₁ = ∇ᵧₜ₊₁ L(Wₜ; xₜ₊₁) can be interpreted as a local surprise signal in representation space that quantifies the mismatch between the current output and the structure the objective L(·;·) enforces.

在上述形式化中,

可以被解释为表示空间中的“局部惊讶信号”(local surprise signal, LSS),用于衡量当前输出与目标函数 ( L(\cdot;\cdot) ) 所期望结构之间的偏差程度。

Therefore, this formulation translates the training phase of the model as a process of acquiring effective memory that maps data samples to their Local Surprise Signal (LSS) in representation space–defined as the mismatch between the current output and the structure enforced by the objective L(·;·).

因此,这种形式化把模型的训练阶段重新解释为:获得一种“有效记忆”的过程,这种记忆将数据样本映射到其在表示空间中的局部惊讶信号(LSS)——而 LSS 即“当前输出与目标结构的偏差”。

Accordingly, in this example, our model has a single gradient flow over the data samples, which is only active over dataset D_train =x₁, …, xD_train|and will be frozen for any other data samples afterwards (a.k.a inference or test time).

因此,在本示例中,模型只有一条关于数据样本的梯度流,它只在训练数据集

上处于激活状态;而对于训练之后的任何输入(如推理或测试阶段),该梯度流都是“冻结”的。

Next, in the above example, we replace gradient descent with momentum

(引入动量 SGD 的示例)

Next, in the above example, we replace the gradient descent algorithm with its enhanced momentum-based variant, resulting in an update rule of:

接下来,在上述示例中,我们将梯度下降替换为其增强版——基于动量(momentum)的梯度下降(SGD with momentum),其更新规则如下:

即:使用动量项 ( m_{t+1} ) 替代直接使用梯度。

即动量项的更新相当于累加过去的梯度外积结构。

In Equation 8, given the previous state of Equation 7 (at time t), the value of ∇₍Wₜ₎ L(Wₜ; xₜ₊₁) or similarly ∇ᵧₜ₊₁ L(Wₜ; xₜ₊₁) are independent of recurrence in Equation 8 and so can be pre-computed beforehand.

在公式(8)中,给定公式(7)在时间 ( t ) 的状态,

或等价的

的值与公式(8)中的递归关系无关,因此可以提前计算。

To this end, we let uₜ₊₁ = ∇₍Wₜ₎ L(Wₜ; xₜ₊₁), and so Equation 8 can be reformulated as:

因此,我们令

于是公式(8)可重写为:

即动量项可视为通过梯度下降更新的“最优记忆”。

将梯度拆分成对输出的梯度与输入向量的外积后,得到的等价优化形式。

where the optimization problem in Equation 10 is equivalent to one step of gradient descent with adaptive learning rate of ηₜ₊₁.

公式(10)中的优化问题等价于执行一次学习率为 ( \eta_{t+1} ) 的梯度下降步骤。

Given these formulation, one can interpret the momentum term as either:

基于上述形式化,我们可以将动量项解释为:

** (两种解释)**

(1) a key-less associative memory that compress the gradients into its parameters, or (2) an associative memory that learns how to map data points to their corresponding LSS-value.

(1)一种无 key 的联想记忆,将梯度压缩进动量参数中; (2)一种将数据点映射到其对应的 LSS 值的联想记忆

Interestingly, this formulation reveals that gradient descent with momentum is indeed a two-level optimization process, where the memory is optimized by simple gradient descent algorithm.

有趣的是,这种形式化揭示:带动量的梯度下降本质上是一个两层优化过程——外层更新权重,而内层作为“记忆”的动量项则通过简单的梯度下降进行更新。

This process is closely related to Fast Weight Programs (FWPs) [62], where the weight update process (i.e., Equation 7) is the slow network that its momentum weight is generated by a fast network (i.e., Equation 10).**

这一过程与快速权重程序(Fast Weight Programs, FWP)[62] 密切相关:在 FWP 中,权重更新过程(即公式 7)被视为“慢网络”(slow network),其动量权重则由一个“快网络”(fast network)生成(对应公式 10)。

Concluding the above examples, we observed that the training process of a 1-layer MLP with:

总结上述示例,我们可以看到:一个单层 MLP 的训练过程可以被重新解释为以下形式:

73(子点 1 与子点 2)

(1) Gradient descent is a 1-level associative memory that learns how to map data points to their corresponding LSS-value; and

(1)普通梯度下降(GD) 是一种单层的联想记忆系统,它学习如何将数据点映射到其对应的 LSS 值;

(2) Gradient descent with momentum is a 2-level associative memory (or optimization process) that the inner-level learns to store gradient values into its parameters, and then the outer-level updates the slow weight (i.e., Wₜ) with the value of the inner-level memory.

(2)带动量的梯度下降(Momentum GD) 是一种两层的联想记忆系统(或两层优化过程)

  • 内层学习将梯度值存入其参数中(即动量项 mₜ),
  • 外层则使用这个记忆(mₜ)来更新“慢权重” ( W_t )。

While these are the most simple examples with respect to both architecture and optimizer algorithms, one might ask if similar conclusion can be made in more complex setups.

虽然这些示例都是在架构与优化算法上最简单的情况,但我们自然会问:在更复杂的设置中,是否也能得到类似的结论?

An Example of Architectural Decomposition

一个架构分解的示例

In the next example, we replace the MLP module with a linear attention [60].

在下一个示例中,我们将 MLP 模块替换为线性注意力(linear attention)[60]。

**at is, we aim to train a 1-layer linear attention for task T and on a sequence of D_train =x₁, …, xD_train|by optimizing the objective L with gradient descent.

也就是说,我们希望通过梯度下降优化目标函数 ( L ),在序列数据集

上训练一个单层的线性注意力模型 来完成任务

(给出线性注意力的定义)*

Recalling the unnormalized linear attention formulation:

回忆非归一化线性注意力(unnormalized linear attention)的形式:

即利用投影矩阵  从输入  生成 和 

这是线性注意力的隐式“记忆矩阵”更新规则。

即线性注意力的输出由记忆矩阵 与查询向量  相乘得到。

As discussed in earlier studies [58, 59], the recurrence in Equation 13 can be reformulated as the optimization process of a matrix-valued associative memory Mₜ(·), in which, it aims to compress the mappings of keys and values into its parameters.

正如先前研究 [58, 59] 所讨论的,公式(13)中的递推关系可以被重新解释为:一个矩阵形式的联想记忆 ( M_t(\cdot) ) 的优化过程,其目标是将 key–value 映射压缩到自身的参数中。

In more details, in Definition 1, if we let L̃(Mₜ₋₁; kₜ, vₜ) := −⟨Mₜ₋₁ kₜ, vₜ⟩ and aim to optimize the memory with gradient descent, the memory update rule is: (Note that ∇ L̃ = vₜ kₜᵀ and learning rate ηₜ = 1)

更具体地说,根据定义 1,如果我们令

并尝试通过梯度下降来优化记忆,则记忆的更新规则为(注意:,且令学习率 

即更新记忆矩阵,使其在下一步更好地存储映射。

该推导显示线性注意力的记忆更新正是对上述内部目标执行一次梯度下降的结果。

which is equivalent to the update rule of an unnormalized linear attention in Equation 13.

这与公式(13)中给出的“非归一化线性注意力”的更新规则完全一致。

Also, note that as we observed in the first example, training a linear layer with gradient descent is a 1-layer optimization problem of an associative memory (see Equation 3) and so the general training/updating process of projection layers (i.e., Wₖ, Wᵥ, and W_q) is itself an optimization process of associative memory.

此外,请注意,正如我们在第一个示例中看到的那样,用梯度下降训练线性层本质上是一个“一层联想记忆”的优化问题(见公式 3)。因此,线性注意力中的投影层(即  训练/更新过程本身,也可以视为联想记忆的优化过程

Accordingly, this setup, i.e., training a linear attention with gradient descent, can be seen as a two-level optimization process, where the outer-loop (also known as training process) optimizes the projection layers with gradient descent, while the inner-loop optimizes the inner memory of Mₜ with gradient descent.

因此,在这种设置下——即使用梯度下降训练线性注意力——整个过程可以被视为一个两层的优化体系

  • 外层(outer-loop):使用梯度下降优化投影层  (即训练过程);
  • 内层(inner-loop):使用梯度下降优化线性注意力内部的记忆矩阵 

两者各自拥有独立的记忆更新与梯度流。

Note that, as discussed above, here, we have two associative memories, and so each of which has their own optimization process and gradient flow.

需要注意的是,正如前文所述,在这种架构中存在两个联想记忆系统,并且每一个系统都有其独立的优化过程和梯度流。

That is, in the optimization of outer-level parameters of Wₖ, Wᵥ, and W_q there is no gradient with respect to parameter M(·) and so there is no backpropagation through it.

也就是说,在外层优化(更新)时,不会计算关于内部记忆 的梯度,因此不会对  执行反向传播。

Similarly, in the inner-level, there is no backpropagation through projection layers and they are considered frozen.

同样地,在内层优化(更新 (M_t))时,也不会对投影层进行反向传播,投影层在这一过程被视为“冻结”。

Furthermore, it is notable that in this example, the above formulation is also closely connected to FWPs perspective of linear attentions [63], where projections are considered slow weights, and memory update in Equation 13 is the fast weight update rule.

此外值得注意的是,在这个例子中,上述形式化同样与 FWP 对线性注意力的视角密切相关 [63]:

  • 投影层被视为慢权重(slow weights)
  • 记忆更新(公式 13)则被视为快权重(fast weights)更新规则

Architectural Decomposition with More Levels

具有更多层级的架构分解

In both above examples, we discussed simple cases, where they can be translated into 2-level optimization processes, which also coincides with their FWPs interpretations.

在以上两个示例中,我们讨论的都是相对简单的情况,它们都可以转换为 两层优化过程,并且其解释方式与 FWP 的观点一致。

In practice, however, we need to use more powerful optimization algorithms to train the model, and/or use more powerful recurrent update rule for memory.

然而,在实际应用中,我们通常需要使用更强大的优化算法来训练模型,和/或使用更复杂的递归更新规则来更新记忆。

As a simple example, assume we use gradient descent with momentum to train a linear attention model.

例如,假设我们使用带动量的梯度下降训练一个线性注意力模型。

In the above examples, we show that how the linear attention component can be decomposed into two nested optimization problem.

在前面的示例中,我们展示了线性注意力组件如何能够分解为两个嵌套的优化问题。

Similarly, here the model can be represented as a 2-level optimization problem, where (1) the inner level optimizes the memory to compress the context using gradient descent (see Equation 15), and (2) the outer level optimizes the projection layers with gradient descent with momentum.

类似地,此时该模型仍可表示为两层优化结构

  1. 内层(inner level):通过梯度下降优化记忆,以压缩上下文(见公式 15);
  2. 外层(outer level):使用带动量的梯度下降来优化投影层。

Interestingly, from the first example, we know that “gradient descent with momentum” algorithm itself is indeed a 2-level optimization problem where the momentum term itself is an associative memory that compress the past gradients into its parameters.

更有趣的是,从第一个示例我们可以知道:“带动量的梯度下降”本身就是一个两层的优化结构:动量项本身就是一个联想记忆,用来将过去的梯度压缩到其参数中。

2.2 Nested Optimization Problems

2.2 嵌套优化问题

In the previous section, we provided examples to demonstrate how one can decompose a machine learning model into a set of nested or multi-level optimization problems.

在前一小节中,我们给出了若干示例,用以说明如何将一个机器学习模型分解为一组嵌套的或多层级的优化问题。

Next, we first aim to present a formal formulation for nested learning problems and then define Neural Learning Module–an integrated computational system that learns from data.

接下来,我们首先给出一个关于嵌套学习问题的形式化表达,然后定义“神经学习模块”(Neural Learning Module)——一种可从数据中学习的综合计算系统。

As we observed in the previous section, while we decomposed the model into a set of optimization process, it is still unclear if we can define a hierarchy (or order) over these problems, and uniquely represent the model in this format.

正如前文所示,虽然我们能够将模型分解为多个优化过程,但仍不清楚能否在这些优化问题之间定义一个层级结构(或顺序),从而以唯一的形式来表示模型。

Inspired by the hierarchy of brain waves that indicates the information processing frequency rate of each part (discussed in Section 1), we use the update rate of each optimization problem to order the components in multiple levels.

受到大脑脑电波层级结构的启发——脑电波反映了大脑各部分的信息处理频率(在第 1 节中讨论过)——我们使用每个优化问题的更新频率来为组件分层。

To this end, we let the one update step over one data point to be the unit of time, and define the update frequency rate of each component as:

为此,我们将“针对一个数据点执行一次更新”视为单位时间,并将每个组件的更新频率定义为:

Definition 2

Update Frequency

Definition 2 (Update Frequency). For any component A, which can be a parametric component (e.g., learnable weights or momentum term in gradient descent with momentum) or a non-parametric component (e.g., attention block), we define its frequency, denoted as f_A, as its number of updates per unit of time.

更新频率

定义 2(更新频率)。 对于任意组件 ,它可以是参数化组件(如可学习权重,或动量  中的动量项),也可以是非参数组件(如注意力模块)。我们定义其频率 为:该组件在单位时间内的更新次数。

Given the above update frequency, we can order the components of a machine learning algorithm based on operator (· ≻ ·).

基于上述更新频率,我们可以使用一个“排序算子”对机器学习算法中的组件进行排序(建立层级)。

We let A to be faster than B and denote A ≻ B if: (1) f_A > f_B, or (2) f_A = f_B but the computation of B’s state at time t requires the computation of A’s state at time t.

我们规定:如果满足以下条件之一,则称 快于,并记为 

  1. ,即 的更新频率更高;
  2. ,但在时间  计算  的状态需要先计算  的状态。

In this definition, when A ⊁ B and B ⊁ A, we let A ≡_f B, which indicates that A and B has the same frequency update, but their computation is independent of each other (Later, we provide an example of this cases in AdamW optimizer).

在该定义中,如果既不存在 ,也不存在 ,我们记为

表示  与  具有相同的更新频率,但它们的计算彼此独立(本文稍后在  优化器中给出一个例子)。

Based on the above operator, we sort the components into an ordered set of “levels”, where (1) components in the same level have the same frequency update, and (2) the higher the level is, the lower its frequency.

基于上述排序算子,我们可以将组件划分为一个有序的“层级集合”,其规则为:

  1. 同一层级的组件具有相同的更新频率;
  2. 层级越高,其更新频率越低。

Notably, based on the above definition, each component has its own optimization problem and so context.

值得注意的是,根据上述定义,每个组件拥有其独立的优化问题,因此也拥有独立的上下文流(context flow)

While we optimize the component’s inner objective with gradient-based optimizers, the above statement is equivalent to having exclusive gradient flow for each component in the model.

虽然我们使用基于梯度的优化器来优化这些内部目标,但上述描述等价于说:模型中的每个组件都拥有专属的梯度流(exclusive gradient flow)

In general case, however, one can use non-parametric solution (as we later discuss about attention).

不过在一般情况下,我们也可以采用非参数化的解决方案 (例如本文稍后讨论的注意力机制)。

Neural Learning Module

神经学习模块

Given the above definition of nested learning problems, we define neural learning module as a new way of representation of machine learning models that shows the model as an interconnected system of components, each of which with its own gradient flow.

基于上述嵌套学习问题的定义,我们将神经学习模块定义为一种新的机器学习模型表示方式:它将模型呈现为一个由多个组件构成的互联系统,其中每个组件都有其独立的梯度流。

Note that, orthogonal to deep learning, nested learning allows us to define neural learning models with more levels, resulting in more expressive architecture.

需要注意的是,与传统深度学习的视角不同,嵌套学习允许我们定义具有更多层级的神经学习模型,从而得到表达能力更强的架构。

(核心句)

Nested learning allows computational models that are composed of multiple (multi-layer) levels to learn from and process data with different levels of abstraction and time-scales.

嵌套学习使得多层级结构的计算模型能够在 不同抽象层次不同时间尺度上处理与学习数据。

Next, we study optimizers and well-known deep learning architectures from the nested learning perspective, and provide examples that how NL can help to enhance those components.

接下来,我们将从嵌套学习的视角考察各类优化器及经典深度学习架构,并给出若干示例,说明 NL 如何帮助增强这些组件。

2.3 Optimizers as Learning Modules

2.3 作为学习模块的优化器

In this section, we start by understanding how well-known optimizers and their variants are special instances of nested learning. Recall the gradient descent method with momentum,

在本节中,我们首先探讨那些众所周知的优化器及其变体如何是嵌套学习(NL)的特殊实例。回顾带动量的梯度下降方法:

其中:

  • (矩阵或向量)是第 (i) 步的动量
  •  和 分别是自适应的动量系数与学习率。 (这些解释在下一段原文中给出)

where matrix (or vector) mᵢ is the momentum at state i and αᵢ and ηᵢ are adaptive learning and momentum rates, respectively.

其中矩阵(或向量) 是第 (i) 步的动量;而  和 分别是自适应的动量率与学习率。

Assuming αᵢ₊₁ = 1, the momentum term can be viewed as the result of optimizing the following objective with gradient descent:

当假设 时,动量项可以被视为通过梯度下降优化下列目标函数而得到的结果:

即:最小化动量与当前梯度的内积(在单位矩阵 (I) 上)。

This interpretation shows that momentum can indeed be viewed as a meta memory module that learns how to memorize gradients of the objective into its parameters. Building on this intuition, in Section C.4 we show that Adam with a small modification is the optimal associative memory for the models’ gradients.

这种解释表明,动量(momentum)确实可以被视为一种“元记忆模块”(meta memory module),它学习如何将目标函数的梯度记入自身参数中。基于这一直觉,我们在附录 C.4 中展示了:带有一个小改动的 Adam 是用于模型梯度的最优联想记忆。

Next, we show that how this perspective can result in designing more expressive optimizers:

接下来,我们展示这种视角如何能够用于设计表达能力更强的优化器:

Extension: More Expressive Association.

扩展:更具表达力的联想机制

As discussed earlier, momentum is a value-less associative memory and so has limited expressive power.

如前所述,动量是一种无 value 的联想记忆 (value-less associative memory),因此其表达能力是有限的。

To address this issue, following the original definition of associative memory (i.e., mapping keys to values), we let value parameter vᵢ = Pᵢ and so the momentum aims to minimize:

为了解决这一问题,并遵循联想记忆“将 key 映射到 value”这一原始定义,我们令 value 参数 (v_i = P_i),这样动量的优化目标变为最小化:

即:最小化动量与梯度的转置,以及  参数  的内积。

using gradient descent, resulting in the update rule:

使用梯度下降,从而得到以下更新规则:

即:权重使用  更新,而 通过 参数 加权的梯度下降获得。

This formulation is equivalent to using preconditioning the momentum GD. In fact, preconditioning means that the momentum term is an associative memory that learns how to compress the mappings between (P_i) and the gradient term (\nabla \mathcal{L}(W_i; x_i)).

这种形式等价于对动量梯度下降使用“预条件化”。实际上,预条件化意味着动量项本身就是一个联想记忆,它学习如何压缩 与梯度项 之间的映射关系。

While any reasonable choice (e.g., random features) of preconditioning can improve the expressivity of the initial version of GD with momentum per se is still a value-less memory (i.e., mapping all gradients to a single value), the above perspective gives more intuition about what preconditioning are more useful.

虽然任意合理的预条件化方式(如随机特征)都能提高带动量 GD 的表达能力,但原始的动量本质上仍是一个“无 value 的记忆”(即所有梯度被映射到单一数值)。上述视角提供了关于何种预条件化更有意义的直觉。

That is, the momentum acts as a memory that aims to map gradients to their corresponding values, and so a function of gradients (e.g., information about Hessian) can provide the memory with a more meaningful mappings.

也就是说,动量作为一个记忆模块的目标,是将梯度映射到其对应的“value”;因此,引入梯度的函数(例如包含 Hessian 信息的函数)能够为记忆提供更有意义的映射。

Extension: More Expressive Objectives.

扩展:更具表达力的目标函数。

As discussed by Behrouz et al. [58], optimizing an inner objective of dot-product similarity results in Hebbian-like update rule, which can cause the memory to be less effective.

正如 Behrouz 等人 [58] 所讨论的,优化一个基于点积相似度的内部目标,会导致得到类似 Hebbian 的更新规则,而这可能使记忆效果变差。

A natural extension of this internal objective is to use (\ell_2) regression loss (for measuring the corresponding key-value mapping fitness) and minimize the loss function (\lVert m \nabla \mathcal{L}(W_i; x_i)^\top - P_i \rVert_2^2), resulting in the update rule of:

对此内部目标的一个自然扩展是使用 (\ell_2) 回归损失(用于衡量 key-value 映射的拟合程度),并最小化:

权重仍与  线性更新。从而得到以下更新规则:

即:内存更新包含一个由梯度构成的二次项。

This update is based on delta-rule [64] and so it allows the memory (momentum) to better manage its limited capacity and better memorize the series of past gradients.

此更新基于  [64],因此能够使记忆(动量)更好地管理其有限容量,并更有效地记住一系列过去的梯度。

Extension: More Expressive Memory.

扩展:更具表达力的记忆机制。

As discussed earlier, momentum can be viewed as a meta memory model that uses a linear layer (i.e., matrix-valued) to compress the past gradient values.

如前所述,动量可视为一种元记忆模型,它使用一个线性层(即矩阵形式)来压缩过去的梯度值。

Due to the linear nature of momentum, only linear functions of past gradients can be learned by its internal objective.

由于动量的线性特性,它的内部目标只能学习到“过去梯度的线性函数”。

To increase the learning capacity of this module, one alternative is to use alternative powerful persistent learning modules: i.e., replacing a linear matrix-valued memory for momentum with an MLP.

为了提升这一记忆模块的学习能力,可以采用更强大的持久学习模块:例如,将动量的线性矩阵记忆替换为一个 MLP。

Therefore, momentum as the memory for the past gradients, has more capacity to capture the underlying dynamics of the gradients.

因此,作为“过去梯度的记忆”,这种新的动量机制具有更强的能力来捕捉梯度的底层动态变化。

To this end, we extend the formulation in Equation 17 as:

为此,我们将公式(17)的形式扩展为:

其中:

*, *为动量内部目标的梯度(例如点积相似度 

权重更新使用一个更丰富的记忆函数。 动量本身通过优化其内部目标 进行更新。

Extension: Non Linear Outputs.

扩展:非线性输出。

Building upon the above perspective, in which we see the momentum as a neural architecture, one common technique to enhance the representation power of momentum memory module is to use non-linearity on top of its output [28, 65].

基于上述视角(即将动量视为一种神经网络结构),增强动量记忆模块表达能力的一种常见方法,是在其输出上添加非线性操作 [28, 65]。

That is, we re-formulate Equation 23 as:

即,我们将公式(23)重新写为:

其中  是任意非线性函数。

权重更新经过一个非线性函数 ,以增强表示能力。

where σ(·) is an arbitrary non-linearity. As an example, we let σ(·) = Newton-Schulz(·), where Newton-Schulz(·) is the iterative Newton–Schulz method [66], and m(·) be a linear layer; the resulted optimizer is equivalent to Muon optimizer [34].

其中, 是一个任意的非线性函数。作为一个例子,我们令, 其中  是迭代的  方法 [66],并令  为一个线性层;在这种设定下,得到的优化器与  优化器 [34] 等价。

Going Beyond Simple Backpropagation.

翻译

超越简单反向传播。

As discussed earlier in Section 2.1, the pre-training process and backpropagation is a form of associative memory, where input data is mapped to the surprised caused by its predicted output ∇yᵢ L(Wᵢ; xᵢ):

如同我们在第 2.1 节先前讨论的那样,预训练过程与反向传播可以被视为一种联想记忆,其中输入数据被映射到其预测输出所引起的“惊讶量”(surprise),即 ∇yᵢ L(Wᵢ; xᵢ):

即:权重更新可以写成输出梯度与输入向量的外积。

which from the associative memory perspective is equivalent to one step of gradient descent in optimization process of:

从联想记忆的角度看,这等价于对下述优化过程执行一步梯度下降:

最小化权重输出与“惊讶信号”之间的点积。

As we discussed in Appendix C, the above formulation cause ignoring the dependencies of data samples like xᵢ.

正如我们在附录 C 中讨论的,上述形式会忽略数据样本(如 xᵢ)之间的依赖关系。

To extend it to a more powerful formulation where it also consider the dependencies of data points (which is extremely important when we use optimizer in the token space as they are not independent), we use L₂ regression objective with one step of gradient descent as follows:

为了将其扩展为更强大的形式,使其能够考虑数据点之间的依赖(尤其在 token 空间中使用优化器时,它们并非独立),我们使用 回归目标,并执行一次梯度下降,其形式如下:

即:拟合输出到惊讶梯度的二范数回归。

This formulation results in a new variant of gradient descent, which can be simplified as follows:**

这种形式会导出一种新的梯度下降变体,其可简化如下:

加入了一个 的投影项。

使用外积形式重新表达梯度项。

Here, we use this optimizer as the internal optimizer of our HOPE architecture.

在此,我们将该优化器作为 HOPE 架构的内部优化器。

3 HOPE: A Self-Referential Learning Module with Continuum Memory

3 HOPE:具有连续体记忆的自指学习模块

Existing architectural backbones consist of (1) a working memory module (e.g., attention), which is responsible to actively fuse the information across sequence length, and (2) a feed-forward layer (e.g., MLP) that fuse information across features and acts as the persistent memory or knowledge storage of pre-training phase.

现有的架构骨干通常由两部分组成: (1) 工作记忆模块(例如注意力机制),负责在序列长度维度上主动融合信息; (2) 前馈层(例如 MLP),在特征维度融合信息,并在预训练阶段充当持久记忆或知识存储。

From the NL perspective, pre-training is the phase that the most outer level of the learning module is updated over its limited context flow. Accordingly, in the continual setup, such pre-training phase is also rarely updated over time, and so its corresponding knowledge storage needs to rarely be updated over time.

从 NL 的视角看,预训练阶段是“学习模块最外层”在其有限上下文流上被更新的阶段。因此在持续学习情境中,预训练阶段的知识存储本身也应极少更新。

Given this intuition, we extend the traditional view-point of long-term/short-term memory system and suggest a knowledge storage feed-forward for each level (frequency domain).

基于这一直觉,我们扩展传统的“长期记忆/短期记忆系统”观念,并提出在每个(频率域)层级设置一个知识存储前馈结构。

Given the definition of frequency, Continuum Memory System (CMS) is formalized as a chain of MLP blocks MLP^{(h)}(·), …, MLP^{(h^ℓ)}(·), each of which associated with a chunk size of C^{(ℓ)} = max_t C^{(ℓ)}_t such that given input x = {x_1, …, x_T} the output of the chain is calculated as (we disregard normalizations for the sake of clarity):

根据频率的定义,“连续体记忆系统”(CMS)被形式化为一个 MLP 模块链:MLP^{(h)}(·), …, MLP^{(h^ℓ)}(·),每个模块对应一个块大小 C^{(ℓ)} = \max_t C^{(ℓ)}_t。对于输入序列 (x = {x_1, \dots, x_T}),该 MLP 链的输出计算如下(为清晰起见省略归一化):

即:序列输入逐层通过多级 MLP 处理得到输出。

where the parameters of ℓ-th MLP block, i.e., θ_t^{(ℓ)}, are updated every C^{(ℓ)} steps:

其中第 ℓ 个 MLP 模块的参数 θ_t^{(ℓ)} 每隔 C^{(ℓ)} 步更新一次:

即:每   步进行一次批量式的参数更新。

In Appendix B.1, we discuss different variants of this formulation, including fully nested MLP layers. Here η_t^{(ℓ)} are learning rates corresponds to θ^{(ℓ)}, and f(·) is the error component of an arbitrary optimizer (e.g., ∇_{θ^{(ℓ)}} f(θ_t^{(ℓ)}; x_t) in gradient descent).

在附录 B.1 中,我们讨论了这一形式的不同变体,包括完全嵌套的 MLP 层。其中 η_t^{(ℓ)} 是对应 θ^{(ℓ)} 的学习率,而 f(·) 是任意优化器的误差分量(例如梯度下降中的 ∇_{θ^{(ℓ)}} f(θ_t^{(ℓ)}; x_t))。

The conventional Transformer block [27] is a special instance of this formulation, where k = 1. It is notable that Equation 31 provides an important interpretation: parameters θ^{(ℓ)} are responsible for compressing their own context into their parameters and so they are the representative of a abstract knowledge of their context.

传统的 Transformer 块 [27] 是该形式的一个特殊实例,其中 k = 1。值得注意的是,公式(31)提供了一个重要解释:参数 θ^{(ℓ)} 负责将其自身的上下文压缩进其参数中,因此它们代表着该上下文的抽象知识。

HOPE. We further present a self-referential learning system that leverages on Titans [28] and our variant of gradient descent discussed in Section B.1. Combining its self-referential sequence model with continuum memory system results in HOPE architecture.

HOPE。我们进一步提出一种自指学习系统(self-referential learning system),它结合了 Titans [28] 以及我们在附录 B.1 中讨论的梯度下降变体。将这种自指序列模型与连续体记忆系统结合,即构成 HOPE 架构。

图片

Figure 3: A comparison of HOPE architectural backbone with Transformers (Normalization and potential data-dependent components are removed for the sake of clarity).

图 3:HOPE 架构骨干与 Transformer 的比较(为清晰起见,已移除归一化与可能与数据相关的组件)。

图片

Table 1: Performance of HOPE and baselines on language modeling and common-sense reasoning tasks. Hybrid models are marked with ∗.

表 1:HOPE 及基线模型在语言建模与常识推理任务上的性能表现。带 ∗ 的模型表示混合模型。

✅ 4  Experiments

4 实验

For the sake of space, in the main paper, we report the results of the HOPE’s evaluation on language modeling, and common-sense reasoning, tasks.

为了节省篇幅,在主文中,我们报告 HOPE 在语言建模和常识推理任务上的评估结果。

However, we report an extensive set of results, including on experiments on optimizers, emergence of in-context learning, continual learning abilities of HOPE, ablation studies, long-context tasks, etc. in the appendix.

然而,在附录中我们报告了更加全面的结果,包括对优化器的实验、上下文学习(in-context learning)的出现、HOPE 的持续学习能力、消融实验、长上下文任务等内容。

Details about the experimental setups and other used datasets are in Appendix G.

关于实验设置以及所使用的其他数据集的详细信息见附录 G。

✅ Language Modeling and Common-sense Reasoning

语言建模与常识推理

We follow recent sequence modeling studies [28, 67, 68] and report the results of HOPE and baselines with size of 340M, 760M, and 1.3B on language modeling and also commonsense reasoning downstream tasks.

我们遵循近期序列建模研究 [28, 67, 68] 的设置,报告 HOPE 及其基线模型(规模为 340M、760M 和 1.3B)在语言建模和常识推理下游任务上的结果。

These results are reported in Table 1.

这些结果展示在表 1 中。

HOPE demonstrate a very good performance across all the scales and benchmark tasks, outperforming both Transformers and recent modern recurrent neural networks, including Gated DeltaNet and Titans.

HOPE 在所有模型规模与基准任务上都表现出非常优异的性能,超过了 Transformer 以及最近的现代循环神经网络(包括 Gated DeltaNet 和 Titans)。

Comparing HOPE to Titans and Gated DeltaNet, we can see that dynamically changing the key, value, and query projections based on the context as well a deep memory module can result in a model with lower perplexity and higher accuracy in benchmark results.

将 HOPE 与 Titans 和 Gated DeltaNet 进行比较,我们可以看到:根据上下文动态地改变 key、value 和 query 的投影方式,再加上一个深度记忆模块,可以得到一个在基准测试中同时具有更低困惑度(perplexity)和更高准确率的模型。


Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/189781