Nested Learning: The Illusion of Deep Learning Architectures
嵌套学习:深度学习架构的幻象
Authors 作者
Ali Behrouz Google Research USAalibehrouz@google.com
阿里·贝胡鲁兹 谷歌研究院 美国alibehrouz@google.com
Meisam Razaviyayn Google Research USArazaviyayn@google.com
梅萨姆·拉扎维扬 谷歌研究院 美国razaviyayn@google.com
Peiling Zhong Google Research USApeilinz@google.com
钟佩琳 谷歌研究院 美国peilinz@google.com
Vahab Mirrokni Google Research USAmirrokni@google.com
瓦哈布·米罗克尼 谷歌研究院 美国mirrokni@google.com
Abstract 摘要
Over the last decades, developing more powerful neural architectures and simultaneously designing optimization algorithms to effectively train them have been the core of research efforts to enhance the capability of machine learning models.
Despite the recent progresses, particularly in developing Language Models (LMs), there are fundamental challenges and unanswered questions about how such models can continually learn/memorize, self-improved, and find “effective solutions,”.
In this paper, we present a new learning paradigm, called Nested Learning (NL), that coherently represents a model with a set of nested, multi-level, and/or parallel optimization problems, each of which with its own “context flow”.
NL reveals that existing deep learning methods learns from data through compressing their own context flow, and explain how in-context learning emerges in large models.
NL suggests a path (a new dimension to deep learning) to design more expressive learning algorithms with more “levels”, resulting in higher-order in-context learning abilities.
In addition to its neuroscientifically plausible and mathematically white-box nature, we advocate for its importance by presenting three core contributions:
(1) Deep Optimizers: Based on NL, we show that well-known gradient-based optimizers (e.g., Adam, SGD with Momentum, etc.) are in fact associative memory modules that aim to compress the gradients with gradient descent.
Building on this insight, we present a set of more expressive optimizers with deep memory and/or more powerful learning rules;
基于这一洞见,我们提出了一组更具表达力的优化器,它们拥有深度记忆结构和/或更强的学习规则。
(2) Self-Modifying Titans: Taking advantage of NL’s insights on learning algorithms, we present a novel sequence model that learns how to modify itself by learning its own update algorithm;
and (3) Continuum Memory System: We present a new formulation for memory system that generalizes the traditional viewpoint of “long-term/short-term memory”.
Combining our self-modifying sequence model with the continuum memory system, we present a learning module, called HOPE, showing promising results in language modeling, continual learning, and long-context reasoning tasks.
This version of the paper has been extensively summarized to fit the page limit of NeurIPS camera ready, and some materials, experiments, discussions, and methods are moved to appendix, which might make some parts hard to follow or cause inconsistencies.
To avoid such cases, please read our arXiv version instead [1] (will be available on November 13).
为了避免此类情况,请阅读我们的 arXiv 版本 [1](将于 11 月 13 日发布)。
39th Conference on Neural Information Processing Systems (NeurIPS 2025).
第三十九届神经信息处理系统会议(NeurIPS 2025)。
Figure 1: The uniform and reusable structure as well as multi time scale update in the brain are the key components to unlock the continual learning in humans. Nested Learning (NL) allows for multi time-scale update for each component of the brain, while showing that well-known architectures such as Transformers are in fact linear layers with different frequency updates.
For decades, AI research has focused on designing machine learning algorithms that learn from data [2–5] or experience [6–8]; often by optimizing an objective θ ∈ Θ$ with gradient-based methods.
While traditional machine learning techniques required careful engineering and domain expertise to design feature extractors, limiting their ability to directly process and learn from natural data [9], deep representation learning offered a fully automated alternative to discover the representations needed for the task.
Thereafter, deep learning has been an inseparable part of the large-scale computational models with seminal success in chemistry and biology [10], games [11, 12], computer vision [13, 14], and multimodal and natural language understanding [15–17].
Stacking of multiple layers, as it is done in deep learning models, provides the models with larger capacity, better expressive power in representing complex features, and more internal computations (e.g., #FLOPS) [18–20], all of which are critical and desirable characteristics for static tasks that require in-distribution predictions over a previously fixed set.
This deep design, however, is not a universal solution to all the challenges and cannot help the expressive power of the models in multiple aspects, for example:
然而,这种深度结构的设计并非应对所有挑战的通用解决方案,也无法在多个方面提升模型的表达能力,例如:
(i) The computational depth of deep models might not change with more layers [21, 22], leaving their ability to implement complex algorithms untouched compared to traditional shallow approaches [23];
(ii) The capacity of some class of parameters might show marginal improvement with increasing the depth/width of the model [24];
(ii)某些类别的参数容量,即便随着模型深度/宽度的增加,也可能只呈现出微弱的提升 [24];
(iii) The training process might converge to a suboptimal solution, mainly due to the suboptimal choice of the optimizer or its hyperparameters;
(iii)训练过程可能会收敛到次优解,主要原因在于优化器或其超参数的选择并不理想;
and (iv) The model’s ability to fast adapt to a new task, continually learn, and/or generalize to out-of-distribution data might not changed with stacking more layers and requires more careful designs.
The core part of the efforts to overcome the above challenges and to enhance the capability of deep learning models concentrate on:
克服上述挑战并增强深度学习模型能力的核心努力主要集中在以下方面:
(1) developing more expressive class of parameters (i.e., neural architectures) [13, 25–28];
(1)开发表达能力更强的参数类别(即神经网络架构)[13, 25–28];
(2) introducing objectives that can better model the tasks [29–32];
(2)设计能够更好建模任务需求的目标函数 [29–32];
(3) designing more efficient/effective optimization algorithms to find better solutions or with more resilience to forgetting [33–36];
(3)设计更高效/更有效的优化算法,以找到更优的解,或使模型具备更强的抗遗忘能力 [33–36];
and (4) scaling the model size to enhance its expressivity, when the “right” choice of architecture, objective, and optimization algorithms are made [24, 37, 38].
Collectively, these advancements and new findings on scaling patterns of deep models have established the foundations upon which Large Language Models (LLMs) have been built.
The development of LLMs marks a pivotal milestone in deep learning research: a paradigm shift from task-specific models to more general-purpose systems with various emergent capabilities as a result of scaling the “right” architectures [38, 39].
Despite all their success and remarkable capabilities in diverse sets of tasks [15, 40, 41], LLMs are largely static after their initial deployment phase, meaning that they successfully perform tasks learned during pre- or post-training, but are unable to continually acquire new capabilities beyond their immediate context.
The only adaptable component of LLMs is their in-context learning ability–a (known to be emergent) characteristic of LLMs that enables fast adaption to the context and so perform zero- or few-shot tasks [38].
Beyond in-context learning, recent efforts to overcome the static nature of LLMs either are computationally expensive, require external components, lack generalization, and/or might suffer from catastrophic forgetting [42–44], which has led researchers to question if there is a need to revisit how to design machine learning models and if a new learning paradigm beyond stacking of layers is required to unleash the capabilities of LLMs in continual setups.
Current Models only Experience the Immediate Present.
当前的模型只能体验“直接的当下”。
As an analogy and to better illustrate the static nature of LLMs, we use the example of anterograde amnesia–a neurological condition where a person cannot form new long-term memories after the onset of the disorder, while existing memories remain intact [45].
This condition limits the person’s knowledge and experiences to a short window of present and long past–before the onset of the disorder–which results in continuously experiencing the immediate present as if it were always new.
The memory processing system of current LLMs suffer from a similar pattern.
当前 LLM 的记忆处理系统也遭受着类似的模式。
Their knowledge is limited to either, the immediate context that fits into their context window, or the knowledge in MLP layers that stores long-past, before the onset of “end of pre-training.”
This analogy, has motivated us to take inspiration from neurophysiology literature and how brain consolidate its short-term memories:
这种类比促使我们从神经生理学文献中汲取灵感,探讨大脑如何巩固其短期记忆:
1.1 Human Brain Perspective and Neurophysiological Motivation
1.1 人脑视角与神经生理学动机
Human brain is highly efficient and effective when it comes to continual learning (a.k.a. effective context management), which is often attributed to neuroplasticity—the brain’s remarkable capacity to change itself in response to new experiences, memories, learning, and even damage [46, 47].
This is when new and initially fragile memory traces are stabilized and begin transferring from short-term to long-term storage;**
在这一阶段,新形成且最初脆弱的记忆痕迹会被稳定下来,并开始从短期记忆转移到长期记忆存储中;
(2) An “offline” consolidation (also known as systems consolidation) process repeats the replay of the recently encoded patterns—during sharp-wave ripples (SWRs) in the hippocampus, coordinated with cortical sleep spindles and slow oscillations—strengthens and reorganizes the memory and supports transfer to cortical sites [51–53].**
Coming back to the analogy of anterograde amnesia, evidence indicates that the condition can impact both stages, but especially the online consolidation phase, mainly due to the fact that hippocampus is the gateway for encoding new declarative memories, and so its damage means new information never will be stored in long-term memory.**
As mentioned above, the design of LLMs, and more specifically Transformer-based backbones, suffers from a similar condition after the pre-training phase.**
That is, the information provided in the context, never impacts the long-term memory parameters (e.g., feedforward layers), and so the model is not capable of acquiring new knowledge or skill, unless the information is still stored in the short-term memory (e.g., attention).**
To this end, although the second stage is equally, or even more, crucial for the consolidation of memories, and its absence can damage the process and might cause loss of memory [54, 55], in this work, we focus on the first stage: memory consolidation as an online process.**
We provide additional discussion on human brain perspective and its connection to NL in Appendix A.**
我们在附录 A 中进一步讨论了人脑视角及其与 NL 的联系。
Notations(符号说明)
Notations. We let x ∈ RN×din be the input, Mt represent the state of memory/model M at time t, K be the keys, V be the values, and Q be the query matrices.**
We use bold lowercase letters with subscript to refer to the vector corresponds to the input t (i.e., ).
我们使用带下标 的黑体小写字母来表示与第 ( t ) 个输入对应的向量(例如 、。
We further refer to the distribution of any entities f as p(f).**
我们进一步将任何实体 的分布记为 。
Through the paper, we use simple MLPs with LM ≥ 1 layers and residual connection as the architecture of the memory module M(·).**
在本文中,我们使用带有 层的简单 MLP,并带有残差连接,作为记忆模块 的架构。
When it is needed, we parameterized the memory module with θM ⊇ {W1, W2, ..., WLM}, which at least includes the parameters of linear layers in the MLP.**
在需要时,我们将记忆模块参数化为
这些参数至少包括 MLP 中线性层的参数。
We use superscript with parenthesis to refer to parameters in different levels of nested learning (different update frequency): i.e., W(ℓ).**
This section discusses the motivations, formal definitions, and general high-level implications of Nested Learning (NL).**
本节讨论嵌套学习的研究动机、形式化定义,以及其一般性的高层含义。
We start with a formulation of associative memory and then by using step-by-step examples, we build the intuition behind architecture decomposition and its connection to modeling a neural network as an integrated system of optimization problems.
We aim to first show how existing methods and concepts in deep learning fall under the NL paradigm and then we present new formulations that go beyond traditional methods and/or provide insights on how to improve existing algorithms and designs.
Figure 2: Nested Learning Paradigm that represent a machine learning model and its training procedure as a set of nested optimization problems. (Left) An example of Hybrid architecture. While deep learning perspective, as the flattened image of NL, does not provide insight about the depth of computation in the blocks, NL transparently represent all the inner gradient flows. (Right) A Neural Learning Module: A computational model that learns how to compress its own context flow. For example, the first level corresponds to the model’s the most outer-loop training, often refer to as “pre-training” step.
Associative memory—the ability to form and retrieve connections between events—is a fundamental mental process and is an inseparable component of human learning [56].**
Often in the literature, the concept of memorization and learning are used interchangeably; in neuropsychology literature, however, these two are clearly distinguished.
More specifically, following neuropsychology literature [57], we build our terminology based on the following definition of memory and learning:
更具体地,遵循神经心理学文献 [57],我们基于以下记忆与学习的定义建立我们的术语:
Learning vs. Memorization:
Memory is a neural update caused by an input, and learning is the process for acquiring effective and useful memory.
学习 vs. 记忆:
记忆是由输入引起的神经更新,而学习是获得有效且有用记忆的过程。
In this work, our goal is to first show that all the elements of a computational sequence model, including optimizers and neural networks, are associative memory systems that compress their own context flow.
Broadly speaking, associative memory is an operator that maps a set of keys to a set of values.
广义而言,联想记忆是一种操作符,它将一组 key(键)映射到一组 value(值)。
We follow the general definition of associative memory by Behrouz et al. [58]:
我们采用 Behrouz 等人 [58] 对联想记忆的通用定义:
Definition 1(联想记忆定义)
Definition 1 (Associative Memory). Given a set of keys K ⊆ R⁽ᵈᵏ⁾ and values V ⊆ R⁽ᵈᵛ⁾, associative memory is an operator M : K → V that maps two sets of keys K and values V. To learn such mapping from the data, an objective L̃(·; ·) measures the quality of the mapping and M can be defined as:
While the operator itself is a memory and the mapping acts as a memorization process (i.e., memorizing the connections of events in the context), acquiring such effective operator based on the data, is a learning process.
Furthermore, while the term of associative memory is more common in neuroscience and neuropsychology literature, the above formulation is also closely related to data compression and low-dimensional representation.
That is, one can interpret the optimization process in Equation 1 as the training process of a network M(·) that aims to compress the mappings into its parameters and so represent them in a lower dimensional space.
In sequence modeling, where keys and values are input tokens (e.g., tokenized text), the choice of objective and the optimization process for solving Equation 1 can result in distinct sequence modeling architectures (see [59] and [58]) such as global/local softmax attention [27], or other modern recurrent models [28, 60, 61].
This simple formulation of sequence models provides us with better understanding of their internal process and also a tool to simply compare their modeling power based on their objective and optimization process.
In the following, using step-by-step examples, we discuss how this formulation can be applied to all components of a neural architecture (including its optimization process in pre-training) and in fact, how a model is an integrated system of multi-level, nested, and or parallel memories, each of which with its own context flow.
We start with a simple example, in which we aim to train a 1-layer MLP (parameterized with W) for task T and on dataset Dₜᵣₐᵢₙ =x₁,…,xby optimizing the objective L(·;·) with gradient descent.
我们从一个简单的示例开始:我们希望针对任务 ( T ),在数据集
上通过梯度下降优化目标函数 ,来训练一个单层 。
In this case, the training process is equivalent to the following optimization problem:
在这种情况下,训练过程等价于以下优化问题:
这是训练目标函数的标准形式。
whose optimization by gradient descent results in a weight update rule equivalent to:
使用梯度下降法求解上述问题,得到的权重更新规则如下:
此处利用链式法则,将梯度拆分为对输出 的梯度与输入向量的外积形式。
where yₜ₊₁ = W xₜ₊₁ is the output of the model for input xₜ₊₁.
其中 是模型对输入 的输出。
Given this formulation, one can let uₜ₊₁ = ∇ᵧₜ₊₁ L(Wₜ; xₜ₊₁) and reformulate the backpropagation process as the solution to an optimization problem on finding an optimal associative memory that maps input data points Dₜᵣₐᵢₙ = {xₜ} to their corresponding uₜ₊₁ values.
根据上述推导,我们可以令
此时反向传播过程即可被重新解释为:寻找一个最优的联想记忆算子,使其将训练数据集
映射到对应的 值。
That is, we let M(·) = Wₜ· parametrizes the memory, and use dot-product similarity to measure the quality of Wₜ’s mapping between xₜ₊₁ and ∇ᵧₜ₊₁ L(Wₜ; xₜ₊₁):**
换言之,我们令记忆算子 ,并使用点积相似度来衡量 将 映射到
的质量:
即:通过最小化点积误差 + 保持与上一状态接近的正则项来更新记忆。
In the above formulation, uₜ₊₁ = ∇ᵧₜ₊₁ L(Wₜ; xₜ₊₁) can be interpreted as a local surprise signal in representation space that quantifies the mismatch between the current output and the structure the objective L(·;·) enforces.
Therefore, this formulation translates the training phase of the model as a process of acquiring effective memory that maps data samples to their Local Surprise Signal (LSS) in representation space–defined as the mismatch between the current output and the structure enforced by the objective L(·;·).
Accordingly, in this example, our model has a single gradient flow over the data samples, which is only active over dataset D_train =x₁, …, xand will be frozen for any other data samples afterwards (a.k.a inference or test time).
因此,在本示例中,模型只有一条关于数据样本的梯度流,它只在训练数据集
上处于激活状态;而对于训练之后的任何输入(如推理或测试阶段),该梯度流都是“冻结”的。
Next, in the above example, we replace gradient descent with momentum
(引入动量 SGD 的示例)
Next, in the above example, we replace the gradient descent algorithm with its enhanced momentum-based variant, resulting in an update rule of:
接下来,在上述示例中,我们将梯度下降替换为其增强版——基于动量(momentum)的梯度下降(SGD with momentum),其更新规则如下:
即:使用动量项 ( m_{t+1} ) 替代直接使用梯度。
即动量项的更新相当于累加过去的梯度外积结构。
In Equation 8, given the previous state of Equation 7 (at time t), the value of ∇₍Wₜ₎ L(Wₜ; xₜ₊₁) or similarly ∇ᵧₜ₊₁ L(Wₜ; xₜ₊₁) are independent of recurrence in Equation 8 and so can be pre-computed beforehand.
在公式(8)中,给定公式(7)在时间 ( t ) 的状态,
或等价的
的值与公式(8)中的递归关系无关,因此可以提前计算。
To this end, we let uₜ₊₁ = ∇₍Wₜ₎ L(Wₜ; xₜ₊₁), and so Equation 8 can be reformulated as:
因此,我们令
于是公式(8)可重写为:
即动量项可视为通过梯度下降更新的“最优记忆”。
将梯度拆分成对输出的梯度与输入向量的外积后,得到的等价优化形式。
where the optimization problem in Equation 10 is equivalent to one step of gradient descent with adaptive learning rate of ηₜ₊₁.
公式(10)中的优化问题等价于执行一次学习率为 ( \eta_{t+1} ) 的梯度下降步骤。
Given these formulation, one can interpret the momentum term as either:
基于上述形式化,我们可以将动量项解释为:
** (两种解释)**
(1) a key-less associative memory that compress the gradients into its parameters, or (2) an associative memory that learns how to map data points to their corresponding LSS-value.
Interestingly, this formulation reveals that gradient descent with momentum is indeed a two-level optimization process, where the memory is optimized by simple gradient descent algorithm.
This process is closely related to Fast Weight Programs (FWPs) [62], where the weight update process (i.e., Equation 7) is the slow network that its momentum weight is generated by a fast network (i.e., Equation 10).**
(2) Gradient descent with momentum is a 2-level associative memory (or optimization process) that the inner-level learns to store gradient values into its parameters, and then the outer-level updates the slow weight (i.e., Wₜ) with the value of the inner-level memory.
(2)带动量的梯度下降(Momentum GD) 是一种两层的联想记忆系统(或两层优化过程):
内层学习将梯度值存入其参数中(即动量项 mₜ),
外层则使用这个记忆(mₜ)来更新“慢权重” ( W_t )。
While these are the most simple examples with respect to both architecture and optimizer algorithms, one might ask if similar conclusion can be made in more complex setups.
In the next example, we replace the MLP module with a linear attention [60].
在下一个示例中,我们将 MLP 模块替换为线性注意力(linear attention)[60]。
**at is, we aim to train a 1-layer linear attention for task T and on a sequence of D_train =x₁, …, xby optimizing the objective L with gradient descent.
也就是说,我们希望通过梯度下降优化目标函数 ( L ),在序列数据集
上训练一个单层的线性注意力模型来完成任务。
(给出线性注意力的定义)*
Recalling the unnormalized linear attention formulation:
回忆非归一化线性注意力(unnormalized linear attention)的形式:
即利用投影矩阵 从输入 生成 和
这是线性注意力的隐式“记忆矩阵”更新规则。
即线性注意力的输出由记忆矩阵 与查询向量 相乘得到。
As discussed in earlier studies [58, 59], the recurrence in Equation 13 can be reformulated as the optimization process of a matrix-valued associative memory Mₜ(·), in which, it aims to compress the mappings of keys and values into its parameters.
In more details, in Definition 1, if we let L̃(Mₜ₋₁; kₜ, vₜ) := −⟨Mₜ₋₁ kₜ, vₜ⟩ and aim to optimize the memory with gradient descent, the memory update rule is: (Note that ∇ L̃ = vₜ kₜᵀ and learning rate ηₜ = 1)
更具体地说,根据定义 1,如果我们令
并尝试通过梯度下降来优化记忆,则记忆的更新规则为(注意:,且令学习率 :
即更新记忆矩阵,使其在下一步更好地存储映射。
该推导显示线性注意力的记忆更新正是对上述内部目标执行一次梯度下降的结果。
which is equivalent to the update rule of an unnormalized linear attention in Equation 13.
这与公式(13)中给出的“非归一化线性注意力”的更新规则完全一致。
Also, note that as we observed in the first example, training a linear layer with gradient descent is a 1-layer optimization problem of an associative memory (see Equation 3) and so the general training/updating process of projection layers (i.e., Wₖ, Wᵥ, and W_q) is itself an optimization process of associative memory.
Accordingly, this setup, i.e., training a linear attention with gradient descent, can be seen as a two-level optimization process, where the outer-loop (also known as training process) optimizes the projection layers with gradient descent, while the inner-loop optimizes the inner memory of Mₜ with gradient descent.
因此,在这种设置下——即使用梯度下降训练线性注意力——整个过程可以被视为一个两层的优化体系:
外层(outer-loop):使用梯度下降优化投影层 (即训练过程);
内层(inner-loop):使用梯度下降优化线性注意力内部的记忆矩阵 。
两者各自拥有独立的记忆更新与梯度流。
Note that, as discussed above, here, we have two associative memories, and so each of which has their own optimization process and gradient flow.
That is, in the optimization of outer-level parameters of Wₖ, Wᵥ, and W_q there is no gradient with respect to parameter M(·) and so there is no backpropagation through it.
也就是说,在外层优化(更新)时,不会计算关于内部记忆 的梯度,因此不会对 执行反向传播。
Similarly, in the inner-level, there is no backpropagation through projection layers and they are considered frozen.
Furthermore, it is notable that in this example, the above formulation is also closely connected to FWPs perspective of linear attentions [63], where projections are considered slow weights, and memory update in Equation 13 is the fast weight update rule.
此外值得注意的是,在这个例子中,上述形式化同样与 FWP 对线性注意力的视角密切相关 [63]:
投影层被视为慢权重(slow weights);
记忆更新(公式 13)则被视为快权重(fast weights)更新规则。
Architectural Decomposition with More Levels
具有更多层级的架构分解
In both above examples, we discussed simple cases, where they can be translated into 2-level optimization processes, which also coincides with their FWPs interpretations.
In practice, however, we need to use more powerful optimization algorithms to train the model, and/or use more powerful recurrent update rule for memory.
As a simple example, assume we use gradient descent with momentum to train a linear attention model.
例如,假设我们使用带动量的梯度下降训练一个线性注意力模型。
In the above examples, we show that how the linear attention component can be decomposed into two nested optimization problem.
在前面的示例中,我们展示了线性注意力组件如何能够分解为两个嵌套的优化问题。
Similarly, here the model can be represented as a 2-level optimization problem, where (1) the inner level optimizes the memory to compress the context using gradient descent (see Equation 15), and (2) the outer level optimizes the projection layers with gradient descent with momentum.
类似地,此时该模型仍可表示为两层优化结构:
内层(inner level):通过梯度下降优化记忆,以压缩上下文(见公式 15);
外层(outer level):使用带动量的梯度下降来优化投影层。
Interestingly, from the first example, we know that “gradient descent with momentum” algorithm itself is indeed a 2-level optimization problem where the momentum term itself is an associative memory that compress the past gradients into its parameters.
In the previous section, we provided examples to demonstrate how one can decompose a machine learning model into a set of nested or multi-level optimization problems.
Next, we first aim to present a formal formulation for nested learning problems and then define Neural Learning Module–an integrated computational system that learns from data.
As we observed in the previous section, while we decomposed the model into a set of optimization process, it is still unclear if we can define a hierarchy (or order) over these problems, and uniquely represent the model in this format.
Inspired by the hierarchy of brain waves that indicates the information processing frequency rate of each part (discussed in Section 1), we use the update rate of each optimization problem to order the components in multiple levels.
To this end, we let the one update step over one data point to be the unit of time, and define the update frequency rate of each component as:
为此,我们将“针对一个数据点执行一次更新”视为单位时间,并将每个组件的更新频率定义为:
Definition 2
Update Frequency
Definition 2 (Update Frequency). For any component A, which can be a parametric component (e.g., learnable weights or momentum term in gradient descent with momentum) or a non-parametric component (e.g., attention block), we define its frequency, denoted as f_A, as its number of updates per unit of time.
Given the above update frequency, we can order the components of a machine learning algorithm based on operator (· ≻ ·).
基于上述更新频率,我们可以使用一个“排序算子”对机器学习算法中的组件进行排序(建立层级)。
We let A to be faster than B and denote A ≻ B if: (1) f_A > f_B, or (2) f_A = f_B but the computation of B’s state at time t requires the computation of A’s state at time t.
我们规定:如果满足以下条件之一,则称 快于,并记为 :
,即 的更新频率更高;
,但在时间 计算 的状态需要先计算 的状态。
In this definition, when A ⊁ B and B ⊁ A, we let A ≡_f B, which indicates that A and B has the same frequency update, but their computation is independent of each other (Later, we provide an example of this cases in AdamW optimizer).
在该定义中,如果既不存在 ,也不存在 ,我们记为
表示 与 具有相同的更新频率,但它们的计算彼此独立(本文稍后在 优化器中给出一个例子)。
Based on the above operator, we sort the components into an ordered set of “levels”, where (1) components in the same level have the same frequency update, and (2) the higher the level is, the lower its frequency.
基于上述排序算子,我们可以将组件划分为一个有序的“层级集合”,其规则为:
同一层级的组件具有相同的更新频率;
层级越高,其更新频率越低。
Notably, based on the above definition, each component has its own optimization problem and so context.
While we optimize the component’s inner objective with gradient-based optimizers, the above statement is equivalent to having exclusive gradient flow for each component in the model.
In general case, however, one can use non-parametric solution (as we later discuss about attention).
不过在一般情况下,我们也可以采用非参数化的解决方案(例如本文稍后讨论的注意力机制)。
Neural Learning Module
神经学习模块
Given the above definition of nested learning problems, we define neural learning module as a new way of representation of machine learning models that shows the model as an interconnected system of components, each of which with its own gradient flow.
Note that, orthogonal to deep learning, nested learning allows us to define neural learning models with more levels, resulting in more expressive architecture.
Nested learning allows computational models that are composed of multiple (multi-layer) levels to learn from and process data with different levels of abstraction and time-scales.
嵌套学习使得多层级结构的计算模型能够在不同抽象层次和不同时间尺度上处理与学习数据。
Next, we study optimizers and well-known deep learning architectures from the nested learning perspective, and provide examples that how NL can help to enhance those components.
In this section, we start by understanding how well-known optimizers and their variants are special instances of nested learning. Recall the gradient descent method with momentum,
where matrix (or vector) mᵢ is the momentum at state i and αᵢ and ηᵢ are adaptive learning and momentum rates, respectively.
其中矩阵(或向量) 是第 (i) 步的动量;而 和 分别是自适应的动量率与学习率。
Assuming αᵢ₊₁ = 1, the momentum term can be viewed as the result of optimizing the following objective with gradient descent:
当假设 时,动量项可以被视为通过梯度下降优化下列目标函数而得到的结果:
即:最小化动量与当前梯度的内积(在单位矩阵 (I) 上)。
This interpretation shows that momentum can indeed be viewed as a meta memory module that learns how to memorize gradients of the objective into its parameters. Building on this intuition, in Section C.4 we show that Adam with a small modification is the optimal associative memory for the models’ gradients.
这种解释表明,动量(momentum)确实可以被视为一种“元记忆模块”(meta memory module),它学习如何将目标函数的梯度记入自身参数中。基于这一直觉,我们在附录 C.4 中展示了:带有一个小改动的 Adam 是用于模型梯度的最优联想记忆。
Next, we show that how this perspective can result in designing more expressive optimizers:
接下来,我们展示这种视角如何能够用于设计表达能力更强的优化器:
Extension: More Expressive Association.
扩展:更具表达力的联想机制
As discussed earlier, momentum is a value-less associative memory and so has limited expressive power.
如前所述,动量是一种无 value 的联想记忆(value-less associative memory),因此其表达能力是有限的。
To address this issue, following the original definition of associative memory (i.e., mapping keys to values), we let value parameter vᵢ = Pᵢ and so the momentum aims to minimize:
为了解决这一问题,并遵循联想记忆“将 key 映射到 value”这一原始定义,我们令 value 参数 (v_i = P_i),这样动量的优化目标变为最小化:
即:最小化动量()与梯度的转置,以及 参数 的内积。
using gradient descent, resulting in the update rule:
使用梯度下降,从而得到以下更新规则:
即:权重使用 更新,而 通过 参数 加权的梯度下降获得。
This formulation is equivalent to using preconditioning the momentum GD. In fact, preconditioning means that the momentum term is an associative memory that learns how to compress the mappings between (P_i) and the gradient term (\nabla \mathcal{L}(W_i; x_i)).
While any reasonable choice (e.g., random features) of preconditioning can improve the expressivity of the initial version of GD with momentum per se is still a value-less memory (i.e., mapping all gradients to a single value), the above perspective gives more intuition about what preconditioning are more useful.
虽然任意合理的预条件化方式(如随机特征)都能提高带动量 GD 的表达能力,但原始的动量本质上仍是一个“无 value 的记忆”(即所有梯度被映射到单一数值)。上述视角提供了关于何种预条件化更有意义的直觉。
That is, the momentum acts as a memory that aims to map gradients to their corresponding values, and so a function of gradients (e.g., information about Hessian) can provide the memory with a more meaningful mappings.
As discussed by Behrouz et al. [58], optimizing an inner objective of dot-product similarity results in Hebbian-like update rule, which can cause the memory to be less effective.
A natural extension of this internal objective is to use (\ell_2) regression loss (for measuring the corresponding key-value mapping fitness) and minimize the loss function (\lVert m \nabla \mathcal{L}(W_i; x_i)^\top - P_i \rVert_2^2), resulting in the update rule of:
This update is based on delta-rule [64] and so it allows the memory (momentum) to better manage its limited capacity and better memorize the series of past gradients.
此更新基于 [64],因此能够使记忆(动量)更好地管理其有限容量,并更有效地记住一系列过去的梯度。
Extension: More Expressive Memory.
扩展:更具表达力的记忆机制。
As discussed earlier, momentum can be viewed as a meta memory model that uses a linear layer (i.e., matrix-valued) to compress the past gradient values.
如前所述,动量可视为一种元记忆模型,它使用一个线性层(即矩阵形式)来压缩过去的梯度值。
Due to the linear nature of momentum, only linear functions of past gradients can be learned by its internal objective.
由于动量的线性特性,它的内部目标只能学习到“过去梯度的线性函数”。
To increase the learning capacity of this module, one alternative is to use alternative powerful persistent learning modules: i.e., replacing a linear matrix-valued memory for momentum with an MLP.
Therefore, momentum as the memory for the past gradients, has more capacity to capture the underlying dynamics of the gradients.
因此,作为“过去梯度的记忆”,这种新的动量机制具有更强的能力来捕捉梯度的底层动态变化。
To this end, we extend the formulation in Equation 17 as:
为此,我们将公式(17)的形式扩展为:
其中:
*, *为动量内部目标的梯度(例如点积相似度 )。
权重更新使用一个更丰富的记忆函数。 动量本身通过优化其内部目标 进行更新。
Extension: Non Linear Outputs.
扩展:非线性输出。
Building upon the above perspective, in which we see the momentum as a neural architecture, one common technique to enhance the representation power of momentum memory module is to use non-linearity on top of its output [28, 65].
where σ(·) is an arbitrary non-linearity. As an example, we let σ(·) = Newton-Schulz(·), where Newton-Schulz(·) is the iterative Newton–Schulz method [66], and m(·) be a linear layer; the resulted optimizer is equivalent to Muon optimizer [34].
As discussed earlier in Section 2.1, the pre-training process and backpropagation is a form of associative memory, where input data is mapped to the surprised caused by its predicted output ∇yᵢ L(Wᵢ; xᵢ):
which from the associative memory perspective is equivalent to one step of gradient descent in optimization process of:
从联想记忆的角度看,这等价于对下述优化过程执行一步梯度下降:
最小化权重输出与“惊讶信号”之间的点积。
As we discussed in Appendix C, the above formulation cause ignoring the dependencies of data samples like xᵢ.
正如我们在附录 C 中讨论的,上述形式会忽略数据样本(如 xᵢ)之间的依赖关系。
To extend it to a more powerful formulation where it also consider the dependencies of data points (which is extremely important when we use optimizer in the token space as they are not independent), we use L₂ regression objective with one step of gradient descent as follows:
This formulation results in a new variant of gradient descent, which can be simplified as follows:**
这种形式会导出一种新的梯度下降变体,其可简化如下:
加入了一个 的投影项。
使用外积形式重新表达梯度项。
Here, we use this optimizer as the internal optimizer of our HOPE architecture.
在此,我们将该优化器作为 HOPE 架构的内部优化器。
3 HOPE: A Self-Referential Learning Module with Continuum Memory
3 HOPE:具有连续体记忆的自指学习模块
Existing architectural backbones consist of (1) a working memory module (e.g., attention), which is responsible to actively fuse the information across sequence length, and (2) a feed-forward layer (e.g., MLP) that fuse information across features and acts as the persistent memory or knowledge storage of pre-training phase.
From the NL perspective, pre-training is the phase that the most outer level of the learning module is updated over its limited context flow. Accordingly, in the continual setup, such pre-training phase is also rarely updated over time, and so its corresponding knowledge storage needs to rarely be updated over time.
Given this intuition, we extend the traditional view-point of long-term/short-term memory system and suggest a knowledge storage feed-forward for each level (frequency domain).
Given the definition of frequency, Continuum Memory System (CMS) is formalized as a chain of MLP blocks MLP^{(h)}(·), …, MLP^{(h^ℓ)}(·), each of which associated with a chunk size of C^{(ℓ)} = max_t C^{(ℓ)}_t such that given input x = {x_1, …, x_T} the output of the chain is calculated as (we disregard normalizations for the sake of clarity):
where the parameters of ℓ-th MLP block, i.e., θ_t^{(ℓ)}, are updated every C^{(ℓ)} steps:
其中第 ℓ 个 MLP 模块的参数 θ_t^{(ℓ)} 每隔 C^{(ℓ)} 步更新一次:
即:每 步进行一次批量式的参数更新。
In Appendix B.1, we discuss different variants of this formulation, including fully nested MLP layers. Here η_t^{(ℓ)} are learning rates corresponds to θ^{(ℓ)}, and f(·) is the error component of an arbitrary optimizer (e.g., ∇_{θ^{(ℓ)}} f(θ_t^{(ℓ)}; x_t) in gradient descent).
The conventional Transformer block [27] is a special instance of this formulation, where k = 1. It is notable that Equation 31 provides an important interpretation: parameters θ^{(ℓ)} are responsible for compressing their own context into their parameters and so they are the representative of a abstract knowledge of their context.
传统的 Transformer 块 [27] 是该形式的一个特殊实例,其中 k = 1。值得注意的是,公式(31)提供了一个重要解释:参数 θ^{(ℓ)} 负责将其自身的上下文压缩进其参数中,因此它们代表着该上下文的抽象知识。
HOPE. We further present a self-referential learning system that leverages on Titans [28] and our variant of gradient descent discussed in Section B.1. Combining its self-referential sequence model with continuum memory system results in HOPE architecture.
Figure 3: A comparison of HOPE architectural backbone with Transformers (Normalization and potential data-dependent components are removed for the sake of clarity).
Table 1: Performance of HOPE and baselines on language modeling and common-sense reasoning tasks. Hybrid models are marked with ∗.
表 1:HOPE 及基线模型在语言建模与常识推理任务上的性能表现。带 ∗ 的模型表示混合模型。
✅ 4 Experiments
4 实验
For the sake of space, in the main paper, we report the results of the HOPE’s evaluation on language modeling, and common-sense reasoning, tasks.
为了节省篇幅,在主文中,我们报告 HOPE 在语言建模和常识推理任务上的评估结果。
However, we report an extensive set of results, including on experiments on optimizers, emergence of in-context learning, continual learning abilities of HOPE, ablation studies, long-context tasks, etc. in the appendix.
Details about the experimental setups and other used datasets are in Appendix G.
关于实验设置以及所使用的其他数据集的详细信息见附录 G。
✅ Language Modeling and Common-sense Reasoning
语言建模与常识推理
We follow recent sequence modeling studies [28, 67, 68] and report the results of HOPE and baselines with size of 340M, 760M, and 1.3B on language modeling and also commonsense reasoning downstream tasks.
HOPE demonstrate a very good performance across all the scales and benchmark tasks, outperforming both Transformers and recent modern recurrent neural networks, including Gated DeltaNet and Titans.
Comparing HOPE to Titans and Gated DeltaNet, we can see that dynamically changing the key, value, and query projections based on the context as well a deep memory module can result in a model with lower perplexity and higher accuracy in benchmark results.