Google's V2 Version of 'Attention Is All You Need': Nested Learning

Hello everyone, I'm PaperAgent, not Agent!

Recently, Google's Gemini 3 Pro and Gemini 3 Pro Image (Nano Banana Pro) are stealing the spotlight, while OpenAI is also exploring the application value of their own GPT-5, publishing a lengthy 89-page research report on GPT-5 for accelerating scientific research, worth a read.

Today, the focus is on sharing Google's latest research achievement, which netizens call the V2 version of 'Attention is all you need': Nested Learning.

Image

Nested Learning is a brand new machine learning method that views the model as a set of smaller, nested optimization problems, each with its own independent internal workflow, thereby mitigating or even completely avoiding "catastrophic forgetting"—the issue of sacrificing performance on old tasks when learning new ones.

Image

1 Why another "new paradigm"?

Deep learning old narrative: Stack network "deeper" → expressiveness ↑ Nested learning new narrative: "Nest-disassemble" network → expressiveness ↑

Training = overall patching Training = each layer self-patches, different frequencies

Memory = attention + FFN Memory = arbitrary "key-value" optimization subsystems

ImageFig 1 Analogy between different brain EEG bands (Δ/Θ/α/β/γ) and NL's "multi-time-scale updates": "Early layers high-freq refresh, later layers low-freq integrate"—this is NL's core intuition.

2 Nested Learning's three key techniques

2.1 Associative memory = everything

Definition:

Associative memory ℳ is an operator mapping keys 𝒦 to values 𝒱, training is ℳ* = argmin ℒ(ℳ(𝒦); 𝒱).

Treat gradients, tokens, historical hidden states as keys/values to get:

• SGD → 1-layer memory (maps x → local surprise signal LSS)

• Momentum → 2-layer memory (outer updates W, inner updates momentum m)

• Linear Attention → 2-layer memory (outer updates projection matrix, inner updates Mt)

Fig 2 Uses "Russian doll" to break down Hybrid RNN+Attention into multi-level gradient flows: Image

2.2 Update frequency = hierarchy order

Frequency definition: f_A = number of updates for module A per unit time. Sort by f_A to get "nested hierarchy": high freq ≻ low freq. → Each hierarchy has its exclusive gradient flow, no mutual blocking.

2.3 Optimizer as memory module

After treating momentum as "keyless memory", further extensions possible:

Extension Formula Effect

Add value param P m ← αm – η P ∇ Automatic gradient preprocessing

Switch to L2 loss Δ-Rule update Capacity utilization ↑

Replace m with MLP Deep Momentum GD Capture nonlinear gradient dynamics

Nest Newton-Schulz σ(·)=Newton-Schulz Equivalent to Muon optimizer

3 Self-modifying sequence models: Titans

Key idea:

"Since the optimizer is memory, can the model learn to write its own optimizer?"

Titans =

• Slow weights (large MLP) → long-term memory

• Fast weights (linear attention) → working memory

Learner L(·) → Real-time outputs "how to update slow weights" code based on current context (i.e., generates ΔW rules).

During training, L(·)'s parameters are also trained by SGD, leading to "gradient optimizes gradient optimizer" nesting.

4 HOPE: Turning memory into "frequency bands"

Continuum Memory System (CMS)

• Breaks traditional "long/short-term memory" into a series of MLPs across continuous frequency bands: MLP^(f₁), …, MLP^(fₖ), each updated per its chunk length C(ℓ).

• Slower bands → longer contexts → more abstract knowledge.

HOPE = CMS + Titans self-modification core Fig 3 Compares update frequencies of Transformer vs HOPE: Image

5 Experiment overview

Table 1 complete results: Image

Conclusion:

• At equal parameters, HOPE has the lowest perplexity, highest average on common sense tasks;

• As models scale, HOPE-Titans gap narrows but still stably outperforms Transformer++;

• HOPE excels in memory management on long-context needle-in-a-haystack (NIAH) downstream tasks, proving CMS offers a more efficient method for handling extended information sequences.

Performance comparison of different architectures on language modeling (perplexity, left) and common sense reasoning (accuracy, right) tasks: HOPE, Titans, Samba, and Transformer baseline.

Performance comparison of different architectures on language modeling (perplexity, left) and common sense reasoning (accuracy, right) tasks: HOPE, Titans, Samba, and Transformer baseline.

Performance comparison across difficulty levels on long-context tasks (HOPE, Titans, TTT, Mamba2)

Performance comparison across difficulty levels on long-context tasks (HOPE, Titans, TTT, Mamba2)

• NIAH-PK: needle-in-a-haystack password retrieval

• NIAH-H: needle-in-a-haystack number retrieval

• NIAH-W: needle-in-a-haystack word retrieval

Finally, one figure sums it up: Image

https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/

https://abehrouz.github.io/files/NL.pdf

Main Tag:Nested Learning

Sub Tags:Continual LearningHOPETitansCatastrophic Forgetting


Previous:Paper Brief | Activating and Enhancing Causal Reasoning Capabilities of Large Language Models Using Conditional Statements (CL2025)

Next:Farewell to Static Weights! Google Proposes Nested Learning

Share Short URL