Google's V2 Version of 'Attention Is All You Need': Nested Learning

Hello everyone, I'm PaperAgent, not Agent!

Recently, Google's Gemini 3 Pro and Gemini 3 Pro Image (Nano Banana Pro) are stealing the spotlight, while OpenAI is also exploring the application value of their own GPT-5, publishing a lengthy 89-page research report on GPT-5 for accelerating scientific research, worth a read.

Today, the focus is on sharing Google's latest research achievement, which netizens call the V2 version of 'Attention is all you need': Nested Learning.

Nested Learning is a brand new machine learning method that views the model as a set of smaller, nested optimization problems, each with its own independent internal workflow, thereby mitigating or even completely avoiding "catastrophic forgetting"—the issue of sacrificing performance on old tasks when learning new ones.

1 Why another "new paradigm"?

Deep learning old narrative: Stack network "deeper" → expressiveness ↑　Nested learning new narrative: "Nest-disassemble" network → expressiveness ↑

Training = overall patching　Training = each layer self-patches, different frequencies

Memory = attention + FFN　Memory = arbitrary "key-value" optimization subsystems

Fig 1 Analogy between different brain EEG bands (Δ/Θ/α/β/γ) and NL's "multi-time-scale updates": "Early layers high-freq refresh, later layers low-freq integrate"—this is NL's core intuition.

2 Nested Learning's three key techniques

2.1 Associative memory = everything

Definition:

Associative memory ℳ is an operator mapping keys 𝒦 to values 𝒱, training is ℳ* = argmin ℒ(ℳ(𝒦); 𝒱).

Treat gradients, tokens, historical hidden states as keys/values to get:

• SGD → 1-layer memory (maps x → local surprise signal LSS)

• Momentum → 2-layer memory (outer updates W, inner updates momentum m)

• Linear Attention → 2-layer memory (outer updates projection matrix, inner updates Mt)

Fig 2 Uses "Russian doll" to break down Hybrid RNN+Attention into multi-level gradient flows:

2.2 Update frequency = hierarchy order

Frequency definition: f_A = number of updates for module A per unit time. Sort by f_A to get "nested hierarchy": high freq ≻ low freq. → Each hierarchy has its exclusive gradient flow, no mutual blocking.

2.3 Optimizer as memory module

After treating momentum as "keyless memory", further extensions possible:

Extension　Formula　Effect

Add value param P　m ← αm – η P ∇　Automatic gradient preprocessing

Switch to L2 loss　Δ-Rule update　Capacity utilization ↑

Replace m with MLP　Deep Momentum GD　Capture nonlinear gradient dynamics

Nest Newton-Schulz　σ(·)=Newton-Schulz　Equivalent to Muon optimizer

3 Self-modifying sequence models: Titans

Key idea:

"Since the optimizer is memory, can the model learn to write its own optimizer?"

Titans =

• Slow weights (large MLP) → long-term memory

• Fast weights (linear attention) → working memory

• Learner L(·) → Real-time outputs "how to update slow weights" code based on current context (i.e., generates ΔW rules).

During training, L(·)'s parameters are also trained by SGD, leading to "gradient optimizes gradient optimizer" nesting.

4 HOPE: Turning memory into "frequency bands"

Continuum Memory System (CMS)

• Breaks traditional "long/short-term memory" into a series of MLPs across continuous frequency bands: MLP^(f₁), …, MLP^(fₖ), each updated per its chunk length C(ℓ).

• Slower bands → longer contexts → more abstract knowledge.

HOPE = CMS + Titans self-modification core　Fig 3 Compares update frequencies of Transformer vs HOPE:

5 Experiment overview

Table 1 complete results:

Conclusion:

• At equal parameters, HOPE has the lowest perplexity, highest average on common sense tasks;

• As models scale, HOPE-Titans gap narrows but still stably outperforms Transformer++;

• HOPE excels in memory management on long-context needle-in-a-haystack (NIAH) downstream tasks, proving CMS offers a more efficient method for handling extended information sequences.

Performance comparison of different architectures on language modeling (perplexity, left) and common sense reasoning (accuracy, right) tasks: HOPE, Titans, Samba, and Transformer baseline.

Performance comparison across difficulty levels on long-context tasks (HOPE, Titans, TTT, Mamba2)

• NIAH-PK: needle-in-a-haystack password retrieval

• NIAH-H: needle-in-a-haystack number retrieval

• NIAH-W: needle-in-a-haystack word retrieval

Finally, one figure sums it up:

https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/

https://abehrouz.github.io/files/NL.pdf

Google's V2 Version of 'Attention Is All You Need': Nested Learning

Share Short URL