Hello everyone, I'm PaperAgent, not Agent!
Recently, Google's Gemini 3 Pro and Gemini 3 Pro Image (Nano Banana Pro) are stealing the spotlight, while OpenAI is also exploring the application value of their own GPT-5, publishing a lengthy 89-page research report on GPT-5 for accelerating scientific research, worth a read.
Today, the focus is on sharing Google's latest research achievement, which netizens call the V2 version of 'Attention is all you need': Nested Learning.
Nested Learning is a brand new machine learning method that views the model as a set of smaller, nested optimization problems, each with its own independent internal workflow, thereby mitigating or even completely avoiding "catastrophic forgetting"—the issue of sacrificing performance on old tasks when learning new ones.
1 Why another "new paradigm"?
Deep learning old narrative: Stack network "deeper" → expressiveness ↑ Nested learning new narrative: "Nest-disassemble" network → expressiveness ↑
Training = overall patching Training = each layer self-patches, different frequencies
Memory = attention + FFN Memory = arbitrary "key-value" optimization subsystems
Fig 1 Analogy between different brain EEG bands (Δ/Θ/α/β/γ) and NL's "multi-time-scale updates": "Early layers high-freq refresh, later layers low-freq integrate"—this is NL's core intuition.
2 Nested Learning's three key techniques
2.1 Associative memory = everything
Definition:
Associative memory ℳ is an operator mapping keys 𝒦 to values 𝒱, training is ℳ* = argmin ℒ(ℳ(𝒦); 𝒱).
Treat gradients, tokens, historical hidden states as keys/values to get:
• SGD → 1-layer memory (maps x → local surprise signal LSS)
• Momentum → 2-layer memory (outer updates W, inner updates momentum m)
• Linear Attention → 2-layer memory (outer updates projection matrix, inner updates Mt)
Fig 2 Uses "Russian doll" to break down Hybrid RNN+Attention into multi-level gradient flows:
2.2 Update frequency = hierarchy order
Frequency definition: f_A = number of updates for module A per unit time. Sort by f_A to get "nested hierarchy": high freq ≻ low freq. → Each hierarchy has its exclusive gradient flow, no mutual blocking.
2.3 Optimizer as memory module
After treating momentum as "keyless memory", further extensions possible:
Extension Formula Effect
Add value param P m ← αm – η P ∇ Automatic gradient preprocessing
Switch to L2 loss Δ-Rule update Capacity utilization ↑
Replace m with MLP Deep Momentum GD Capture nonlinear gradient dynamics
Nest Newton-Schulz σ(·)=Newton-Schulz Equivalent to Muon optimizer
3 Self-modifying sequence models: Titans
Key idea:
"Since the optimizer is memory, can the model learn to write its own optimizer?"
Titans =
• Slow weights (large MLP) → long-term memory
• Fast weights (linear attention) → working memory
• Learner L(·) → Real-time outputs "how to update slow weights" code based on current context (i.e., generates ΔW rules).
During training, L(·)'s parameters are also trained by SGD, leading to "gradient optimizes gradient optimizer" nesting.
4 HOPE: Turning memory into "frequency bands"
Continuum Memory System (CMS)
• Breaks traditional "long/short-term memory" into a series of MLPs across continuous frequency bands: MLP^(f₁), …, MLP^(fₖ), each updated per its chunk length C(ℓ).
• Slower bands → longer contexts → more abstract knowledge.
HOPE = CMS + Titans self-modification core Fig 3 Compares update frequencies of Transformer vs HOPE:
5 Experiment overview
Table 1 complete results:
Conclusion:
• At equal parameters, HOPE has the lowest perplexity, highest average on common sense tasks;
• As models scale, HOPE-Titans gap narrows but still stably outperforms Transformer++;
• HOPE excels in memory management on long-context needle-in-a-haystack (NIAH) downstream tasks, proving CMS offers a more efficient method for handling extended information sequences.
Performance comparison of different architectures on language modeling (perplexity, left) and common sense reasoning (accuracy, right) tasks: HOPE, Titans, Samba, and Transformer baseline.
Performance comparison across difficulty levels on long-context tasks (HOPE, Titans, TTT, Mamba2)
• NIAH-PK: needle-in-a-haystack password retrieval
• NIAH-H: needle-in-a-haystack number retrieval
• NIAH-W: needle-in-a-haystack word retrieval
Finally, one figure sums it up:
https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/