Xinzhiyuan Report
Edited by: Aeneas KingHZ
【Xinzhiyuan Guide】Is the Transformer killer here? The MoR architecture, newly released by KAIST, Mila, Google DeepMind, and other institutions, has doubled inference speed and halved memory usage. It directly reshapes the performance boundaries of LLMs, comprehensively outperforming traditional Transformers. Netizens exclaim, 'Blown away! Another game-changing bomb has arrived.'
Just now, the teams from KAIST, Mila, and Google DeepMind released a major breakthrough—
a new LLM model architecture named Mixture-of-Recursions (MoR).
This innovative architecture is considered by the industry to have the potential to be a Transformer killer!
It boosts inference speed by 2 times, reduces training FLOPs, and directly halves KV cache memory.
Ultimately, at parameter scales ranging from 135M to 1.7B, MoR directly established a new Pareto front: achieving lower perplexity, higher few-shot accuracy, and over 2 times higher throughput with the same training FLOPs.
It comprehensively outperforms traditional Transformers!
Paper link: https://arxiv.org/abs/2507.10524
In fact, the academic community has long recognized that Transformers are too complex and have astonishing computational demands.
For example, Albert Gu, a CMU expert and author of the Mamba architecture, recently stated that Transformer models have significant limitations, and the concept of 'tokens' is nonsense.
And Google Product Lead Logan Kilpatrick openly pointed out the flaws in the attention mechanism—that infinite context is impossible to achieve—and emphasized the necessity for comprehensive innovation at the core architectural layer.
Google DeepMind's research today aligns perfectly with the views of these experts.
In response, netizens were utterly blown away.
Some predict that latent space reasoning might bring about the next major breakthrough.
Clearly, for tasks involving hierarchical decomposition like code, mathematics, and logic, MoR is a game-changing heavy bomb.
Some even commented: It looks like Hinton's Capsule Networks have been reborn.
Google DeepMind's Big Move
Recursive Magic Streamlines LLMs and Boosts Speed
Given the current state of LLM development, what's next? Relying on stacking parameters and adding layers to make them smarter?
This research tells us: true mastery never relies on brute force, but on the art of design.
The brand-new MoR architecture they've created, literally translated as 'Mixture-of-Recursions,' directly doubles LLM inference speed!
So, what exactly did MoR do?
In short, it accomplished two things:
1. Not all tokens are treated equally
When an LLM processes text, it breaks sentences into tokens. However, words like 'of,' 'is,' or 'in' don't require deep reasoning; a single forward pass is sufficient. Complex tokens, on the other hand, need to pass through the same stack layer multiple times.
MoR's cleverness lies in its token-specific approach.
MoR's secret weapon is a small router that scores the hidden state of each token. Only high-scoring tokens continue to loop, while the rest exit early.
2. Loop reuse: One module handles everything
The traditional Transformer approach is to constantly 'stack layers'; the more layers, the stronger the processing capability. However, the cost of this is memory and computing power: models become slower and more expensive.
MoR, on the other hand, takes the opposite approach, specifically designing shared blocks where each token loops a maximum of 4 times. As soon as the router says 'done,' it exits the loop early.
In essence, if Transformer is a massive factory assembly line, MoR is more like a highly efficient special forces unit. The future of AI will likely no longer be about who is heavier, but who is better at division of labor, scheduling, and saving effort.
And Google DeepMind has keenly grasped this point, demonstrating an early prototype of this trend.
True Adaptive Computation
Simply scaling up language models based on the Scaling law can indeed make their capabilities skyrocket, but the computational power and costs required for training and deployment also soar accordingly.
Common 'streamlining' methods currently involve either parameter sharing (saving VRAM) or on-demand computation (saving computing power).
However, there is still a lack of an architecture that can organically integrate both.
Mixture-of-Recursions (MoR) fully leverages the potential of recursive Transformers (see Figure 1), successfully combining the two.
Figure 1: Overview of Mixture-of-Recursions (MoR)
(Left) Each recursive step includes a fixed layer stack and a router (gray box in the middle) that determines whether a token continues recursion.
(Middle) Full model structure, where shared recursive steps apply to each token up to N𝑟 times based on routing decisions.
(Right) Example of a routing pattern showing token-level recursion depth, where darker colors indicate more active computation for that token within the recursive block. Numbers at the bottom label the recursion steps for each text token with different colors: 1 step, 2 steps, and 3 steps.
In a unified architecture, MoR simultaneously achieves three efficiency optimizations:
parameter compression through weight sharing; redundant computation reduction through dynamic routing; and memory overhead reduction through intelligent caching.
Recursive Mixture Architecture
During pre-training and inference, MoR dynamically adjusts recursion steps for each token, relying on two major components:
routing mechanisms and KV caching strategies.
Routing Mechanisms: Expert-choice vs. Token-choice
Inspired by top-k gating mechanisms, researchers proposed Expert-choice routing (see Figure 2a).
In this mode, recursion depth can be seen as 'experts,' and during each recursion round, these experts select the top-k tokens they deem most worthy of processing.
To make recursion more consistent, the team also introduced a hierarchical filtering mechanism: only tokens selected in layer r are eligible to participate in the evaluation for layer r+1.
This design simulates an early exit mechanism, allowing the model to automatically 'filter' out tokens requiring deeper processing during early training, concentrating computational resources on the most difficult tokens.
Unlike the former, token-choice routing (see Figure 2b) determines how many recursive passes each token will take from the outset.
Specifically, based on the hidden state of the first layer, the model calculates a score for each expert (e.g., via softmax or sigmoid).
Assuming there are 𝑁𝑟 experts, each corresponding to one recursion, the model assigns the token to the expert with the highest score. The token will then be sent through the first 𝑖 layers of recursion, with each layer processed sequentially.
In this method, the recursion depth of a token is determined upon entering the network, avoiding re-selection at each layer and improving inference efficiency.
Table 2 (left) compares the two methods:
Expert-choice routing has the advantage of achieving ideal computational load balancing. However, it is prone to information leakage.
In contrast, token-choice routing naturally does not leak information. But this method results in uneven load distribution.
Table 2: Comparison of routing and KV caching strategies. (Left) Summary of two routing strategies: expert-choice vs. token-choice; (Right) Relative cost efficiency of caching strategies compared to standard Transformer.
Figure 2: Architectural components of Mixture-of-Recursions (MoR). (a) Expert-choice routing; (b) Token-choice routing; (c) KV caching strategies.
KV Caching Strategies: Per-recursion layer caching vs. Cross-recursion sharing
For the MoR model, researchers proposed two KV caching strategies:
per-recursion layer caching and cross-recursion sharing.
1. Per-recursion layer caching (see Figure 2c, top) is 'selective caching': only tokens routed to a specific recursion layer will generate and store their KV pairs in that layer.
Attention computation is performed only within the cache of the current recursion layer. This design helps achieve localized computation, significantly improving memory usage efficiency and reducing I/O burden.
2. Cross-recursion sharing (see Figure 2c, bottom): KV pairs are generated and cached only in the first recursion layer, then reused across all subsequent layers. Under this mechanism, the number of queries participating in attention computation at each layer may decrease.
This means that all tokens, regardless of whether they continue to participate in computation in subsequent layers, can fully access historical context without recalculation.
Table 2 (right) compares the two caching strategies:
Per-recursion layer caching: KV memory and I/O burden are compressed to about half of the original.
Cross-recursion sharing: Only linearly compresses attention computation, and KV read/write frequency is higher, which might become a performance bottleneck.
Table 3: Comparison of MoR, Recursive Transformer, and Standard Transformer under conditions of equal computation and equal token count.
Experiments
Researchers pre-trained models from scratch, adopting a Llama-based Transformer architecture and referencing the SmolLM open-source model's configuration. They evaluated it on the FineWeb-Edu validation set and six few-shot benchmark test sets.
Key Results
Under the same training computational budget, MoR outperforms baseline models with fewer parameters
Under the same training budget (16.5e18 FLOPs), researchers compared the MoR model with standard Transformer and recursive Transformer.
For four model scales (135M, 360M, 730M, and 1.7B parameters), the validation loss corresponding to different computational budgets is shown in the figure.
As shown in Table 3, the MoR model, using expert-choice routing and two recursions (Nr=2), achieved lower validation loss and higher few-shot average accuracy than the standard baseline.
This is due to MoR's higher computational efficiency, allowing it to process more training tokens under the same FLOPs budget.
Under the same data volume, MoR still outperforms baseline models with less computation
To isolate the impact of architectural differences, researchers conducted an analysis under the premise of a fixed number of training tokens (20B).
The results confirmed that, with 25% fewer training FLOPs, the MoR model (𝑁𝑟=2) still achieved lower validation loss and higher accuracy, surpassing both standard and recursive baselines.
Compared to the standard baseline, MoR models reduced training time by 19% and peak memory usage by 25%.
This is attributed to the specifically designed hierarchical filtering mechanism and recursion-based attention mechanism.
Furthermore, MoR's performance is also influenced by routing and caching strategies.
IsoFLOP Analysis
One of the core criteria for evaluating a new model architecture design is whether its performance can continue to improve as model scale and computational load increase.
Therefore, the research team comprehensively compared MoR with standard Transformer (Vanilla) and recursive Transformer.
Experimental Setup
Experiments were conducted with four model scales: 135M, 360M, 730M, and 1.7B parameters.
For both recursive Transformer and MoR configurations, the number of recursions was uniformly set to 3.
Pre-training was performed under three different computational budgets: 2e18, 5e18, and 16.5e18 FLOPs.
MoR Architecture: Scalable and Parameter-Efficient
As shown in Figure 3, MoR consistently outperformed recursive baseline models across all parameter scales and computational budgets.
Although MoR performed slightly worse than the standard Transformer at the smallest scale (135M), this gap rapidly narrowed as the model scale increased.
When the parameter scale exceeded 360M, MoR not only matched the standard Transformer but even performed superiorly under low and medium computational budgets.
Overall, these results indicate that MoR possesses good scalability and high parameter efficiency, making it a viable alternative to older architectures.
Inference Throughput Evaluation
Through parameter sharing, MoR can utilize contiguous deep batching technology to significantly boost throughput during the inference phase.
This mechanism continuously maintains high GPU utilization by immediately filling new tokens once old sequences are completed during the decoding process.
Experimental Setup
At the 360M parameter scale, the team tested the MoR model with different recursion depths (2, 3, and 4).
MoR significantly boosts inference throughput through deep batching
As shown in Figure 4a, MoR variants exceeded the throughput of standard Transformers in both settings.
Higher recursion depths lead to more tokens exiting early, thereby reducing KV cache usage and further significantly boosting inference speed. For example, at the maximum batch setting (𝐵=Max), MoR-4's speed can increase by 2.06 times.
Experiments show that combining the deep batching mechanism with an early exit strategy can greatly accelerate the MoR model's practical inference speed.
For more content and details on ablation studies, please refer to the original paper.
References:
https://arxiv.org/abs/2507.10524
https://x.com/rohanpaul_ai/status/1945342236310561091
https://www.rohan-paul.com/p/landmark-research-from-google-deepmind