Recursive Reasoning HRM Model Reimagined! TRM Two-Layer Network (7M Parameters) Outperforms LLMs!

❝In short, the authors have radically reformed the already groundbreaking previous work (HRM, only 0.2B parameters), demonstrating that true "depth" does not come from stacking network layers, but from the number of computations. By allowing an extremely simple two-layer network to perform repeated recursive reasoning, this model—stripped of all fancy designs—simulated stronger logical capabilities than SOTA large models, perfectly illustrating the concept of "Less is More." (Original paper title at the end of the article. Published on arxiv on 06 Oct 2025, by Samsung SAIL Montréal)

First Stage: Identifying Core Concepts

Analysis of the Paper's Motivation

The paper's starting point is clearly defined, primarily addressing a dilemma faced by the current field of Artificial Intelligence, especially Large Language Models (LLMs):

The "Fragility" of Large Models: While models like GPT-4 are knowledgeable and powerful, they perform imperfectly when handling tasks requiring strict, multi-step, precise reasoning (e.g., solving Sudoku, navigating mazes, or handling abstract visual puzzles like ARC-AGI). They generate answers token by token, and if one step is wrong, the entire solution might fail. This is akin to a talented but occasionally careless genius making fatal small errors.
Limitations of Existing Methods: To boost LLM reasoning, methods like Chain-of-Thoughts (CoT) were proposed, encouraging the model to "think" before answering. However, this method is computationally expensive and relies on high-quality "thought process" data for training, and sometimes the generated "thought process" itself is incorrect.
A Promising but Complex Predecessor: Against this backdrop, a paper titled "Hierarchical Reasoning Model (HRM)" proposed a new idea: using two small networks for "recursive thinking," mimicking different frequency activities in the brain. The HRM model achieved surprising results on some puzzle tasks, demonstrating the potential for small models to perform deep reasoning. However, the HRM design itself was complex, relying on intricate biological metaphors and potentially invalid mathematical theorems (Fixed-Point Theorem), leading to inefficient implementation.

Therefore, the authors' motivation can be summarized as: Can we inherit the advantages of HRM's "recursive reasoning" but implement it in a simpler, more robust, more efficient, and more powerful way? The paper's title, "Less is More," perfectly captures this motive—achieving better results with fewer parameters, simpler theory, and a more direct method.

Analysis of the Paper's Main Contributions

The core contribution of this paper is the proposal of a new model called the Tiny Recursive Model (TRM), which drastically simplifies and improves upon the predecessor HRM.

Key Innovations:

Structural Simplification: Simplified HRM's complex dual-network (one high-frequency, one low-frequency) structure into a single, smaller network.
Theoretical Simplification: Abandoned HRM's reliance on the complex and potentially inapplicable "Fixed-Point Theorem," no longer requiring the assumption that the model's thought process reaches a stable "equilibrium."
Conceptual Simplification: Discarded HRM's obscure biological "hierarchical reasoning" explanation, proposing a more intuitive understanding: the model maintains a "current answer" and a "current idea (or scratchpad)," and iteratively updates between the two.
Efficiency Improvement: Simplified HRM's mechanism for early training termination (ACT), removing the need for two forward passes, thereby improving training efficiency.
Performance Leap: Ultimately, this smaller, simpler TRM model not only far surpassed HRM on multiple high-difficulty reasoning tasks (like Sudoku, Maze, ARC-AGI) but also defeated top-tier Large Language Models with tens of thousands of times its parameters.

Key Supporting Technologies:

Full Recursive Backpropagation: TRM no longer approximates gradients by only passing them through the last thinking step (like HRM). Instead, gradients are backpropagated through the entire "thought-solution" recursive process. While this increases computation per step, it ensures the model learns more robustly and effectively.
Deep Supervision: This is a core mechanism inherited and optimized from HRM. The model performs multiple "attempt-improve" loops. The output of each attempt is used to calculate the loss, and the model uses the results of this attempt (answer and idea) as the starting point for the next attempt, continuing to improve. This process simulates a very deep network but avoids massive memory overhead.
Reinterpretation of Latent States: TRM redefined HRM's two abstract latent variables (h and y) into the more understandable "idea/scratchpad" (latent reasoning z) and "answer" (predicted answer y). This simple conceptual shift clarifies the entire operational logic of the model.

Significant Results: The most striking result is the extreme manifestation of small model triumphing over large ones. A TRM model with only 7 million parameters achieved 45% accuracy on the ARC-AGI-1 test, a level similar to or even higher than LLMs with hundreds of billions or even trillions of parameters (like Gemini 2.5 Pro). On extremely difficult Sudoku tasks, TRM reached 87% accuracy, while large models achieved 0%. This strongly proves that for certain types of reasoning problems, excellent algorithmic architecture design is more important than simply scaling parameter count.

Identifying Difficulties in Understanding

Key Concepts to Understand:

Recursive Reasoning: The core of the method. Understanding how the model repeatedly calls itself to progressively optimize the answer.
Deep Supervision: Key to effective learning. Understanding why training is not done all at once but step-by-step, and how the output of the previous step serves as the input for the next.
HRM vs. TRM Comparison: Appreciating TRM's finesse largely relies on understanding the "subtractions" made from HRM and why these subtractions were effective. Particularly the abolishment of the Fixed-Point Theorem and the 1-step gradient approximation.

Most Challenging Conceptual Part: The most challenging part is understanding how "recursive reasoning" and "deep supervision" work collaboratively. Specifically, during one complete "attempt" (one deep supervision step), the model performs T rounds of recursion. In the first T-1 rounds, the recursion only serves to improve the model's "idea" and "answer" but is not used for learning (i.e., no gradient calculation). Only in the final round of recursion does the model "turn on the gradient switch," allowing the learning signal (loss) to backpropagate and update network weights. Simultaneously, the output state of this round is "frozen" (detached) to serve as the starting point for the next "attempt." This mechanism is intricate but is fundamental to the model's ability to "think deeply" without "memory explosion."

The optimal starting point for explanation is the TRM's core operational loop.

Second Stage: Deep Explanation of Core Concepts

Designing a Real-Life Analogy: A Student Solving an Extremely Difficult Sudoku Puzzle

Imagine a very smart student tackling an extremely difficult Sudoku puzzle. The student doesn't solve it in one go; their process is iterative:

Tools for Solving: They have two items:

An Official Answer Sheet (Sudoku Grid): This is the final answer to be submitted (Predicted Answer y).
A Large Scratchpad (Scratchpad): This is where they perform reasoning, calculations, and record possibilities (Latent Reasoning z).

The Solving Process:

Initial Observation: The student glances at the puzzle and fills in a few highly certain numbers on the answer sheet.
Deep Thinking (Scratchpad Phase): Next, they focus on the scratchpad. Looking at the current answer sheet and the puzzle, they start deriving conclusions furiously on the scratchpad: "If position A is 3, then position B must be 5, and C must be 8..." They repeat these sequences of logical deductions, constantly updating the scratchpad over several turns, but they do not write these deductions on the answer sheet yet. (Internal recursion, n times of updating z)
Updating the Answer (Answer Sheet Phase): After much deliberation, they form a mature set of ideas on the scratchpad. They then turn back and, based on the final conclusions on the scratchpad, update the answer sheet, perhaps erasing a previous uncertain number and filling in a new, well-reasoned one. (One time of updating y)
End of One Attempt: They have now completed one full round of "attempt and improvement." The answer on the sheet is more complete than before.

Teacher's Supervision and Learning:

Staged Check: After the student completes one round of "attempt and improvement," the teacher reviews their answer sheet against the correct answer and tells them: "You did well this step, but a few numbers are still wrong." (Deep Supervision: calculate loss based on y)
Reflection and Learning: Upon hearing the feedback, the student only reviews the reasoning process of that specific round (the deductions on the scratchpad and the final decision to write the answer), reflecting on what went wrong to adjust their solving strategy. They do not reflect on every thought they've had since starting the puzzle, as that would be too exhausting. (Single-step backpropagation: gradient flows only through the last recursion step)
Starting a New Attempt: They then take the current answer sheet and scratchpad content as the new starting point, beginning the next "deep thinking → update answer" loop, striving for a better outcome. (State detachment: y.detach(), z.detach())

This process repeats many times until the student perfectly solves the Sudoku puzzle.

Mapping Analogy to Actual Technology

The effectiveness of TRM is rooted in the explicit separation of function for the latent variables. This can be mapped as follows:

Analogy Elements: Student; Sudoku Puzzle; Answer Sheet (y); Scratchpad (z); Deep Thinking (n times); Update Answer (1 time); Full Attempt-Improve Round; Teacher's Feedback; Reflecting only on the last round; Current state as new start.

TRM Technical Concepts: TRM Model (Single NN); Input Question (x); Predicted Answer (y); Latent Reasoning (z); Internal Recursion (updating z); Answer Update Step (updating y); One Recursive Call; Deep Supervision; Single-Step Backpropagation; State Detachment.

In-Depth Technical Details

A complete training step (one step in the Deep Supervision loop) is as follows:

Perform T-1 rounds of "no-gradient" recursive optimization: This stage corresponds to the student solving the problem before the teacher checks. This uses a code block like with torch.no_grad(): for j in range(T-1): y, z = latent_recursion(x, y, z, n). Here, torch.no_grad() prevents gradient recording, saving memory.
Perform 1 round of "with-gradient" recursive optimization: This is the critical moment for learning. This involves calling y, z = latent_recursion(x, y, z, n) outside the no-grad context.
Internal latent_recursion function: This involves:
- n times of thinking (updating the scratchpad z) — The loop for i in range(n): z = net(x, y, z) executes. Mathematically: new idea = NN_idea_part( Concatenate( original problem, current answer, current idea ) ). Every thought is based on global information.
- 1 time of updating (updating the answer sheet y) — y = net(y, z) executes. Mathematically: new answer = NN_answer_part( Concatenate( current answer, final idea ) ). The original problem x is no longer needed here. The model updates the answer based on the refined thoughts.
Calculate Loss, Backpropagation, and Model Update: The loss is calculated using loss = softmax_cross_entropy(output_head(y), y_true). The loss.backward() command flows the gradient only through the single "with-gradient" latent_recursion call, updating the network weights.
State Reset for Next Attempt: When the function returns, y and z are .detach()ed, cutting their connection to the computation graph. The detached states become the improved initial states for the next Deep Supervision loop.

This loop repeats until the maximum T is reached or the ACT mechanism triggers early stopping.

Summary

Through the "student solving Sudoku" analogy, one can deeply understand TRM's core mechanism:

TRM decomposes complex reasoning tasks into two steps: "thinking" (updating z) and "acting" (updating y).
By repeatedly iterating this "thinking-acting" loop within the framework of "deep supervision," TRM simulates an extremely deep reasoning process using a small network.
The crucial .detach() operation and the "single-step backpropagation" mechanism make this deep simulation computationally feasible, preventing memory explosion. This is the secret weapon behind TRM's "less is more" success.

The core mathematical principle can be summarized popularly as: "Constantly trying and failing, but learning only from the most recent mistake, and treating the corrected result as the new starting point."

Third Stage: Detailed Step-by-Step Procedure

Step 1: Preparation and Initialization

Input Data: A sample is taken from the training set, including: Problem x (a 9x9 Sudoku grid with blanks) and True Answer y_true (the complete, correct solution).
Model and State Initialization: TRM Network net (a randomly initialized, shallow NN, e.g., 2 layers). Latent States (y and z are initialized as zero or random vectors, matching the model's hidden layer dimension).

Step 2: Entering the Deep Supervision Loop (The Outer Loop)

The model will perform multiple (up to T = 16) "attempt-improve" steps on this Sudoku problem. The following describes one complete iteration:

Input Encoding: The input Sudoku problem x is transformed into a high-dimensional vector representation x_embed via an embedding layer. This x_embed serves as the "constant problem context" throughout all subsequent recursive steps.

Step 3: Executing Deep Recursion (The `deep_recursion` Function)

This is the core of TRM, consisting of T_R (e.g., 3) rounds of internal recursion, allowing sufficient "thinking" time before a learning (gradient update) step.

Warm-up Thinking Phase (T_R-1 rounds, no gradient):
- Goal: Improve the current y and z as much as possible without learning. (Student's private practice before the teacher checks).
- Execution Flow: The model executes the latent_recursion function T_R-1 times (e.g., 2 times). In the first round, the current x_embed, y, and z are input into latent_recursion. Inside, the model performs n times of "thinking" (internal reasoning loop, updating z), culminating in a final idea z_final, and then updates the answer y (answer update). The output y' and z' become the input for the next round. All subsequent pre-warm rounds repeat this process.
- Key Point: All these computations occur under torch.no_grad(), consuming no gradient memory.
Formal Learning Phase (Last 1 round, with gradient):
- Goal: Execute the exact same recursive process, but this time record all calculation steps so the model can learn from them.
- Execution Flow: The model executes the latent_recursion function 1 more time, using the final y and z from the "warm-up phase" as input. The process is identical (n updates of z, then 1 update of y).
- Key Point: This calculation is not under torch.no_grad(), so the computational graph is fully constructed.

Step 4: Calculating Loss, Backpropagation, and Model Update

Generate Final Prediction: The final latent answer y obtained from the "Formal Learning Phase" is decoded via an output head (output_head) to produce the predicted answer y_pred, which is compared to the true answer y_true.
Calculate Loss: The primary task loss (e.g., cross-entropy loss) is calculated between y_pred and y_true. An optional ACT loss may also be calculated to promote early termination.
Backpropagation: The loss.backward() command is invoked. The gradient flows backward only through the computational graph constructed during the "Formal Learning Phase" (Step 3.2), updating the weights of the net NN. Crucially, the gradient does not flow into the "Warm-up Thinking Phase."
Parameter Update: The optimizer updates all weights of the net NN. One learning step is complete.

Step 5: State Reset, Preparing for Next Iteration

Detach States: Upon returning from the deep_recursion function, the output y and z are subjected to .detach(). This operation severs their gradient history connection.
Enter Next Deep Supervision Iteration: The detached y' and z' become the initial states for the next (second) Deep Supervision loop. The model returns to Step 2 and repeats the entire process.

This loop continues until the T limit is reached or ACT triggers early termination for the current sample.

Fourth Stage: Experimental Design and Validation Analysis

1. Main Experiment Design Interpretation: Validating the Core Thesis

Core Claim: TRM achieves better performance than its predecessor HRM and massive LLMs on difficult reasoning tasks, using fewer parameters and a simpler structure.
Experimental Design: Direct performance showdowns across multiple benchmarks:
- Datasets: Sudoku-Extreme & Maze-Hard (representing classic hard problems requiring precise, long-range, symbolic reasoning where "one mistake is fatal"); ARC-AGI-1 & ARC-AGI-2 (the gold standard for abstract reasoning, measuring strong inductive and generalization abilities, deemed crucial for AGI). These datasets are specifically chosen to test the reasoning "shortcomings" of LLMs, powerfully highlighting the advantage of specialized architectures like TRM.
- Metric: Accuracy, as answers are unique and deterministic.
- Baselines: HRM (the predecessor); Direct Prediction (a non-recursive model of the same size, proving the necessity of the "recursive" mechanism); Top-tier LLMs (e.g., Gemini, Claude, Deepseek, illustrating TRM's architectural advantage surpassing capability gains from sheer scale).
Main Results and Conclusion (Table 4 and Table 5):
- On Sudoku-Extreme, the TRM-MLP version achieved a striking 87.4% accuracy, compared to HRM's 55.0% and the 0.0% achieved by all LLMs. This definitively proves TRM's overwhelming advantage in symbolic logic reasoning.
- On ARC-AGI-1 and ARC-AGI-2, TRM-Att achieved 44.6% and 7.8% accuracy, respectively, significantly higher than HRM (40.3%/5.0%) and most LLMs (e.g., Gemini 2.5 Pro at 37.0%/4.9%). This shows TRM's recursive mechanism is equally effective in abstract visual reasoning.
- Conclusion: The main experiments strongly support the core thesis. TRM is not just a successful iteration of HRM but proposes a more effective and efficient solution for specific high-difficulty reasoning domains than the "scaling is all you need" LLM paradigm.

2. Ablation Study Analysis: Contribution of Internal Components

Table 1 precisely demonstrates the value of each TRM design choice using controlled variables:

w/ 1-step gradient vs. TRM: The TRM's "Full Recursive Backpropagation" is replaced by HRM's "1-step gradient approximation." The accuracy plummets from 87.4% to 56.5%. Proof: This eloquently proves that passing the gradient through the entire recursive process is the most critical factor for TRM's performance leap. HRM's approximation loses too much learning information.
w/ separate fH, fL vs. TRM: TRM's single network is reverted to HRM's dual-network structure. Accuracy drops from 87.4% to 82.4%. Proof: This shows that the dual network is not only parameter-heavier but less effective. The single network likely learns more generalized reasoning capabilities through weight sharing, reinforcing "Less is More."
w/ 4-layers, n=3 vs. TRM: TRM's 2-layer network is deepened to 4 layers while reducing the recursion count n to keep the total computation similar. Accuracy drops from 87.4% to 79.5%. Proof: Increasing network depth tends to cause overfitting on small datasets. Increasing "computational depth" via recursion is a more effective regularization method, forcing the small network to be reused and learn more general functions.
w/ self-attention vs. TRM-MLP (on Sudoku): The channel-mixing MLP in TRM-MLP is replaced with standard self-attention. Accuracy drops from 87.4% to 74.7%. Proof: For fixed-grid problems like Sudoku, the global receptive field of self-attention may be overly flexible, proving less effective than a simpler MLP. This highlights that architecture choice must match task characteristics.

3. Insightful Experiment Analysis: Intrinsic Properties of the Method

Theoretical Hypothesis Validation (Table 2 - Different numbers of latent features): Experiments tested three variants: 1) Single feature (y only); 2) Multi-feature (splitting z into multiple features); 3) Standard TRM (y + z). Conclusion: Both Single (71.9%) and Multi-feature (77.6%) significantly underperformed the standard TRM (87.4%). This strongly proves that the explicit separation of function between "answer" (y) and "idea" (z) in the latent space is crucial.
Visualization Analysis (Figure 6 - Latent State Visualization): Authors decoded and visualized the latent states y and z generated by a pre-trained model on Sudoku. Conclusion: The visualization clearly showed that the decoded y looked like a partially completed Sudoku solution, while the decoded z was a set of abstract, unintelligible numerical patterns. This provides intuitive and powerful evidence for the reinterpretation of HRM's latent variables, allowing readers to "see" the model's internal workings.
Performance vs. Depth Trade-off Analysis (Table 3): They defined effective depth as n * T_R and compared TRM and HRM performance at similar effective depths. Conclusion: TRM consistently outperformed HRM at every comparable depth level (e.g., at effective depth ~48, TRM was 87.4%, HRM was 61.6%). This cleverly ruled out the possibility that "TRM is better because it computes more," proving that TRM's architecture itself is fundamentally more efficient and better utilizes each computation step.

Original paper title: Less is More: Recursive Reasoning with Tiny Networks