Hierarchical Reasoning Model

Abstract

Reasoning, the process of designing and executing sequences of complex goal-directed actions, remains a critical challenge in artificial intelligence. Current large language models (LLMs) primarily employ Chain-of-Thought (CoT) techniques, which suffer from fragile task decomposition, high data requirements, and high latency. Inspired by the hierarchical and multi-timescale processing in the human brain, we propose the Hierarchical Reasoning Model (HRM), a novel recurrent architecture capable of achieving significant computational depth while maintaining training stability and efficiency. HRM executes sequential reasoning tasks in a single forward pass through two interdependent recurrent modules without explicit supervision of intermediate processes: a high-level module responsible for slow, abstract planning, and a low-level module handling fast, detailed computations. With only 27 million parameters, HRM achieves outstanding performance on complex reasoning tasks using just 1000 training samples. The model requires no pre-training or CoT data, yet it achieves near-perfect performance on challenging tasks including complex Sudoku puzzles and optimal pathfinding in large mazes. Furthermore, HRM surpasses larger models with significantly longer context windows on the Abstraction and Reasoning Corpus (ARC), a key benchmark for measuring artificial general intelligence capabilities. These results highlight HRM's potential as a transformative advancement towards general computation and general reasoning systems.

1 Introduction

Deep learning, as its name suggests, originated from the idea of stacking more layers to achieve stronger representational capabilities and superior performance. However, despite the remarkable success of large language models, their core architecture is surprisingly shallow. This imposes fundamental limitations on their most anticipated capability: reasoning. The fixed depth of standard Transformers places them in computational complexity classes such as AC0 or TC0, making them unable to solve problems requiring polynomial time. Large language models are not Turing-complete, and thus, at least in a purely end-to-end manner, cannot execute the complex algorithmic reasoning necessary for achieving deliberative planning or symbolic manipulation tasks. For example, our results on Sudoku tasks show that increasing the depth of Transformer models can improve performance, but even with very deep models, performance remains far from optimal (see Figure 2), supporting the speculation about the limitations of current large language model scaling paradigms.

The literature in the field of large language models primarily relies on Chain-of-Thought (CoT) prompting techniques for reasoning. CoT externalizes the reasoning process into token-level language expressions by decomposing complex tasks into simpler intermediate steps, using shallow models to generate text sequentially. However, CoT for reasoning is merely a temporary measure, not a satisfactory solution. It relies on fragile, human-defined task decomposition, where an error or misordering in any single step can lead to the failure of the entire reasoning process. This reliance on explicit linguistic steps shackles reasoning to token-level patterns. Consequently, CoT reasoning often requires large amounts of training data and generates numerous tokens in complex reasoning tasks, leading to slow response times. We need a more efficient method to minimize these data requirements.

To this end, we explore “latent reasoning,” where the model computes within its internal hidden state space. This aligns with a cognitive view that language is a tool for human communication, not the vehicle of thought itself; the brain maintains long, coherent reasoning chains in latent space with astonishing efficiency, without constantly translating them back into language. However, the capacity for latent reasoning is still fundamentally limited by the effective computational depth of the model. Simply stacking layers is extremely difficult due to the vanishing gradient problem, which severely impacts training stability and effectiveness. Recurrent architectures, as natural alternatives for sequential tasks, often suffer from premature convergence, leading to the failure of subsequent computational steps, and rely on biologically unrealistic, computationally expensive, and memory-intensive “Backpropagation Through Time” (BPTT) for training.

The human brain provides an incredibly inspiring blueprint for achieving the effective computational depth lacking in current artificial models. The brain organizes computation hierarchically across cortical regions operating at different timescales, enabling deep, multi-stage reasoning. Recurrent feedback loops continuously optimize internal representations, allowing slow, high-level regions to guide fast, low-level circuits to execute tasks, achieving hierarchical processing while maintaining global consistency. Notably, the brain achieves this depth while avoiding the high credit assignment costs associated with traditional recurrent networks due to backpropagation through time.

Inspired by this hierarchical and multi-timescale biological structure, we propose the Hierarchical Reasoning Model (HRM). HRM aims to significantly enhance effective computational depth. The model comprises two coupled recurrent modules: a high-level (H) module for abstract, deliberative reasoning, and a low-level (L) module for fast, detailed computations. This structure avoids the rapid convergence problem of standard recurrent models through a process we call “hierarchical convergence.” The low-level module updates rapidly, performing multiple computational steps and reaching a local equilibrium before the high-level module advances a step. At this point, the low-level module is reset, entering a new phase of computation towards another local equilibrium state.

Furthermore, we propose a single-step gradient approximation method to train HRM, which improves training efficiency and eliminates the need for BPTT. This design maintains constant memory usage throughout backpropagation (O(1), compared to O(T) for BPTT, where T is the number of timesteps), making it scalable and more biologically plausible.

With enhanced effective depth, HRM excels in tasks requiring extensive search and backtracking. Using only 1000 input-output samples, without pre-training or Chain-of-Thought supervision, HRM learned to solve problems infeasible for state-of-the-art large language models. For example, in complex Sudoku puzzles (Sudoku-Extreme Full) and optimal path search tasks in 30×30 mazes, HRM achieved near-perfect accuracy, while state-of-the-art CoT methods failed entirely (0% accuracy). In the Abstraction and Reasoning Corpus (ARC) AGI challenge—a benchmark for inductive reasoning—HRM, trained from scratch using only the official dataset (approx. 1000 samples) with 27 million parameters and a context of 30×30 grid (900 tokens), achieved 40.3% performance, significantly outperforming leading CoT-based models like o3-mini-high (34.5%) and Claude 3.7 (8K context, 21.2%), despite the latter having much larger parameter scales and context lengths (see Figure 1). This achievement points to a promising direction for developing next-generation AI reasoning systems with general computational capabilities.

2 Hierarchical Reasoning Model

We propose the Hierarchical Reasoning Model (HRM), designed based on three fundamental principles observed in neural computation in the brain:

Hierarchical processing: The brain processes information in a hierarchical structure of cortical regions. High-level regions integrate information over longer timescales and form abstract representations, while low-level regions are responsible for more immediate, detailed sensory and motor processing.

Temporal separation: These hierarchies in the brain operate at different intrinsic timescales, reflected in neural rhythms (e.g., slow theta waves at 4–8 Hz, fast gamma waves at 30–100 Hz). This temporal separation allows high-level areas to stably guide fast computations in low-level circuits.

Recurrent connections: The brain features extensive recurrent connections. These feedback loops support iterative optimization of internal representations, leading to more accurate and context-sensitive results, at the cost of additional processing time. Furthermore, the brain largely avoids the tricky deep credit assignment problem associated with Backpropagation Through Time (BPTT).

HRM is explicitly designed to combat this premature convergence through a process we call “hierarchical convergence.” In each cycle, the low-level module (L-module, a recurrent neural network RNN) stably converges to a local equilibrium state. However, this equilibrium state depends on the high-level state zH provided by the high-level module in that cycle. After completing T timesteps, the high-level module (H-module) integrates the results of this sub-computation (i.e., the final state zL of the low-level module) and performs its own state update. This update of zH establishes an entirely new contextual environment for the low-level module, essentially “restarting” its computational path, thereby initiating a new phase of convergence towards another local equilibrium state.

This process enables HRM to perform a series of distinct yet stable nested computations: the high-level module guides the overall problem-solving strategy, while the low-level module performs the dense search or fine-grained optimization required for each step. Although a standard RNN might approach convergence within T iterations, the effective computational depth of the hierarchical convergence mechanism reaches N×T steps. As shown by the experimental results in Figure 3, this mechanism allows HRM to maintain high computational activity (forward residual) over multiple timesteps, while the activity of a standard RNN rapidly decays; simultaneously, it still achieves stable convergence. This enables HRM to exhibit superior performance at any computational depth, as shown in Figure 2.

Gradient Approximation: Recurrent models typically use Backpropagation Through Time (BPTT) to compute gradients. However, BPTT requires storing all hidden states during the forward pass and combining them with gradients during backpropagation, which demands O(T) memory (where T is the number of timesteps). This substantial memory overhead forces the use of smaller batch sizes, leading to inefficient GPU utilization, especially for large-scale networks. Furthermore, since retaining the complete historical trajectory over time is biologically implausible, the brain likely does not implement BPTT.

Fortunately, if a recurrent neural network converges to a fixed point, we can avoid unfolding its state sequence by performing a single-step backpropagation at that equilibrium point. Moreover, this mechanism could potentially be implemented in the brain solely through local learning rules. Based on this finding, we propose a single-step approximation method for HRM gradients—using only the gradients of the last state of each module and treating other states as constants. Thus, the gradient propagation path is:

The above method requires only O(1) memory, avoids temporal unfolding, and can be easily implemented through automatic differentiation frameworks like PyTorch, as shown in Figure 4. Since each module only needs to backpropagate errors through its most recent local synaptic activity, this method aligns highly with the view that cortical credit assignment relies on short-range, temporally localized mechanisms (rather than global activity pattern replay).

We can now define the loss function for the learning process. The overall loss for each supervision segment combines the Q-head loss and the sequence-to-sequence loss:

Minimizing the above loss function enables accurate predictions and near-optimal halting decisions. The choice of a “halt” action ends the supervision loop. In practice, sequences are processed in batches, which can be easily handled by replacing any halted samples in a batch with new samples from the data loader.

Figure 5 illustrates the performance comparison between two HRM variants: one utilizing the ACT mechanism, and another using a fixed number of computation steps comparable to ACT’s Mmax parameter. The results indicate that ACT can adaptively adjust its computational resources based on task complexity, achieving significant computational savings with minimal impact on performance.

Scalability at Inference Time An effective neural network model should be able to leverage additional computational resources during inference to improve performance. As shown in Figure 5-(c), HRM can seamlessly achieve inference-time scaling by simply increasing the computation limit parameter Mmax, without requiring further training or modification of the network structure.

Additional computational resources are particularly effective for tasks requiring deep reasoning. For Sudoku problems—which typically require long-term planning—HRM exhibits strong reasoning and scaling capabilities. On the other hand, we found that for ARC-AGI challenge tasks, the performance improvement from additional computational resources was very limited, as solutions to these problems typically require only a few transformations.

Stability of Q-learning in ACT

The deep Q-learning underpinning our ACT mechanism is known to be prone to instability, often requiring stabilization techniques such as replay buffers and target networks, which are absent in our design. However, our method achieves stability through intrinsic properties of the model itself and its training process. Recent theoretical work by Gallici et al. has shown that Q-learning can converge if network parameters are bounded, weight decay is introduced during training, and post-normalization layers are implemented. Our model satisfies these conditions through its Post-Norm architecture, which employs RMSNorm (a variant of layer normalization) and the AdamW optimizer. AdamW has been shown to solve an L∞ constrained optimization problem, ensuring model parameters remain within the range of 1/λ.

Both the low-level and high-level recurrent modules fL and fH are implemented using encoder-only Transformer blocks with identical architecture and dimensions. These modules accept multiple inputs, which we merge via simple element-wise addition, although more complex merging techniques (e.g., gating mechanisms) might improve performance and are left for future research. In this work, across all Transformer blocks including those in baseline models, we have incorporated enhancements found in modern large language models based on the Llama architecture. These improvements include Rotary Position Embeddings, Gated Linear Units, RMSNorm, and the removal of bias terms from linear layers.

Additionally, both HRM and recurrent Transformer models implement a Post-Norm architecture, with weights initialized via truncated LeCun normal initialization, while excluding scaling and bias parameters in RMSNorm. All parameters are optimized using the Adam-atan2 optimizer, a scale-invariant variant of Adam, combined with a constant learning rate that includes linear warm-up.

3 Results

This section first introduces the three benchmark tasks: ARC-AGI, Sudoku, and Maze, followed by an overview of baseline models and their results. Figures 6-(a,b,c) visually present these three benchmark tasks, which were carefully selected to evaluate AI models' reasoning capabilities in different aspects.

3.1 Benchmark Tasks

ARC-AGI Challenge Task

The ARC-AGI benchmark evaluates general fluid intelligence through IQ-test-like puzzles that require inductive reasoning. The original version, ARC-AGI-1, presents challenges as input-output grid pairs, forcing AI systems to extract and generalize abstract rules from a few examples. Each task provides several input-output example pairs (typically 2–3 pairs) and one test input. AI models have two chances to generate the correct output grid. While some argue that mastering ARC-AGI signifies achieving true artificial general intelligence, its primary purpose is actually to reveal key bottlenecks in the current development of artificial general intelligence. Indeed, traditional deep learning methods and Chain-of-Thought (CoT) techniques face significant challenges on ARC-AGI-1, mainly because the task requires models to generalize to entirely new tasks.

Addressing the limitations found in ARC-AGI-1, ARC-AGI-2 significantly expands the benchmark, offering a more comprehensive and carefully optimized set of tasks. These new tasks place greater emphasis on deep compositional reasoning, multi-step logic, context-dependent rule application, and symbolic abstraction capabilities. Human calibration studies show that these tasks are challenging but solvable for humans, while being much harder for current AI systems, thus providing a clearer standard for measuring general reasoning ability.

Sudoku-Extreme

Sudoku is a 9×9 logic puzzle that requires each row, each column, and each 3×3 block to contain digits 1 through 9 exactly once. A model's prediction is considered correct if its output exactly matches the unique solution of the puzzle. Due to its complex logical structure, Sudoku is often used as a popular benchmark task for evaluating the logical reasoning capabilities of machine learning models.

The most commonly used Sudoku dataset in current research is the Kaggle dataset, where all puzzles can be fully solved using basic single-digit techniques. Another widely used dataset is the 17-clue puzzle set with a minimum of 17 clues, which superficially appears more challenging due to its minimal clue count. However, this perception is misleading—because 17 is the minimum number of clues required to guarantee a unique solution for Sudoku, these clues must be highly orthogonal to each other. This orthogonal arrangement, paradoxically, leads to many direct and easily solvable reasoning paths.

We propose “Sudoku-Extreme,” a more challenging new dataset that integrates the simpler datasets mentioned above, as well as puzzles recognized by the Sudoku community as extremely difficult for human players:

Easy puzzles: from the Kaggle dataset, the 17-clue dataset, and unbiased samples drawn from the Sudoku puzzle distribution, totaling 1,149,158 puzzles. Hard puzzles: from the Magictour 1465, Forum-Hard, and Forum-Extreme subsets, totaling 3,104,157 puzzles.

The integrated data underwent a strict 90/10 train-test split, ensuring that puzzles in the test set could not be derived from equivalent transformations of any sample in the training set. “Sudoku-Extreme” is a down-sampled subset of this data, containing 1000 training samples. We use Sudoku-Extreme in our main experiments (Figure 1), focusing on few-shot learning scenarios. To ensure convergence and control overfitting in our analytical experiments (Figures 2, 3, and 5), we use the full training data “Sudoku-Extreme-Full,” comprising 3,831,994 samples.

We measure puzzle difficulty by the number of search backtracks (i.e., “guesses”) required by tdoku, an intelligent Sudoku solver program. This program uses propositional logic to reduce the number of guesses. Our Sudoku-Extreme dataset requires an average of 22 backtracks per puzzle, significantly higher than existing datasets; for example, the recently hand-designed Sudoku-Bench dataset requires an average of only 0.45 backtracks per puzzle. The complexity levels of these subsets are shown in Figure 6-(d).

Maze-Hard

This task requires finding the optimal path in a 30×30 maze, and is often used to train large language models to perform search tasks due to its high interpretability. We adopt the instance generation method proposed by Lehnert et al., but with an additional filtering criterion: only instances with difficulty greater than 110 are retained. Here, “difficulty” is defined as the length of the shortest path, which aligns with the linear time complexity of the wavefront breadth-first search algorithm running on GPUs. A path is considered correct only if it is valid and optimal (i.e., the shortest path from start to end). The training and test sets each contain 1000 samples.

3.2 Evaluation Details

For all benchmark tasks, HRM models are initialized with random weights and trained in a sequence-to-sequence framework using input-output sample pairs. The 2D input and output grids are flattened and padded to the maximum sequence length. Final performance results are shown in Figure 1. Notably, HRM achieved these performance levels using only about 1000 training samples per task, without requiring pre-training or Chain-of-Thought (CoT) labels.

For the ARC-AGI challenge tasks, we use all input-output example pairs from the training and evaluation sets. Data augmentation is performed by applying transformations such as translation, rotation, flipping, and color permutation to the puzzles. A learnable special token is added before each task example to indicate its puzzle type. During the testing phase, for each test input in the evaluation set, we follow these steps: (1) Generate and solve 1000 augmented variants, applying inverse augmentation transformations to the prediction results of each variant to restore the original form; (2) Select the two most frequent prediction results as the final output. All results are reported on the evaluation set.

For Sudoku puzzles, we perform data augmentation via band and digit permutation; maze tasks do not enable data augmentation. Both tasks undergo only one inference pass.

In ARC-AGI tasks, CoT model scores are from the official leaderboard; for Sudoku and maze tasks, scores are obtained via corresponding API evaluations.

In Figure 1, baseline models are grouped based on whether they are pre-trained and whether CoT is used. The “Direct pred” baseline refers to “direct prediction without CoT and no pre-training,” with the same training setup as HRM, simply replacing the model with a Transformer architecture. Interestingly, on the ARC-AGI-1 task, the “Direct pred” baseline performs comparably to Liao and Gu—who constructed a carefully designed, domain-specific equivariant network for this task, trained from scratch without pre-training. By replacing the Transformer architecture with HRM’s hierarchical framework and introducing the ACT mechanism, our performance improved by more than twofold.

On the Sudoku-Extreme and Maze-Hard benchmarks, the performance gap between HRM and baseline methods is extremely significant, as baseline methods are almost unable to solve these tasks. These tasks requiring long reasoning chains are particularly difficult for CoT-based methods. Using only 1000 training samples, the “Direct pred” baseline, employing an 8-layer Transformer of the same scale as HRM, completely failed on these complex reasoning problems. However, when trained on the larger Sudoku-Extreme-Full dataset, the “Direct pred” baseline was able to solve some simple Sudoku puzzles, achieving 16.9% accuracy (see Figure 2). Research by Lehnert et al. showed that a conventional Transformer model with 175 million parameters, trained with 1 million samples over multiple training rounds, still performed extremely limited on the 30×30 maze task, with accuracy still below 20% when evaluated using the pass@64 metric.

3.3 Visualization of Intermediate Timesteps

Although HRM excels at complex reasoning tasks, it raises an interesting question: what underlying reasoning algorithms does the HRM neural network actually implement? Answering this question is crucial for enhancing model interpretability and gaining a deeper understanding of HRM's solution space.

In the maze task, HRM appears to initially explore several potential paths simultaneously, then eliminate blocked or inefficient routes, then construct an initial outline of a solution, followed by multiple refinement iterations. In the Sudoku task, its strategy resembles a depth-first search approach, where the model appears to explore potential solutions and backtrack when it encounters dead ends. For ARC tasks, HRM employs a different approach, incrementally adjusting the board and iteratively refining until a solution is found. Unlike Sudoku (which involves frequent backtracking), ARC's solution paths follow a more coherent progression, similar to hill climbing optimization.

Importantly, the model demonstrates its ability to adapt to different reasoning methods, likely selecting an effective strategy for each specific task. Further research is needed to gain a more comprehensive insight into these solution strategies.

4 Brain Correspondence

A key principle in systems neuroscience is that the functional versatility of brain regions—their ability to process diverse and complex tasks—is closely related to the dimensionality of their neural representations. High-level cortical regions responsible for complex reasoning and decision-making must cope with a variety of diverse tasks, thus requiring more flexible and context-dependent processing mechanisms. In dynamical systems, this flexibility is often achieved through higher-dimensional state-space trajectories, supporting richer potential computational patterns. This principle forms an observable hierarchy of dimensions, where the position of brain regions in the information processing hierarchy is positively correlated with their effective dimensionality. To quantify this phenomenon, we can examine the “Participation Ratio” (PR), a standard metric for measuring the effective dimension of high-dimensional representations.

where {λi} are the eigenvalues of the covariance matrix of neural activity trajectories. Intuitively, higher PR values indicate variance is more evenly distributed across more dimensions, corresponding to higher-dimensional representations; conversely, lower PR values indicate variance concentrated in a few principal components, reflecting a more compact, lower-dimensional structure.

This dimensional hierarchy can be observed, for instance, in the mouse cortex: PR values of population neural activity show a monotonic increase from low-level sensory regions to high-level association regions, supporting the correlation between representational dimensionality and functional complexity (see Figure 8(a,b)).

We examine whether HRM reproduces this neuroscience principle by computing the PR values of its two recurrent modules after training on the Sudoku-Extreme-Full dataset. The PR is computed based on covariance matrices constructed from neural states collected across multiple neural activity trajectories solving Sudoku. The results show striking consistency with biological findings: the states of the low-level module (zL) occupy a relatively small subspace, with a participation ratio of 30.22; while the states of the high-level module (zH) operate in a much larger subspace, reaching a participation ratio of 89.95, as shown in Figure 8(c). Furthermore, Figure 8(d) shows that as the number of independent tasks (trajectories) increases from 10 to 100, the dimensionality of zH significantly rises, while the dimensionality of zL remains stable. These results indicate a spontaneous separation in representational capacity between the two modules, corresponding to their respective functional roles.

To verify that this hierarchical structure is an emergent property during training, rather than an artifact of the network architecture itself, we conducted a control analysis: using an identically structured but untrained network with random weights.

We initialized an identical HRM architecture, assigned it random weights, and without any training, allowed this network to process the same task-specific inputs as the trained model, measuring the PR values of its modules.

The results, shown in Figure 8(e,f), present a stark contrast: in the untrained network, the high-level and low-level modules do not exhibit hierarchical separation; their PR values are both low and almost indistinguishable from each other. This control analysis validates that the dimensional hierarchy is an emergent property of the model during the process of learning complex reasoning.

The PR ratio between the high-level and low-level modules in HRM (zH / zL ≈ 2.98) is very close to the ratio measured in the mouse cortex (≈ 2.25). In contrast, traditional deep networks often exhibit “neural collapse,” where the features of the last layer converge to a low-dimensional subspace. HRM breaks this collapse pattern, instead developing high-dimensional representations in its high-level module. This is significant because such high-dimensional representations are considered crucial for cognitive flexibility and are typical features of high-level brain areas like the prefrontal cortex (PFC), which is central to complex reasoning.

This structural similarity suggests that the model autonomously discovered some fundamental organizational principle. By partitioning representations into a high-capacity, high-dimensional subspace (zH) and a more specialized, low-dimensional subspace (zL), HRM spontaneously evolved a basic organizational scheme believed to be essential for robust and flexible reasoning in biological systems. This provides a potential mechanistic explanation for why this model succeeds in complex, long-range tasks that models lacking such differentiated internal structure struggle to cope with.

However, we emphasize that the current evidence is only correlational. While causal relationships could be tested through interventions (e.g., limiting the dimensionality of the high-level module), such operations can have complex interfering effects on the training process itself, making accurate interpretation difficult in deep learning. Therefore, the causal necessity of this emergent hierarchical structure remains an important topic for future research.

5 Related Work

Reasoning and Algorithmic Learning

Given the central role of reasoning problems in artificial intelligence and their close connection to algorithms, researchers have long explored neural network architectures capable of learning algorithms from training instances. This research direction includes Neural Turing Machines (NTM), Differentiable Neural Computers (DNC), and Neural GPUs—all of which construct iterative neural architectures that simulate computational hardware to execute algorithms, and learn algorithms through data training. Another important work in this field is Recurrent Relational Networks (RRN), which execute algorithms on graph-structured representations via graph neural networks.

In recent years, researchers have combined algorithmic learning methods with Transformer-based architectures. The Universal Transformer extends the capabilities of standard Transformer models by introducing recurrent mechanisms between layers and adaptive halting mechanisms. Geiping et al. showed that Transformers with recurrent structures can generalize to more recurrent steps during inference than during training. Shen et al. proposed incorporating continuous recurrent reasoning tokens into Transformers. Additionally, TransNAR combines recurrent graph neural networks with language models.

Building on the success of Chain-of-Thought (CoT)-based reasoning, a series of studies have proposed fine-tuning methods that use reasoning paths generated by search algorithms (such as A*) as targets for supervised fine-tuning (SFT).

We also mention adaptive halting mechanisms designed to allocate additional computational resources for more complex problems, such as Adaptive Computation Time (ACT) for recurrent neural networks, and subsequent work like PonderNet, which aims to improve the stability of this resource allocation process.

HRM further expands the boundaries of algorithmic learning through a brain-inspired computational architecture, achieving exceptional data efficiency and model expressiveness, successfully discovering complex and diverse algorithms with only 1000 training samples.

Brain-inspired Reasoning Architectures

Building models with brain-like reasoning capabilities has long been a pursuit in the field of neuromorphic computing. Spaun is a typical example, which uses spiking neural networks to construct different modules corresponding to brain regions such as the visual cortex and prefrontal cortex. This design enables the model to perform a range of cognitive tasks, from memory recall to simple reasoning puzzles. However, its reasoning relies on manually designed algorithms, which may limit its ability to learn new tasks.

Another important model is the Tolman-Eichenbaum Machine (TEM), inspired by the role of the hippocampal-entorhinal cortex system in spatial and relational memory tasks. TEM proposes that medial entorhinal cortex cells build the foundation for structured knowledge, while hippocampal cells associate this foundation with sensory information. This mechanism enables TEM to generalize and explains the emergence of various neuron types such as grid cells, boundary cells, and place cells.

Another class of methods is neural sampling models, which treat neural signaling processes as inference over probability distributions, operating similarly to Boltzmann machines. These models typically require manually setting rules for specific reasoning tasks.

Essentially, while previous models have made progress on simple reasoning problems, HRM is designed to tackle complex tasks that even advanced large language models struggle with, without requiring pre-training or task-specific manual design.

Hierarchical Memory

Hierarchical multi-timescale structures also play an important role in how the brain processes memory. Models such as Hierarchical Sequential Models and Clockwork RNNs use multiple recurrent modules operating at different timescales to more effectively capture long-range dependencies in sequences, thereby mitigating the forgetting problem in RNNs.

Similar mechanisms have also been applied to linear attention methods to memorize long contexts (see discussion). Since HRM focuses on reasoning tasks, it adopted a full attention mechanism for simplified design. Incorporating hierarchical memory mechanisms into HRM could be a promising direction for future research.

6 Discussion

Turing Completeness of HRM

Similar to earlier neural algorithmic reasoners (e.g., Universal Transformer), HRM possesses computational universality given sufficient memory and time constraints. In other words, it belongs to the class of models capable of simulating arbitrary Turing machines, thereby overcoming the computational capacity limitations of standard Transformers mentioned in the introduction. Because earlier neural algorithmic reasoners were often trained in the form of recurrent neural networks, they were prone to premature convergence and relied on computationally and memory-intensive BPTT (Backpropagation Through Time). Therefore, despite their effective computational depth still exceeding standard Transformers, they remained limited in practice. HRM addresses both these challenges and possesses adaptive computation capabilities, allowing it to be trained on long reasoning processes and solve complex puzzles requiring depth-first search and backtracking, thus moving closer to practical Turing completeness.

Chain-of-Thought based Reinforcement Learning

In addition to fine-tuning with human-annotated Chain-of-Thought (CoT), reinforcement learning (RL) is another widely adopted training method. However, recent research indicates that RL primarily serves to activate existing, CoT-like reasoning capabilities within models, rather than discovering entirely new reasoning mechanisms. Furthermore, RL combined with CoT is known for unstable training and data inefficiency, often requiring extensive exploration and carefully designed reward functions. In contrast, HRM relies on dense gradient-based supervision signals rather than sparse reward signals. Additionally, HRM naturally operates in a continuous space, which is more biologically plausible and allows for dynamic allocation of different computational resources based on the varying complexity of individual tokens in reasoning and planning, avoiding treating all tokens uniformly.

Linear Attention Mechanisms

Recurrent structures are not only studied for their potential in general computation but also explored as alternatives to the attention mechanism in Transformers, as standard attention suffers from quadratic growth in time and memory complexity. Recurrent alternatives achieve more efficient architectural designs by processing input tokens one by one sequentially and predicting the next token at each timestep, similar to earlier RNN-based language models.

Some linear attention variants (e.g., Log-linear Attention) adopt RNN-like state update mechanisms that can be interpreted as propagating aggregated statistics across multiple timescales, thereby preserving long-range contextual information without incurring the quadratic memory growth of standard self-attention. However, merely replacing the attention mechanism does not change the fact that Transformers remain fixed-depth models, still relying on Chain-of-Thought as a compensatory mechanism. Notably, linear attention can handle longer contexts via compressed key-value caches, making it more suitable for deployment on resource-constrained edge devices.

7 Conclusion

This study presents the Hierarchical Reasoning Model (HRM), a brain-inspired architecture that achieves significant computational depth through hierarchical structures and multi-timescale processing, without sacrificing training stability and efficiency. With only 27 million parameters and trained on 1000 samples, HRM can effectively solve challenging reasoning tasks such as ARC, Sudoku, and complex maze navigation—tasks that typically pose significant challenges for current large language models and Chain-of-Thought methods.

While the brain highly relies on hierarchical structures for most cognitive functions, these ideas have largely remained in academic research and have not been widely translated into practical applications. Current mainstream artificial intelligence methods still tend towards non-hierarchical models. Our research challenges this established paradigm, demonstrating that hierarchical reasoning models can serve as a viable alternative to current mainstream Chain-of-Thought reasoning methods, marking an important step towards a foundational framework with Turing-complete general computational capabilities.

Original Link: https://arxiv.org/abs/2506.21734

Hierarchical Reasoning Model

Share Short URL