Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces
https://arxiv.org/pdf/2410.09918v2
In human cognitive theory, human thinking is dominated by two systems: System 1, which is fast and intuitive, and System 2, which is slower but more reasoning-oriented. Recent research shows that integrating System 2 thought processes into Transformer models (including Large Language Models, LLMs) can significantly enhance their reasoning capabilities. However, models that solely simulate System 2 thinking incur higher computational costs and slower response times. To address this challenge, we propose Dualformer, a single Transformer model that seamlessly integrates both fast and slow reasoning modes.
Dualformer is obtained by training on data that includes randomized reasoning traces, where different parts of the reasoning traces are randomly discarded during training. These discarding strategies are specifically designed based on the structure of the reasoning traces, similar to analyzing our thought processes and forming mental shortcuts through patterns. During the inference phase, our model can be configured to output only the solution (fast mode), or to output both the reasoning chain and the final answer (slow mode), or to automatically decide which mode to enable (auto mode).
In all cases, Dualformer outperforms corresponding baseline models in terms of performance and computational efficiency: (1) In slow mode, Dualformer achieves an optimal solution rate of 97.6% for unseen 30×30 maze navigation tasks, surpassing the 93.3% performance of the Searchformer baseline model trained on full reasoning trace data, while reducing inference steps by 45.5%; (2) In fast mode, Dualformer achieves an 80% optimal rate for these tasks, significantly outperforming the Solution-Only model (which only achieves 30% optimal rate) trained solely on answer data; (3) In auto mode, Dualformer achieves a 96.6% optimal rate, while reducing inference steps by 59.9% compared to Searchformer.
Our method also achieves performance improvements through Large Language Model fine-tuning on mathematical problems, demonstrating the broad applicability of this technique beyond task-specific models.
1 Introduction
In psychology, the dual process theory (Wason and Evans, 1974) explains that thinking can arise through two different processes. Typically, one process is implicit (automatic and unconscious), while the other is explicit (controlled and conscious).
The renowned book "Thinking, Fast and Slow" (Kahneman, 2017) deeply explores the concept of dual process theory. It describes two distinct modes of thinking: System 1 corresponds to implicit processes, being fast, intuitive, and emotional; System 2 corresponds to explicit processes, being slower, more inferential, and logical. Kahneman broadly applied this theory to various cognitive and behavioral activities to illustrate how these two systems influence our decision-making and reasoning processes. As described in the book, System 1 is prone to cognitive biases and systemic errors due to a lack of active control; in contrast, System 2 is better suited for tasks requiring careful analysis, deliberation, and planning.
Transformers (Vaswani et al., 2017), a sequence modeling tool, have become a core technology for various foundational models, including Large Language Models (LLMs) (Dosovitskiy, 2020; Baevski et al., 2020; Radford et al., 2021; Touvron et al., 2021; Hsu et al., 2021; Touvron et al., 2023; Dubey et al., 2024). In many studies, Transformers are widely used to solve reasoning and planning problems, such as Zhou et al. (2022); Kojima et al. (2022); Pallagani et al. (2022); Valmeekam et al. (2023a); Chen et al. (2024); Gundawar et al. (2024); Wang and Zhou (2024). Interestingly, the dual process theory also applies to the field of Artificial Intelligence, where we can classify Transformer's reasoning modes into "fast" and "slow." In fast mode, the Transformer directly outputs the final solution without any reasoning steps; in slow mode, the model generates intermediate steps of thought, such as search traces when finding the shortest path, outputting them along with the final plan. These two reasoning modes are very similar to the advantages and disadvantages of the two thinking systems possessed by humans. As discussed in previous research (Wei et al., 2022; Valmeekam et al., 2023b; Lehnert et al., 2024; Gandhi et al., 2024; Saha et al., 2024), models running in fast mode have lower computational costs and faster response times, but are less accurate and optimal than models in slow mode.
This raises a question: Can we integrate both fast and slow modes into Transformer-based reasoning agents, allowing them to complement each other, just as humans possess two distinct thinking systems?
Multiple methods have been proposed. One popular paradigm is to first construct a pure System 2 model and then fine-tune it to be more efficient like System 1, for example, by "distilling" System 2's output into System 1's form (Yu et al., 2024; Deng et al., 2024), or by bootstrapping from symbolic systems (e.g., Searchformer (Lehnert et al., 2024), Stream of Search (Gandhi et al., 2024)) or knowledge distillation (Wang et al., 2023; Shridhar et al., 2023) to improve the inference efficiency of existing System 2. However, these methods often require additional fine-tuning, are computationally expensive, and make it difficult for the final system to flexibly switch between System 1 and System 2 at runtime. To address this, Saha et al. (2024) designed an explicit meta-controller to switch between two different systems.
In this study, we demonstrate a surprising finding: a simple data processing method is sufficient to achieve on-the-fly configuration of System 1 and System 2 when solving reasoning tasks. The resulting model, Dualformer, can be easily configured to operate in fast or slow mode during inference; if no mode is specified, the model can decide which mode to use on its own. Specifically, to simulate System 2's reasoning process, our Transformer is trained on data containing reasoning traces and final solutions. Utilizing the structural characteristics of reasoning steps, we designed specific trace discarding strategies, making the generated traces resemble the "shortcuts" taken by System 1 in its thought process. In extreme cases, we completely discard the entire reasoning trace, encouraging the Transformer to directly output the final solution, skipping all intermediate steps. During training, we randomly select these structured trace discarding strategies. See Section 4 for details.
We first applied this framework to train an encoder-decoder Transformer model to solve pathfinding problems, where reasoning traces are generated by the A* search algorithm. We considered two task domains: maze navigation and Sokoban games, consistent with the setup in Lehnert et al. (2024), and adopted the same tokenization scheme. Interestingly, we found that these types of problems are challenging for current state-of-the-art Large Language Models (such as o1-preview and o1-mini), whose output paths often "pass through walls" (see examples in Appendix G).
In each reasoning mode, Dualformer outperforms established baseline models, achieving stronger performance in both problem-solving and optimality rates. Additionally, Dualformer significantly enhances the diversity of generated plans, capable of finding more unique paths to the goal. Notably, even in slow mode, Dualformer demonstrates high efficiency, generating significantly shorter reasoning traces than baseline models. Next, we applied this framework to fine-tune Large Language Models to answer mathematical problems. Following the method of Yu et al. (2023), training samples were derived from the MATH dataset (Hendrycks et al., 2021), with answers rewritten by the Llama-3.1-70B-Instruct model to include detailed intermediate reasoning steps. Similarly, the resulting LLMs also showed higher effectiveness and efficiency.
2 Related Work
Learning Planning and Reasoning
Significant efforts have been made by researchers to enhance the planning and reasoning capabilities of Transformer-based models in long-horizon tasks. Currently, two main categories of methods have been developed.
The first category of methods leverages existing Large Language Models (LLMs). For instance, researchers train LLMs to call external symbolic solvers, as seen in the work of Ahn et al. (2022), Besta et al. (2024), Sel et al. (2023), He-Yueya et al. (2023), Liu et al. (2023), and Silver et al. (2024). Pallagani et al. (2022, 2024) investigated fine-tuning LLMs to use symbolic solvers; Schick et al., Hao et al. (2024) fine-tuned LLMs to use external tools; Hao et al. proposed fine-tuning LLMs to reason and plan within a world model. Among these, some studies attempt to integrate System 1 and System 2 thinking modes into the LLM's reasoning process. Weston and Sukhbaatar (2023) introduced "System 2 Attention (S2A)," a more deliberate attention mechanism aimed at improving LLM reasoning by reducing the model's reliance on spurious correlations in context. Yu et al. (2024) distilled System 2 outputs into a System 1 model, aiming to recompile high-quality results generated by System 2 methods back into the LLM's generation process without intermediate reasoning token sequences.
The second category of methods aims to train Transformer models from scratch to perform planning and reasoning independently (Lehnert et al., 2024; Saha et al., 2024; Gandhi et al., 2024), often using task-specific linguistic representations. Both Lehnert et al. (2024) and Gandhi et al. (2024) attempted to train LLMs for search by linguistically representing the search process: Lehnert et al. (2024) developed Searchformer, a model trained to mimic the A* search algorithm's process in path planning problems; Gandhi et al. (2024) applied their model to the "Countdown game." In concurrent work, Saha et al. (2024) trained two separate models managed by an external meta-controller: one model generated fast responses without reasoning traces, while the other responded slower but included full reasoning chains. Similarly, Lin et al. (2023) used multiple models to achieve slow and fast thinking through an agentic workflow. Our research is closely related to the aforementioned works but differs significantly: unlike Lehnert et al. (2024) and Gandhi et al. (2024), who do not modify reasoning traces in their training data, our method randomizes reasoning traces during training; unlike Saha et al. (2024) and Lin et al. (2023), we do not use any explicit controllers, nor do we train two separate networks for each mode, but instead integrate the functionalities of both fast and slow modes into a single model.
Generating Synthetic Data with Large Language Models
Large Language Models (LLMs) have been widely used for synthetic data generation in multiple domains. For example, Wei et al. (2021), Longpre et al. (2023), and Taori et al. (2023) constructed synthetic instruction datasets by sampling from diverse templates containing natural language instructions. This approach has also been applied in the visual domain (Liu et al., 2024b,a; Zhu et al., 2023; Brooks et al., 2023; Peng et al., 2023). To enhance LLMs' ability to solve mathematical problems, Yu et al. (2023) developed a method to rewrite, validate, and augment the original MATH dataset (Hendrycks et al., 2021) using specially designed prompts. Similar methods have been further explored in other studies, including research by Yuan et al. (2023), Luo et al. (2023), Lee et al. (2023), Yue et al. (2023), and Tong et al. (2025).
3 Preliminaries
Our work builds upon the research of Lehnert et al. (2024). To enable planning, we train a Transformer model to model a sequence composed of tokens, which sequentially represent the planning task, the A* algorithm's computation process, and the optimal solution derived from the A* search. The tokenization method is shown in Figure 3.1.
To illustrate with a simple example: we consider a 3×3 maze in a navigation task, where the goal is to find the shortest path from a starting cell to a target cell without passing through wall cells. The A* algorithm has successfully determined an optimal path. We use a sequence of tokens to simultaneously express the task and the maze structure, and these tokens also serve as the input prompt for Dualformer. The solution is presented through a "plan token sequence" representing the path coordinates.
The A* algorithm generates a search trace sequence, recording dynamic behaviors during the search process, as shown in Figure 4.1. To recall, the A* algorithm is a pathfinding algorithm on weighted graphs. The "create" clause adds nodes (represented by subsequent coordinates) to the search frontier, while the "close" clause adds nodes to the closed set. Each clause, whether create or close, is followed by four tokens: x, y, c0, and c1, representing the node's coordinates, the actual cost from the start (cost-since-start), and the heuristic estimate, respectively. For more details on the A* algorithm and tokenization methods, we refer readers to Russell and Norvig (2021) and Lehnert et al. (2024).
4 Structured Trace Discarding and Randomized Training
Searchformer, proposed by Lehnert et al. (2024), has proven effective in solving various complex decision-making tasks. However, it still has two significant limitations.
First, the model operates only in slow mode, outputting lengthy reasoning chains, which significantly increases inference time. While this issue can be mitigated through bootstrapping methods (Lehnert et al., 2024)—an iterative optimization technique involving multiple rollouts and fine-tuning cycles—this process incurs significantly increased additional computational resource requirements.
Second, Searchformer has difficulty generating diverse solutions, often repeatedly sampling the same reasoning paths¹. For example, in our tests on 1000 30×30 maze problems, Searchformer's reasoning chains averaged over 1500 tokens, finding only 7.6 distinct feasible paths in 64 responses (see Section 5.1.2).
To address these challenges, we propose a training framework that leverages randomized reasoning traces. Our method is inspired by two lines of research: first, we noted that although Searchformer is trained on full A* search traces, its generated reasoning chains are shorter and only roughly outline the search process; second, research indicates that humans often rely on "shortcuts" and patterns in decision-making, a concept known as System 1 thinking (Kahneman, 2017). These observations, combined with the success of dropout techniques (Hinton, 2012; Srivastava et al., 2014)—which involve randomly dropping units in neural networks during training—prompted us to explore the possibility of using randomized reasoning traces within our framework. We aim to selectively discard parts of the trace content for each training sample, utilizing the structural features of A* search traces. The specific methods are described below.
As shown in Figure 4.1, the A* search trace includes two clauses: "create" and "close," each comprising the node's coordinates and its (estimated) cost from the start and heuristic cost to the goal. To obtain Dualformer, we leverage the structure of the search trace to discard parts of its information in each training sample. There are three natural types of discarding:
- D1: Discard a close clause
- D2: Discard cost tokens within a clause
- D3: Discard a create clause
Based on these fundamental types, we designed four levels of discarding strategies, each progressively enhanced beyond the previous one:
- Level 1 Strategy: Remove all close clauses from the search trace.
- Level 2 Strategy: Building on Level 1, further remove all cost tokens within all clauses.
- Level 3 Strategy: More aggressive, additionally randomly discard 30% of create clauses.
- Level 4 Strategy: The final strategy, directly discard the entire reasoning trace.
By randomly applying these structured discarding strategies during training, the model is encouraged to learn to directly derive correct answers from incomplete or highly simplified reasoning processes, thereby simulating System 1's fast intuitive thinking while retaining System 2's full reasoning capabilities.
Figure 4.1 illustrates our discarding strategies using the previously mentioned maze task as an example. Intuitively, Level 1 discarding instructs Dualformer to effectively skip the "close-set" computations in A* search; Level 2 discarding prompts Dualformer to skip both the "close-set" and cost computations simultaneously; Levels 3 and 4 discarding encourage Dualformer to omit some or all search steps. We will show in Section 5 that these strategies effectively guide Dualformer to learn more concise and efficient search and reasoning processes.
To enhance the diversity of training data, we did not treat discarding as a data preprocessing step. Instead, during training, for each training sample in each batch, we randomly sample a discarding strategy from a categorical distribution Cat(p₀, p₁, p₂, p₃, p₄), where p₁ to p₄ represent the probabilities of applying Level 1 to Level 4 discarding strategies, respectively, and p₀ represents the probability of retaining the complete reasoning trace. This training framework allows Dualformer to learn from multiple simplified traces, even for the same training sample, as the same sample may reappear in different batches with different discarding methods.
Comparison with Token Masking Techniques Interested readers might wonder if our training framework resembles token masking techniques used by prominent Large Language Models like BERT (Devlin, 2018; Liu, 2019; Song, 2019; Gauthier and Levy, 2019; Sinha et al., 2021; Kitouni et al., 2024). However, our method significantly differs from these masking techniques. First, standard masking techniques typically apply uniform random masking to input tokens in a sequence, whereas our discarding strategies are applied only to the search trace portion. Second, masked Large Language Models usually employ a bidirectional attention mechanism and aim to predict masked tokens; Dualformer, however, uses a causal attention mechanism, with its training objective solely being next-token prediction, and the overall goal being to enhance the model's reasoning and planning capabilities. In terms of computational efficiency, our training process also offers advantages: discarding tokens shortens the input sequence, thereby saving computational resources. For example, training Dualformer to complete a 30×30 maze task on 8 Tesla V100 32GB GPUs takes 30 hours, whereas using full reasoning traces would take 36 hours. More training details can be found in Section 5 and Appendix A.
4.1 Controllable Generation
An appealing feature of Dualformer is that, during inference, it can be easily controlled via prompts to generate results in "fast" or "slow" mode. The control mechanism is extremely simple: we add a beginning-of-sequence (bos) token and a control token (either "plan" or "create") to the standard prompt (which includes environment and task descriptions). If "plan" is used, Dualformer enters fast mode, directly outputting the final path plan, skipping all intermediate reasoning steps; conversely, if "create" is inserted after bos, Dualformer enters slow mode, generating a full reasoning trace along with the final solution. Specific examples can be found in Appendix B. If only the standard prompt is used (without control tokens), Dualformer will simulate the dual-process mechanism of human decision-making—autonomously deciding which type of response to generate based on the specific context, corresponding to System 1 and System 2 reasoning modes, respectively.
5 Experiments
Our experiments aim to answer the following questions:
- Dualformer: Does it outperform corresponding baseline models in fast, slow, and auto modes? Can it generate more diverse planning solutions?
- In slow mode, can Dualformer achieve faster inference, i.e., output shorter reasoning traces?
- Can structured trace discarding techniques be generalized to Large Language Models (LLMs) trained on natural language datasets?
We address questions 1 and 2 in Section 5.1, where we train Transformer models to solve maze navigation tasks and related Sokoban games, using a method similar to Searchformer (Lehnert et al., 2024). To answer question 3, we fine-tuned Llama-3.1-8B and Mistral-7B models in Section 5.2 to solve mathematical problems.
5.1 Navigation Tasks: Mazes and Sokoban
We followed Lehnert et al. (2024)'s method, considering maze and Sokoban tasks, and used the same datasets. Both datasets contain 10⁵ training samples with complete A* search traces. The A* algorithm implementation is non-deterministic: when path costs are equal, the algorithm makes random choices; the expansion order of child nodes is also random. Maze sizes vary from 15×15 to 30×30. For all maze tasks, we randomly generated 30%–50% wall cells as obstacles and randomly sampled start and end positions. Sokoban maps are 7×7, containing two randomly placed docks, two boxes, and one worker position. We also randomly added two additional wall cells inside the map. For specific map generation procedures and example images, see Appendix A.1.
First, in Section 4.1, we demonstrated that Dualformer can be explicitly controlled to operate in fast or slow mode: in fast mode, it outputs only the final solution, while in slow mode, it generates a complete reasoning trace. In Sections 5.1.1 and 5.1.2, we compare Dualformer with corresponding baseline models in each mode. We systematically evaluate model performance using various metrics, including the correctness, optimality, and diversity of generated solutions, as well as reasoning trace length. Finally, in Section 5.1.5, we conduct ablation studies on design choices.
Hyperparameter Settings We instantiate Dualformer using the same encoder-decoder Transformer architecture as Lehnert et al. (2024). The encoder is based on the T5 architecture (Raffel et al., 2020) with rotary embeddings, and the decoder employs a GPT-style architecture. For the maze environment, a 15-million-parameter (15M) model is used, and for the Sokoban environment, a 46-million-parameter (46M) model. All models are trained on 100,000 training samples. The maze environment is trained for 4×10⁵ iterations, and the Sokoban environment for 8×10⁵ iterations. Detailed information on the model architecture and other hyperparameters can be found in Appendix A.
We trained Dualformer using the structured trace discarding strategies described in Section 4. For the discarding strategy applied to each training sample, we experimented with three sets of probability configurations: (1) {p₀ = 0.45, p₁ = p₂ = p₃ = 1/6, p₄ = 0.05}, (2) {p₀ = 0.8, p₁ = p₂ = p₃ = p₄ = 0.05}, (3) {p₀ = 0.7, p₁ = 0.05, p₂ = p₃ = 0.1, p₄ = 0.05}, and selected the set that yielded the lowest validation error. The final choices were: Group (1) for maze tasks, and Group (3) for Sokoban tasks.
Baseline Models For fast mode, our baseline is the "Solution-Only model," which has the same architecture as Dualformer but is trained only on sequence data containing optimal final solutions without any reasoning traces. For slow mode, the baseline is the "Complete-Trace model," i.e., a model trained on data containing complete A* search traces. This model is referred to as the "search augmented model" in Lehnert et al. (2024) and is also the original Searchformer model without search dynamics bootstrapping. We will also compare Dualformer with bootstrapped models in Section 5.1.2. All these models have the same number of parameters: 15 million for maze problems and 46 million for Sokoban problems.
Evaluation Metrics We use two metrics to evaluate whether the model generates correct and optimal solutions: 1-Solved-64 and 1-Optimal-64. Specifically, for each evaluation task (e.g., a maze or a Sokoban game), we randomly sample 64 responses from the trained model. Regardless of the generated reasoning trace, we only parse and evaluate the final solution part. If at least one of the 64 solutions is correct (i.e., feasible and reaches the target location), the task is marked as successful under the 1-Solved-64 metric; if at least one solution is optimal, it is marked as successful under the 1-Optimal-64 metric. We repeat this process for 1000 unseen evaluation tasks and report the average success rate.
To examine the robustness of each method, we also report the 3-Solved-64 and 3-Optimal-64 metrics: a task is marked as successful only if at least 3 solutions are correct or optimal.
Additionally, we use "Success Weighted by Cost" (SWC) (Wu et al., 2019) to measure the overall quality of generated solutions, aggregating based on their costs: SWC = (1/64) Σ₆₄ᵢ₌₁ I(solution i is correct) · (c∗ / cᵢ), where I is the indicator function, c∗ is the cost of the optimal solution, and cᵢ is the cost of the i-th solution. Clearly, a higher SWC score indicates that the generated solutions are closer to optimal; if all generated solutions are optimal, SWC reaches its maximum value of 1.
Finally, to quantify the diversity of generated solutions, we count the number of unique correct solutions produced for each task among 64 responses and report the average value over 1000 evaluation tasks.
5.1.1 Fast Mode
Table 5.1 reports the performance of Dualformer versus the baseline "Solution-Only model" on maze and Sokoban tasks. In terms of generating correct and optimal solutions, Dualformer significantly outperforms the baseline model on both 1-Solved-64 and 1-Optimal-64 metrics. Simultaneously, it also clearly surpasses the baseline on the 3-Solved-64 and 3-Optimal-64 metrics, indicating Dualformer's stronger robustness in solution generation.
Particularly, as task difficulty increases, the performance gap widens. For the largest 30×30 maze task, Dualformer's 1-Optimal-64 rate is 2.8 times that of the "Solution-Only model," and its 3-Optimal-64 rate reaches 2.97 times. Dualformer's SWC score is also significantly higher than the baseline model, exceeding 0.9 in all environments, indicating that every solution generated by Dualformer is of high quality and its cost is very close to the optimal solution.
Furthermore, for all problems considered, Dualformer consistently generates more diverse solutions. In Appendix C, we show an example maze and plot the unique correct paths generated by Dualformer and the baseline model. An interesting observation is that Dualformer's diversity score (i.e., the average number of unique correct solutions generated in 64 responses) increases with larger maze sizes. Intuitively, larger mazes offer more possible paths to the same target location.
This suggests that Dualformer has learned the structural features of mazes, whereas the "Solution-Only model" might merely be memorizing optimal paths, as its diversity score is close to 1 across all maze sizes.
5.1.2 Slow Mode
Table 5.2 reports the results of Dualformer operating in slow mode. The corresponding baseline model is the "Complete-Trace model," which uses the same architecture and is trained on data containing full A* search traces. In addition to the previously reported metrics, we also tallied the average length of reasoning traces generated by 64 responses across all 1000 evaluation tasks.
The results show that Dualformer achieved improvements in both planning ability and inference speed. Across all correctness and optimality metrics—including problem-solving rate, optimality rate, and SWC score—Dualformer outperformed the Complete-Trace model. Furthermore, Dualformer generated significantly shorter reasoning traces than the baseline model. On average, Dualformer reduced reasoning trace length by 49.4% across five tasks. Consistent with previous findings, Dualformer also generated more diverse solutions than the baseline model. For specific examples, please refer to Appendix C.
Comparison with Search Dynamics Bootstrapping Methods
The Complete-Trace model is the foundational Searchformer model proposed in Lehnert et al. (2024). That research also introduced a "search dynamics bootstrapping" method to enhance the model's performance on Sokoban tasks, similar to the methods in Anthony et al. (2017) and Zelikman et al. (2022). After training the Searchformer model, it is fine-tuned on a newly constructed bootstrapping dataset. Specifically, for each Sokoban level in the original dataset, 32 responses are generated, and the shortest optimal response among them is added to the new training dataset. This process can be repeated multiple times, allowing Searchformer to progressively learn to generate shorter responses.
Table 5.4 compares Dualformer with Searchformer models fine-tuned with up to 3 bootstrapping steps. The results show that Dualformer matches or outperforms the bootstrapped models on most metrics, while reducing the number of reasoning steps used by over 45.1%.
Notably, each bootstrapping step requires generating a total of 3.2×10⁶ response samples and an additional 10⁴ iterations of fine-tuning. This means that, including the initial 8×10⁵ iterations of pre-training, Searchformer at the 3rd bootstrapping step requires a total of 8.3×10⁵ iterations of training and 9.6×10⁶ rollouts, which is computationally very expensive. In contrast, Dualformer only requires a single training phase, totaling 8×10⁵ iterations, and does not need any additional rollout processes, significantly reducing computational overhead.
5.1.3 Auto Mode: Dual Process
Besides controlling Dualformer's inference mode by injecting control tokens after the "bos" token, we can also sample from it directly, allowing it to freely decide its operating mode, similar to the dual-process mechanism in human decision-making. We refer to this mode as Dualformer's auto mode. Table 5.3 reports the relevant results. In all tasks we considered, Dualformer in auto mode similarly outperformed both the Complete-Trace model and the Solution-Only model.
An interesting question worth exploring is: Can Dualformer automatically adjust its operating mode based on problem difficulty? To investigate this, we generated 64 responses in auto mode for each of 1000 unseen mazes (with wall density varying between 0.3 and 0.5) and then analyzed the proportion of feasible solutions that were slow-mode paths. The results are shown in Figure 5.1. As wall density increases (implying potentially more challenging mazes), the proportion of slow-mode paths also rises. This could be due to two factors: (1) solving more difficult problems inherently requires a slower thinking process; and (2) Dualformer more frequently chose slow mode. Similarly, we also observed that as maze size increased and problem difficulty rose, Dualformer consistently adopted the slow-thinking mode more often.
5.1.4 Generalization Performance
Thus far, all our experiments have been tested on mazes with the same dimensions and wall densities as the training set. Next, we further investigate Dualformer's generalization ability in out-of-distribution (OOD) scenarios. In this section, we consider a Dualformer model trained on 20×20 mazes, with wall density uniformly randomly sampled between 0.3 and 0.5 during training. During the testing phase, we varied the wall density from 0.1 to 0.6. Table 5.5 presents the test results in slow mode, with 50 unseen maze instances tested for each case.
As wall density increases, maze difficulty rises, and concurrently, the length of the input prompt also increases. Therefore, for in-distribution test cases (wall densities of 0.3, 0.4, 0.5), Dualformer (slow mode) achieved nearly 100% optimality. However, performance decreased when wall density increased to 0.6 (out-of-distribution case). Interestingly, at lower wall densities (e.g., 0.1 and 0.2), Dualformer (slow mode) failed to successfully solve any mazes. Our speculation is that when wall density is too low, the prompt information is too short, making it difficult for the model to generalize.
In addition to varying wall density, we also tested Dualformer's performance in slow mode on rectangular mazes (e.g., 20 in height, 10 in width). The model was still trained on 20×20 square mazes. Table 5.6 shows the test results. Surprisingly, Dualformer demonstrated good generalization capabilities in these scenarios.
5.1.5 Ablation Study
As discussed in Section 4, the randomized traces used to train Dualformer originate from a combination of various trace discarding strategies, and there are multiple possible ways to combine these strategies. In this section, we conduct an ablation analysis of our design choices.
First, to enable both fast and slow reasoning modes, a simple alternative method is to mix "solution-only data" and "complete-trace data" during training, meaning we set p₁ = p₂ = p₃ = 0 in our randomization strategy. We refer to such variants as Mix-p models, where p represents the proportion of "solution-only data" in the training data. It is important to note that the "Solution-Only model" is essentially a Mix-1 model, while the "Complete-Trace model" is a Mix-0 model. Next, we compare Dualformer with Mix-p models of different p values in both reasoning modes.
Second, our discarding strategies are hierarchically and structurally designed. For example, Level 2 discarding is an extension based on Level 1 strategy. We further explore how model performance changes when discarding strategies are stopped at a certain specific level.
Comparison with Mix-p Models Figure D.1 compares the 1-Optimal-64 rate of Dualformer with Mix-p models. We tested eight p values: 0 (equivalent to the Complete-Trace model), 0.05, 0.1, 0.2, 0.4, 0.6, 0.8, and 1.0 (equivalent to the Solution-Only model). In both reasoning modes, Dualformer outperformed all Mix-p models in the five tasks we considered. In fact, Dualformer also comprehensively outperformed Mix-p models across all other metrics, as detailed in Appendix D. Notably, in slow mode, Dualformer is the model that generates the shortest reasoning traces, demonstrating the highest inference efficiency.
Analysis of Randomized Strategy Combinations We also compare Dualformer with its variants where the application of discarding strategies stops at a specific level. For the maze environment, we fixed the probability of using complete reasoning traces for each training sample at p₀ = 0.5 and adjusted the probabilities for other discarding strategies; for the Sokoban environment, we fixed the probability of Level 1 discarding at p₀ = 0.05. Table E.1 lists the probability distributions of different discarding strategies used by all models, and Table 5.7 presents the experimental results.
It should be noted that these variants cannot operate in fast mode because they were not trained on any Level 4 discarded data (i.e., completely discarded traces). Therefore, we only report their performance in slow mode. As the level of discarding strategies increases, the length of the generated reasoning traces gradually shortens. In terms of other metrics, models employing Level 1 + Level 2 + Level 3 discarding perform comparably to Dualformer. However, Dualformer offers a more comprehensive advantage by having shorter reasoning traces while also being capable of operating in fast mode.
5.2 Application to Large Language Model Training: Mathematical Reasoning
In this section, we demonstrate the effectiveness of structured trace discarding techniques in training large-scale Large Language Models (LLMs) to solve mathematical problems. Specifically, we fine-tune Llama-3-8B and Mistral-7B models using a dataset containing various mathematical problems with answers accompanied by detailed reasoning steps, and similarly design trace discarding methods that leverage the specific structure of mathematical problem reasoning traces. We compare the fine-tuned models with baseline models obtained by directly fine-tuning on this dataset.
Dataset We evaluate all models using a dataset called Aug-MATH, which originates from the MATH dataset (Hendrycks et al., 2021). The original dataset contains 7500 training samples and 5000 test samples of mathematical problems and solutions. Following Yu et al. (2023)'s method, we invoked the Llama-3.1-70B-Instruct model to rewrite the original answers in a specified format, ensuring they included more detailed intermediate reasoning steps. To enhance the diversity of reasoning traces, we sampled 4 different LLM responses for each problem with a temperature of 0.7 and top-p = 0.9. The final dataset comprises 30,000 training samples and 5,000 test samples. Appendix F.1 illustrates the prompt template we used for answer rewriting, along with a specific training example before and after rewriting.
Structured Trace Discarding and Randomization The rewritten mathematical problem answers generated by Llama-3.1-70B-Instruct contain, on average, 6 to 10 intermediate reasoning steps, each potentially comprising multiple sentences. We adopt a randomization training method similar to the framework proposed in Section 4: in each training batch, for each sampled training instance, we randomly discard each of its intermediate reasoning steps with probability p. The experimental results for different p values will be reported below.
Hyperparameter Settings We fine-tuned both base models—Mistral-7B and Llama-3-8B—for two epochs, with a batch size of 32. The AdamW optimizer (Loshchilov and Hutter, 2019) was used, with a learning rate of 8e−6 for the Mistral model and 5e−6 for the Llama-3 model. The learning rates were selected by searching among 2×10⁻⁶, 5×10⁻⁶, and 8×10⁻⁶, choosing the one that resulted in the lowest validation loss. Then, we retrained the model on the complete training dataset using the selected learning rate and reported the results. For other hyperparameters, we used default values. Specifically, we did not use linear learning rate warm-up, weight decay, or multi-step gradient accumulation. We used the following parameters for AdamW: betas=(0.9, 0.999), eps=1e⁻⁸, γ=0.85 (multiplicative step learning rate decay), and employed "packing" as the batching strategy.
Evaluation Methods Similar to Dualformer, we evaluate model performance in both fast and slow modes: in fast mode, the Large Language Model is required to directly output the final answer; in slow mode, it is required to solve the problem by step-by-step reasoning. Evaluation uses zero-shot prompting, consistent with Yu et al. (2023). The prompt templates used for each mode can be found in Appendix F.2. We use the following two evaluation metrics:
- Greedy@1 (Dubey et al., 2024; Yu et al., 2023): For each problem, generate 1 response with temperature 0 and determine if its answer is correct.
- pass@20 (Chen et al., 2021): For each problem, randomly sample 20 responses with temperature 0.5 and calculate the proportion where at least one is correct.
For reference, we also report the results of directly fine-tuning these models on the original MATH dataset.
Results The results are shown in Table 5.8. We tested four p values: 0.1, 0.2, 0.3, and 0.4. The experimental results indicate that the proposed training strategy improved both the effectiveness and efficiency of the two Large Language Models.
We first analyze the results for the Mistral-7B model. In slow-mode inference, the model fine-tuned with trace discarding and randomized training outperformed the baseline model directly fine-tuned on the Aug-MATH dataset. When p = 0.1, the Greedy@1 metric showed an absolute improvement of 1.7% (equivalent to a relative performance gain of 10%); when p = 0.2 and p = 0.3, the improvement was 0.9%; and when p = 0.4, the improvement was 0.1%.
On the Pass@20 metric, our model also outperformed the baseline model when p = 0.1, 0.2, and 0.3, achieving a maximum absolute accuracy of 61.9%. Under both evaluation methods, as the p value increased, the average length of the reasoning traces gradually shortened.
Similarly, in fast-mode inference, our model also achieved higher accuracy rates. The Llama-3-8B model also showed a similar trend of performance improvement.
Finally, for reader reference, we list the results of fine-tuning the Mistral-7B and Llama-3-8B models on the original MATH dataset. The performance of these models significantly lags behind those fine-tuned on Aug-MATH².
6 Conclusion
We proposed a simple and easily implementable framework for training Transformer models to solve reasoning and planning tasks. We deeply analyzed the structure of reasoning traces and designed corresponding discarding strategies to simulate "shortcut" behaviors in human thought processes. By randomly applying these discarding strategies to training samples, the resulting Dualformer model can be controlled to operate in fast, slow, or auto mode during inference—in auto mode, the model decides on its own which reasoning method to employ. Dualformer achieved performance improvements in both maze navigation tasks and Sokoban games, while also reducing the number of reasoning steps required.
Notably, our method is not limited to training task-specific models from scratch. We applied the same idea's techniques to fine-tune Large Language Models (LLMs) to answer mathematical problems, and similarly achieved performance improvements. Furthermore, the proposed framework can reduce computational overhead because discarding reasoning traces shortens the input sequence length.
Future work can further investigate whether this method contributes to model scalability and explore combining methods such as curriculum learning and hierarchical planning to enable Dualformer to adapt to more complex task scenarios.
Appendix A: Network Architecture and Hyperparameters
We used the same encoder-decoder Transformer architecture as Lehnert et al. (2024) for Dualformer. It first converts each token into a one-hot vector, then transforms it into a set of vectors via an embedding layer. The embedding vectors then pass through subsequent layers as shown in Lehnert et al. (2024, Figure 4). We used RoPE embeddings for positional encoding and did not use dropout in our architecture.
For model size, architectural parameters, and batch size, we used the same settings as Lehnert et al. (2024). Specifically, for maze tasks, we used a model with 15 million parameters, comprising 3 attention heads and 6 layers, with a hidden layer size of 64. We used the AdamW optimizer (Loshchilov and Hutter, 2019) for model optimization, with a learning rate of 2.5e-4 and a batch size of 16, where β₀ and β₁ were set to 0.9 and 0.99, respectively. A linear warm-up strategy was employed for the first 2000 gradient updates. Thereafter, we used a cosine learning rate scheduler (Loshchilov and Hutter, 2016).
For Sokoban tasks, we used a model with 46 million parameters, comprising 4 attention heads and 8 layers, with a hidden layer size of 96. We used a batch size of 64 and a learning rate of 7.5e-5, while other hyperparameters were the same as described above.
A.1 Mazes and Sokoban
We used the same dataset as Lehnert et al. (2024), available at https://github.com/facebookresearch/searchformer. The maze dataset contains 10 million examples, and the Sokoban dataset also contains 10 million examples. We sorted them by the "id" field in MongoDB and used the first 100,000 examples (following the same method as Lehnert et al., 2024). For reader reference, the dataset was generated as follows: For maze tasks, 30-50% of cells were first randomly designated as walls. Next, a start and a target position were randomly selected. Then, the A* algorithm was applied to generate an optimal plan. For Sokoban tasks, we used a 7×7 grid map with two randomly inserted wall cells as obstacles. Additionally, two docks, two boxes, and two worker positions were randomly placed. Once a game was generated, it was added to the dataset only if it could be solved by the A* algorithm.
A.2 Mathematical Reasoning
We used the implementation provided at https://github.com/meta-llama/llama-recipes to fine-tune the models.
We trained all models for 2 epochs, using a batch size of 32. For the Llama model, we used the AdamW optimizer with a learning rate of 5×10⁻⁶; for the Mistral model, the learning rate was 8×10⁻⁶. The learning rates were selected by searching among the values 2×10⁻⁶, 5×10⁻⁶, and 8×10⁻⁶, choosing the one that resulted in the lowest validation loss. Then, we retrained the model on the complete training dataset using the selected learning rate and reported the results. For other hyperparameters, we used default values. Specifically, we did not use linear learning rate warm-up, weight decay, or multi-step gradient accumulation. We used the following parameters for AdamW: betas=(0.9, 0.999), eps=1e⁻⁸, γ=0.85 (multiplicative step learning rate decay), and employed "packing" as the batching strategy.
C. Diversity of Generated Plans
Dualformer excels over baseline models in finding unique feasible solutions. To illustrate this intuitively, we selected an example maze task and generated 64 responses using Dualformer in fast mode. Figure C.1 plots all unique feasible paths found by fast-mode Dualformer, as well as paths found by the Solution-Only baseline (64 responses). Fast-mode Dualformer identified 42 unique feasible paths, while the Solution-Only model found only 3. Similarly, Figure C.2 compares slow-mode Dualformer with the Complete-Trace (Searchformer) baseline. Slow-mode Dualformer found 39 unique feasible paths, whereas the Complete-Trace model found only 17.
Original Link: https://arxiv.org/pdf/2410.09918v2