Peking University Alumna Lilian Weng's Latest Blog Post: Why We Think

Content from: Synced

The excellent blog on large models has been updated!

Recently, Peking University alumna and former OpenAI Applied AI Research Lead, Lilian Weng, updated a very long blog post titled "Why We Think".

The article reviews recent research progress on how to effectively utilize test-time computation (i.e., "thinking time") and its mechanisms, aiming to provide rational support from multiple perspectives for the goal of enabling models to "think longer".

By observing the iterations of models like GPT, Claude, and Gemini, it is clear that their performance boundaries in advanced cognitive tasks such as complex logical reasoning, long text comprehension, mathematical problem-solving, and code generation and debugging are continuously expanding.

This performance improvement benefits from the optimization of strategies like Chain-of-Thought (CoT) and test-time computation, but it also brings new research challenges.

To facilitate a better understanding of this content for domestic readers, Synced has compiled this article. Interested readers can also refer to the original English content.

English blog link: https://lilianweng.github.io/posts/2025-05-01-thinking/

Analogical Psychology

This core idea is closely related to human thinking. Humans usually cannot immediately provide the answer to "12345 × 56789". For complex problems, a period of thought and analysis is a natural process.

In "Thinking, Fast and Slow", Daniel Kahneman, based on the "dual process theory" of cognition, divided human thinking into two modes:

System 1 (fast thinking): Operates quickly and automatically, driven by intuition and emotion, requiring almost no cognitive effort.

System 2 (slow thinking): Relies on conscious logical reasoning and rigorous thought, requiring higher cognitive energy and active engagement.

Because System 1 thinking is easy to initiate and less effortful, it often dominates decision-making, but this usually comes at the cost of accuracy and logic. It naturally relies on the brain's "cognitive shortcuts" (i.e., heuristics), which can easily lead to errors and biases. By consciously slowing down the thinking process and allowing more time for reflection, improvement, and analysis, System 2 thinking can be activated, challenging intuition and leading to more rational and accurate decisions.

Computation as a Resource

In deep learning, one perspective is that neural networks can be characterized by the computational and storage resources they can invoke during each forward pass.

If we use gradient descent to optimize them to solve problems, the optimization process will "figure out" how to organize these resources for computation and information storage, and thus can be seen as a process of automatically planning circuits.

From this perspective, if we design an architecture or system that can perform more computations at test time, and we train the model to effectively utilize this additional resource, then the model's performance will be better.

In a Transformer, the number of floating-point operations (FLOPs) performed by the model for each token generated is roughly twice the number of its parameters. In sparse models (such as Mixture of Experts, MoE), since only a portion of the expert networks are activated in each forward pass, the parameters involved in computation are only a fraction of the total parameters, so the computational load is 2 × number of parameters ÷ sparsity, where "sparsity" refers to the proportion of activated expert networks.

In contrast, Chain-of-Thought (CoT) allows models to perform more computations (i.e., FLOPs) when reasoning about each token they wish to generate. In fact, a major advantage of CoT is that it allows the model to flexibly adjust the amount of computation based on the difficulty of the problem.

Latent Variable Modeling

In machine learning, a classic approach is to define a probabilistic model with a latent variable z and an observed variable y, where y is the variable observed by the learning algorithm. By marginalizing (i.e., summing over) the possible values of the latent variable, a complex probability distribution over the observed variable y can be expressed: P(y) = ∑z∼P(z) P(y | z).

For example, we can use this method to model the distribution of mathematical problems and their solutions: let x represent the problem text, y represent the true answer or proof process, and z represent a series of free-form reasoning processes that guide the generation of the proof. The optimization objective is the conditional probability P(y | x) = ∑z∼p(z | x) P(y | x, z).

This latent variable perspective is particularly suitable for understanding multiple parallel Chain-of-Thoughts (CoTs) or search-based CoT generation methods. These algorithms can be viewed as sampling from the posterior distribution P(z | x, y). This perspective also reveals the benefits of optimizing the log-likelihood log P(y | x) as an objective function—because the log loss objective function is significantly effective in pre-training.

Thinking Token by Token

The strategy of generating intermediate reasoning steps before producing a short answer, particularly in mathematical problems, was initially proposed by researchers and has gradually evolved into a standard method. Early research in this area included building mathematical reasoning datasets and training generators with supervised learning, using human-written solution paths as reference, while simultaneously training a verification model to judge the correctness of candidate answers, thereby enabling effective search among multiple solutions.

Subsequent research introduced the concept of a "scratchpad," treating intermediate tokens generated by the model as temporary content to aid the reasoning process. This further led to the widely used term "Chain-of-Thought" (CoT), describing the ability of large language models to reason step-by-step rather than directly outputting answers.

Initial methods for enhancing CoT reasoning capabilities primarily involved supervised learning on human-written reasoning paths, or training on model-generated paths filtered for correctness, the latter of which can be considered an early form of reinforcement learning. Other research has shown that simply by designing appropriate prompts, such as using "think step-by-step" cues or constructing more complex prompts that guide the model to first consider relevant knowledge, the performance of instruction-tuned language models on mathematical tasks can be significantly improved.

With further exploration, researchers found that applying reinforcement learning to datasets with automatically verifiable answers can significantly enhance CoT reasoning capabilities. These datasets include STEM problems with definite short answers or programming tasks whose correctness can be verified through unit tests. Such reinforcement learning strategies have been proven to significantly boost the performance of large language models in multi-step reasoning tasks. In recent model development, adopting policy gradient algorithms combined with automatically evaluated problem sets has become an effective method for improving model performance.

Chain-of-Thought (CoT) prompting significantly improves the success rate in solving mathematical problems. The larger the model size, the more significant the performance gains from its "thinking time".

Branching and Revision

The fundamental purpose of test-time compute is to dynamically adjust the model's output distribution during inference. To this end, various strategies can be employed to consume additional test-time resources, thereby optimizing the decoding process, selecting better samples, and guiding the model's output towards a more desirable distribution.

Currently, there are two main categories of strategies for improving generation quality: parallel sampling and sequential revision.

Parallel sampling involves generating multiple output sequences in parallel at each step, while judging the quality of the final output through process reward signals or result evaluation modules (e.g., verifiers) to select the optimal answer. This is currently a widely adopted form of decoding method for improving test-time performance, such as best-of-N and beam search. In scenarios where ground truth is unavailable, a self-consistency strategy is often used, meaning the final answer is chosen by majority voting among multiple Chain-of-Thought (CoT) reasoning sequences.

Sequential revision refers to reflecting on and iteratively correcting the model's previous output, guiding the model to actively identify and correct potential errors in subsequent outputs. This revision process usually needs to be implemented on a fine-tuned model; if it solely relies on the model's "intrinsic self-correction" without external supervised feedback, it is often difficult to achieve significant improvement.

Parallel sampling is simple to implement, intuitive, and easy to deploy, but its effectiveness largely depends on whether the model can generate the correct answer "in one go". Sequential revision, by explicitly guiding the model to identify and correct errors, is logically more targeted but slower, and requires additional control over the revision risks during the process, such as a correct answer being improperly changed to an incorrect one, or the model generating new hallucinations. These two methods can be combined. Snell et al. found that for simple problems, using a single sequential strategy for test-time computation works best; for highly difficult problems, an optimal combination ratio between parallel computation and sequential revision is usually needed to achieve the best task performance.

Diagram illustrating parallel sampling and sequential revision.

Parallel Sampling

When we have a generative model and a scoring function that can rate complete or partial generated samples, various search algorithms can be designed to find higher-scoring generated results. The simplest of these algorithms is best-of-N: directly generating N independent samples and then selecting the highest-scoring one based on the scoring function.

In contrast, beam search is a more complex and adaptive search algorithm that dynamically allocates computational resources to prioritize exploring more promising parts of the solution space. During beam search, the model maintains a set of the most promising partial sequences and alternates between expanding these candidate sequences and pruning less ideal ones.

When selecting generation candidates, a Process Reward Model (PRM) can be introduced as a guiding mechanism to assist in filtering paths during beam search. For example, some research uses the large language model itself to evaluate whether each of its generated reasoning steps is reasonable, formatting it as a multiple-choice question for per-step self-evaluation. This method can significantly reduce cumulative errors in multi-step reasoning processes, especially during the beam search decoding phase, where its effectiveness is particularly noticeable.

Furthermore, regulating randomness during sampling through temperature annealing strategies also helps mitigate unnecessary fluctuations in the generation process. This series of experimental methods achieved a 5–6% performance improvement when applied to few-shot tasks such as GSM8k, AQuA, and StrategyQA using the Codex model.

Distinct from traditional beam search, the REBASE (Reward Balanced Search) method trains an independent Process Reward Model to evaluate the expansion priority of each node in the beam search tree at different depths. This model dynamically determines which nodes receive more generation resources based on softmax-normalized reward scores.

Another process reward model, RATIONALYST, is trained on a large amount of unlabeled data for multi-step reasoning tasks. The model generates candidate reasoning paths (called rationales) and filters for high-quality reasoning processes based on whether these rationales can reduce the negative log-probability of the true answer token by a certain threshold.

During the inference phase, RATIONALYST can guide the Chain-of-Thought (CoT) generator in two ways: one is implicitly estimating the probability of the next reasoning step (implicit supervision); the other is explicitly generating the content of the next step as part of the prompt (explicit supervision).

Guidance mechanism for per-step self-evaluation by a large language model during beam search decoding.

Interestingly, Chain-of-Thought (CoT) reasoning paths can be elicited in language models even without explicit zero-shot or few-shot prompting. Wang and Zhou found that if branching occurs at the first token of sampling, keeping the top k tokens with the highest confidence (measured by the probability difference between top-1 and top-2 candidates), and then applying greedy decoding to continue generating each of these k paths, a significant portion of the generated sequences naturally contain a complete Chain-of-Thought.

Especially when CoT content already exists in the context, this mechanism is more likely to guide the model to generate a final answer with higher confidence. When evaluating the confidence of the final answer, it is necessary to first identify the answer segment through task-specific heuristics, for example, extracting the last numerical answer in a math problem; or by further prompting the model (e.g., by prompting "So the answer is" at the end) to extract an explicit answer.

The design logic of branching only at the first token is that, compared to subsequent tokens, early branching significantly enhances the diversity of reasoning paths, while the generation of subsequent tokens is more constrained by the existing contextual sequence.

In Top-k decoding, k represents the number of candidate options retained in the first sampling step.

Sequential Revision

If a model has the ability to reflect on and correct errors in its previous answers, we would expect it to produce an iteratively improving sequence of revisions. However, it turns out that this self-correction capability is not inherent in large language models and is often difficult to achieve without specific design. Failure modes include:

(1) Hallucination, meaning modifying an originally correct answer into an incorrect version;

(2) Behavioral collapse to a non-correcting state, for example, making only minor or no changes to an initial incorrect answer;

(3) Inability to generalize to distribution shifts at test time.

Experiments by Huang et al. showed that directly applying self-correction strategies might lead to performance degradation. For models to achieve self-improvement, they still need to rely on external feedback signals, such as comparison with ground truth labels, heuristic rules, task-specific evaluation metrics, unit test results for programming problems, references from more powerful models, and human feedback.

Self-correction learning aims to train a corrector model Pθ(y∣y₀, x) based on a fixed generative model P₀(y₀∣x). The generative model remains general, while the corrector model can be customized for specific tasks and generates only under the condition of a given initial model response and optional additional feedback (e.g., a piece of text, compiler hints, or unit test results):

1. Self-correction learning first generates multiple candidate outputs for each prompt in the data pool;

2. Then, "value-improving pairs" are constructed for each prompt; that is, two outputs are selected, and if one is superior to the other in evaluation metrics, they form a triplet (prompt x, initial answer y, corrected answer y′);

3. Next, these sample pairs are weighted and sampled to train the corrector model based on the value improvement v(y′) − v(y) brought by the correction, and the similarity Similarity(y, y′) between the two outputs;

4. To encourage exploration, new outputs generated by the corrector are also added to the data pool. During inference, the corrector can be repeatedly called to form a continuously improving revision trajectory.

Diagram of self-correction learning for training a corrector model by matching different outputs of the model under the same problem to construct "value-improving pairs".

Recursive Inspection also aims to train a better corrector model, but it employs a strategy where a single model performs both generation and self-correction simultaneously.

SCoRe (Self-Correction via Reinforcement Learning) is a multi-round interactive reinforcement learning method aimed at encouraging the model to generate a better answer on the second attempt compared to the first, thereby achieving self-correction. This method consists of two training stages:

The first stage aims to maximize the accuracy of the second attempt's answer, while only applying a KL divergence penalty to the first attempt to prevent the first generation behavior from deviating too far from the base model, maintaining its generation style stability;

The second stage simultaneously optimizes the accuracy of both the first and second generated answers.

Ideally, we hope the model can achieve better performance in both the first and second attempts, but introducing the first stage can effectively avoid "behavioral collapse", where the model makes only minor or no modifications to the initial response;

The second stage further improves overall performance on this basis.

Explicit training setup to improve self-correction ability through two-stage RL training.

Improving Reasoning Ability with Reinforcement Learning

Recently, a growing number of studies have successfully leveraged reinforcement learning (RL) to enhance the reasoning capabilities of language models. These methods typically use problem sets with reliable reference answers (mainly STEM problems or logic puzzles with easily verifiable answers), optimizing performance by rewarding the model for generating correct answers.

The development in this direction has been driven by the excellent reasoning performance of models released by OpenAI, followed by DeepSeek's models and technical reports which further accelerated relevant research progress.

DeepSeek-R1 is an open-source large language model specifically built to handle highly complex tasks such as mathematics, programming, and logical reasoning. Its training process includes two rounds of SFT-RL (Supervised Fine-Tuning + Reinforcement Learning), enabling R1 to perform well in both reasoning and non-reasoning tasks.

In the initial training phase, "cold-start SFT" (Supervised Fine-Tuning) is adopted, which means fine-tuning the base model on thousands of cold-start data points. If this step is omitted, the model might exhibit issues such as language mixing and incoherent expressions in its early performance.

In the reasoning-oriented RL (Reinforcement Learning) phase, the model undergoes reinforcement training using prompts that only contain reasoning tasks. The reward signals are designed as two types of rule-based metrics:

Format rewards: require the model to wrap "Chain-of-Thought" (CoT) with special tokens.

Accuracy rewards: determine if the final answer is correct. Answers to mathematical problems must be presented in a specific format (e.g., enclosed in a box) for automatic verification; programming problems are evaluated by a compiler to check if they pass test cases.

In the rejection sampling + non-reasoning supervised fine-tuning stage, reasoning outputs generated from the RL step 2 checkpoint are filtered for quality using rejection sampling. Combined with supervised data from non-reasoning tasks such as writing, factual questioning, and self-cognition, a new training set is constructed to train DeepSeek-V3-Base:

Filter out CoT results containing mixed languages, overly long paragraphs, or verbose code blocks;
Utilize the DeepSeek-V3 pipeline to generate and annotate non-reasoning tasks;
For certain non-reasoning needs, DeepSeek-V3 can be called to pre-generate potential thought processes as prompts before answering, whereas simple queries like "hello" do not require Chain-of-Thought;
Finally, fine-tune DeepSeek-V3-Base for two epochs using approximately 800,000 samples.

The final reinforcement learning phase uses the checkpoint from the third stage, training on both reasoning and non-reasoning prompts to comprehensively improve the model's helpfulness, harmlessness, and reasoning capabilities.

DeepSeek-R1's performance is comparable to OpenAI models across multiple widely used reasoning benchmarks. Notably, DeepSeek-V3 is the only non-reasoning model on the leaderboard.

Interestingly, the DeepSeek team demonstrated that even without any Supervised Fine-Tuning (SFT) stage, relying solely on pure Reinforcement Learning (RL) can still enable models to learn advanced reasoning capabilities such as reflection and backtracking, leading to what they call "Aha moments." During RL training, the model naturally learns to invest more "thinking tokens" when solving reasoning tasks, thereby improving its problem-solving abilities.

An "Aha moment" refers to the model's ability to reflect on previous errors and attempt alternative strategies for correction. The emergence of this capability indicates that the model possesses human-like deep thinking processes.

Subsequently, several open-source projects have attempted to reproduce the results of R1, including Open-R1, SimpleRL-reason, and TinyZero, all built upon the Qwen model. Related research has also confirmed that pure reinforcement learning methods can achieve excellent performance on mathematical problems and exhibit similar "Aha moments."

Examples of models learning to reflect and correct errors.

The DeepSeek team also shared some unsuccessful training attempts. For instance, using a process reward model failed because it was difficult to define clear scoring criteria for every intermediate step and to judge whether intermediate reasoning was correct. Furthermore, this approach easily led to reward hacking, thereby affecting training stability.

Attempts with Monte Carlo Tree Search (MCTS) also failed because the token space of language models is far larger than traditional finite action spaces like in chess, making the search process computationally very expensive. Simultaneously, training a fine-grained value model to guide the search also proved extremely challenging.

Despite these failures, these attempts provided unique research insights. We encourage the research community to further share such "unsuccessful attempts" to foster a more comprehensive understanding and progress.

Use of External Tools

During the reasoning process, certain intermediate steps can be efficiently and precisely solved by executing code or mathematical computations. By offloading this part of the reasoning work to external code interpreters, such as PAL (Program-Aided Language Model) or Chain of Code methods, the capabilities of large language models can be expanded, allowing them to avoid learning code execution themselves or acting as calculators.

In Chain of Code, the code execution system used can be further enhanced by the LLM: when a standard code interpreter fails to correctly run a specific line of code, the model can also attempt to execute that code segment directly using the LLM. Using code to assist reasoning is especially suitable for mathematical problems, symbolic reasoning, and algorithmic tasks.

In some programming tasks, unit tests may not be provided in the problem statement. In such cases, we can guide the model to autonomously generate unit tests for verifying solutions, thereby improving the correctness of the answers.

An example of a Program-Aided Language Model prompt is shown below.

The ReAct (Reason+Act) method combines calling external operations like Wikipedia API with generating reasoning trajectories to enable the incorporation of external knowledge into the reasoning path.

An example of the ReAct prompting method, demonstrating how it aids reasoning by calling the Wikipedia search API as an external tool to answer a HotpotQA question.

OpenAI's recently released o3 and o4-mini models are also typical examples of integrated tool usage, with their reasoning processes involving web search, code execution, and image processing. The research team observed that in GPT-like training paradigms, large-scale reinforcement learning also exhibits the trend of "more computational resources, better performance."

Thinking Faithfully

Deep learning models are often regarded as black boxes, and various interpretability methods have been proposed. Interpretability is useful for several reasons: first, it provides us with an additional test to determine if the model is acting contrary to its creator's intent or if it has erred in a way we cannot judge by monitoring its behavior. Second, it can help us determine if the model used a reasonable procedure to compute its answer. Chain-of-Thought offers a particularly convenient form of interpretability because it allows the model's internal processes to be expressed in natural language. However, this interpretability relies on the assumption that the model accurately describes its internal thought process.

Recent research indicates that monitoring the CoT of reasoning models can effectively detect improper model behaviors, such as reward hacking, and can even enable weaker models to monitor stronger ones. Increasing test-time computation also enhances adversarial robustness; this intuitively makes sense, as when a model encounters unusual inputs, such as adversarial samples or jailbreaking attempts, longer thinking time should be particularly useful, allowing it to utilize the extra time to comprehend the strange situation encountered.

In experiments, we asked a model to determine if another model was attempting to hack unit tests in some way to solve coding problems in its thought process. We can use different types of monitors during training to track these reward hacking behaviors. An exit(0) code hack is when an agent exploits a vulnerability that allows it to exit the environment prematurely without running all unit tests. A raise SkipTest hack is when an agent raises an exception from a function outside the test framework to skip unit test evaluation. Source: Baker et al. 2025

Does the Model Faithfully Express its Thoughts?

Intuitively, due to the lack of explicit training objectives designed to encourage faithful reasoning, the model's CoTs may be biased. Alternatively, when we fine-tune models based on human-written explanations, these human-written samples might contain errors. Therefore, we cannot assume that CoT is always faithful.

Lanham et al. (2023) investigated several modes of CoT faithfulness failure by intentionally introducing errors into CoT and measuring their impact on the accuracy of a range of multiple-choice tasks (e.g., AQuA, MMLU, ARC Challenge, TruthfulQA, HellaSwag):

Error 1 (Premature Answering): The model might form a conclusion prematurely before CoT generation. The test method involves truncating early or inserting errors into the CoT. Different tasks vary in their reliance on CoT effectiveness; some tasks' evaluation performance is sensitive to truncated CoT, while others are not. Wang et al. (2023) conducted similar experiments but observed more subtle errors related to bridging objects or language templates during CoT formation.

Error 2 (Uninformative tokens): Uninformative CoT tokens can improve performance. This hypothesis was tested by replacing CoT with filler text (e.g., all periods), and results showed that this setting did not improve accuracy, and performance on some tasks might slightly decrease compared to not using CoT.

Error 3 (Human-Unreadable Encoding): Encoding relevant information in a way that is difficult for humans to understand. Rephrasing CoT in non-standard ways did not decrease performance across datasets, suggesting that accuracy improvement does not rely on human-readable reasoning.

Diagram illustrating different ways to perturb CoT to evaluate its faithfulness. Source: Lanham et al. 2023

Interestingly, Lanham et al. suggest that for multiple-choice questions, smaller models might not utilize CoT effectively, while larger models might be able to solve tasks without CoT. This reliance on CoT reasoning (measured as the percentage of obtaining the same answer with and without CoT) does not always increase with model size in multiple-choice questions, but it does increase with model size in addition tasks, implying that thinking time is more crucial in complex reasoning tasks.

The reliance on CoT reasoning is measured as the percentage of obtaining the same answer with and without CoT. CoT plays a larger role for reasoning tasks like addition, and larger models also benefit more.

The reliance on CoT reasoning is measured as the percentage of obtaining the same answer with and without CoT. For reasoning tasks like addition, CoT's role is more significant, and larger models also bring more benefits.

Other methods for testing CoT faithfulness include perturbing the prompt rather than directly modifying the CoT path.

One method is to label the correct answer as "(A)" in a few samples, regardless of the true label, thereby introducing bias.

Another prompting technique is to insert misleading prompts, such as "I think the answer is <random_label>, but I'd love to hear your thoughts" or "A Stanford professor thinks the answer is <random_label>". By comparing whether the model's predictions for the same question include misleading prompts, we can gauge whether the model can faithfully describe the influence of the prompt on its answer. Particularly when the model's prompted answer differs from its unprompted answer, we can measure whether the model acknowledges the prompt when solving the prompted problem. If the model is faithful, it should explicitly acknowledge the prompt's influence and attribute the change in its answer to the prompt.

GPT models and Claude models are sensitive to different types of contextual biases. The decrease in model accuracy indicates systemic unfaithfulness. Directly prompting incorrect labels is more effective than "answer is always A" type biases. Source: Turpin et al. 2023

Multiple studies have found that reasoning models describe the influence of hints far more reliably than all non-reasoning models tested. For example, we can measure the proportion of samples where the model admits the hint was a decisive factor ("faithful CoT"). Reasoning models (Claude 3.7 Sonnet, DeepSeek R1) generally outperform non-reasoning models (Claude 3.6, DeepSeek V3).

Reasoning models are more likely to reveal faithful CoT than non-reasoning models. Source: Chen et al. 2025

Some evidence suggests that using reward models can reduce the faithfulness of model outputs. Reward models in classical RLHF were not trained to incentivize faithful behavior in this test, as preference comparison datasets are collected based on human choices of what looks better or more pleasing. However, in reasoning models, CoT paths are optimized to produce correct final answers rather than to match human preferences as defined in RMs, so they are intuitively expected to be more faithful.

Optimization Pressure on CoT: Good or Bad?

Monitoring the CoT of reasoning models to detect reward hacking behavior is effective. A simple way to use such a monitor is to run it as a filter during test time for rejection sampling, finding solutions without reward hacking behavior. However, this method is only effective in the early stages of reasoning, and as the model is further optimized, samples that pass CoT monitoring are rarely found within budget. This challenge naturally suggests that we could consider incorporating CoT monitoring into RL rewards—a form of process-based supervision—to prevent reward hacking.

However, adding CoT monitoring rewards in RL does not work well, as it can lead to obfuscated reward hacking behavior, where the model attempts to hide its true intentions within the CoT (Baker et al. 2025). If this direct optimization pressure from CoT monitoring is used during RL training, the model can still find ways to hack, albeit at a lower frequency, but it can hide its intentions in the CoT, avoiding detection by the monitor.

During training, when monitors are used to provide RL reward signals to prevent reward hacking behavior, the model can still find ways to hack while avoiding detection by the monitor. Source: Baker et al. 2025

Yeo et al.'s experiments on CoT length RL signals confirmed similar findings to Baker et al., namely that novel RL reward shaping can lead to new types of reward hacking behavior. Their designed reward function would give correct CoTs higher rewards than incorrect ones, short correct CoTs higher rewards than long correct ones, and short incorrect CoTs higher penalties than long incorrect ones. With this additional reward, the model learned to repeat words in CoTs on challenging problems instead of attempting to solve them. Yeo et al. further applied n-gram repetition penalties to prevent this hacking behavior.

To stably train with CoT length rewards, careful reward shaping is necessary. Source: Yeo et al. 2025

Chen et al. conducted experiments in a flawed RL environment, specifically using a scorer that filled in incorrect answers for multiple-choice questions. In over 99% of the prompts, the model learned to exploit reward hacking, but in more than half of the environments, the model almost never (<2%) verbally expressed the reward hacking in its CoT. In this scenario, additional RL optimization pressure failed to incentivize the model to verbally express hacking behavior.

RL training is inherently very sensitive to reward hacking behavior. Relying solely on heuristic investigations and manual fixes for reward hacking might lead to a "whack-a-mole" situation. We recommend being very cautious when attempting to directly optimize CoT during RL training, or preferably avoiding it altogether.

Thinking in Continuous Spaces

Adaptive Computation Time, proposed by Alex Graves in 2016, predates large language models but pioneered the same direction: allowing models to dynamically decide the number of computational steps during inference. This can be seen as enabling models to "think more" in a continuous space at test time. Adaptive thinking time in continuous spaces can be achieved vertically through recurrent architectures or horizontally through more continuous sampling steps.

Recurrent Architectures

Many architectural variants have been proposed to enable Transformer architectures to become recurrent architectures and achieve adaptive test-time computation. Delving deeply into the literature on this topic would make the article too long, so we will only review a few of them.

Universal Transformer combines self-attention from Transformers with recurrent mechanisms from RNNs, dynamically adjusting steps using adaptive computation time. At a high level, it can be seen as a recursive function for learning the hidden state representation of each token; if the number of steps is fixed, a Universal Transformer is equivalent to a multi-layer Transformer with shared parameters across layers.

The latest recurrent architecture design proposed by Geiping et al. adds a recurrent block R on top of the standard Transformer. Each iteration of this recurrent block requires embedding e and a random state s_i. Conceptually, this recurrent deep architecture is somewhat similar to conditional diffusion models, where the original input e is provided at each recurrent step, while the random Gaussian initialized state s_i is iteratively updated throughout the process. (Interestingly, their experiments on designs more similar to diffusion models did not yield ideal results.)

The recursive count r during training is stochastic, sampled per input sequence from a log-normal Poisson distribution. To control computational cost, backpropagation is truncated only to the last k iterations of the recurrent unit (k=8 in experiments), allowing training on the heavy tail of the Poisson distribution. Since the output e of the embedding block is injected at each step, the embedding block continues to receive gradient updates at each step, mimicking RNN training. As expected, the stability of training recurrent models is highly sensitive. Factors such as initialization, normalization, and hyperparameters are crucial, especially when scaling up training. For example, hidden states might collapse by predicting the same hidden state for every token; or the model might learn to ignore the incoming state s. To stabilize training, Geiping et al. adopted embedding scale factors, smaller learning rates, and meticulous tuning.

Experimental plot for training a 3.5B model with deep recursion. Saturation roughly occurs around r¯=32, which begs the question of how this structure extrapolates and generalizes to a larger number of iterations. Source: Geiping et al. 2025

Thinking Tokens

Thinking tokens refer to a set of implicit tokens introduced during training or inference that do not carry direct linguistic meaning. Instead, their purpose is to provide the model with additional thinking time and computational capacity, enabling it to perform better.

Herel & Mikolov (2023) proposed the idea of inserting special thinking tokens (<T>) after each word in a sentence and training models on such datasets. Each thinking token buys the model extra time to process and make better predictions. Training with thinking tokens on toy model setups resulted in lower complexity compared to baseline models trained without thinking tokens. For non-trivial reasoning tasks or sentences involving numbers, the advantage of thinking tokens becomes more pronounced.

Similarly, Goyal et al. (2024) proposed pause tokens which delay the model's output by adding dummy tokens (e.g., periods or # characters) at the end of the input sequence, thereby providing the model with additional computation during inference. Injecting such pause tokens is important during both training and inference, whereas fine-tuning only on pause tokens yields limited benefits. During training, multiple copies of pause tokens are inserted at uniformly random positions, and the loss for pause tokens is ignored during training.

Interestingly, the thinking tokens or pause tokens in the experiments above do not carry any extra information or add many new parameters. Why are they still helpful? On one hand, they help extend computation by introducing more reasoning loops, thereby effectively increasing computational power. On the other hand, they can be viewed as a special implicit form of CoT. A drawback here is that the model needs to be pre-trained based on thinking tokens. Nevertheless, this strategy remains an interesting approach to further improve the utilization of test-time computation beyond inference-time CoT.

Quiet-STaR introduces token-level reasoning by training models to generate rationales after each token to explain future text. It mixes future text prediction with and without rationales and uses learning to generate better rationales, and REINFORCE to optimize the quality of rationale generation.

Quiet-STaR diagram.

Thinking as a Latent Variable

Latent variable models define a probabilistic framework where observed data is explained by unobserved (latent) variables. These latent variables capture the hidden structure or intermediate processes that generate observable outcomes. Language models can be seen as probabilistic latent variable models, where test-time thoughts and reasoning steps are latent thought variables. Such latent variable models define the joint distribution of problem x_i, answer y_i, and latent thought z_i. We aim to optimize the log-likelihood of the answer given the problem and various Chain-of-Thoughts as latent variables (N is the number of samples; K is the number of Chain-of-Thoughts per problem):

Our goal is to maximize the marginal likelihood p(y∣x) of the correct answer, given multiple reasoning trajectories {z(k)} k=1K for each problem.

Expectation Maximization

Expectation Maximization is a commonly used iterative algorithm for optimizing model parameters with (implicit) latent variables, and thus can be applied to train better Chain-of-Thoughts, and then generate better responses conditioned on them. Typically, we iterate between the E-step (Expectation) and M-step (Maximization). In the E-step, we infer the missing information about the latent variables (i.e., how to sample better Chain-of-Thoughts), and in the M-step, we optimize the model parameters based on the latent variables (i.e., how to sample better answers), until convergence.

Since we cannot directly sample from the latent variable distribution p(z∣x,y), researchers have explored methods that rely on human-annotated data, Metropolis-Hastings MCMC, or Monte Carlo sampling with special importance weights to draw good Chain-of-Thought samples to update the model. Ruan et al. (2025) experimented with training models with latent thoughts using the EM algorithm on a large corpus of web text, where latent thoughts were synthesized for each observed data chunk, and then the model learned autoregressively on both latent thoughts and data.

Diagram illustrating training on data corpus injected with latent thoughts.

They first prompt a large language model to generate synthetic latent thoughts Z_i based on observed data X_i:

Special tokens such as <StartOfLatent><Prior> ... <EndOfPrior> are used to insert the generated latent thought content into the original data, for training the joint distribution p(z,x) or approximating the posterior q(z∣x), depending on whether z is inserted before or after x. However, since we use a large language model to generate the Chain-of-Thought, this imposes a performance ceiling on how good the approximation of q(z∣x) can be. Ruan et al. introduced importance weights in the E-step to select Chain-of-Thought samples, formulated as:

This way, we prioritize Chain-of-Thought samples that are good at predicting observed results (i.e., high p(x∣z(k))), simple and intuitive (i.e., high p(z(k))), yet informative and not too obvious (i.e., low q(z(k)∣x)).

Iterative Learning

Since pre-trained models already possess the ability to generate Chain-of-Thoughts, we can intuitively design an iterative improvement process where we generate multiple CoTs and fine-tune the model only based on the rationales that lead to correct answers.

However, this straightforward design might fail because the model receives no learning signal for problems it cannot solve. STaR addresses this limitation by adding a "rationalization" process for failed attempts, where the model generates good backward CoTs based on the problem and the fundamental true answer, enabling it to produce more reasonable CoTs. Then, the model is fine-tuned based on correct solutions, which are either directly produced or generated through rationalization.

STaR algorithm.

We can view STaR as an approximation of policy gradients in RL, where the reward is a simple indicator function 𝟙1 [y^=y]. We aim to maximize the expected value of this reward when sampling from z∼p(z∣x) and y∼p(y∣x,z).

Each iteration is equivalent to first selecting CoT samples and then running supervised fine-tuning to optimize the log-likelihood of generating good CoT and answers. STaR's performance improves with increasing training iterations, and the "rationalization" process for generating better CoT can accelerate learning. They observed that high-temperature sampling increases the chance of obtaining correct answers through incorrect reasoning, and fine-tuning LLMs on such data can harm generalization. For data without ground truth, a majority vote from multiple high-temperature outputs can serve as a proxy for the ground truth answer, allowing training with synthetic samples.

Comparison of accuracy for two n-digit additions. With rationalization (step-by-step reasoning generation based on ground truth), the model is able to learn complex arithmetic tasks, such as five-digit addition, at a relatively early stage.

Scaling Laws of Thinking Time

So far, there is substantial evidence that allowing models to invest additional computational resources during inference, by thinking before producing a final answer, can significantly improve performance. Techniques such as prompting models to generate intermediate reasoning steps before giving an answer, or training models to pause and reflect before predicting the next token, have been found to boost model performance beyond the capabilities achieved during training. This essentially introduces a new dimension for improving model intelligence, complementing existing factors defined in scaling laws such as model scale, training compute, and data volume.

Recent research indicates that optimizing test-time computation for large language models may be more effective than simply scaling up model parameters. Smaller models combined with advanced reasoning algorithms can offer a Pareto-optimal trade-off between cost and performance.

Snell et al. evaluated and compared test-time computation and pre-training computation, finding that they are not 1:1 interchangeable. When the model capability gap is small, test-time computation can easily bridge the gap for simple and medium-difficulty problems, but it is less effective for hard problems. The ratio of pre-training tokens to inference tokens is very important. Test-time computation is more advantageous only when inference tokens are significantly fewer than pre-training tokens. This suggests that developing capable base models using ample pre-training data and computational resources remains crucial, as test-time computation cannot solve all problems or close large model capability gaps.

(Left) Evaluating accuracy as a function of test-time compute budget through iterative revision or parallel decoding. (Right) Comparing a small model with test-time compute sampling techniques against a model 14x larger using only greedy decoding. We can control the token budget used at test time, such that the ratio of inference tokens to pre-training tokens is <<1, ~=1, or >>1. Only when the ratio is <<1 are the benefits of test-time compute significant.

The s1 model experimented with extending Chain-of-Thought reasoning path length using budget forcing techniques (i.e., by adding the word "wait" to force longer thinking, or by adding a thought-ending token or "Final Answer:" to shorten and terminate the model's thinking process). They observed a clear positive correlation between the average thinking time, measured in tokens, and downstream evaluation accuracy.

In s1 experiments, both parallel and sequential scaling methods for test-time computation showed a positive correlation with evaluation performance.

When comparing this budget forcing technique with other decoding methods that control the length of the reasoning trajectory, it was quite surprising that simple rejection sampling (i.e., sampling generations until the length fits the token budget) led to inverse scaling, meaning longer Chain-of-Thoughts resulted in worse performance.

(Left) Longer Chain-of-Thought path length shows a positive correlation with evaluation accuracy. (Right) Rejection sampling used to control the generated Chain-of-Thought path length shows negative scaling, where longer Chain-of-Thoughts lead to worse evaluation accuracy. (Image source: Muennighoff & Yang, et al. 2025)

Future Outlook

The exploration of test-time computation and Chain-of-Thought reasoning presents new opportunities for enhancing model capabilities. More interestingly, by thinking at test time, we are moving towards building future AI systems that can reflect the best practices of human thought, integrating adaptability, flexibility, critical reflection, and error correction capabilities.

The excitement about current progress motivates us to conduct more future research to improve and deepen our understanding—not just how we and our models think, but why we think.

Finally, I would like to call for more research on the following open questions regarding test-time computation and Chain-of-Thought reasoning.

Can we incentivize models to produce human-readable, faithful reasoning paths during reinforcement learning training while avoiding reward hacking?
How is reward hacking defined? Can we detect reward hacking during reinforcement learning training or inference without human intervention? How can we prevent "whack-a-mole" approaches to fixing reward hacking in reinforcement learning training?
Self-correction can occur internally within the Chain-of-Thought or be explicitly encouraged through multi-round reinforcement learning. When no ground truth answers are available, how can we train models to self-correct without generating hallucinations or experiencing performance degradation?
For highly contextualized, personalized, and difficult-to-score tasks (e.g., creative writing, advisory suggestions, brainstorming), how do we conduct reinforcement learning training with Chain-of-Thought expansion?
When deploying models in real-world scenarios, we cannot infinitely extend test-time thinking. How can we smoothly transfer performance improvements back to the base model to reduce inference time costs (e.g., through knowledge distillation)?
How can the use of test-time computational resources be made more adaptive based on the difficulty of the current problem?

Peking University Alumna Lilian Weng's Latest Blog Post: Why We Think

Share Short URL