Reasoning models often exhibit behaviors similar to self-reflection, but the question is—
Can these behaviors truly effectively explore new strategies?
In response, a team from Northwestern University, Google, and Google DeepMind questioned the relationship between traditional reinforcement learning and reflection, and proposed a Bayesian adaptive reinforcement learning method, for the first time explaining why, how, and when models should reflect and explore new strategies.
By comparing models trained with traditional reinforcement learning and the new method, researchers found:
In completing a synthetic task where "the model needs to output three consecutive identical characters within 3 steps," traditional RL often pursues a single path to the end, while the new method understands how to rule out invalid hypotheses and switch to new strategies when appropriate.
Furthermore, in mathematical reasoning tasks, the new method achieved higher accuracy on most benchmarks and models, while requiring fewer tokens to solve problems.
More interestingly, the team discovered that the number of reflections is not the sole determinant of performance; some base models often exhibit many futile reflections that do not bring substantial information gain.
Details are elaborated below.
Bayesian Adaptive Reinforcement Learning Inspires Reflective Exploration
Intuitively, trial-and-error steps during testing are beneficial only when they bring information gain; however, models are not explicitly told about the information gain from trial-and-error and reflection during RL training.
In fact, existing reinforcement learning paradigms based on the Markov assumption have inherent limitations—exploration only occurs during the training phase, and agents typically only exploit the deterministic strategies learned during training when deployed (tested).
Moreover, the Markov assumption means that RL agents make decisions based solely on the current state; historical information (e.g., trial-and-error and retrospective thought processes) is compressed only into the current state representation, affecting the strategy.
The researchers point out that this traditional paradigm might lead models to achieve high scores by memorizing training solutions without truly learning to reflect; internal trial-and-error within the model also cannot provide information gain.
So, is reflective exploration during testing truly useful? How can effective reflective exploration strategies be learned?
To answer these questions, the researchers investigated a Bayesian adaptive RL framework, abbreviated as BARL, which differs from traditional RL.
Its core idea is to transform LLM's reflective exploration into a Bayesian adaptive reinforcement learning problem. By introducing modeling of environmental uncertainty, the model is enabled to adaptively explore during the reasoning process.
Simply put, BARL is no longer limited to the Markov assumption of traditional RL. Instead, it considers the uncertainty of MDPs (e.g., the effectiveness of different strategies for a problem), thus requiring all historical observations (including reward feedback) to be incorporated into decision-making.
This framework naturally balances the exploitation of reward maximization and the exploration of information acquisition.
Specifically, in BARL, the team assumes that the model faces a task with unknown elements, which can be described by a set of hypothetical MDPs (Markov Decision Processes) that represent these uncertainties.
The model maintains a posterior probability (belief) for each hypothetical MDP, which is continuously updated throughout the reasoning process.
Whenever the model selects an action (e.g., generating the next thought step), it updates its belief about various hypotheses based on the observed results.
BARL's target policy is not optimized for a single deterministic environment but directly optimizes the expected cumulative return under the posterior distribution. This means that when making decisions, the model considers "what is the benefit of doing this, and to what extent can this action reduce uncertainty?"
BARL explicitly incorporates testing performance into its optimization objective, encouraging the model to consider unknown situations by maximizing expected returns under the posterior.
The model understands that only active exploration can maintain high returns in unknown situations; therefore, reflection is for acquiring crucial information and avoiding pursuing a single wrong path to the end.
In short, BARL makes the model realize—
Timely reflection and trying different approaches can lead to higher returns, which is precisely the motivation for reflective behavior to emerge.
A New Reinforcement Learning Algorithm for Reasoning Models
The researchers provided the mathematical form of BARL's decision-making for reasoning models, with the core being how to calculate the expected value of the posterior:
This formula calculates a weighted sum of expected returns for multiple candidate answers (e.g., N answers in best-of-N). The weights are partly based on the model's assessment of the candidate answer's quality, and partly include a "correction term" that measures the deviation between actual observed results and the model's expectations.
This correction term acts as a reflection signal: if a strategy was initially highly favored by the model but the reward feedback is unsatisfactory, this discrepancy will quickly reduce the weight of that hypothesis, prompting the model with "perhaps it's time to switch to a new idea." This answers when the model should reflect and explore.
Through this mechanism, BARL's decision formula guides the model at each step to determine whether reflection is needed and when to switch strategies.
This is the essence of BARL's reflective decision-making—allowing the model to weigh "continuing the current approach" versus "trying a new approach" based on Bayesian posteriors.
This update process encourages the model to combine and switch different reasoning strategies, much like connecting multiple possible problem-solving paths and quickly switching to another when one proves unworkable midway.
BARL automatically achieves this through end-to-end RL optimization, providing LLMs with a principled guide on "when to reflect and how to reflect" during the reasoning process, achieving the effect of linearizing best-of-N into a long CoT.
Synthetic Task Case Study: A Clearer Comparison of RL and BARL
To intuitively demonstrate how BARL exhibits reflective exploration capabilities during testing, the authors designed a synthetic task: the model needs to output three consecutive identical characters (0/1/2) within 3 steps to receive a reward.
During the training phase, the prompt characters would only be 0 or 1, and the model learned to output 000 or 111 respectively to get rewards; however, during testing, the prompt character changed to 2.
Intuitively, the deterministic strategy learned during training would no longer be effective when encountering a new character, requiring the model to immediately explore the correct output pattern.
Two models challenged this task: one trained with traditional Markovian RL, and the other with the BARL method.
Markovian RL quickly maximized training accuracy, essentially memorizing the answers.
BARL also learned the correct output patterns during training, but more interestingly, it simultaneously learned to adjust its strategy based on uncertainty—a difference that only became apparent during testing.
The testing phase revealed vastly different behaviors. That is, when the prompt changed to a new character, 2, Markovian RL, having only memorized fixed outputs (000/111) during training, could not generalize and therefore almost always gave wrong answers, with testing accuracy close to zero.
The BARL agent, however, exhibited "reflection" capability. It would first attempt a certain strategy, and if the initial attempt did not yield a reward, it would quickly reflect and switch to try another possible sequence.
The following figure vividly illustrates the decision-making differences between Markov RL and BARL in this synthetic task—
Markovian strategy goes down a single path blindly, while BARL strategy knows how to rule out invalid hypotheses and switch to new strategies when appropriate.
As can be seen, in the left figure, the Markovian RL model quickly approached 100% accuracy during training but almost completely failed during testing. The BARL model in the middle figure not only improved training performance but also achieved significantly high accuracy during testing.
It is worth noting that the right figure shows that if BARL is given some prior knowledge about the task structure beforehand (e.g., "the reward pattern is a character repeated three times"), its convergence speed and final performance would be even better.
This indicates that candidate strategies need both diversity to cover unknown situations and reasonable credibility to avoid wasting effort pointlessly.
Mathematical Reasoning Task: Comprehensive Performance Improvement, Significant Token Savings
Researchers also applied BARL to the field of mathematical reasoning for LLMs and compared it against GRPO and the "Progress" reward baseline (step-by-step rewards based on the probability of correct answers).
BARL achieved higher accuracy on most benchmarks and models.
Moreover, BARL also demonstrated higher efficiency advantages.
The authors specifically measured the number of tokens consumed by each method to solve problems, and the results showed that BARL generated significantly shorter content while achieving equivalent or even higher accuracy.
This means that the BARL model does not incur the cost of being verbose for "reflecting multiple times"; instead, each reflection is more targeted and effective.
The authors also observed another interesting phenomenon: the number of reflections itself is not the sole determinant of performance.
Base models often exhibit many futile reflections that do not bring substantial information gain. In contrast, BARL's reflective behavior is more "purposeful."
The researchers calculated the Bayesian value of the chain of thought generated by the model at each step. Simply put, this is a score that comprehensively considers "how much this step contributes to the final solution" and "how much information gain this step brings."
The results showed that the Bayesian value of each action in the BARL model was consistently significantly higher than that of traditional RL models, indicating that its chosen steps either helped solve the problem (high reward) or explored new possible paths (high information gain), never blindly wasting steps.
In contrast, while base models sometimes seemed to output a lot of self-checking content, their "reflection" steps were evaluated as having low value due to ineffective information utilization, often remaining superficial.
Finally, the authors specifically trained a length-constrained GRPO, artificially limiting it to output a problem-solving process of at most 32 tokens, forcing the model to tend not to elaborate on reasoning but to directly give the final answer.
It can be found that the model's training accuracy eventually converged to be similar to normal GRPO, while the length of the generated process became shorter and shorter, almost degenerating into directly memorizing answers.
In other words, Markovian RL might indeed achieve optimality during training by sacrificing the thought process, but such a strategy will hit a wall when encountering new problems during testing. This further verifies that traditional RL cannot explain the benefits of reflective exploration, nor can it package the emergence of self-reflection.
Finally, the researchers have released the training code and paper.
The first author of this article, Shenao Zhang, is a second-year Ph.D. student at Northwestern University. His research focuses on large language models and reinforcement learning, particularly concerning LLM alignment, reasoning, and agents. His research aims to build intelligent systems that can actively acquire information and self-improve to achieve superhuman intelligence.
Training Code: https://github.com/shenao-zhang/BARL Paper: https://arxiv.org/abs/2505.20561
— End —