LLM + RL Questioned: Deliberately Using Incorrect Rewards Still Significantly Boosts Math Benchmarks, Causing a Stir in the AI Community

Content from: Synced

Editors: Zenan, +0

What have we been training for so long?

This is the most 'absurd' paper of the year.

With the release of this paper, the relevance of all Large Language Models (LLMs) combined with Reinforcement Learning (RL) is now being questioned.

This past Tuesday, a paper from the University of Washington, Allen Institute for AI, and Berkeley created a stir in the AI community.

Image

Paper: https://github.com/ruixin31/Rethink_RLVR/blob/main/paper/rethink-rlvr.pdf

Project Link: https://github.com/ruixin31/Rethink_RLVR/tree/main

The authors refute the prevailing reinforcement learning methods in the large model domain, having discovered that:

Training the Qwen2.5-Math-7B model with spurious rewards can also improve its MATH-500 scores; with random rewards, the score increased by 21%, and with incorrect rewards, it improved by 25% (while true rewards led to a 28.8% increase).

What is going on? Are large model training techniques truly effective? The authors of this work have introduced their findings in a blog post:

Challenging Conventional Views on RLVR

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard method for enhancing the reasoning capabilities of Large Language Models (LLMs). The conventional view holds that high-quality supervisory signals are crucial for effective RLVR training. Recent research challenges this assumption, indicating that training with RLVR on single or unsupervised samples can still achieve significant progress with Qwen-Math models.

However, we are compelled to ask: Where do the training signals in single-sample or unsupervised RLVR come from? What are the minimum requirements for a reward to provide meaningful RLVR training signals?

Our findings are startling.

Spurious Rewards, Even Random or Incorrect Ones, Significantly Improve Qwen-Math Performance

We found that RLVR can significantly improve mathematical reasoning capabilities through what we call 'spurious rewards'—signals that provide minimal or even misleading guidance.

Here are some interesting rewards we experimented with:

  • Format Reward: Rewards granted solely for answers containing \boxed { } — specifically, for answers including the \boxed {} expression. This format is also specified in the prompts provided to the model, thereby introducing a concept of 'prompt adherence'.

  • Random Reward: Completely arbitrary feedback — literally: 1 if (random.random () < rate) else 0

  • Incorrect Reward: Deliberately setting incorrect supervisory signals — steps to obtain incorrect but plausible labels:

  1. Sort the model's rollouts by frequency

  2. Take the most common answer

  3. Discard the sample if the answer is correct

  4. Train on the subset where the model's most common answer is incorrect, and use that specific answer as the training label.

We also compared with other weak rewards studied in the literature:

  • Majority-Vote Reward: Using the majority-voted answer as the label.

  • One-Shot Reinforcement Learning: Performing standard Reinforcement Learning with Verifiable Rewards (RLVR) on a single sample.

Image

MATH-500 accuracy after 150 training steps with RLVR on different training signals. We demonstrate that even 'spurious rewards' can lead to significant MATH-500 improvements on Qwen models. It's important to note that these reward signals are not applicable to other models like Llama3 and OLMo2, as their reasoning priors differ.

Starting with the Qwen2.5-Math-7B model, which is widely used in the AI community for reinforcement learning, we achieved performance improvements comparable to models supervised with ground truth values across various mathematical reasoning benchmarks.

This finding directly challenges the current understanding of the role reinforcement learning plays in enhancing AI reasoning capabilities.

A Twist: Spurious Rewards Are Not Effective for All Models

When we extended our experiments to other model families not specifically optimized for mathematical reasoning (including Qwen2.5-Base, Olmo2, and Llama3 variants), we observed some interesting phenomena:

Unlike Qwen-Math, other models showed very limited performance with 'spurious rewards'.

(We primarily discuss performance on MATH-500; for more results concerning AMC, AIME 2024, and especially the AIME 2025 test set, which is beyond the training data cutoff, please refer to the full paper.)

  • First, a sanity check with true labels. It improved the performance of all models. When performing simple GRPO with true labels, we observed improvements across all model families, with Qwen and Qwen-Math showing greater improvements than Llama and OLMo models.

  • What about majority voting? Previous research has proposed methods to improve model consistency. We found that this indeed benefits most models but not OLMo.

  • What if we only reward responses that contain \boxed {}? Experiments found that merely training the model to generate parseable results led to a significant performance boost in Qwen models—an absolute improvement of up to 49.9% for Qwen2.5-1.5B. However, this reward harmed the performance of Llama3.2-3B-Instruct and OLMo2-SFT-7B, reducing them by 7.3% and 5.3%, respectively. Interestingly, performance began to gradually decline after peaking. We hypothesize this is because the model had already 'learned' the format, and further training did not provide it with more information.

  • Incorrect rewards—this is where things get interesting. We found that they still significantly improved the performance of Qwen models but had no effect on Llama models and harmed OLMo-Base and OLMo-SFT models.

  • Finally, what if we don't observe the model itself and simply assign random rewards of 0 or 1? Is this still effective? You guessed it, it works for Qwen models, but not for others.

Note that random rewards did not work for Qwen2.5-1.5B and only started to become effective for Qwen2.5-7B after approximately 120 steps. Based on this observation, we trained it longer (300 steps) and found that these models had lower convergence levels compared to other signal-bearing rewards.

This architecture-dependent behavior suggests that RLVR's effectiveness relies more on pre-existing model capabilities than on the quality of supervisory signals.

Practical Warnings for Future Work

The Qwen model, with its open-source weights and high performance on reasoning tasks, has become the de facto choice for RLVR research in the open-source community—a recent series of RLVR studies have drawn conclusions based on Qwen-centric experiments (see the original paper for a list).

However, we found two recent studies indicating that weakly supervised RLVR performs well on Qwen models, but these conclusions do not generalize to other model families.

  • Test-Time Reinforcement Learning: This paper proposes performing RLVR on test samples and uses majority-voted answers under an on-policy setting to calculate rewards.

  • One-Shot Reinforcement Learning: This paper showed that RLVR with just one sample can achieve performance comparable to standard RLVR on training datasets.

ImageWe evaluated two recently proposed weakly supervised RL methods—TTRL and One-Shot RL—across multiple base models. We found that these proposed training rewards consistently worked on Qwen models. However, with few exceptions, these same signals generally did not yield benefits on other model families, echoing the limited generalization we observed when training with spurious rewards.

Therefore, we suggest that future RLVR research should be validated on other models.

What Makes RLVR with Spurious Rewards Effective?

Now, you might be curious—why does this happen? Why are all these spurious rewards effective on the Qwen-Math model? Where exactly is the magic?

Overall, we hypothesize that the differences in RLVR training outcomes are due to the distinct reasoning strategies learned by each model during pre-training. Specifically, some strategies might be easily elicited by RLVR, while others might be harder to manifest or simply not exist.

We identified one such pre-existing strategy: generating code to aid mathematical reasoning, which Qwen-Math effectively leverages, whereas other model families utilize it less. We investigate code reasoning as an illuminating case study, though this is not a complete explanation: we observed other behaviors that are also easily elicited and often correlated with performance, such as 'non-repetition'. See the paper for more details.

An Illuminating Case Study: Code Reasoning

Through careful analysis, we discovered a crucial insight: even before RLVR training, Qwen-Math generated Python code to solve math problems 65.0% of the time. More strikingly, it often produced correct code output and the correct answer to the problem without a code executor.

However, this frequent and high-quality code reasoning ability is not present in other models.

Image

Below is an example of how Qwen-Math-7B can accurately predict Image to 15 decimal places—one more than an iPhone calculator.

Image

Example of Qwen2.5-Math-7B's code reasoning response. The problem was randomly selected from the MATH-500 test set. Note that both the code and its execution result were autoregressively generated by Qwen2.5-Math-7B. No external code interpreter was provided to the model.

After applying RLVR, regardless of reward quality, the frequency of this code reasoning increased to over 90% on average.

This shift in reasoning strategy—rather than the acquisition of new reasoning skills—appears to be the driver of performance improvements. Qwen models learned to use more code reasoning through RLVR training. The transition from linguistic reasoning to code reasoning effectively boosted performance.

Image

For Qwen-Math and Qwen models, code frequency is highly correlated with performance. More code -> more correct answers, and vice versa. However, in models that can generate code but not high-quality code (e.g., OLMo2-7B-SFT), this correlation is inverse.

Fine-grained Accuracy Tracking — How much do we benefit simply from choosing the right reasoning strategy?

More interestingly, we tracked problems where the reasoning strategy changed before and after RLVR and analyzed where the performance gains originated. We found:

  • Spurious rewards were more aggressive in converting model behavior to code reasoning and rarely transformed initially code-based reasoning into natural language reasoning. Impressively, it appears that RLVR based on spurious rewards made the correct choice—for problems that switched from natural language reasoning to code reasoning, performance sharply increased by about 55%. On the other hand, true label rewards boosted natural language reasoning performance by 60.2%! The flowchart below contains more detailed explanations.

Image

We further quantified the contribution of each strategy transition behavior to each model's performance gain. It's really cool to see this: if a model excels at code reasoning (code accuracy > language accuracy), RLVR's gain primarily comes from the transition from language to code reasoning; if a model is not proficient in code reasoning (code accuracy < language accuracy), RLVR's gain primarily comes from the transition from code to language reasoning.

Image

Partial contribution to overall performance gain, averaged over rewards that successfully guided the model's reasoning strategy.

Based on these strong correlations we initially observed, we hypothesize that code reasoning is one of the reasoning behaviors in Qwen models that leads to good mathematical performance.

To validate our hypothesis, we explicitly constrained models to generate code reasoning through prompting and reinforcement learning. We observed a strong correlation between the frequency of code reasoning and benchmark performance across all tested models. (The direction of correlation depends on the specific model's code quality).

Inducing Code Reasoning via Prompting

We simply prompted the model to start its response with 'Let's solve this using Python.' This simple approach significantly improved the performance of the Qwen-math model but decreased the performance of Llama and OLMo models.

Inducing Code Reasoning via Reinforcement Learning (RL)

Given the success of the prompting experiment, we designed an additional spurious reward: granting a reward whenever the model's response contained the string 'python'. This greatly encouraged all models to use code reasoning (after 50 training steps, over 99% of responses contained code).

Image

In the chart below, we show a similar trend, but the effect is even more pronounced if we use reinforcement learning to train models to use Python code more. The performance of Qwen-Math and Qwen2.5-7B models improved, while the performance of other models decreased.

But Why Random?

We were perplexed when we saw the training curve climb with rewards generated by random.random() < 0.5. How can completely meaningless, uninformed rewards genuinely facilitate model learning?

This paradox led us to search for the 'London dispersion force' of AI—much like how electrically neutral atoms still mysteriously attract each each other. After delving into GRPO, we discovered that the clipping term might be key. We conducted an ablation study on the clipping factor using three methods:

  • (a) Directly disable clipping in the loss calculation.

  • (b) Adjust the batch sizes for training and inference to align the inference model with the policy.

  • (c) Reduce the inference batch size to maintain equivalent conditions.

Methods (b) and (c) ensure only one gradient update per inference step, naturally avoiding clipping constraints.

Image

Performance and code reasoning frequency during an ablation study on the clipping term in GRPO for the Qwen2.5-Math-7B model. Training with random rewards with clipping increases the code reasoning mode and improves performance.

With standard GRPO clipping, random rewards led to approximately a 21% performance improvement for Qwen2.5-Math-7B and increased the code reasoning mode. However, when we eliminated the clipping effect using any of the three methods above, random rewards did not yield any improvement.

We speculate this is due to a bias inherent in the GRPO formulation itself, which we will elaborate on below. Under the effect of clipping, random rewards do not teach the model the quality of the task—instead, they trigger a concentration effect, causing the model to focus on its existing distribution of reasoning patterns. When clipping is disabled, this concentration mechanism completely disappears.

Implications and Future Work

  • Spurious Rewards Work by Amplifying Existing Capabilities: RLVR with spurious rewards can act as a mechanism to amplify and highlight useful reasoning representations learned during pre-training. When new RLVR methods are proposed, they should examine whether their benefits extend beyond revealing these surface patterns to investigate the true extent of learning.

  • Test Claims About RL Methods on More Model Families: Given that different model families possess varying pre-existing capabilities, we suggest that future RLVR research should perhaps be validated on diverse models, rather than solely relying on a single 'de facto standard' choice, as we have shown that even with entirely spurious reward signals, significant performance gains can easily be achieved on Qwen models.

  • Understand Your Model First: We should be more aware that reasoning patterns learned during pre-training significantly influence downstream RLVR training behavior—both when designing pre-training methods and when using pre-trained models for RLVR.

References:

https://rethink-rlvr.notion.site/Spurious-Rewards-Rethinking-Training-Signals-in-RLVR-1f4df34dac1880948858f95aeb88872f

Main Tag:Artificial Intelligence

Sub Tags:Large Language ModelsAI ResearchMachine LearningReinforcement Learning


Previous:Summary! Multi-Turn Planning Techniques in 2025 for Large Language Model Agent RL Training

Next:Tsinghua University's New RAG Framework: DO-RAG Accuracy Soars by 33%!

Share Short URL