In the evolution of large language models, Reinforcement Learning with Human Feedback (RLHF) is undoubtedly one of the most significant paradigms: it transformed models from "mechanical conversationalists" into "mirrors of human preference." However, RLHF also has a fatal flaw—it doesn't truly require models to reason. Consequently, we often see models provide answers that are "plausible but superficial," appearing satisfactory on the surface but lacking logical depth.
On the other hand, Reinforcement Learning with Verifiable Rewards (RLVR), which emerged in recent years, has demonstrated astonishing power in verifiable tasks such as mathematics and code. It requires models to first write out an explicit reasoning trajectory, and then uses rules to determine the correctness of the answer. This allows models to excel at "problem-solving" but makes it difficult to extend to open-ended tasks, as there is no single "right or wrong" standard in these scenarios.
So, can we combine the "spirit" of RLHF with the "form" of RLVR? Can we enable models to think explicitly and generate responses that align with human preferences?
The latest paper from Princeton's Danqi Chen group provides an answer: Reinforcement Learning with Model-rewarded Thinking (RLMT). It compels models to "write down long-chain reasoning" before answering, and then uses a preference reward model to evaluate the final answer.
Experimental results show that an 8B model, powered by RLMT, can approach or even surpass GPT-4o and Claude-3.7 Sonnet in chat and creative tasks.
Paper Title:
Language Models that Think, Chat Better
Paper Link:
https://arxiv.org/pdf/2509.20357
Code Link:
https://github.com/princeton-pli/RLMT
This is not just a technical breakthrough but a paradigm shift. Below, we will follow the paper's main logical thread to progressively dissect RLMT's core ideas and experimental findings.
The Form and Spirit of RLMT
If RLHF is seen as a "mirror of human preference" and RLVR as a "steel ruler for verifiable reasoning," then RLMT attempts to unite both: requiring the model to think explicitly while generating answers that meet human expectations.
In RLMT, the model is compelled to first write a thought trajectory z, and then produce the final answer y. Unlike RLVR, which uses strict validators to determine "right or wrong," the evaluator here is a preference reward model r. Thus, the training objective becomes:
For better understanding, let's review the two "parent approaches":
RLHF Objective Function:
RLVR Objective Function:
In comparison, RLMT continues RLVR's "think first, then answer" generation method, but the ultimate reward mechanism is not a rigid right-or-wrong criterion, but an RLHF-style human preference model. This forces the model to generate a reasoning chain while remaining flexible in open-domain scenarios.
Figure 1 illustrates the structural differences among the three: RLHF directly uses preference rewards, RLVR emphasizes strict verification, and RLMT combines "explicit thinking" with "preference scoring."
▲ Figure 1. The RLMT framework combines RLVR's explicit thinking process with RLHF's preference reward mechanism.
Figure 2 provides an example of RLMT: when faced with an open-ended question, the model first writes a checklist or a draft plan, and then generates the final answer.
▲ Figure 2. RLMT enables models to explicitly generate reasoning trajectories before answering, transforming thinking style from checklist to iterative revision.
Decomposition of Effective Components
The paper's ablation studies indicate that RLMT's success is not due to a single innovation but a superposition of multiple factors:
The strength of the reward model is a critical foundation. The authors used the Skywork series of reward models and found that RLMT performed significantly better when the reward model was stronger; conversely, a weaker reward model led to an overall performance decline.
Prompt distribution is more important than data scale. Compared to accumulating large-scale instruction data, selecting a WildChat-IF subset (approximately 7.5k samples) that is closer to real chat contexts yielded more stable benefits.
Algorithm choice is not the sole factor. RLMT operated effectively under three optimizers—GRPO, PPO, and DPO—with GRPO achieving the best results, but the overall differences were not decisive.
These factors collectively ensure that RLMT not only "looks reasonable" mathematically but also "runs smoothly" in engineering practice.
From Validation to Breakthrough
Is explicit thinking truly useful?
The paper's first question is: if models are forced to "think first and then answer," is there any benefit?
The answer is in the upper part of Table 1. For the same 8B model, RLMT outperformed RLHF by 1.5–4 points on almost all open-domain benchmarks. The improvements were most pronounced on WildBench and AlpacaEval2. This proves that "explicit thinking" is not a burden but an aid.
▲ Table 1. In the upper part, RLMT significantly outperforms RLHF on WB, AE2, CWv3 tasks.
From "Small Models" to "Big Opponents"
Table 2 compares the RLMT 8B model with GPT-4o and Claude-3.7 Sonnet. On WB and AE2, 8B-RLMT not only surpassed GPT-4o but also briefly overtook Claude. Although there was still a gap on AH2 and CWv3, the overall average score of 54.1 was higher than GPT-4o (53.2).
This indicates that RLMT for the first time enables small models to "compete" with flagship commercial models.
▲ Table 2. 8B-RLMT surpasses GPT-4o and Claude on some tasks.
Mathematical Logic ≠ General Reasoning
Figure 3 reveals that RLVR models trained exclusively in the mathematical domain are almost ineffective when transferred to open-domain tasks; whereas RLMT maintains stable performance on tasks like WildBench.
The logic is clear: reasoning chains need to be coupled with appropriate reward signals. Purely verifiable "right or wrong" cannot be generalized to open-ended scenarios.
▲ Figure 3. Mathematical domain RLVR models perform poorly on WildBench, while RLMT maintains its advantage.
What if SFT is skipped entirely?
The bottom part of Table 1 provides the answer: Zero-RLMT.
On Qwen-2.5-7B, Zero-RLMT achieved an average score of 36.4, surpassing Instruct's 35.0.
On Llama-3.1-8B, the total score was slightly lower (28.7 vs 30.8), but it outperformed in chat capability (AvgChat) by 5.5 points.
This shows that the key to RLMT does not rely on extensive SFT; it can still function effectively even when starting from scratch.
▲ Table 1. In the lower part, Zero-RLMT completely surpasses Instruct on Qwen, and has stronger chat capability on Llama.
Algorithm Choice is Just a Detail
Table 3 indicates that whether DPO, PPO, or GRPO is used, RLMT consistently outperforms RLHF. The difference is that GRPO is optimal, scoring 1–3 points higher than PPO and about 5 points higher than DPO. However, the core gain comes from "explicit thinking + preference rewards," not the specific optimizer.
▲ Table 3. GRPO performs best, but RLMT is valid across different optimizers.
Ablation Study: Verifying Which Factors Are Truly Key
In the methods section, the authors proposed the "effective components hypothesis": the strength of the reward model, the quality of the training prompt distribution, and the source of warm-start might determine the final performance.
Table 4's ablation study verifies this from three angles:
Prompt mixture: The results show that the WildChat-IF subset performs best, improving performance more than UltraFeedback or random mixtures. This corroborates the earlier point: "fit" of the training distribution is more crucial than data scale.
Warm-start source: Here, the authors did not use Gemini-2.5 but instead used SFT data generated by GPT-4.1-mini for warm-up. The results show that even with GPT-4.1-mini, RLMT still runs successfully and maintains similar trends to the original setup. This indicates that the warm-start source is not a decisive factor.
Reward model strength: Skywork-V2 significantly outperforms V1 and ArmoRM. A stronger reward model not only boosts chat task scores but also reduces performance degradation on non-chat tasks.
In other words, Table 4 provides an empirical validation of the "decomposition of effective components" mentioned earlier: the reward model and prompt distribution are the most important levers, while warm-start source and optimization algorithms are merely details.
▲ Table 4. Ablation study confirms that the reward model and prompt distribution are the true levers for RLMT.
Evolution of Thinking Style
Figure 4 shows that the RLMT model gradually learns a reasoning style of "setting constraints—segmenting topics—iterative revision," rather than a checklist-style enumeration.
▲ Figure 4. RLMT shifts the model's thinking style from "linear checklist" to "iterative planning + revision."
Figure 5 reveals that as training steps increase, the length of thinking and answering grows synchronously. This is not padding but rather the reasoning chain gradually solidifying into a habit.
▲ Figure 5. During RLMT training, thinking and answering lengths grow synchronously, reflecting a more systematic reasoning habit.
From Formula to Style: What has RLMT truly changed?
The value of RLMT is not just reflected in score improvements. What it truly changes are the model's dual attributes at the formulaic level and the stylistic level.
At the formulaic level, RLMT cleverly unifies RLHF's human preference rewards with RLVR's explicit thinking trajectories into a single objective function. This means "logic" and "preference" are no longer separate but bound together in one training process.
At the stylistic level, RLMT reshapes the model's generation habits. Figures 4 and 5 in the experiments clearly show: the model evolves from a checklist-style straightforward enumeration to an iterative planning approach more akin to human thought. It no longer settles for "writing down a few key points" but learns to "set constraints—segment topics—continuously revise."
Therefore, RLMT's contribution is not just a "score-boosting trick" but the infusion of the embryo of "great wisdom" into small models.
From "Mirror" and "Steel Ruler" to "The Third Path"
The introduction of RLMT not only continues the "spirit" of RLHF and the "form" of RLVR but also opens up a "third path." It addresses a long-standing dilemma: how to make models both logically coherent and human-pleasing.
The potential of this new path lies in at least two directions:
Refinement of Reward Models: With the emergence of stronger preference models, the effectiveness of RLMT will continue to increase.
Multimodality and Tool Use: If future RLMT-driven thinking goes beyond text to include images, code execution, and search planning, it could truly become a "general reasoning foundation."
At a time when RLHF struggles to sustain and RLVR has boundaries, RLMT shows us new possibilities: small models, through a post-training paradigm, can approach or even match the strongest commercial models.
This is not just an experimental breakthrough but a paradigm shift. From "mirror" and "steel ruler" to "the third path," RLMT may well be a crucial node on the road to more general intelligence.