AI Learns Reasoning Solely by "Confidence": Zhejiang University Alumnus Replicates DeepSeek's Long Chain-of-Thought Emergence, Reinforcement Learning Needs No External Reward Signals

MLNLP community is a well-known machine learning and natural language processing community at home and abroad, with an audience covering NLP master's and doctoral students, university professors, and enterprise researchers. The community's vision is to promote communication and progress among domestic and international natural language processing, machine learning academic circles, industry, and enthusiasts, especially the progress of beginner students.

Source | Qubit

Authors | Meng Chen, Luyu

Replicating DeepSeek-R1's long chain-of-thought reasoning, the new large model reinforcement learning paradigm RLIF has become a hot topic.

Xuandong Zhao, co-first author from UC Berkeley team, calls this achievement:

Large models can learn complex reasoning by optimizing their own confidence, without needing to contact real answers.

Specifically, the new method requires no external reward signals or labeled data, only using the model's own confidence as an intrinsic reward signal.

Compared to GRPO, which uses external reward signals, the new method improves the performance of basic models in mathematical tasks without standard answers, and performs better in code tasks.

At almost the same time, another paper, "RENT: Reinforcement Learning via Entropy Minimization," also verified similar conclusions.

The authors stated that the main difference between the two lies in using KL divergence and minimizing entropy to measure confidence.

The VP of Engineering at Dropbox commented after seeing this: Confidence is all you need.

Confidence-Driven Reinforcement Learning

For a long time, training large models primarily relied on two methods:

Either large amounts of manual annotation (e.g., ChatGPT's RLHF) or verifiable standard answers (e.g., DeepSeek's RLVR).

The former is costly and may introduce bias, while the latter is limited to domains with clear answers, such as mathematics and programming.

So, as AI capabilities gradually approach or even surpass human capabilities, can models rely solely on intrinsic signals generated by themselves, breaking free from dependence on external supervision?

To address this question, the UC Berkeley team proposed a new training method, Intuitor, which calculates the KL divergence between the model's predicted distribution and a uniform distribution as its "confidence level."

This is analogous to how humans doing problems feel clearer in their thinking if they are confident in the answer; when confidence is low, they often need to rethink.

By optimizing this intrinsic signal, INTUITOR encourages the model to generate answers it is "more confident" about, and also promotes the generation of more structured reasoning processes.

In experiments, small models of 1.5B and 3B also showed long chain-of-thought reasoning behavior similar to DeepSeek-R1.

The paper also points out that intrinsic reward signals provide an additional benefit: they mechanistically reduce the risk of "reward hacking."

Traditional reinforcement learning with external reward signals is prone to "gaming the system," such as models generating syntactically correct but logically flawed code to match test cases, or directly memorizing answers in math problems instead of reasoning.

In INTUITOR, the team found that if offline learning was used, the model also learned to cheat after about 100 training steps: by appending a simple problem it had already solved to its answer to increase its confidence score.

However, using online learning can avoid this problem, as the evaluation criteria evolve with the model's capabilities, rendering cheating strategies ineffective.

Experimental Results: Not only good at solving problems, but also good at drawing inferences

The team first empirically studied the improvement of LLMs' mathematical reasoning capabilities within the INTUITOR framework.

The experiments selected Qwen2.5-1.5B/3B as the base model, using self-certainty as the sole reward signal, and placed them under INTUITOR and two baseline methods (GRPO, GRPO-PV) for pre-training on the MATH dataset.

Using dialogue prompts, 128 problems were processed at a time, and 7 candidate solutions were generated for each, with the KL penalty coefficient set to 0.005.

Performance was evaluated on benchmarks for mathematical reasoning, code generation, and instruction following, with results as shown in the figure:

Experiments showed that after fine-tuning with INTUITOR, Qwen2.5-1.5B transformed from initially only outputting repetitive meaningless content and scoring below 10% on dialogue tasks, to significantly reducing invalid outputs and effectively increasing response length.

In terms of structured reasoning ability, the team also found that INTUITOR had faster early learning speed. For example, Qwen2.5-3B on the GSM8K benchmark, INTUITOR (0.811) consistently outperformed GRPO (0.758).

Furthermore, INTUITOR also performed excellently in multi-task generalization. For example, when Qwen2.5-3B was performing code generation tasks, although relatively lagging, it continuously grew, eventually performing 8% higher than GRPO, a relative increase of 65%.

At the same time, the team also observed that when performing long-chain reasoning, the INTUITOR model would add natural language reasoning (e.g., "To solve problem X, first perform step Y") before generating complete code. It is speculated that this might be one of the reasons why INTUITOR consistently performs well in tests.

Its evolution process can roughly be described in three stages:

The model learns to generate code, leading to improved accuracy and reduced invalid responses.
Pre-code reasoning is conducted to facilitate self-understanding.
Gradually refine and generate effective code with detailed reasoning.

To evaluate the robustness of self-certainty as a reward, researchers also compared offline self-certainty (rewards from a fixed base model) with online self-certainty (rewards from an evolving policy model).

Additionally, to further evaluate the quality of self-certainty as a reward signal, researchers also analyzed the distribution of self-certainty scores generated by the model in MATH500 responses.

Notably, the INTUITOR model had significantly higher self-certainty for correct answers, and while GRPO improved the model's self-assessment ability, its discriminative power was significantly lower than INTUITOR.

Due to computational resource limitations, experiments were only conducted on a relatively small unsupervised corpus. In the future, the advantages of INTUITOR can be further studied on larger-scale foundational models and more diverse real-world datasets.

Team Introduction

This research comes from Sergey Levine and Xiaodong Song's team at UC Berkeley. There are five authors: first author postdoctoral researcher Xuandong Zhao, co-first author undergraduate Zhewei Kang, Aosong Feng from Yale University, and Sergey Levine and Dawn Song.

In 2019, after graduating from Zhejiang University, Xuandong Zhao entered the University of California, Santa Barbara, to pursue a Ph.D. in Computer Science. During this period, he also interned at companies such as Alibaba, Microsoft, and Google.

Since joining UC Berkeley in 2024, in addition to this new achievement, he has published more than a dozen papers, which have been accepted by ICLR 2025, ICML 2025, and others.

Additionally, in February this year, Xuandong Zhao and Zhewei Kang collaborated on a paper describing a new strategy for improving LLMs' reasoning capabilities based on self-certainty, Best-of-N, which can be seen as a prior attempt for this paper.

Paper link: https://arxiv.org/abs/2505.19590 Code link: https://github.com/sunblaze-ucb/Intuitor

References: [1]https://x.com/joshclemm/status/1927400772817285264 [2]https://x.com/xuandongzhao/status/1927270931874910259 [3]https://x.com/xuandongzhao/status/192778163679341780 [4]https://arxiv.org/abs/2502.18581

Technical Exchange Group Invitation

△Long press to add assistant

Scan the QR code to add the assistant's WeChat

Please note: Name-School/Company-Research Direction

(e.g., Xiao Zhang - Harbin Institute of Technology - Dialogue System)

You can apply to join the Natural Language Processing/Pytorch and other technical exchange groups

About Us

MLNLP community is a civilian academic community jointly established by machine learning and natural language processing scholars at home and abroad. It has now developed into a well-known machine learning and natural language processing community at home and abroad, aiming to promote the progress of machine learning, natural language processing academic circles, industry, and enthusiasts.

The community can provide an open exchange platform for relevant practitioners in terms of further education, employment, and research. Welcome everyone to follow and join us.

AI Learns Reasoning Solely by "Confidence": Zhejiang University Alumnus Replicates DeepSeek's Long Chain-of-Thought Emergence, Reinforcement Learning Needs No External Reward Signals

Share Short URL