SFT+RL Two-Stage Training Breaks Through LLM Self-Supervision! RUC DeepCritic Achieves Autonomous Evolution of AI Critique

Paper Title

DeepCritic: Deliberate Critique with Large Language Models

Paper Link:

https://arxiv.org/abs/2505.00662

Code Link:

https://github.com/RUCBM/DeepCritic

Author Team:

Gaoling School of Artificial Intelligence, Renmin University of China; School of Computer Science and Technology, Beijing Jiaotong University

Background

Large Language Models (LLMs) have demonstrated outstanding performance on many tasks by learning from massive human data and continuously evolving through human supervised feedback. However, as model intelligence continues to increase, relying on manual supervision faces increasingly high costs and difficulties. How to supervise these increasingly evolving models in a more efficient and scalable way has become a critical and urgent problem.

One potential solution is to utilize LLMs themselves as critics (LLM Critics) to evaluate and provide feedback on the models' generated content. LLM critics generate critique opinions to help improve model output, thereby potentially replacing manual feedback work and enabling automated supervision and continuous optimization of LLMs.

However, some studies have found that existing LLM critics still perform relatively poorly when dealing with complex domains such as mathematical reasoning.

The authors analyzed the reasons and found that existing LLM critics lack critical thinking when critiquing, often merely performing simple repetitive verification based on the logic in the original reasoning steps, rather than critiquing and deeply analyzing from a questioning perspective. This often leads them to be misled by errors in the original reasoning steps and fail to discover problems (see Figure 1).

This flaw leads to two core problems: first, low accuracy of judgment results; second, the provided critique information lacks guidance, making it difficult to provide effective correction and optimization directions for generative models.

▲ Figure 1. Existing LLM critics can only generate shallow, superficial critiques, leading to low accuracy. This work trains a critique model that can combine iterative critique, multi-angle verification, and meta-critique mechanisms to perform deliberate reasoning before making judgments, generating detailed feedback and accurate judgments.

This work aims to solve the problem of current LLM critics generating overly superficial critiques in mathematical reasoning tasks by proposing the DeepCritic framework, which trains deliberate LLM critics through a two-stage training process: supervised fine-tuning (SFT) and reinforcement learning (RL).

The DeepCritic-7B-RL model, trained based on Qwen2.5-7B-Instruct, significantly outperforms current LLM critics on various mathematical error identification benchmarks, including GPT-4o, Qwen2.5-72B-Instruct, and similarly sized DeepSeek-R1-Distill models. DeepCritic-7B-RL can also further improve the test-time scaling results of generative models when used as a verifier and critic respectively.

Two-Stage Training Enhances LLM Critique Capability

2.1 Supervised Fine-tuning Teaches LLMs Deliberate Critique

In the first stage, to teach current LLMs deep critique behaviors and formats, the authors first constructed critique data in the form of long chains of thought from scratch and performed supervised fine-tuning (SFT) to give LLMs preliminary deep critique capabilities.

Specifically, the authors proposed a phased, progressively enhanced critique generation process that guides the model to think deeper and self-reflect, thereby improving its judgment accuracy and feedback quality. The generation method includes the following three key steps:

Initial Critique Generation: First, a small portion of problems and steps with manually labeled step correctness from PRM800K are selected. A large model (Qwen2.5-72B-Instruct) is called to critique each reasoning step individually, generating an initial critique for each step.

Deep Critique Generation: However, as shown above, direct critiques from existing large models tend to be superficial and lack true critical thinking.

Therefore, in this step, given the problem, reasoning steps, and initial critique, the model is again guided to re-evaluate and critique from different angles and with different verification methods, or to re-examine the initial critique itself, to discover problems not found by the initial critique or problems existing in the initial critique itself, forming a deeper and more reflective meta-critique that effectively corrects initial misjudgments.

Final Critique Fusion and Supervised Fine-tuning: Finally, all deep critiques whose judgment results are consistent with manual labels, along with their corresponding initial critiques, are combined into a long chain of thought to form a more mature and detailed final critique text for each step.

The final critiques for each step are then concatenated to obtain a deep critique text for the entire solution. Approximately 4.5K high-quality supervised fine-tuning data entries were constructed this way. By performing supervised fine-tuning on the base model (Qwen2.5-7B-Instruct), the initial critique model DeepCritic-7B-SFT, which possesses multi-round evaluation, multi-angle verification, and meta-critique capabilities, was finally obtained.

▲ Figure 2. Two-stage training process diagram

2.2 Reinforcement Learning Motivates LLMs to Critique Deliberately

After completing the first stage of supervised fine-tuning and building a model with preliminary deep critique capabilities, the goal of the second stage is to further unleash the model's potential, making it more accurate and flexible in evaluating complex reasoning processes. To this end, the authors used reinforcement learning (RL) for further training.

The key to the reinforcement learning stage lies in obtaining high-quality data. The authors explored RL training under two different data source settings:

Manually Annotated Data: Directly using the existing manually annotated dataset PRM800K, which is the most ideal data source with reliable label information.

Automatically Constructed Data: Considering the reality of increasingly high and unsustainable manual annotation costs, the authors also adopted an automated data construction process that does not require manual annotation.

Specifically, a portion of problems are extracted from NuminaMath-CoT, and Qwen2.5-1.5B/3B/7B-Instruct is used to generate multiple solution paths for each problem, filtering out problems that are too simple or too difficult. For the remaining solution paths, the correctness of each reasoning step is evaluated using a Monte Carlo sampling estimation method:

(1) Identifying incorrect steps in erroneous solution paths: The solution is truncated at a certain step, and the generator model (Qwen2.5-7B-Instruct) is allowed to expand subsequent steps multiple times from that point. If that step and all subsequent reasoning steps are incorrect in all expansions, and the majority of expansions for all steps before that step yield the correct answer, then that step is marked as the first erroneous step.

(2) Verifying correct solution paths: For solutions with correct final answers, the same strategy is applied to detect whether there are incorrect intermediate steps, ensuring accurate labels and sample quality.

Finally, the DeepCritic-7B-SFT model was trained on 40.7K PRM800K samples or 14.2K automatically constructed samples, yielding the models DeepCritic-7B-RL-PRM800K and DeepCritic-7B-RL-Numina, respectively.

Experimental Results

3.1 Main Experimental Results for Mathematical Critique Tasks

▲ Table 1. Performance of various models on different mathematical critique task benchmarks. The metric is the F1 score between the accuracy of finding the first erroneous step in incorrect reasoning paths and the accuracy of successfully determining correct paths.

The authors systematically evaluated the critique capabilities of different models on multiple mathematical evaluation benchmarks, and the results are shown in Table 1. The main experimental conclusions are as follows:

(1) The critique capabilities of basic instruction fine-tuned models are generally weak, especially for small models; as the model scale increases, critique capabilities also enhance accordingly.

(2) DeepSeek-R1-Distill series models showed improved performance in mathematical critique tasks due to their significantly enhanced mathematical reasoning capabilities. However, these models often tend to use their own problem-solving abilities to assist in judging the correctness of reasoning steps and have not truly learned to evaluate and critique. Therefore, their F1 scores are still relatively low when facing difficult problems (e.g., Omni-Math).

(3) After fine-tuning on the carefully constructed 4.5K critique data, the DeepCritic-7B-SFT model improved its average F1 score from 34.1 to 54.1 compared to the base model Qwen2.5-7B-Instruct, an increase of 20 percentage points. This proves the high quality of the constructed deliberate critique data and validates the effectiveness of the motivation to

SFT+RL Two-Stage Training Breaks Through LLM Self-Supervision! RUC DeepCritic Achieves Autonomous Evolution of AI Critique

Share Short URL