DeepSeek makes a big move! New model focuses on mathematical theorem proving, significantly refreshing multiple high-difficulty benchmarks.

In the Putnam competition, the new model DeepSeek-Prover-V2 directly reset the record to 49 problems solved.

The current number one solved only 10 problems out of 657, which was Kimina-Prover, a result of collaboration between Kimi and the AIME 2024 champion team Numina.

DeepSeek-R1, which is not optimized for theorem proving, solved only 1 problem.

This makes the yet-to-be-released R2 even more anticipated.

In addition to the evaluation results, the paper specifically reports the phenomenon of "discovering new skills through reinforcement learning."

Just as R1 brought "aha moments," Prover-V2 also possesses unexpected capabilities.

Specifically, in the Putnam competition, the smaller parameter DeepSeek-Prover-V2-7B successfully solved 13 problems that the 671B model failed to solve using a non-CoT generation mode.

After carefully examining the output of this model, the team found that its reasoning method has a unique pattern: when dealing with problems involving finite cardinality, the 7B model often uses Cardinal.toNat and Cardinal.natCast_inj, while these are noticeably absent in the output generated by the 671B model.

It should be noted that the 7B model was obtained by first fine-tuning the DeepSeek-Prover-V1.5-Base model using data collected during the reinforcement learning phase of the 671B model, and then performing reinforcement learning.

In other words, the 7B model learned new skills that the 671B model did not.

Unified Model for Formal and Informal Mathematical Proofs

DeepSeek's mathematical theorem proving models, the DeepSeek-Prover series, have released 3 versions:

DeepSeek-Prover in March 2024 (hereinafter referred to as Prover-V1)

DeepSeek-Prover-V1.5 in August 2024 (hereinafter referred to as Prover-V1.5)

DeepSeek-Prover-V2 in May 2025 (hereinafter referred to as Prover-V2)

Prover-V1 mainly explored advancing theorem proving by fine-tuning DeepSeek-Math-7B with a large-scale synthetic dataset.

Prover-V1.5 added reinforcement learning with proof assistant feedback (RLPAF) and Monte Carlo Tree Search methods on top of that.

Prover-V2 further proposed "reinforcement learning with subgoal decomposition" and upgraded the base model from DeepSeek-Math-7B to DeepSeek-V3.

By integrating the high context window and powerful natural language reasoning capabilities of DeepSeek-V3, formal and informal mathematical proofs are unified into a single model.

Prover-V2 also inherited the two generation modes, CoT and non-CoT, proposed by Prover-V1.5.

Next, we will introduce the main aspects of Prover-V2 in detail.

Synthesizing Cold-Start Reasoning Data through Recursive Proof Search

Utilizing DeepSeek-V3 as a unified tool for subgoal decomposition and formalization to construct a cold-start dataset, DeepSeek-V3 is prompted to decompose theorems into high-level proof sketches and simultaneously formalize these proof steps in Lean 4, generating a series of subgoals.

A smaller 7-billion-parameter model is used to handle the proof search for each subgoal, thus alleviating the associated computational burden. Once the decomposition steps for a challenging problem are solved, the complete step-by-step formal proof is paired with the corresponding chain-of-thought from DeepSeek-V3 to create cold-start reasoning data.

Reinforcement Learning with Subgoal Decomposition using Synthetic Cold-Start Data

The team carefully selected a set of challenging problems that the 7-billion-parameter prover model could not solve end-to-end, but all decomposed subgoals were successfully solved.

By combining the proofs for all subgoals, a complete formal proof for the original problem was constructed.

This proof was then appended to the chain-of-thought from DeepSeek-V3, which outlines the corresponding lemma decomposition, thus achieving an organic combination of informal reasoning and subsequent formalization.

The reinforcement learning phase, conducted after fine-tuning the prover model on synthetic cold-start data, further enhances its ability to bridge informal reasoning with formal proof construction. Following the standard training objectives for reasoning models, binary correct/incorrect feedback is used as the primary form of reward supervision.

Specific Training Details

Two-Stage Training:

DeepSeek-Prover-V2 establishes complementary proof generation modes in two stages.

The first stage uses an efficient non-CoT mode, focusing on rapid generation of Lean proof code to speed up iteration and data collection.

The second stage builds upon the results of the first stage, employing a high-precision Chain-of-Thought (CoT) mode, elaborating on intermediate reasoning steps, and using cold-start CoT data for reinforcement learning to enhance the ability to reason about complex problems.

Expert Iteration:

Non-CoT mode training follows the expert iteration paradigm, generating proof attempts for difficult problems using the best proof strategies. Upon verification by Lean, successful attempts are included in the supervised fine-tuning (SFT) dataset. Compared to previous versions, the training problem distribution was adjusted, introducing additional problems and problems generated through subgoal decomposition.

Supervised Fine-Tuning:

Supervised fine-tuning was performed on DeepSeek-V3-Base-671B. The training corpus includes data from two complementary sources:

First, non-CoT data collected through expert iteration. The Lean code generated from this data does not include intermediate reasoning steps and is primarily used to strengthen the model's formal verification skills within the Lean theorem proving ecosystem.

Second, cold-start CoT data. This data refines the advanced mathematical reasoning process of DeepSeek-V3 into structured proof paths, explicitly simulating the cognitive process of translating mathematical intuition into a formal proof structure.

Reinforcement Learning:

The GRPO algorithm is used. Unlike traditional PPO, GRPO does not require a separate critic model. It optimizes the policy by sampling a set of candidate proofs for each theorem prompt and basing optimization on their relative rewards.

A binary reward mechanism is used during training: a generated Lean proof receives a reward of 1 if verified as correct, and 0 otherwise.

To ensure learning effectiveness, training prompts were carefully selected to include only problems that were sufficiently challenging but solvable by the supervised fine-tuned model.

Distillation of DeepSeek-Prover-V2 7B

The context window of DeepSeek-Prover-V1.5-Base-7B was expanded to 32768 tokens and fine-tuned with DeepSeek-Prover-V2-671B data, incorporating non-CoT proof data to enable the smaller model to generate concise formal output, providing a cost-effective proving option.

Furthermore, DeepSeek-Prover-V2-7B underwent the same reinforcement learning phase as the 671B model training to further enhance its performance.

The resulting model, Prover-V2 671B, achieves state-of-the-art performance in neural theorem proving, with a pass rate of 88.9% on the miniF2F test and solving 49 problems from the Putnam competition. Proofs generated by Prover-V2 for the miniF2F dataset are available for separate download.

ProverBench: Formalization of AIME and Textbook Problems

Launched alongside Prover-V2 is ProverBench, a benchmark dataset containing 325 problems. Of these, 15 problems are formalized from recent American Invitational Mathematics Examination (AIME 24 and 25) number theory and algebra questions, providing realistic high school competition level challenges. The remaining 310 problems are drawn from carefully selected textbook examples and instructional tutorials, forming a diverse set of formalized mathematical problems based on teaching needs. The benchmark aims to provide a more comprehensive evaluation of high school competition problems and undergraduate-level mathematical problems.

The final overall scores of the DeepSeek-Prover-V2 series evaluated on three datasets are as follows:

DeepSeek All-Star Lineup

There are 18 authors of Prover-V2. The co-first authors, Z.Z. Ren, Zihong Shao, and Huajian Xin, are core members who participated in previous versions of V3, R1, and the Prover series.

Several researchers who did not participate in the previous two versions (Prover-V1, Prover-V1.5) appeared in the author list.

For example, Shirong Ma, a Tsinghua undergraduate and master. According to public information, he joined DeepSeek after graduation last year and is currently a DeepSeek researcher. Previously, he participated in works ranging from DeepSeek LLM v1 to R1 and DeepSeek-Coder.

Also Zhe Fu and Yuxuan Liu.

Although they did not appear in the author list of Prover-V1, Prover-V1.5, they are both senior members of DeepSeek.

Their names can be seen in the《Fire-Flyer AI-HPC》research published around the same time as Prover-V1/V1.5.

The Fire-Flyer AI-HPC architecture proposed in this research reduces training costs through software and hardware co-design, addressing the shortcomings of traditional supercomputing architectures for AI training needs.

However, this Prover-V2 paper did not mention specific optimization strategies for training or inference infrastructure.

Finally, there is a new face, Hongxuan Tang, about whom specific information is not yet available.

After Prover-V2 was released, it quickly gained attention from the community, with the GitHub repository receiving over 350 stars within 12 hours.

On platforms like X (formerly Twitter) and Hugging Face, netizens engaged in heated discussions.

Zihong Shao, a core contributor to Prover-V2, actively promoted the research results on his personal account.

X engineer @kache praised:

Thank you for your dedication to open science research.

Assistant Professor Chi Jin from Princeton University said:

Congratulations on this amazing work! Breaking the last 10-20% of problems on miniF2F marks a significant leap in capability. The current competition in formal math is fierce, and it's incredible that Kimina held the SOTA for only two weeks before being surpassed by DeepSeek.

Even @Marco Dos Santos, a core contributor to Kimina-Prover, sent his congratulations:

Congrats to the DeepSeek AI team for pushing the SOTA on miniF2F task to 89%!

Excited to see long CoT methods being explored independently by other teams and exhibiting some interesting differences. Formal math is more popular than ever today!

In addition, the most concerned question among netizens is still: When will R2 be released?

Paper:

https://github.com/deepseek-ai/DeepSeek-Prover-V2/blob/main/DeepSeek_Prover_V2.pdf

Model:

https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-671B

DeepSeek-Prover

https://arxiv.org/abs/2405.14333

DeepSeek-Prover-V1.5

https://arxiv.org/abs/2408.08152

References:

[1]https://trishullab.github.io/PutnamBench/leaderboard.html

[2]https://x.com/zhs05232838/status/1917600755936018715

Main Tag:AI Models

Sub Tags:Theorem ProvingLarge Language ModelsPerformance BenchmarksMachine Learning


Previous:AI Secretly Manipulated Reddit Users' Opinions for 4 Months Without Detection

Next:Breaking Convention: Why LLMs' Final Answers Might Be Unreliable?

Share Short URL