From Intuition to "Deep Thinking": Multidimensional Evolution of Large Model Reasoning Capabilities

(Reading time: 15 minutes)

Editor's note: Although the capabilities of artificial intelligence have rapidly increased in recent years, there are still shortcomings in complex reasoning tasks. Researchers at Microsoft Research Asia have been studying this from multiple perspectives, continuously exploring new ways to improve the reasoning capabilities of large models. From rStar-Math, which uses Monte Carlo Tree Search to simulate the human process of "deep thinking," to Logic-RL, a rule-based reinforcement learning method; from LIPS, which integrates the mathematical intuition of large language models with symbolic methods, to a new framework for improving the accuracy of automatic formalization; and a neuro-symbolic framework for automatically generating high-quality, supervised mathematical data, as well as the introduction of the unified reasoning framework CoR and critical plan step learning CPL – each study provides a new perspective and method for enhancing the reasoning capabilities of large models.

The rapid development of artificial intelligence has enabled it to demonstrate powerful capabilities in many fields, while also constantly giving rise to higher expectations, hoping that AI can possess deep thinking and reasoning abilities to help humans solve various "brain-burning" complex real-world problems.

To improve the reasoning capabilities of large models, researchers at Microsoft Research Asia are conducting research in three directions: First, enhancing reasoning capabilities by improving the models themselves, allowing smaller models to also achieve strong reasoning performance; second, deeply learning the rules of mathematical reasoning to improve the reliability of large language models; and third, enhancing the generalization of reasoning, enabling models to flexibly apply across domains, thereby promoting the progress of general artificial intelligence.

Unleashing the Potential of Large Language Models – Stronger Reasoning in Smaller Models

"Although large models trained on world knowledge have massive knowledge reserves, existing models have not fully exploited their full potential. Furthermore, existing models lack the ability for continuous learning, which is a sharp contrast to the human ability to constantly learn new knowledge and fill cognitive gaps," says Dr. Lihong Zhang, Principal Researcher at Microsoft Research Asia. Excellent reasoning ability often relies on the support of large-scale models. Therefore, some researchers are also exploring how similar reasoning capabilities can be achieved in smaller models.

Under the traditional reasoning mode, large models often use simple, direct "intuitive" reasoning to generate answers when facing complex problems. Although this method is fast, it is prone to errors. In contrast, humans analyze problems step by step, try multiple approaches, weigh the pros and cons, and then give an answer. Given this, researchers proposed rStar-Math, which centers on using Monte Carlo Tree Search (MCTS) to simulate the human process of "deep thinking," enabling small language models to reach a higher level in reasoning capabilities.

rStar-Math achieves self-evolution through three steps: First, breaking down complex mathematical problems into multiple reasoning steps, allowing the model to explore and verify the contribution of each step gradually, ensuring that the reasoning trajectory generated by the small model consists of correct, high-quality intermediate steps; second, training a small model as a Process Preference Model (PPM) to reliably predict reward labels for each mathematical reasoning step, thereby achieving the desired process reward modeling and reliable annotation; finally, through a four-round self-evolution scheme, gradually building cutting-edge policy models and PPMs from scratch, with each round using the latest policy models and PPMs for Monte Carlo Tree Search, thereby gradually evolving and training stronger policy models and PPMs.

Experiments show that rStar-Math validates its effectiveness on four small language models (1.5 billion - 7 billion parameters). In the American Invitational Mathematics Examination (AIME), rStar-Math on average solved 53.3% (8/15) of the problems, ranking among the top 20% of the most excellent high school math students.

rStar-Math: Small LLMs can master math reasoning with self-evolved deep thinking

Paper link:

https://arxiv.org/pdf/2501.04519

Figure 1: rStar-Math schematic diagram

The researchers also proposed the rule-based reinforcement learning method Logic-RL, which improves the model's reasoning ability on complex logic problems by synthesizing logic puzzles as training data. Logic-RL introduces a practical system prompt and a strict format reward function to prevent the reasoning model from taking shortcuts. For example, when generating answers, the model must organize the reasoning process and answer according to a specific format, and only when both the reasoning process and the answer meet the requirements can it receive a higher reward, thus ensuring the completeness and accuracy of the reasoning process.

After training with Logic-RL, the model not only performed excellently on logic puzzles but also showed strong generalization ability on mathematical competition benchmarks for 7 billion parameter small models (such as AIME and AMC), with accuracy improving by 125% and 38% respectively.

Logic-RL: Unleashing LLM reasoning with rule-based reinforcement learning

Paper link:

https://arxiv.org/pdf/2502.14768

Strengthening Mathematical Reasoning Capabilities – More Reliable Reasoning

Mathematics, as the cornerstone of science, possesses rigorous logic and high precision. For artificial intelligence, solving mathematical reasoning problems will greatly enhance AI's reasoning capabilities and promote the wide application of models in various fields. However, relying solely on the natural language processing capabilities of large models often fails to meet the strict standards required for mathematical reasoning. To address this, researchers are using formal and symbolic research methods to help models learn existing human mathematical methods and tools, master mathematical rules, and improve the efficiency and accuracy of reasoning.

"Natural language is human language, not the native language of computers or large models, which cannot directly understand natural language. We hope to convert the output of large language models into code form and map it to axioms, such as '1+1=2,' which are self-evident truths, thereby verifying the correctness of the model's output. This is similar to how humans convert what they hear into their own understanding when communicating, while we convert it into tools that computers can understand through a formalization process," says Dr. Xian Zhang, Senior Researcher at Microsoft Research Asia.

Mathematical language includes mathematical theorems, inequality proofs, etc., which are significantly different from the language system of large language models. To enable large models to understand mathematical problems, it is first necessary to convert mathematical problems into code form through formal and symbolic methods, and then map them to axioms understandable by computers. Based on this, researchers designed the LLM-based inequality prover with symbolic reasoning (LIPS). It creatively integrates the mathematical intuition of large models with domain-specific insights encoded by symbolic methods to determine which parts of mathematical reasoning are best suited for large models and which are better handled by symbolic methods.

By analyzing how humans solve such problems, LIPS extracts two strategies: one is scaling, handled by symbolic methods; the other is rewriting, handled by large models. After evaluating LIPS on 161 challenging inequalities from multiple math competitions, the results show that LIPS demonstrated state-of-the-art performance and significantly outperformed existing large models and symbolic methods without requiring additional training data.

Proving Olympiad inequalities by synergizing LLMs and symbolic reasoning

Paper link:

https://openreview.net/pdf?id=FiyS0ecSm0

Figure 2: LIPS symbolic reasoning inequality prover

Although formal methods have shown great potential for large models in various mathematical reasoning tasks, the success rate of large models in automatically formalizing data statements is still low. Specifically, in the automatic formalization of large models, there is a significant difference between the one-pass rate (the first generated result is correct) and the k-pass rate (one of the top k generated results is correct).

To narrow this gap, researchers introduced a new framework that establishes self-consistency for automatic formalization from two innovative and complementary dimensions – symbolic equivalence and semantic consistency. Symbolic equivalence extends traditional comparisons (such as final answers and execution behavior) to verify the logical equivalence between automatic formalization candidates. Semantic consistency corrects unexpected reasoning differences that symbolic equivalence might miss by measuring the embedding similarity between re-non-formalized (inverse translated) results and the original natural language statements. This method ensures that the automatic formalization process preserves the intended meaning and coherence of the original statements. Experiments on the MATH and miniF2F datasets showed that this method greatly improved the accuracy of automatic formalization, achieving a relative improvement of up to 0.22-1.35 times across various large language models and baseline methods.

Autoformalizing mathematical statements by symbolic equivalence and semantic consistency

Paper link:

https://openreview.net/pdf?id=8ihVBYpMV4

Figure 3: Automatic formalization framework

Furthermore, researchers believe that the extreme scarcity of high-quality mathematical datasets is also a key factor limiting the improvement of large language model mathematical reasoning capabilities. To break through this challenge, researchers proposed a neuro-symbolic framework for automatically generating high-quality, supervised mathematical data. This paradigm combines the strengths of neural and symbolic approaches. On the one hand, it generates diverse mathematical problems through systematic sampling in the symbolic space and uses symbolic solvers to ensure the validity of the problems; on the other hand, large models can effectively support the conversion from the symbolic space to the natural language space, ensuring that the newly generated formalized problems remain consistent with their corresponding natural language versions.

Neuro-symbolic data generation for math reasoning

Paper link:

https://openreview.net/pdf?id=CIcMZGLyZW

Figure 4: Neuro-symbolic framework

Enhancing the Generalization Ability of Large Language Models – More Usable Reasoning

Reasoning generalization is an important indicator of whether artificial intelligence truly possesses universality. Models with strong generalization ability can cross the knowledge boundaries of different fields and "infer other cases from one instance," thereby expanding the application scope and value of artificial intelligence. Researchers have found that after training models with mathematical data, their reasoning ability significantly improves in multiple fields such as science and code. This finding provides a new direction for enhancing the reasoning generalization ability of large models.

By integrating three reasoning paradigms—natural language, code, and symbolic language—into the same reasoning trajectory, researchers proposed the unified reasoning framework CoR (Chain-of-Reasoning). Natural language helps understand the problem's context and requirements, code language is good at precise calculation and logical processing, and symbolic language can express mathematical and logical relationships in a concise and rigorous manner. CoR allows the model to first reason based on one paradigm, then flexibly switch paradigms according to the different stages and needs of the problem, and continue collaborative reasoning with multiple paradigms based on previously generated content, achieving reasoning generalization on general mathematical tasks.

Furthermore, by adjusting prompts, the model can change the depth of reasoning and the number of paradigms used, greatly improving its adaptability to different tasks. In tests on 5 mathematical reasoning datasets, CoR achieved significant improvements, demonstrating surprising general mathematical problem-solving ability—it can solve both mathematical calculation problems and mathematical proof problems.

Chain-of-Reasoning: Towards unified mathematical reasoning in LLMs via a multi-paradigm perspective

Paper link:

https://arxiv.org/pdf/2501.11110

Figure 5: Reasoning process under different paradigms

In addition, existing large models mainly focus on improving reasoning capabilities for specific tasks or specific domains (such as mathematics or programming) and have not fully addressed the issue of generalization ability across various reasoning tasks. To enhance generalization ability in reasoning tasks, researchers suggest searching within the action space of high-level abstract plans, rather than being limited to task-specific action spaces which usually restrict generalization.

By analyzing previous research that used large models to generate reasoning plans and specific task solutions to improve reasoning ability, researchers found that specific task solutions are closely related to specific task skills. In contrast, plans represent abstract thinking for solving problems, such as deciding which knowledge to apply or how to decompose a problem, which helps the model develop broader, task-agnostic capabilities, thereby improving generalization ability.

Dr. Xueting Han, Principal Researcher at Microsoft Research Asia, states, "Humans have some common strategies when thinking about problem-solving. For example, breaking down complex problems into subproblems, extracting key parts from abundant information, and recalling and retrieving existing knowledge based on specific information, like theorems in mathematics or algorithms in programming. By learning these problem-solving strategies, when encountering new problems, large models will also form a similar thinking process to human problem-solving, thus solving problems more effectively."

Based on this, researchers proposed the Critical Plan Step Learning (CPL) method, which consists of two key components: plan-based search and learning critical plan steps through Step-wise Advantage Preference Optimization (Step-APO). Plan-based search uses Monte Carlo Tree Search to explore different plan steps in multi-step reasoning tasks. By creating a plan tree, it helps the model acquire task-agnostic skills, improving the model's generalization ability across different tasks. Step-APO integrates the advantage estimates from step-wise preference pairs obtained using Monte Carlo Tree Search, enabling the model to learn fine-grained preferences between steps, identify critical plan steps, and weaken the influence of incorrect steps, thereby enhancing the model's overall reasoning ability and improving its generalization ability across different tasks.

CPL: Critical plan step learning boosts LLM generalization in reasoning tasks

Paper link:

https://arxiv.org/pdf/2409.08642

Figure 6: CPL schematic diagram

Continuously Expanding the Boundaries of Reasoning Ability and Addressing Large Model Challenges

From mathematical reasoning to enhancing the generalization ability of models' reasoning, from intuitive quick answers to answers derived from deep thinking, researchers at Microsoft Research Asia continue to explore the boundaries of large model reasoning performance. By introducing new perspectives and methods, they have not only pushed the forefront of this field but also driven more related research to make new progress. With the improvement of large language model performance and reliability, the application scope of artificial intelligence in real-world scenarios is also continuously expanding, providing strong technical support for fields such as intelligent education, intelligent healthcare, and intelligent scientific research.

However, we must also recognize that current large models still face numerous challenges, such as hallucination issues when generating content and insufficiently rigorous reasoning processes. These issues can lead to serious consequences in specific application scenarios. For example, in scientific research, deviations in model reasoning can lead to incorrect research directions, causing significant resource waste; in the healthcare field, inaccurate information can directly endanger patients' lives.

In addition to the research mentioned above, researchers at Microsoft Research Asia are also trying to improve the reasoning capabilities of artificial intelligence from many different angles, including using LLMs to automatically generate correctness proofs for Rust code, designing methods that match the unique features of the Verus verification tool; proposing the SAFE framework to address the data scarcity problem in formal verification of Rust code; introducing the Alchemy framework to construct formal theorems by varying symbols, alleviating the data insufficiency problem in Neural Theorem Proving (NTP), and so on. These achievements provide more possibilities for enhancing the reasoning capabilities of large language models and offer rich ideas for future research directions.

Other related research:

AutoVerus: Automated proof generation for rust code

https://arxiv.org/abs/2409.13082

Automated proof generation for rust code via self-evolution

https://arxiv.org/pdf/2410.15756v1

Alchemy: Amplifying theorem-proving capability through symbolic mutation

https://arxiv.org/pdf/2410.15748

Mutual reasoning makes smaller LLMs stronger problem-solvers

https://arxiv.org/pdf/2408.06195

From Intuition to "Deep Thinking": Multidimensional Evolution of Large Model Reasoning Capabilities

Share Short URL