QbitAI Think Tank QbitAI | Official Account QbitAI
In 2023, while the industry was still focused on Scaling Law, constantly pushing the boundaries of parameter and data scale, Dr. Li Zhang's team at Microsoft Research Asia chose a different path.
Long before OpenAI o1 was released, Dr. Zhang's team began exploring the deep reasoning capabilities of large models.
"System 2," a term originally from cognitive science, was first introduced to the large model field by her and her team.
Recently, they enabled a 7B model to achieve o1-level mathematical reasoning capabilities through the Monte Carlo search algorithm.
The release of rStar—Math sparked widespread discussion within and outside academic circles.
△ rStar-Math paper
In the current landscape dominated by PPO/GRPO reinforcement learning approaches, what new possibilities will their work bring?
For this "Large Model Innovative Architecture" themed interview, QbitAI invited Dr. Li Zhang, lead author of rStar-Math and Principal Researcher at Microsoft Research Asia, to discuss breaking the IQ ceiling of large models, reward models, and the story behind System 2.
Li Zhang is a Principal Researcher in the System Research Group at MSRA and the project leader for Microsoft's LongRoPE and rStar series of works.
△ Dr. Li Zhang, Principal Researcher, System Research Group, Microsoft Research Asia
Below is a transcript of the interview between QbitAI and Dr. Li Zhang, lead author of rStar-Math and Principal Researcher at Microsoft Research Asia:
IQ Breakthrough
QbitAI: Could you briefly introduce the core work of rStar-Math? Why did you choose this research direction initially?
MSRA Li Zhang: We have consistently focused our research on the broad direction of improving the IQ of large language models, specifically in two areas:
One is enabling models to possess infinite and persistent memory, and the other is enhancing their deep logical reasoning capabilities.
Our rStar-Math work, published in January 2025, is simply the first publicly available work that used the Monte Carlo search algorithm to enable a 7B model to achieve mathematical reasoning capabilities close to OpenAI's o1 level.
When we started this work, the industry trend was still centered on scaling law, believing that larger model sizes and more data would lead to better results.
However, we found that despite new model sizes being released periodically, the deep mathematical reasoning ability of models had not significantly improved.
QbitAI: Did you start working on System 2 before o1 was released in 2024?
MSRA Li Zhang: Yes, it was around May 2023.
When ChatGPT came out in November 2022, everyone was astonished, but we found that it still fell short in some aspects.
As researchers, we focus more on logical reasoning ability, so it was natural for us to hope that large language models could possess strong reasoning capabilities like humans.
Our initial ideas were twofold:
First, we wanted models to be able to use a very long "scratchpad" when solving problems, which led to our LongRoPE work, extending the long-text reasoning window of large models.
△ LongRoPE paper published in February 2024
Second, to effectively utilize this scratchpad, it requires a deep reasoning approach similar to humans, which led to the rStar series of works.
△ rStar-Math's predecessor, rStar paper published in August 2024
QbitAI: Who was the first to introduce the term "System 2" from human cognitive science into the large model domain?
MSRA Li Zhang: It might have been us. More accurately, when we wanted to define this capability, we found this analogous term from human cognitive science.
QbitAI: Why did you believe at that time that System 2 would be a very important research direction for the future?
MSRA Li Zhang: We believe that for large language models to truly be deployed and generalized, other capabilities might be easier to address, but IQ or reasoning ability is the most critical factor.
Looking at top talents across various industries, their professional fields differ—some excel at solving math problems, some at writing code, some at writing or public speaking—but fundamentally, they all possess strong reasoning abilities, which is IQ.
With this foundation, applying large models to other tasks, enabling their deployment, or enhancing social productivity will become much simpler.
△ System1&2 (Fast and Slow Thinking) distinction diagram
QbitAI: During the research process of rStar-Math, the model itself spontaneously exhibited self-reflection capabilities. What does this signify?
MSRA Li Zhang: This was actually unintentional, an unexpected gain. Looking back, it might indirectly validate that self-reflection is a key capability for improving the IQ of large models.
This self-correction or self-reflection is a way of thinking that humans use in many activities; it can be said to be an essential ability.
We didn't deliberately pursue replicating the "aha moment," but it was indeed an opportunity at the time; many teams wanted to replicate it, and eventually found that reinforcement learning could stimulate this capability.
QbitAI: What is key to stimulating self-reflection capabilities in large models?
MSRA Li Zhang: I personally believe that large model pre-training data inherently contains information about the human self-reflection process.
A large amount of data on the internet naturally includes such content because it's a fundamental advanced thinking mode for humans.
After large models are pre-trained and memorize these patterns, reinforcement learning or Monte Carlo search algorithms will activate this capability.
In the process of solving complex problems, if the model finds that using self-reflection leads to better results, the Monte Carlo algorithm will mark these as high-quality data;
If it's reinforcement learning, and the model finds that using self-reflection leads to correct answers, it will give a higher score to this strategy. The ultimate result in both cases is the emergence of this capability in the model.
△ rStar-Math demonstrating self-reflection capability
Monte Carlo Breakthrough
QbitAI: rStar-Math generated a significant response after its release. Are there any particularly memorable feedbacks?
MSRA Li Zhang: Indeed, rStar-Math received more attention than our previous work, completely exceeding my expectations.
I think it might be because o1 had been out for several months at that time, but there wasn't any public report clearly explaining how it achieved its results.
I know many people were also using similar Monte Carlo search algorithms, but none achieved o1-level performance.
We happened to achieve it, and there were some innovations in our methodology, which might be why it suddenly gained attention.
It felt a bit like a "breakthrough" effect. Academics usually only focus on work in their own direction, but at that time, many colleagues and friends not in this field messaged me saying someone saw our work and wanted to connect; this situation was very rare.
Many media outlets, both domestic and international, also sought interviews with us. There was extensive discussion on X, with some giving high praise, finding it "very incredible" that a 7B model could achieve OpenAI o1-level performance.
Some also debated whether 2025 might be the era of small models, sparking a new round of debate about scaling law versus other approaches.
△ Keras founder François Chollet's evaluation of rStar-Math
QbitAI: Did you encounter any skepticism?
MSRA Li Zhang: Of course, there were roughly two stages.
Initially, before DeepSeek R1 and Kimi 1.5 came out, the main skepticism was "how can a small model be so powerful?" and "can this method generalize to other tasks?" So later, we open-sourced the code and data.
Later, after DeepSeek R1 and Kimi 1.5 were released, some began to discuss whether Monte Carlo search was truly necessary to reproduce OpenAI o1's performance. These questions are all reasonable, as everyone has different perspectives.
QbitAI: What is the fundamental difference between the reward model in Monte Carlo search algorithm and the traditional Best of N reward model?
MSRA Li Zhang: The fundamental difference is that the reward model in the Monte Carlo search algorithm is "step-level," it's a "process reward model."
Best of N is a "result reward model" that doesn't focus on the process, which is why the Monte Carlo search algorithm performs better.
QbitAI: Why does the Monte Carlo search algorithm perform so well on small models? Is its effectiveness limited to small models?
MSRA Li Zhang: Its excellent performance on small models actually indicates its great potential.
We discovered the enormous potential of the Monte Carlo algorithm when we released the initial version of rStar in August 2024.
At that time, we didn't perform any training, not even for the reward model; we simply applied the Monte Carlo search algorithm to small models and found it worked remarkably well, even comparable to models that had undergone specialized fine-tuning.
Because System 2 is a more advanced thinking mode, there's a certain threshold, and the policy model cannot be too weak. Small models are inherently weaker as policy models.
Therefore, to address the issue of suboptimal performance in small models, such as hallucinations, the only thing we did was add code-augmented CoT to maximize the effectiveness of the Monte Carlo search algorithm.
△ rStar-Math using code-augmented CoT example
QbitAI: Was the Monte Carlo search algorithm a mainstream approach before your work was published?
MSRA Li Zhang: It wasn't very mainstream before, but there were indeed some academic works starting to focus on this direction.
QbitAI: After o1 and your work were published, has this method become more mainstream?
MSRA Li Zhang: I haven't seen that trend yet; most people are still focusing on reinforcement learning. However, I know some people in other fields are also trying the Monte Carlo search algorithm.
Due to the attention our work received, some third parties contacted us, for example, a company wants to use this model for AI math education, and some well-known international labs hope to collaborate on code-related and mathematical proof tasks.
Interestingly, a smart car manufacturer also contacted us, hoping to replicate our algorithm on their models and asking for our help in answering some questions.
QbitAI: Do you see rStar-Math being adopted in industrial-grade models? In general scenarios, would the search space for the Monte Carlo search algorithm be too large?
MSRA Li Zhang: For very simple problems, there's indeed no need for such a complex method.
The Monte Carlo search algorithm gained prominence initially due to AlphaGo, and it might be inherently more suited for complex tasks.
△ Monte Carlo search algorithm illustration in AlphaGo
For general tasks, it can be used but isn't necessarily essential. A single response from a regular large model might be acceptable enough, without needing multiple searches using System 2.
Searching multiple times might find a better answer than a single response, but the difference might not be significant, so from a cost-effectiveness perspective, its necessity might not be very high.
QbitAI: Will your next research focus more on long text or deep reasoning?
MSRA Li Zhang: Regarding long text, we previously developed LongRoPE, which provided an algorithmic solution to extend the text window of pre-trained models indefinitely.
It has also been validated on Microsoft's Phi series models.
△ Phi-3 Technical Report indicates use of LongRoPE
However, truly extending to such great lengths still requires solving efficiency issues, as well as long-text data and computational power problems, which are not the focus of my current stage.
We are currently more focused on improving reasoning capabilities, i.e., deep reasoning.
QbitAI: Will you continue to research reward models?
MSRA Li Zhang: Next, we might do three things.
First is to continue optimizing the reward model.
Second is to further improve the policy model's capabilities, hoping it can learn more human-like advanced reasoning methods, such as proactive questioning or other reasoning methods beyond self-reflection.
Third is to expand the task domains. In addition to mathematics, we also want to expand to high-difficulty code reasoning tasks, ultimately achieving general deep reasoning capabilities.
QbitAI: Is solving math problems always the task that requires the highest IQ?
MSRA Li Zhang: I believe so. Mathematical reasoning is fundamentally the task type in large language models that most demands programmatic execution ability and logical rigor.
Some proof problems take mathematicians hundreds of years to prove, and I personally believe it represents a ceiling of intelligence.
QbitAI: There's a saying that research on improving mathematical ability is more prevalent because the results are unique, data is complete, and verification is easy. Does mathematical ability necessarily represent the ceiling of intelligence?
MSRA Li Zhang: Mathematical tasks are indeed easier to start researching, and their results are easier to verify, but truly improving mathematical reasoning ability is not easy.
For example, in the FrontierMath high-difficulty math benchmark test, which is designed by multiple mathematicians, the strongest model currently has an accuracy of only about 2%.
△ Performance of mainstream SOTA models on FrontierMath
Current mathematical research is more prevalent because data is relatively abundant, conditions are more mature, and judgment of quality is clearer.
For some non-proof problems, you don't even need to look at the steps, just whether the answer is correct, so it might give people the impression that large models' mathematical abilities are easy to develop.
For other complex human tasks, research conditions might not be mature enough yet, so it feels like everyone is focusing on mathematical ability.
But truly enabling large models to become trustworthy assistants for mathematicians, that road is still very long.
Paper: https://arxiv.org/abs/2501.04519
— End —
Recommended reading from the "Large Model Innovative Architecture" special series: "Transformer is like a gasoline car, attention-free is the new energy" | Interview with RWKV founder Peng Bo Mobile phones achieve GPT-level intelligence, a more extreme sparse technology than MoE: saves memory without reducing effectiveness | Interview with FaceMind & Tsinghua's Xiao Chaojun MiniMax bets on linear attention, making millions of tokens of long text only consume 1/2700 of computing power | Interview with MiniMax-01 architecture head Zhong Yiran Large models run smoothly on Raspberry Pi! Enabling terminals with autonomous learning and memory capabilities | Interview with RockAI CEO Liu Fanping
Think Tank in Progress | Large Model Innovative Architecture Special Research Report
Innovation at the model architecture layer is unleashing a profound transformation in artificial intelligence. We firmly believe that innovative improvements in Transformer architecture and exploratory innovations in non-Transformer architectures are important paths to explore AGI. This conversation is the second in a series of specialized dialogues. QbitAI Think Tank sincerely invites connection and sharing of cutting-edge insights and best practices with other innovators in large model architecture within the industry. For cooperation, please contact us.