MLNLP community is a well-known machine learning and natural language processing community at home and abroad, covering NLP master's and doctoral students, university teachers, and enterprise researchers.
The community's vision is to promote communication and progress among domestic and international natural language processing and machine learning academia, industry, and enthusiasts, especially the progress of beginner students.
Source | Synced Review
Following the popularity of DeepSeek-R1, the R1-like reward-based training paradigm has sparked a reasoning boom across various fields. Rule-based result rewards are simple to implement and strictly judged. But is that truly enough?
In reasoning tasks, if we only reward models based on "correctness of result," then the model is very likely to learn to "take shortcuts to answer questions."
Under this mode, the model's "correct thinking strategy" is not fully established. It might even repeatedly reinforce incorrect strategies and deviate further because of a "lucky guess" reward.
To address this issue, the CUHK team, in collaboration with Shanghai AI Lab, released the multimodal reasoning model SophiaVL-R1. It makes a key evolution in the R1-like reinforcement learning training framework: it no longer just rewards for correct results, but also incorporates the "thinking process" into the reward system.
Paper Link: https://arxiv.org/abs/2505.17018
Project Address: https://github.com/kxfan2002/SophiaVL-R1
This design not only enables the model to learn more general and reliable reasoning strategies but also significantly enhances its generalization ability – in multiple mathematical and general multimodal benchmarks, SophiaVL-R1-7B even outperformed LLaVA-OneVision-72B, a model 10 times its parameter size. Currently, the research team has open-sourced all models, data, and code.
Thinking Process Also Needs Scoring to Be a Good Model
The key breakthrough of SophiaVL-R1 lies in its introduction of a "thought reward" mechanism – it no longer solely checks if the answer is correct, but begins to evaluate whether the model's entire reasoning process is reasonable, coherent, and reliable.
The research team meticulously created a dataset for scoring thinking processes, including diverse thinking patterns and errors, and trained a "thought scoring model" that outputs an overall score for the thinking process based on multiple perspectives.
For example, if a reasoning process leads to a correct answer, but the intermediate logic is significantly disjointed or even completely nonsensical, that process might only receive a thought score of 0.3; whereas another reasoning process also ultimately chooses B, but the process is meticulous and the derivation is clear, its thought score might reach 0.9. It's like a teacher grading an exam, not just looking at the result, but also giving "process points."
This approach not only improves the model's reasoning quality but, more importantly, it teaches the model "how to think," not "how to guess."
SophiaVL-R1's "Reward Reform"
However, incorporating the "process" into the reward mechanism does not mean that simply adding them together will work.
Since the model's generated thinking process is free-form text, it can easily "feign seriousness" – for example, it might write a long segment of seemingly reasonable "logic" that is actually repetitive nonsense or even conceals reasoning flaws. This phenomenon of reward hacking is a very common problem in reinforcement learning.
To address this pain point, SophiaVL-R1 introduces a training algorithm called Trust-GRPO. Its core idea is to judge the trustworthiness of thought rewards based on in-group information from GRPO.
This method compares the thought rewards corresponding to correct and incorrect answers for the same problem. If it finds that incorrect answers receive unusually high thought rewards, it automatically lowers the credibility score for that reward, thereby improving the overall training stability and trustworthiness. An example is shown below.
Experimental Results
In multiple commonly used evaluation benchmarks (MMMU, MME, MathVista, etc.), SophiaVL-R1-7B demonstrated extremely strong reasoning and generalization capabilities. Compared to GRPO, SFT+GRPO, and PRM-based methods, it is highly competitive, directly matching or even surpassing LLaVA-OneVision-72B, a model 10 times its size in parameters, on several multimodal math and general evaluation datasets.
This highlights a crucial point: reasoning ability is developed through the correct training paradigm. SophiaVL-R1's success is the best illustration of this.
In ablation experiments, it can also be observed that all components of SophiaVL-R1 are effective.
At the same time, from the training curve, SophiaVL-R1 not only trained better but also faster, demonstrating the effectiveness of the thought reward signal and the importance of the Trust-GRPO algorithm.
Some SophiaVL-R1 reasoning examples are shown below, demonstrating that the model can output high-quality reasoning processes.
For more details, please refer to the original paper.
Technical Exchange Group Invitation
△Long press to add assistant
Scan QR code to add assistant on WeChat
Please note: Name-School/Company-Research Direction
(e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue System)
You can apply to join Natural Language Processing/Pytorch and other technical exchange groups
About Us
The MLNLP community is a private academic community jointly built by machine learning and natural language processing scholars at home and abroad. It has currently developed into a well-known machine learning and natural language processing community both domestically and internationally, aiming to promote progress among the academic, industrial, and general enthusiasts of machine learning and natural language processing.
The community provides an open exchange platform for related practitioners in terms of further education, employment, and research. Everyone is welcome to follow and join us.