Bridging the Gap: LUFFY, a New Reinforcement Learning Paradigm for AI Reasoning

Figure 1


Imagine you are preparing for a high-level math competition. If you only repeatedly memorize the standard answers to past problems without ever trying to solve them yourself, you might be helpless when facing new problem types; conversely, if you work in isolation, only relying on your own trial and error without referencing the problem-solving experience of teachers and experts, your progress will be exceptionally slow. This is analogous to the two long-standing extremes in AI model training: "Imitation Learning" which only copies demonstrations but lacks self-practice, and "Reinforcement Learning" which only explores but does not learn from existing experience.


These two strategies, "learning without practice" and "practicing without learning," each have drawbacks: the former often learns quickly but generalizes poorly, while the latter may explore diligently but is inefficient. So, is there a way to have the best of both worlds, allowing models to both learn from expert experience and maintain independent exploration? Recently, a research team from Shanghai AI Laboratory, together with Westlake University, Nanjing University, and The Chinese University of Hong Kong, proposed a new reinforcement learning paradigm: LUFFY (Learning to reason Under oFF-policY guidance).


Figure 2


Paper Link: https://arxiv.org/abs/2504.14945
Code Repository: https://github.com/ElliottYan/LUFFY


The core idea of LUFFY is to simultaneously leverage expert reasoning trajectories for learning (off-policy guidance) and continue independent trial-and-error exploration (on-policy reasoning) during training, thereby achieving the goal of "learning while practicing, and applying what is learned." Experiments show that LUFFY achieves an average performance leap of +7.0 points across multiple mathematical reasoning challenge tasks and demonstrates significant generalization ability on out-of-distribution tasks.


Figure 3

Figure 1. Overall performance on six competitive-level mathematical reasoning benchmarks. Across the six challenging benchmarks AIME 2024, AIME 2025, AMC, MATH-500, Minerva Math, and OlympiadBench, LUFFY achieved an average accuracy of 49.6%, a significant performance improvement of over +7.0 points compared to existing Zero-RL methods.


Upon release, this work quickly climbed to the top of the Daily Papers hot list on the Hugging Face community and sparked heated discussion on the authoritative academic forum arXiv.


Figure 4


Figure 5


The Dilemma of Imitation Learning vs. Reinforcement Learning


Current mainstream large model reasoning training methods can be divided into two categories:


Imitation Learning (SFT): Models learn by referencing expert solution trajectories, similar to "copying solutions from the answer key." While they can quickly learn known methods, they may struggle with new problems and lack autonomy.


Figure 6

Figure 2. Imitation Learning (SFT): Imitating high-quality reasoning trajectories generated by expert models.


Reinforcement Learning (Zero-RL): Models receive reward feedback through continuous trial and error and optimize their strategy. Although they have some generalization ability, if the initial strategy is weak, they can easily fall into local optima and struggle to improve beyond a certain point.


Figure 7

Figure 3. Reinforcement Learning: Interacting with the environment (e.g., a verifier) for feedback to continuously optimize its strategy.


These two methods each have advantages but also limitations. LUFFY was proposed precisely to break this binary opposition, combining the strengths of both to solve the core problem of allowing models to "learn deeply and practice broadly."


LUFFY's Intuition and Mechanism: Expert Demonstration, Model Exploration


The key idea of LUFFY is to introduce "off-policy guidance" during the reinforcement learning process, using reasoning trajectories from stronger models or experts as guidance. This differs from the current mainstream reinforcement learning paradigm which only optimizes based on the model's own policy.


This is like a student who, while using classic examples provided by a teacher, continues to independently complete practice problems. In LUFFY, the model is trained by mixing two types of trajectories: one is the online reasoning process generated by its current policy (on-policy), and the other is offline demonstrations borrowed from stronger agents (off-policy). These two types of trajectories are used together for policy optimization, enabling the model to "learn while practicing."


Figure 8

Figure 4. LUFFY: A reasoning learning framework for learning while practicing. LUFFY introduces external high-quality reasoning trajectories into the reinforcement learning framework. Through the "policy shaping" mechanism, it combines the advantages of its own attempts (on-policy) and expert demonstrations (off-policy). When the model's own reasoning fails, it learns key steps from expert demonstrations; when it performs well, it maintains independent exploration. This mechanism guides the model to focus on low-probability but critical actions while maintaining exploration ability, thereby achieving continuous evolution and generalization of reasoning capabilities.


Technical Highlights: Mixed Strategy and Policy Shaping


LUFFY's implementation is based on the GRPO algorithm framework and revolves around two core mechanisms:


1. Mixed Strategy Training: Simultaneously utilize on-policy trajectories and off-policy demonstrations to guide the model towards high-reward actions while retaining its effective attempts.


2. Policy Shaping Function (Figure 6): Enhance the learning of key steps through a non-linear weighting mechanism to prevent premature convergence and strategy entropy reduction, maintaining continuous exploration. Figure 5 shows the non-linear weight of policy shaping on gradient updates and its impact on model exploration.


Figure 9

Figure 5. Effect of policy shaping in LUFFY. Left: Comparison of policy entropy during training. Middle: Weight distribution of the loss function based on decision probability for different methods. Right: Comparison of gradient weighting based on decision probability. LUFFY enhances the gradient response to rare (low probability) but important actions through non-linear weighting, thus guiding the model to learn deeper reasoning patterns more effectively from off-policy demonstrations.


Figure 10

Figure 6. The policy shaping function f() can be seen as importance sampling under a regularization constraint, encouraging the model to focus on low-probability but potentially important action decisions.


Experimental Results: Learn and Practice Immediately, Generalize


Figure 11

Figure 7. Training dynamics analysis: In the early stages of training, the LUFFY model gradually adapts to external guidance, and the length of its reasoning paths gradually approaches the off-policy trajectories, showing effective imitation and adjustment. Simultaneously, throughout the training process, LUFFY consistently maintains a high policy entropy, demonstrating the ability for continuous exploration. In contrast, the entropy of traditional on-policy RL quickly converges in the early stages, reducing exploration ability.


Across six public mathematical reasoning benchmarks, LUFFY achieved an average improvement of +7.0 points compared to existing Zero-RL methods and demonstrated leading performance on several out-of-distribution test sets.


Figure 12

Figure 8. Performance of LUFFY on six high-difficulty mathematical reasoning benchmarks.


Figure 13

Figure 9. Performance on out-of-distribution test sets (ARC-c, GPQA-diamond, and MMLU-Pro).


LUFFY also shows significant advantages on other models, such as the smaller 1.5B model and the instruction-aligned Instruct model:


Figure 14

Figure 10. Performance of LUFFY on Qwen2.5-Math-1.5B.


Figure 15

Figure 11. Performance of LUFFY on Qwen2.5-Instruct-7B.


Furthermore, LUFFY is also significantly better than SFT in terms of "reasoning path length." At the same accuracy level, LUFFY can reach the correct answer with a shorter reasoning process, reducing unnecessary expansion; when the temperature is increased during testing to boost exploration intensity, LUFFY's performance remains stable, while SFT shows a significant decline.


Figure 16

Figure 12. Comparison of reasoning length.


Figure 17

Figure 13. Comparison of exploration ability during testing.


Outlook: A New Starting Point for General Reasoning


LUFFY proposes an efficient, stable, and generalizable method for reasoning training that balances learning and practice, allowing models to truly grasp the intrinsic logic of reasoning strategies. In the future, this framework can be extended to AI tasks requiring complex reasoning, such as code generation, scientific QA, and automated planning, to build more general and autonomous agents.


The project is now open-source on GitHub. Interested individuals are welcome to explore, replicate, or extend it.


Author Introduction:


Author Photo 1


Yan Jianhao, a third-year PhD student under Professor Zhang Yue at Westlake University. His main research interests are post-training techniques based on large models, including reinforcement learning, online learning, and model editing. Before his PhD, Yan Jianhao worked as a researcher at WeChat AI and won the WMT machine translation competition.


Author Photo 2


Dr. Li Yafu, currently a researcher at Shanghai AI Laboratory, whose research areas include large language model reasoning, trustworthy AI, and machine translation. He pursued his PhD through a joint program between Zhejiang University and Westlake University, and previously obtained a Master's in AI from the University of Edinburgh and a Bachelor's in Electronic Information Engineering from Wuhan University. Dr. Li Yafu has published numerous research findings at top-tier conferences such as ACL, EMNLP, and ICLR, with over 1800 citations. He was nominated for the ACL 2023 Best Paper Award and serves as an ACL Area Chair and reviewer for multiple international top conferences and journals. During his PhD, he received the National Scholarship and was selected for the Tencent Rhino Bird Elite Program, receiving an Outstanding Scholarship.


© THE END

Please contact this official account for authorization to reprint.

Submissions or press inquiries: liyazhou@jiqizhixin.com

Main Tag:Reinforcement Learning

Sub Tags:Machine LearningGeneralizationMathematical ReasoningAI Models


Previous:ZTE Wireless Institute "Large Model Diving" Team Releases LLM-Adaptive Question Difficulty Distillation Method, Significantly Enhancing Small Model Reasoning Capabilities

Next:OpenAI's 'AI in the Enterprise' Report: 7 Key Lessons for Business AI Adoption

Share Short URL