Category: Reinforcement Learning

Breaking News! DeepSeek Officially Releases 2 Models
US Air Force Integrates AI into Advanced Wargaming
What? RLVR Isn't Learning New Knowledge—It's Learning How to Use Knowledge for Reasoning!
Xiaohongshu Proposes DeepEyesV2: From "Visual Thinking" to "Tool Collaboration", Exploring New Dimensions in Multimodal Intelligence
Microsoft Proposes GAD Framework: Open-Source Models Can Directly Distill Black-Box GPT-5
Reinforcement Learning + Large Model Memory: Mem-α, Enabling Agents to "Learn How to Remember" for the First Time
SJTU PhD's Latest Insights: Clarifying Reinforcement Learning with Just Two Questions
Meta's Two Latest Agent Learning Papers Are Quite Interesting!
The More You Fail, The Faster You Learn! Trajectory Rewriting Allows AI Agents to Create Perfect Experiences from Mistakes!
Abandoning Manual Annotation! Chinese Team Proposes Self-Evolution Algorithm for Multimodal Large Models
First Multi-Round LLM Router Unveiled: Router-R1 Teaches Large Models to "Think–Route–Aggregate"
Princeton Danqi Chen's Group's New Work: RLHF Insufficient, RLVR Bounded? RLMT Forges a Third Path
ByteDance Breaks the 'Entropy Curse' in LLM RL Training, Enabling Models to Learn with Certainty!
Stanford Proposes New RL Paradigm: 3B Model Agent Outperforms Claude, GPT-4
Microsoft Introduces rStar2-Agent: "Thinking Smarter" Proves Far More Effective and Efficient Than Simply "Thinking Longer"
LLMs Dominate Math Boards, Yet Forget How to Chat? CMU et al. Reveal Striking Differences Between SFT and RL!
Evolution and Development Trends of Reinforcement Learning Frameworks
Advancing Silicon-Based Intelligence: Shuchao Bi's Insights on Past, Present, and Future AI
ARPO: Agentic Reinforced Policy Optimization, Enabling Agents to Explore One Step Further at Critical Moments
RAG Revolution! Graph-R1, the First RL-driven Graph Reasoning Agent
Revisiting Qwen3's Abandoned Mixed Inference Mode
Why Can't Language Models Directly Output Answers with Confidence?
DeepSeek-GRPO Importance Weight Design Flaw? Explaining Qwen3's New Reinforcement Learning Algorithm GSPO
Counter-Intuitive RL Research: Directly Providing Answers to LLMs is More Effective Than Detailed Step-by-Step Instructions!
Alibaba Open-Sources Breakthrough Agent Overnight, Directly Challenges OpenAI with State-of-the-Art Performance!