Have you ever asked AI a simple question only to receive a lengthy, verbose answer? Or asked a complex question and gotten a superficial response? Today, I want to share a groundbreaking study that teaches AI "when to think and when to answer directly."
1. The AI Thinking Dilemma: To Reason or Not to Reason?
Modern Large Language Models (LLMs) are capable of complex reasoning through "Chain-of-Thought" (CoT). Simply put, this method allows AI to list the steps to solve a problem, much like humans, before arriving at the final answer.
However, a clear problem exists with this approach: AI consistently uses detailed reasoning regardless of the question's simplicity or complexity. It's like asking a friend "What is 1+1?" and they seriously write down: "First, we have the number 1, and then we add the number 1. According to the definition of addition, 1+1=2." – This is clearly a waste of time!
This "overthinking" leads to three major drawbacks:
(1) Generation of a large number of redundant tokens (the basic unit of AI output)
(2) Increased memory footprint
(3) Significantly higher computational costs
2. Thinkless: A Tool to Teach AI "Timely Thinking"
The paper raises a crucial question: Can AI learn to decide when to think based on task complexity and its own capabilities?
Researchers developed the Thinkless framework, which cleverly uses two control tokens: indicating concise answers, and indicating detailed reasoning. Through reinforcement learning, AI can autonomously decide which answer mode to use for a specific question.
3. How Thinkless Works
This framework trains AI through two stages:
(1) Warm-up Distillation Phase
First, the model learns from two "experts": one model proficient in detailed reasoning and another proficient in concise answers. This process is like a student learning from two teachers with different styles, mastering both answering methods.
This stage establishes a clear mapping between control tokens and answer formats, providing a diversified output basis for subsequent reinforcement learning.
(2) Decoupled Group Relative Policy Optimization (DeGRPO)
This is the core innovation of Thinkless. Researchers found that traditional optimization methods can lead to "mode collapse" – where the model might completely favor one reasoning mode, losing flexibility.
DeGRPO cleverly decomposes the learning objective into two parts:
1) Mode Selection: Controls how the model adjusts its strategy based on current accuracy.
2) Accuracy Improvement: Enhances answer content, improving the correctness of answers under the selected reasoning mode.
This decoupled design avoids mode collapse, enabling the model to learn accurate outputs and context-sensitive reasoning strategies.
3. Results: Saving 50%-90% of Computational Resources
After training, the Thinkless model learned to accurately identify simple queries and respond with more efficient non-thinking modes. In multiple benchmark tests, it achieved impressive results:
1) On MATH-500, Minerva Algebra, and GSM8K datasets, the use of long-form reasoning was reduced by 50%-90%.
2) On the more challenging AIME tasks, the model naturally adopted a higher proportion of long-form reasoning.
This means AI has become "smarter" – it knows when to think deeply and when to answer directly. This significantly reduces inference costs while maintaining task performance.
4. Conclusion
The researchers observed some interesting phenomena during training:
U-shaped Learning Curve
In the early stages of training, the model tended to use long-chain reasoning, as this generally led to higher accuracy. However, as training progressed, the accuracy of short-chain answers gradually increased, and the model began to explore the feasibility of brief reasoning more.
This behavior manifested as an increase in the proportion of short-chain outputs over time, with many short answers achieving perfect accuracy in the later stages of training. Simultaneously, the accuracy of long-chain answers decreased, which was not due to a decline in the model's reasoning ability but because more difficult problems were assigned to the long-chain mode.
Influence of Control Token Weights
The weights of the control tokens determine the learning speed of mode selection. Overly high weights can cause the model to update its strategy too quickly, potentially assigning some samples to the long-chain mode too early, without giving enough room for performance improvement in the short mode.
Practical Case Examples
How does Thinkless make decisions when faced with questions of varying complexity?
(1) Simple question: "Calculate 123 + 456" Mode Selection: Short answer mode () Answer: "579"
(2) Moderately complex question: "What is the volume of a sphere if its surface area is 100 square centimeters?" Mode Selection: Depends on the model's self-assessment of its capabilities. Possible short answer: "The volume of the sphere is approximately 166.67 cubic centimeters."
(3) Complex question: "Prove that the sum of the interior angles of any triangle is equal to 180 degrees." Mode Selection: Thinking mode () Answer: Detailed geometric proof steps...
While the Thinkless research has achieved significant results, there is still room for further improvement:
(1) Improved Warm-up Phase: Explore better mixed-model construction strategies, such as merging techniques or lightweight fine-tuning methods.
(2) Expansion to More Domains: Currently validated primarily on mathematical problems, future expansion to a wider range of domains is possible.
(3) More Complex Decision Mechanisms: Develop decision systems that can consider more factors, such as user preferences, environmental constraints, etc.
The Thinkless study demonstrates an important idea in AI systems: not all problems require the same depth of thought. This is very similar to human thinking – we also adjust our thought depth based on problem complexity in daily life.
This research not only significantly improves the efficiency of AI systems but also reveals the direction for building more intelligent and natural AI systems. In the future, AI will better understand "when to accelerate and when to slow down," thinking deeply when needed and answering directly when possible, thereby providing a more natural and efficient user experience.
Paper Title: Thinkless: LLM Learns When to Think
Paper Link: https://arxiv.org/abs/2505.13379
Recommended Reading
When Thinking Becomes a Burden: Unveiling the "Thinking Traps" of Large Language Models
How Strong is the Reasoning Ability of Large Models? A Study Reveals LLM Limitations and Potential
AI Agents vs. Agentic AI: Evolution from Tool-Based Assistants to Autonomous Collaborative Systems