Does AI Know When to "Think"? Thinkless Teaches Large Language Models When to Reason

Have you ever asked AI a simple question only to receive a lengthy, verbose answer? Or asked a complex question and gotten a superficial response? Today, I want to share a groundbreaking study that teaches AI "when to think and when to answer directly."

Image

1. The AI Thinking Dilemma: To Reason or Not to Reason?

Modern Large Language Models (LLMs) are capable of complex reasoning through "Chain-of-Thought" (CoT). Simply put, this method allows AI to list the steps to solve a problem, much like humans, before arriving at the final answer.

However, a clear problem exists with this approach: AI consistently uses detailed reasoning regardless of the question's simplicity or complexity. It's like asking a friend "What is 1+1?" and they seriously write down: "First, we have the number 1, and then we add the number 1. According to the definition of addition, 1+1=2." – This is clearly a waste of time!

This "overthinking" leads to three major drawbacks:

(1) Generation of a large number of redundant tokens (the basic unit of AI output)

(2) Increased memory footprint

(3) Significantly higher computational costs

2. Thinkless: A Tool to Teach AI "Timely Thinking"

The paper raises a crucial question: Can AI learn to decide when to think based on task complexity and its own capabilities?

Researchers developed the Thinkless framework, which cleverly uses two control tokens: indicating concise answers, and indicating detailed reasoning. Through reinforcement learning, AI can autonomously decide which answer mode to use for a specific question.

3. How Thinkless Works

Image

This framework trains AI through two stages:

(1) Warm-up Distillation Phase

First, the model learns from two "experts": one model proficient in detailed reasoning and another proficient in concise answers. This process is like a student learning from two teachers with different styles, mastering both answering methods.

This stage establishes a clear mapping between control tokens and answer formats, providing a diversified output basis for subsequent reinforcement learning.

(2) Decoupled Group Relative Policy Optimization (DeGRPO)

This is the core innovation of Thinkless. Researchers found that traditional optimization methods can lead to "mode collapse" – where the model might completely favor one reasoning mode, losing flexibility.

DeGRPO cleverly decomposes the learning objective into two parts:

1) Mode Selection: Controls how the model adjusts its strategy based on current accuracy.

2) Accuracy Improvement: Enhances answer content, improving the correctness of answers under the selected reasoning mode.

This decoupled design avoids mode collapse, enabling the model to learn accurate outputs and context-sensitive reasoning strategies.

3. Results: Saving 50%-90% of Computational Resources

After training, the Thinkless model learned to accurately identify simple queries and respond with more efficient non-thinking modes. In multiple benchmark tests, it achieved impressive results:

1) On MATH-500, Minerva Algebra, and GSM8K datasets, the use of long-form reasoning was reduced by 50%-90%.

2) On the more challenging AIME tasks, the model naturally adopted a higher proportion of long-form reasoning.

This means AI has become "smarter" – it knows when to think deeply and when to answer directly. This significantly reduces inference costs while maintaining task performance.

Image

Image

Image

4. Conclusion

The researchers observed some interesting phenomena during training:

U-shaped Learning Curve

In the early stages of training, the model tended to use long-chain reasoning, as this generally led to higher accuracy. However, as training progressed, the accuracy of short-chain answers gradually increased, and the model began to explore the feasibility of brief reasoning more.

This behavior manifested as an increase in the proportion of short-chain outputs over time, with many short answers achieving perfect accuracy in the later stages of training. Simultaneously, the accuracy of long-chain answers decreased, which was not due to a decline in the model's reasoning ability but because more difficult problems were assigned to the long-chain mode.

Influence of Control Token Weights

The weights of the control tokens determine the learning speed of mode selection. Overly high weights can cause the model to update its strategy too quickly, potentially assigning some samples to the long-chain mode too early, without giving enough room for performance improvement in the short mode.

Practical Case Examples

How does Thinkless make decisions when faced with questions of varying complexity?

(1) Simple question: "Calculate 123 + 456" Mode Selection: Short answer mode () Answer: "579"

(2) Moderately complex question: "What is the volume of a sphere if its surface area is 100 square centimeters?" Mode Selection: Depends on the model's self-assessment of its capabilities. Possible short answer: "The volume of the sphere is approximately 166.67 cubic centimeters."

(3) Complex question: "Prove that the sum of the interior angles of any triangle is equal to 180 degrees." Mode Selection: Thinking mode () Answer: Detailed geometric proof steps...

While the Thinkless research has achieved significant results, there is still room for further improvement:

(1) Improved Warm-up Phase: Explore better mixed-model construction strategies, such as merging techniques or lightweight fine-tuning methods.

(2) Expansion to More Domains: Currently validated primarily on mathematical problems, future expansion to a wider range of domains is possible.

(3) More Complex Decision Mechanisms: Develop decision systems that can consider more factors, such as user preferences, environmental constraints, etc.

The Thinkless study demonstrates an important idea in AI systems: not all problems require the same depth of thought. This is very similar to human thinking – we also adjust our thought depth based on problem complexity in daily life.

This research not only significantly improves the efficiency of AI systems but also reveals the direction for building more intelligent and natural AI systems. In the future, AI will better understand "when to accelerate and when to slow down," thinking deeply when needed and answering directly when possible, thereby providing a more natural and efficient user experience.

Paper Title: Thinkless: LLM Learns When to Think

Paper Link: https://arxiv.org/abs/2505.13379

Recommended Reading

When Thinking Becomes a Burden: Unveiling the "Thinking Traps" of Large Language Models

How Strong is the Reasoning Ability of Large Models? A Study Reveals LLM Limitations and Potential

AI Agents vs. Agentic AI: Evolution from Tool-Based Assistants to Autonomous Collaborative Systems

Main Tag:Artificial Intelligence

Sub Tags:Large Language ModelsEfficiencyMachine LearningChain-of-Thought


Previous:DreamCoder: Growing Generalizable, Interpretable Knowledge with Wake-Sleep Bayesian Program Learning

Next:Breaking News! OpenAI Teams Up with Legendary Apple Designer Jony Ive to Announce New Company "io": Targeting Next-Gen AI Interaction Hardware

Share Short URL