Claude 3.7 Sonnet introduced a new paradigm where a single model can simultaneously handle both non-thinking and Long Reasoning capabilities. The goal of this path is to merge chat models similar to GPT-4o with reasoning models similar to the GPT-o1/3/4 series into a single model. This article provides a small summary of the existing works I have reviewed so far (there might be omissions). This will not include works that simply shorten CoT length.
Source | Zhihu
Author | Greytong Sextant
AdaptThink's diagram intuitively explains the peculiarity of this setting: for simple problems, instead of short CoT, direct non-CoT should be used.
Training-Free
Most Training-Free methods focus on training a Router. I found two related works: Self-Route[1] and ThinkSwitcher[2], but I guess I haven't covered all. Since there isn't much difference from previous long2short training-free works, I won't go into more detail here due to limited time and energy.
Finetuning-based
Here, I will only introduce the training methods related to three models: Qwen3, Llama-Nemotron, and KAT-V1. Other pure SFT methods (e.g., AutoL2S[3], Self-Braking Tuning[4], TLDR[5]) can only shorten CoT length and cannot enable a reasoning model to choose to not think at all. Methods that use both SFT and RL will be introduced in the RL section.
Qwen3
After enabling LongCoT capabilities in Qwen3 during Stage 1 and 2, the primary objective in Stage 3 was to achieve preliminary Adaptive Reasoning capabilities using SFT.
I have directly translated the specific technical details, and I feel the information density is quite high: The SFT dataset included both thinking and non-thinking data. To ensure that the performance of the Stage 2 model was not affected by adding SFT data, the Qwen team used the Stage 2 model itself to perform rejection sampling on Stage 1 queries, generating thinking data. Non-thinking data was carefully curated to cover various task types, including programming, mathematics, instruction following, multilingual tasks, creative writing, Q&A, and role-playing.
Additionally, the Qwen team used automatically generated checklists to evaluate the quality of responses for non-thinking data. To improve performance on low-resource language tasks, the Qwen team specifically increased the proportion of translation tasks in the dataset. The specific thinking and non-thinking templates are as follows:
Llama-Nemotron[7]
NVIDIA's Nemotron was also released around the same time. They do not hide that they leveraged other models to improve performance, thus skipping the step of pre-training the model for LongCoT capability. Instead, they directly blended DeepSeek-R1's reasoning output into the SFT. The specific blending ratio is as follows:
Subsequently, because reasoning ability was still insufficient using distillation alone, RL was further added.
KAT-V1[8]
Kuaishou's model also used DeepSeek-R1 for data. For each query, it generated answers in both think-on and think-off modes, then used majority vote to decide which mode to use. DeepSeek-R1 was used for think-on mode, and DeepSeek-V3 for think-off mode. DeepSeek-V3 was also used to generate reasons for choosing the voted-on mode, which the model learned from. The overall ratio of think-on to think-off was roughly 2:1. There is also an AutoThink RL part, but Kuaishou did not detail it in the paper, stating it would be covered in a separate future article... The paper includes a diagram of the training process, which can be viewed:
RL-based
AutoThink[9]
This paper first discovered a very interesting phenomenon: adding an ellipsis at the beginning of thinking content can make the model unstable. The model might output LongCoT or directly not think. This suggests that even a Long Reasoning Model, under such out-of-distribution (OOD) prompts, still retains the ability to not think.
Therefore, this paper introduced a three-stage RL approach to strengthen this capability:
• By applying greater rewards to correct non-thinking outputs, the model's dual-mode output capability is strengthened and stabilized.
• Normal rewards are used to enhance the model's performance. Because the Stage 1 training was very good, even without additional tricks, the model did not collapse into thinking only or non-thinking only.
• Stage 2 training still led to overly long outputs, so Stage 3 penalized overly long outputs.
AdaCoT[10]
This paper did not observe the phenomenon mentioned by AutoThink. Therefore, similar to Qwen3 and Nemotron, data was first collected for SFT to equip the model with basic non-thinking capabilities, followed by RL training. Here, the two parts of the data were not collected separately; instead, a 15B model was directly used to label whether a query was simple enough to be answered directly without thinking.
The loss in the RL stage is straightforward:
Here, is the base reward, is a penalty term for whether reasoning should be omitted, is a penalty term for reasoning being too long, and is a penalty term for formatted output. Here, AutoThink's three steps are combined into one.
Another ingenious technique is called Selective Loss Masking. Concerned that the model might consistently not reason or consistently reason, the authors selectively excluded the first token after <think> from loss calculation. This is very clever. It prevents the model from continuing to learn whether to think at this stage, preventing it from unlearning or distorting what was well-learned during SFT. This also resolves the problem AutoThink Stage 2 worried about but didn't occur.
AdaptThink[11]
Several motivational diagrams in this paper are excellent, and the teaser image used at the beginning of this article is also theirs. As shown in the left figure below, No Thinking is not just an efficiency issue; its accuracy is even higher on the simplest problems.
The approach in this paper is very aggressive: since no-thinking merely involves <think> directly followed by </think>, there's no need for SFT to imbue this capability; one can directly optimize the following equation:
After Lagrangian multipliers and some other transformations, it becomes optimizing the following equation:
And since and are non-differentiable, the expectation part of this expression is treated as an advantage function and optimized using PPO.
During importance sampling, because the original model had not undergone SFT and lacked no-thinking capability, the authors set a 50% probability to force output, and the other 50% probability to output LongCoT normally.
From the perspective of loss, PPO will only make the model more inclined to not think under the following circumstances. The larger the , the more the model is encouraged to not think.
HGPO[12]
This paper also first collected data for SFT to enable the model with basic non-thinking capabilities, followed by RL training, which is referred to as HGPO in the section title.
The HGPO process is as follows:
• For each query, N/2 candidate responses are sampled separately under the thinking mode (⊢) and non-thinking mode (⊬), meaning each query yields N responses.
• Assign initial reward scores. Rule-based methods are used for queries with definite answers; otherwise, the reward model Llama-3.1Tulu-3-8B-RM is used.
• Reward Assignment. Here, inter-group rewards and intra-group rewards are calculated separately. Inter-group reward is given to the one with the higher initial reward score between the thinking and non-thinking modes for the same query. Intra-group reward is given to the query with the higher initial reward score within the same thinking mode.
• Advantage Estimation. GRPO is used, combining the two rewards mentioned above. What's interesting here is the inter-group rewards, as they are only applied to the words in the response that determine the thinking mode, i.e., think and no_think.
The author also proposed a metric to evaluate this adaptive thinking ability, called Hybrid Accuracy (HAcc). The specific approach is to have the model sample N responses each in thinking mode and non-thinking mode for each query, then score them using a reward model. The one with the higher score is considered the preferred reasoning mode. Then, the agreement ratio between the model's own choice and this calculated preferred reasoning mode is observed.
References
[1] Self-Route: http://arxiv.org/abs/2505.20664
[2] ThinkSwitcher: http://arxiv.org/abs/2505.14183
[3] AutoL2S: http://arxiv.org/abs/2505.22662
[4] Self-Braking Tuning: http://arxiv.org/abs/2505.14604
[5] TLDR: http://arxiv.org/abs/2506.02678
[6] Qwen3: https://arxiv.org/abs/2505.09388
[7] Llama-Nemotron: http://arxiv.org/abs/2505.00949
[8] KAT-V1: http://arxiv.org/abs/2507.08297
[9] AutoThink: http://arxiv.org/abs/2505.10832
[10] AdaCoT: http://arxiv.org/abs/2505.11896
[11] AdaptThink: http://arxiv.org/abs/2505.13417
[12] HGPO: http://arxiv.org/abs/2505.14631