The 'Chatterbox Syndrome' of Large Models: Can't Solve Problems Without Introspection?
Modern large models (e.g., DeepSeek-R1), when performing complex reasoning, often insert<think>tags and self-reflection words like “Wait” and “Hmm” (similar to human mumbling when conflicted). They "talk to themselves" like humans do (“Wait... Let me double check…”). However, these words trigger redundant verification loops, leading to bloated inference traces (e.g., a single math problem generating 7000+ tokens), slowing down speed and increasing computational power consumption.
It's like repeatedly muttering "Let me check again" when solving a problem, but actually just spinning in circles.
Paper: Wait, We Don’t Need to “Wait”! Removing Thinking Tokens Improves Reasoning Efficiency
Link: https://arxiv.org/pdf/2506.08343
NoWait: Equipping Models with a "Keyword Filter"
The team proposes a zero-training-cost solution:
Step 1: Identify "Thinking Keywords" – statistically high-frequency self-reflection words (e.g., Wait/Hmm/Alternatively) to build a "blacklist."
Step 2: Expand Synonymous Variants – consider case, spacing, and other word forms, for example, extending "Wait" to "wait," "WAIT," etc. (to prevent the model from exploiting loopholes).
Step 3: Real-time Masking During Inference – during the decoding phase, forcefully suppress the generation probability of these words to negative values, compelling the model to skip unnecessary chatter.
This is equivalent to installing an "anti-distraction plugin" on the model; the process requires no alteration of model parameters.
Results: Comprehensive Slimming of Text/Image/Video Tasks
Textual Reasoning (Math Competition Problems)
On models like QwQ-32B, Phi4:
Reasoning chain shortened by 27%-51% (e.g., AIME problem set from 15,000 tokens to 10,500)
Accuracy not decreased, but increased (AMC2023 task +4.25%)
Text Task Performance Comparison: Original vs. NoWait
Multimodal Tasks (Image + Video)
Vision Model Kimi-VL:
Token usage plummeted by 40-60% (e.g., EMMA-mini from 5734 to 2269)
Accuracy only slightly decreased by 3%
Video Model QvQ-72B:
Reasoning is more focused on temporal logic (e.g., "video opening → progression → ending")
Reduced redundant self-reflection words, resulting in more concise logic
Vision Task Performance Comparison
Video Task Performance Comparison
Case Study Comparison
Original Output (Qwen3-32B solving a math problem): "Wait, let me check again" appeared repeatedly, verifying the same conclusion 5 times.
NoWait Output: Directly targets key verification points, with a 30% reduction in length and correct answer.
NoWait Simplifies Reasoning Chain Example
Key Finding: Why are RL Models More "Robust"?
RL-trained models (e.g., Qwen3-32B): Accuracy remained stable after "Wait" suppression because RL encourages necessary introspection.
Distilled small models (e.g., Qwen3-4B): Accuracy plummeted by 12% (AIME2025 task) because they rely on a preset reasoning chain; cutting keywords directly caused them to collapse.
Distilled Model Accuracy Drop Comparison
Industry Significance
Zero-cost deployment: No retraining/fine-tuning required, plug-and-play.
Multimodal versatility: First time proven beneficial for text, image, and video tasks.
Cognition disruption: "Self-reflection" is not a necessary step; efficient reasoning can skip formalities.
Less hesitation when solving problems leads to greater accuracy and speed!