This research was jointly developed by the Institute of Automation, Chinese Academy of Sciences (CASIA) and Tencent Hunyuan. The team members include Qi Yang, Bolin Ni, Shiming Xiang, Han Hu, Houwen Peng, and Jie Jiang.
Background: The Thinking Dilemma of Multimodal Large Models
Currently, leading large models in the industry are competing to tackle the challenge of "overthinking," meaning they adopt an "always-on thinking" detailed reasoning mode regardless of the problem's simplicity. Whether it's DeepSeek-V3.1, which relies on a hybrid reasoning architecture requiring manual user intervention for fast-slow thinking switching, or GPT-5, which uses an adaptive thinking switch via a massive and costly "expert routing" mechanism, they are still far from truly "intelligent thinking." These solutions either shift the judgment burden to users or are limited by complex system architectures and high deployment costs. Therefore, developing a lightweight, multimodal large model that enables more intelligent adaptive thinking will provide users with a smoother interactive experience.
Recently, a new study co-authored by the Tencent Hunyuan team and CASIA introduced the R-4B multimodal large model. Through an adaptive thinking (auto-thinking) mechanism, R-4B changes this status quo, allowing AI to "intelligently switch" its thinking mode like humans. It responds directly to simple questions and performs deep reasoning for complex problems, maximizing answer accuracy while minimizing computational overhead.
Paper Title: R-4B: INCENTIVIZING GENERAL-PURPOSE AUTOTHINKING CAPABILITY IN MLLMS VIA BI-MODE ANNEALING AND REINFORCE LEARNING
Paper Link: https://arxiv.org/pdf/2508.21113
This core "on-demand thinking" capability sets a new performance benchmark for 4B-scale multimodal models, enabling it to surpass larger models like Keye-VL-8B and Kimi-VL-A3B-Thinking-2506 in evaluation metrics.
At the same time, R-4B achieved excellent results on the authoritative OpenCompass benchmark list.
Topping the OpenCompass Multimodal Academic Ranking: Ranked Top 1 among multimodal large models within 20B parameters!
Ranked first in the OpenCompass Multimodal Reasoning Open-source Ranking: Leading in reasoning performance among open-source models!
Currently, the model is available on GitHub and HuggingFace, supporting fast deployment with vLLM. "It can run on consumer-grade graphics cards, suitable for low-power scenarios such as laptops, smart cockpits, and smart homes, and supports low-cost fine-tuning for vertical domains." As of now, downloads have exceeded ten thousand.
GitHub Code Repository: https://github.com/yannqi/R-4B
Hugging Face Model Download: https://huggingface.co/YannQi/R-4B
Breakthrough: R-4B's Adaptive Thinking Engine
R-4B's intelligence lies in its adaptive thinking capability:
When encountering simple problems (e.g., simple entity recognition, easy Q&A), it chooses to respond directly and efficiently.
For complex tasks (e.g., mathematical calculations, chart analysis), it automatically switches to a deep thinking mode, generating a detailed thought process.
The core innovation of R-4B lies in its unique two-stage training strategy. To achieve adaptive thinking for general domains, the research team first proposed a bi-mode annealing training strategy, prompting the model to master both thinking and non-thinking capabilities in general domains simultaneously.
This stage can be understood as "thinking enlightenment" for the model, i.e., simultaneously feeding it two types of data paradigms: one requiring direct answers (non-thinking mode, like daily conversations) and another requiring detailed reasoning (thinking mode, like solving math problems). Through this training, the model masters both thinking and non-thinking response modes, laying a solid foundation for subsequent adaptive thinking mode training. The core of this stage is the data construction strategy for general domain reasoning and non-reasoning modes: for objective questions, the consistency of model-sampled answers is used to measure the difficulty of the questions; for subjective questions, prompt engineering is used to distinguish whether further thought is needed to solve the problem.
Reasoning mode data: Covers multi-step reasoning tasks such as chart analysis and logical inference (e.g., scientific diagrams or mathematical problems).
Non-reasoning mode data: For queries requiring direct factual responses (e.g., entity recognition or simple Q&A).
After annealing training, a base model, R-4B-Base, proficient in both thinking and non-thinking modes, is obtained, laying the foundation for subsequent adaptive thinking reinforcement training. Based on this, the team developed the Bi-mode Policy Optimization (BPO) reinforcement learning algorithm. It does not rely on elaborately designed reward functions or specific data but only on rule-based reward signals, starting from mathematical data and generalizing to general domains. Its core is a hybrid bi-mode rollout mechanism, which forces the model to explore both thinking and non-thinking mode trajectories during training, thereby preventing the model from falling into a single response mode preference. On this basis, by simultaneously rewarding strategies for both thinking modes, the model learns to determine when to think.
Performance: Small Model, Big Energy
The R-4B-RL model demonstrates outstanding performance in multiple public benchmark tests, setting new records and surpassing larger models like Keye-VL-8B and Kimi-VL-A3B-Thinking-2506.
More importantly, R-4B-RL achieves improved inference efficiency in adaptive thinking mode, requiring no more tokens for simple tasks. This proves the effectiveness of the BPO algorithm: the model can achieve adaptive thinking without reinforcement learning data for general domains or additional reward function design.
Application Prospects: An Intelligent Wave from Research to Industry
R-4B's breakthrough extends beyond technology, opening up broad application scenarios:
Applied Intelligence: In daily Q&A analysis, it automatically switches thinking modes between simple queries (e.g., document content extraction) and complex reasoning (e.g., chart analysis), improving automation efficiency.
Scientific Research: When processing scientific charts, R-4B's deep reasoning mode can analyze multi-step relationships, accurately interpret data, and enhance research efficiency.
Consumer AI: For edge device deployment, R-4B reduces latency and energy consumption with fewer parameters and an adaptive thinking mode, making it suitable for instant Q&A systems.
(1) Document content extraction (simple query)
(2) Chart analysis (complex reasoning)
Conclusion: Adaptive Thinking, Exploring New Paths for AI Development
From bi-mode annealing training to BPO optimization, R-4B not only solves the thinking dilemma of MLLMs but also explores the feasibility of adaptive thinking in small-sized models. Adaptive thinking is not merely a technical optimization; it is also a pursuit of balance between efficiency and universality. In an era where AI computing and inference costs are soaring, R-4B's lightweight and intelligent design injects green power into the sustainable development of large models.