A Survey of Research on Reinforcement Learning-based Reasoning Capabilities in Multimodal Large Language Models

The research direction of integrating Reinforcement Learning (RL) into the reasoning capabilities of Multimodal Large Language Models (MLLMs) is rapidly developing, becoming a transformative frontier topic. Although MLLMs have significantly expanded upon traditional Large Language Models (LLMs) and can process various modalities such as images, audio, and video, achieving robust reasoning under multimodal input still faces significant challenges. This article systematically reviews the research progress on RL-based multimodal reasoning, covering core algorithm design, reward mechanism innovations, and practical application cases. We focus on analyzing two major categories of RL paradigms—value-free methods and value-based methods—and discuss how RL enhances reasoning capabilities by optimizing reasoning trajectories and aligning multimodal information. Furthermore, this paper comprehensively surveys mainstream benchmark datasets, evaluation methods, and current research limitations, and proposes potential future research directions to address key bottlenecks such as sparse rewards, inefficient cross-modal reasoning, and real-world deployment. Our goal is to provide a systematic and comprehensive reference guide for scholars aspiring to advance RL reasoning research in the multimodal era.

The rise of Large Language Models (LLMs) [2, 35, 36, 94, 130] has ushered in an unprecedented new era for artificial intelligence, demonstrating remarkable instruction-following and few-shot learning capabilities [10]. However, achieving human-like intelligence requires not only surpassing basic perceptual abilities but also developing complex cognitive abilities capable of iterative reasoning through contextual understanding and self-correction. Inspired by this, In-context Learning (ICL) techniques [112, 113, 121] have endowed LLMs with the ability to reason step-by-step, a mechanism often referred to as the Chain-of-Thought (CoT) reasoning mechanism [9, 109, 114, 146]. OpenAI's o1 model [45] performed exceptionally well in solving reasoning tasks, attracting widespread attention across various fields to the study of test-time scaling of reasoning capabilities. By introducing additional computation during the reasoning process to enable "slow thinking" [49], the model further improved the accuracy of answering complex questions.

Inspired by the extensive CoT research in LLMs, reasoning tasks in Multimodal Large Language Models (MLLMs) [6, 69, 96, 105, 119] have also rapidly progressed. Typical methods include Best-of-N, Beam Search, and Monte Carlo Tree Search [13, 99, 108, 125, 132]. These methods rely on complex search mechanisms to generate large amounts of reasoning data and enable models to learn autonomous reasoning capabilities through supervised fine-tuning.

With the advancements in Reinforcement Learning (RL) theory and techniques, DeepSeek R1 [37] demonstrated how LLMs can learn complex reasoning abilities autonomously through simple rule-based incentive mechanisms and lightweight RL algorithms (such as GRPO [85]). This approach allows LLMs to naturally exhibit "Aha Moments" without explicit supervision, manifested as the model self-reflecting and autonomously increasing response length during training. Recent studies [43, 63, 76, 150] have extended this method to MLLMs and applied it to areas such as object recognition [63], semantic segmentation [60], and video analysis [91]. These methods have significantly improved the performance of MLLMs with limited training data, comparable to Supervised Fine-Tuning (SFT) methods in in-domain tests and surpassing SFT models in out-of-distribution (OOD) evaluation.

However, as shown in Figure 1, this rapidly developing trend also presents numerous challenges for researchers. Although RL-based methods are effective, most still follow a text-based thinking paradigm, neglecting the crucial role played by other modalities in multimodal scenarios. Furthermore, current RL reasoning methods primarily rely on rule-based reward functions and verifiable answers, failing to cover broader generalization scenarios such as problems without clear answers.

While multiple surveys have focused on the reasoning capabilities of MLLMs [54, 110], no literature has specifically and systematically explored RL-based reasoning methods in MLLMs. To fill this gap, this paper systematically reviews RL-based MLLM reasoning methods, comprehensively surveys technical developments, methodological systems, practical applications, and future directions, aiming to provide a systematic reference and guidance for the rapidly evolving MLLM reasoning research field, thereby promoting continuous innovation in this area.

We first introduce the relevant background on MLLMs, Chain-of-Thought reasoning mechanisms, and Reinforcement Learning in Section 2. Then, Section 3 reviews RL algorithm design and optimization strategies in LLMs and MLLMs; Sections 4 to 6 elaborate on the algorithm design, reward mechanisms, and benchmark evaluation of RL-based reasoning methods in MLLMs; finally, Section 7 discusses current limitations and future research directions.

This paper provides a systematic analysis of reinforcement learning-based reasoning methods in MLLMs from the following four key perspectives:

Exploring the key designs and optimization strategies of RL in LLMs and MLLMs: Focusing on analyzing the core concepts and improvement directions of value-free and value-based methods, discussing innovative solutions for enhancing training efficiency, stability, and reasoning performance, and comparing the pros and cons of each method and future optimization potential.

Analyzing the algorithmic framework, reward function design, and multimodal fusion strategies of existing RL-based reasoning methods: Systematically classifying representative methods based on the reinforcement learning algorithms used, reward mechanisms (accuracy- or structure-oriented), and multimodal input integration (including visual, audio, and temporal information).

Surveying benchmark datasets and evaluation protocols for assessing MLLM reasoning capabilities: Analyzing the dataset construction process, including data sources, model output collection, and preference annotation methods, covering various types of reasoning tasks such as mathematical, scientific, spatial, and interactive reasoning, and organizing them by domain specificity and generalization ability.

Identifying current limitations and proposing future research directions: Discussing current challenges, such as sparse and static reward feedback, inefficient reasoning paths, and weak cross-modal collaboration, and exploring promising directions including hierarchical reward modeling, visually guided CoT generation, and lightweight RL frameworks suitable for real-world multimodal agents.

Main Tag:Reinforcement Learning for Multimodal Reasoning

Sub Tags:Multimodal LLMsReward MechanismsChain-of-ThoughtReasoningRL


Previous:BBC Launches AI Agatha Christie Suspense Writing Course, Bringing the Legendary Queen Back to Teach

Next:Invisible Matchmaker? How Body Odor and Genes Influence Your Social Choices

Share Short URL