Evolution and Development Trends of Reinforcement Learning Frameworks

Author | Feng Nie

Original Link: https://zhuanlan.zhihu.com/p/1932578428181279850

This article is for academic sharing only. If there is any infringement, please contact us for deletion.

Original Link:

Robin's Home Page: jianzhnie.github.io/llmtech/#/rlhf/infra/RL-Infra_overview

1. From SFT to Reinforcement Learning: A Paradigm Shift in Model Training

Before OpenAI released the O1 series models in 2024, the mainstream machine learning training method primarily relied on Supervised Fine-Tuning (SFT). This method updates model parameters by having the model learn "standard answers" and calculating the loss between predictions and true labels. The training process is relatively simple, and deep learning frameworks like PyTorch and TensorFlow have built rich training acceleration tools around this paradigm.

However, with the release of the O1 series models, the focus of model training has gradually shifted from SFT to Reinforcement Learning (RL). SFT is increasingly seen as a "warm-up" phase in the training process, with its role diminished to parameter initialization or policy guidance. Instead, RL plays an increasingly critical role in enhancing model capabilities.

1.1. Evolution and Diversification of RL Algorithms

RL algorithms themselves are also continuously iterating and optimizing. From early DPO (Direct Preference Optimization) to classic PPO (Proximal Policy Optimization), and more recent emerging methods like GRPO, RLOO, Reinforce++, DAPO, RL algorithms continue to optimize in terms of policy update methods, stability, and sample efficiency.

Although DPO was once popular due to its simplicity, its limitations have gradually become apparent with increasing task complexity and model scale, and it is now less commonly used in practical engineering. Nevertheless, the overall structure of mainstream RL frameworks remains relatively consistent, with core processes primarily including the following stages:

1.2. Three Major Modules of RL Training Workflow

Module One: Policy Generation (Rollout)

Corresponds to the process of "students finding answers themselves." This is the Rollout phase in RL training, where the model generates responses (actions) based on the current policy, simulating interaction with the environment. This phase is an extension of the model inference process and usually requires extensive sampling to obtain diverse behavior trajectories.

Module Two: Reward Evaluation

Corresponds to the process of "grading students' answers." Traditionally, this stage relies on a Reward Model to evaluate the quality of generated results. At the current stage, as task complexity increases, reward evaluation methods also tend to diversify:

Rule-based evaluation: In fields such as mathematics, physics, and code, scoring is done by matching results with rules.

Lightweight reward model: Training a small model (e.g., 7B parameters) for scoring, with controllable costs and good results.

In many research projects, this module is even simplified as part of Rollout and not given separate emphasis. However, with the rise of Agent behavior simulation, especially in commercial application scenarios (e.g., e-commerce, customer service), the complexity of reward evaluation has significantly increased, and the importance of this module will continue to grow in the future.

Module Three: Policy Update

Corresponds to the process of "students learning based on grades." This is the core stage of RL training, which is implemented by modifying the loss function based on traditional training frameworks (e.g., PyTorch, DeepSpeed). Different algorithms (e.g., PPO, DPO, RLOO) have different implementation logics in this stage, but the overall structure remains consistent.

1.3 Summary

From an SFT-dominated training paradigm to RL-driven capability enhancement, the training process for large models is undergoing profound changes. While the structure of RL frameworks remains stable, the functions, implementation methods, and importance of their various modules are constantly evolving.

Rollout module: Faces performance challenges brought by long contexts and heterogeneous tasks;

Reward Evaluation module: Evolving from simple rules to complex evaluations, potentially becoming a critical bottleneck in future RL training;

Policy Update module: Relies on underlying training framework performance optimization and algorithm iteration.

With the development of Agent behavior simulation, complex task modeling, multimodal interaction, and other directions, the design of RL frameworks will increasingly focus on inter-module synergy, efficient resource scheduling, and the unification of algorithms and engineering implementations.

2. RL Training Framework Design and Performance Optimization Challenges

Currently, mainstream Reinforcement Learning (RL) training frameworks are typically divided into two core modules: Training and Rollout.

When designing an efficient RL training system, developers face a series of key challenges. Below are the three core issues we have identified during our technology selection and framework design process.

2.1 Challenge One: Synergy and Resource Management between Rollout and Training Modules

Currently, RL training commonly employs an On-policy strategy, meaning that the Rollout and training processes must be executed sequentially. However, with the continuous growth in model scale, distributed multi-GPU training has become an inevitable trend.

Rollout phase: Primarily a memory-intensive task, especially when handling long contexts (e.g., Chain-of-Thought), requiring the maintenance of large amounts of KV Cache (Key-Value Cache).

Training phase: A compute-intensive task involving large-scale parameter updates and gradient calculations.

While both phases already have numerous optimization techniques (e.g., memory reuse, pipeline parallelism), how to efficiently manage these two types of heterogeneous resources within a unified framework? How to optimize the parameter synchronization mechanism between them? These are key challenges in building efficient RL systems.

2.2 Challenge Two: Diversity of Underlying Training and Inference Frameworks

Currently, there are multiple mainstream training frameworks, such as:

Megatron-LM

DeepSpeed (FSDP)

PyTorch FSDP

At the same time, inference engines also show a diversifying trend:

vLLM

SGLang

Significant architectural differences between various training frameworks and inference engines lead to substantial differences in implementation logic for aspects like parameter synchronization and inference scheduling. For example, in the parameter update portion alone, different combinations might require completely different implementation logics, which places high demands on system maintainability and scalability.

2.3 Challenge Three: Uncertainty Caused by Heterogeneous Batch Execution

Rollout tasks are typically executed in batches, but the complexity of tasks within a batch can vary greatly. Especially when introducing Agent behavior simulation, this heterogeneity becomes more pronounced, potentially leading to decreased overall scheduling efficiency and imbalanced resource utilization.

3. Performance Optimization Analysis

3.1 Initial Implementation and Performance Bottlenecks

In early implementations of RL training, the entire workflow was typically divided into three stages:

1. Inference phase (Rollout): The model generates responses based on the current policy.

2. Evaluation phase: The quality of generated results is scored via a reward model or other mechanisms.

3. Training phase: The policy model is updated based on the scoring results.

This workflow can essentially be implemented based on the SFT (Supervised Fine-Tuning) framework, with the difference being the need to initialize multiple model instances (e.g., policy model, reward model). However, this implementation often presents significant performance bottlenecks in actual operation.

3.2 Memory Optimization Strategies

In large-scale model training, GPU memory consumption primarily includes the following parts:

Model parameters (Parameters)

Gradients

Optimizer states (Optimizer States)

Activations

Taking a 7B parameter model as an example, at FP32 precision, model parameters and gradients alone require approximately 28GB of GPU memory, and optimizer states might additionally occupy 28GB×3=84GB, totaling up to 112GB. Clearly, a single GPU cannot handle such massive memory requirements.

To address this, the industry has proposed various distributed training strategies:

Data Parallelism (DP): Such as DeepSpeed ZeRO-1/2/3, which dynamically reconstructs full parameters via All-Gather operations.

Tensor Parallelism (TP) and Pipeline Parallelism (PP): Such as Megatron-LM, which uses a parameter partitioning strategy suitable for large-scale models.

According to research conclusions from NVIDIA's relevant papers, DP and TP/PP perform similarly at scales below a thousand GPUs; however, at larger scales, TP/PP shows a more significant performance advantage due to avoiding the communication overhead of All-Gather operations.

This table compares the performance of Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) strategies across different characteristics.

3.3 Inference Speed Optimization and Engine Selection

Current mainstream inference engines (e.g., vLLM and SGLang) have achieved significant performance improvements in KV Cache reuse and underlying operator optimization. Nevertheless, parameter synchronization between training and inference engines still presents some challenges:

Outputs generated by inference engines differ in precision from those by training engines;

The current mainstream practice is: use an inference engine for accelerated generation during the Rollout phase, and then have the training engine recalculate logits (only the prefill stage is needed, which is computationally efficient).

Therefore, integrating high-performance inference engines with training frameworks is an effective path to improving overall RL training efficiency. However, how to efficiently combine and coordinate training and inference modules remains a question worthy of in-depth research.

4. Integration of Training Frameworks and Inference Engines

4.1 SPMD and MPMD Concept Analysis

Before discussing how training frameworks and inference engines can be combined, it is necessary to understand the concepts of SPMD (Single Program, Multiple Data) and MPMD (Multiple Programs, Multiple Data). Briefly, SPMD refers to multiple processing units executing the same program but operating on different datasets, while MPMD involves multiple processing units running different programs and processing different datasets. The former typically does not require a central controller to coordinate workflows, whereas the latter might need one to avoid confusion.

When discussing the integration of training frameworks and inference engines, it's essential to first understand two parallel processing modes: SPMD (Single Program, Multiple Data) and MPMD (Multiple Programs, Multiple Data). These two modes can also be described as single-controller and multi-controller architectures.

Single Controller (SPMD): All worker nodes execute the same program logic, suitable for scenarios with large datasets but smaller model sizes.

Multi-Controller (MPMD): Each worker node can execute different programs, increasing implementation complexity but eliminating the need for centralized control, suitable for specific application scenarios.

Mainstream deep learning training frameworks like DeepSpeed and Megatron adopt the SPMD mode, ensuring all processes follow the same code logic for operations. However, for inference engines (e.g., SGLang and vLLM), the situation is different. Although inference engines (e.g., SGLang and vLLM) adhere to the SPMD principle during computation, they do not fully fit the SPMD/MPMD classification when deciding the source of the next token or how to handle KV cache. For these situations, systems like Google Pathway offer more flexible solutions.

Given this background, we should focus more on the communication mechanisms between training frameworks and inference engines regarding training data and model parameters, rather than being limited to whether a single-controller or multi-controller architecture is used.

4.2 SLIME's Specific Implementation Method

The core challenge between training frameworks and inference engines lies in the communication mechanism for training data and model parameters. To better understand this, we can examine the SLIME and ROLL projects to explore specific implementation solutions.

SLIME is a post-training framework focused on reinforcement learning extensions, defining two main components: RayTrainGroup for the training framework and RolloutGroup for the inference engine.

4.2.1 Data Transfer Mechanism

SLIME implements data transfer between the inference engine and the training module by defining a middleware class—Buffer. All data is stored in this Buffer (even writable to disk) and accessed via a specified rollout ID. Additionally, data processing functions and rollout/eval functions within the Buffer class can be flexibly configured via command-line parameters, greatly enhancing system adaptability.

This design makes it more flexible and efficient to meet business needs, especially when facing various special requirements and data formats.

Rollout's generate function is via Buffer.

Obtaining data required by the training framework also relies on this Buffer:

The process of synchronizing the rollout buffer to the actor is as follows:

4.2.2 Model Parameter Synchronization Mechanism

To enable the rollout engine to correctly synchronize parameters at appropriate times, SLIME passes actor configuration information to the rollout. This involves initializing process groups to update weights after each training stage.

This process includes not only data buffer synchronization but also coordination of parallel configurations between actors, ensuring consistency and accuracy of parameter updates.

4.3 ROLL's Specific Implementation Method

ROLL defines multiple roles through a Cluster, with each role responsible for different tasks. This design aligns well with the algorithmic perspective, as from an algorithmic standpoint, the differences between training frameworks and inference engines are not significant, and encapsulating them within a cluster effectively hides these complexities.

4.3.1 Data Transfer Mechanism

Similar to Megatron, ROLL allows for domain-specific sampling, configured in the pipeline.py file. This offers a more convenient solution if users do not want to write data generators. Particularly for reward models, while a unified model is ideal, training difficulty often leads to using different reward models for different domains, which are then aggregated. ROLL supports custom configuration for different domains, batches, and queries to adapt to diverse application scenarios.

4.3.2 Model Parameter Synchronization Mechanism

ROLL's model update logic combines point-to-point communication and collective communication:

Point-to-point communication: Used for parameter updates on the same device, directly judging whether they are on the same device via the worker's node_rank and gpu_rank for efficient data exchange.

Collective communication: Achieved by broadcasting parameters to the target cluster, with broadcast operations performed only by the main process (rank 0), suitable for parameter synchronization across devices.

These two communication strategies correspond to colocate and non-colocate scenarios, ensuring flexibility and efficiency in parameter synchronization.

4.3.4 Considerations for Cross-Machine Deployment

When all components are on the same machine, hardcoding parameter synchronization is relatively simple, but when it comes to cross-machine deployment, the situation becomes more complex. At this point, it is necessary to consider not only how to effectively manage network communication delays and bandwidth limitations but also how to optimize resource allocation and load balancing in a distributed environment. Furthermore, in a single-controller mode, the controller's pressure increases with the expansion of the cluster scale, especially when processing multimedia data, requiring special attention to potential performance bottlenecks. Therefore, for cross-machine deployment, selecting appropriate communication strategies and optimizing the controller's workload becomes particularly important. However, based on the designs of SLIME and ROLL, the core of parameter synchronization is to notify the GPU to perform synchronization, and the intermediate communication process does not rely on the controller, which offers some convenience and flexibility for cross-machine deployment.

4.4 Colocation and Ray's Application

Placing models such as Actor, Ref, Reward, and Critic on the same GPU card is called colocation. However, as previously mentioned, with increasing model scale (e.g., a 7B model is already difficult to train on a single card), multiple models exceeding 1000B parameters are expected to emerge in the latter half of the year. This makes the overhead brought by parallel computing extremely significant. Currently, Reward models are generally smaller, with 7-30B scale being sufficient, so separate deployment often offers better cost-effectiveness.

To address this complexity, Ray—a powerful framework supporting distributed computing—has been introduced into projects to help developers reduce the burden of managing underlying logic. For detailed introductions to Ray-based distributed training workflows and the Ray distributed computing framework, please refer to the following articles: - Illustrated Guide to Ray-based Distributed Training Workflow in OpenRLHF - Detailed Explanation of Ray Distributed Computing Framework

Next, we will compare the differences in colocation and non-colocation implementations across four frameworks: SLIME, Verl, ROLL, and OpenRLHF.

4.4.1 SLIME

SLIME defines only two main workers: RayTrainGroup for training and RolloutGroup for inference. For colocation, training and inference can be deployed separately; in non-colocation scenarios, distributed communication is required to synchronize parameters. This design has a high level of abstraction, is easy to understand, and adapts well to different training and inference needs. By simply specifying whether to colocate in the configuration, relevant operations are automatically executed in all key stages.

4.4.2 ROLL

For non-colocation scenarios, ROLL allows fine-grained specification of different workers (e.g., actor, critic, reward, etc.) to be deployed on different GPUs, and even configured by iteration. If not manually specified, Ray will automatically handle the deployment. Given the high resource consumption of RL tasks, fine-grained GPU resource configuration helps improve resource utilization efficiency, but this also places higher demands on the algorithm side's resource scheduling capabilities. Clearly, using Ray to manage these complexities is more appropriate.

4.4.3 Verl

VERL adopts a unique approach to implementing colocation and non-colocation deployment. In non-colocation mode, each worker (e.g., actor, critic, reward, etc.) runs as an independent process, relying on Ray for scheduling. In colocation mode, multiple roles share the same Ray actor instance, instantiating multiple worker classes within the same process. By using create_colocated_worker_cls or create_colocated_worker_cls_fused methods, a multi-role class (e.g., WorkerDict/FusedWorker) is dynamically generated, holding multiple worker instances internally. External calls can use a unified interface to invoke methods of different role workers, and internally, these calls are automatically dispatched to the corresponding worker instances. This approach enables co-existence of multiple roles within the same process and can significantly improve performance in some scenarios, such as reducing latency and memory fragmentation caused by inter-process communication.

4.4.4 OpenRLHF

OpenRLHF offers flexible hybrid deployment options, supporting co-location of vLLM engine, Actor, Reference, Reward, and Critic model nodes, as well as partial hybrid deployment or complete separate deployment to accommodate asynchronous training needs. This flexibility allows it to handle diverse application scenarios, but also implies more complex management and optimization requirements.

4.4.5 Conclusion

In summary, in non-colocation scenarios, Ray can indeed help us manage resources more easily, especially when dealing with complex Agent and multi-turn interaction scenarios. However, according to feedback from operations and maintenance teams, Ray's design philosophy conflicts with existing Kubernetes cloud-native production environments, leading to higher management costs when deployed in actual production. Nevertheless, the Ray team is also working on optimizing these issues, for example, by enabling Ray to directly transfer tensor data via NCCL, thereby bypassing object storage and improving efficiency. In the future, we can expect more updates and improvements from Ray.

4.5 Integration of Different Training Frameworks and Inference Engines

When integrating different training frameworks and inference engines, parameter conversion issues may arise. For example, if vLLM uses 4-dimensional Tensor Parallelism (TP) while DeepSpeed is distributed across 8 GPUs, appropriate parameter conversion is needed to ensure data transfer consistency. Megatron-LM has similar requirements. When multiple training frameworks and inference engines exist, the adaptation workload increases exponentially, which can lead to configuration errors and performance problems.

4.6 Decoupled Code Design

Taking SLIME as an example, its architecture is divided into three layers: the top-level RolloutGroup is responsible for managing the overall flow of the inference engine; the middle-layer RolloutRayActor handles specific inference requests; and the bottom-layer SglangEngine implements the specific inference logic. This layered design makes it simple to replace the backend inference engine by just changing the bottom-layer implementation without modifying the upper-level control logic. Similarly, the training framework also adopts a similar layered structure, ensuring system flexibility and maintainability.

5. About Agentic RL

Currently, frameworks like ROLL, Verl, and OpenRLHF provide good support for Agentic RL. Although this might increase code complexity, with technological maturity, a clearer design is expected to emerge. In the future, Agentic RL is likely to become mainstream, with existing RL methods becoming part of it.

6. Framework Selection Advice

6.1 Framework Difficulty Analysis

A rapidly developing technological environment means old frameworks can quickly become outdated. Therefore, keeping frameworks concise and highly maintainable is key. New frameworks, having no legacy burden, can adapt to new technology trends more easily.

6.2 Recommended Frameworks

OpenRLHF: A high-performance open-source RLHF framework integrating Ray, vLLM, ZeRO-3, and HuggingFace Transformers.

slime: A newly launched framework with clean code, suitable for researchers who want to attempt bold framework modifications.

ROLL: Emphasizes data processing and asynchronous operation support, particularly suitable for teams exploring Agentic RL in depth.

verl: Stable and well-optimized, suitable for large-scale cluster deployment, especially for resource-rich teams.

Teams can choose the most suitable framework based on their specific needs and technical background. For teams with specific requirements or those looking to expand rapidly, Verl might be a better choice as it has been validated by multiple major companies. For teams pursuing technological innovation and agile development, SLIME or ROLL might be more appealing.

Conclusion

Over the past half-year, we have deeply explored RL training frameworks, Agent frameworks, and inference engine frameworks. Overall, in terms of code volume, Agent frameworks are the most extensive, followed by inference engines and RL training frameworks; in terms of code difficulty, inference engines lead, followed by RL training frameworks and Agent frameworks. It is worth noting that, excluding the complexity of the underlying operators of inference engines, the challenges of RL training frameworks primarily lie in integrating various systems and technologies, which requires framework developers to have a deep understanding of multiple technologies and business logics.

Open-source frameworks such as Verl, SLIME, ROLL, and OpenRLHF each have unique characteristics, showcasing the pursuit and persistence of their authors, and boast high community activity. It can be said that in the field of open-source RL frameworks, China is in a world-leading position in terms of technical strength and cognitive depth. Although the differences among algorithm talents are not significant, there is still a certain gap in hardware resources (such as graphics cards).

Main Tag:Reinforcement Learning

Sub Tags:Distributed SystemsAgentic RLAI FrameworksPerformance OptimizationMachine Learning


Previous:Advancing Silicon-Based Intelligence: Shuchao Bi's Insights on Past, Present, and Future AI

Next:A New Revolution in Reward Models! SWIFT Reads "Inner Voice" Instead of Text, Creating a Faster, Stronger, and More Cost-Effective AI Judge

Share Short URL