SJTU PhD's Latest Insights: Clarifying Reinforcement Learning with Just Two Questions

Reprinted from AI Tech Review, for academic sharing only. Please leave a message for removal if there is any infringement.

From "trial and error" to "optimization", a unified thinking framework for RL.

As the field of artificial intelligence has developed, Reinforcement Learning (RL) has become one of the most fascinating and core research directions in AI. It seeks to answer a fundamental question: When an agent does not have a ready-made answer, how can it autonomously learn optimal behavior through interaction with its environment?

It sounds simple, but it is extremely complex in practice. For decades, researchers have proposed hundreds of algorithms, from the earliest Q-learning to later deep learning-based DDPG, SAC, PPO, IQL... Each method has its own principles, parameters, and assumptions, appearing independent, like a vast and confusing maze.

For those new to reinforcement learning, this complexity can often be frustrating: we seem to be learning countless names, yet struggle to see the connections between them.

However, a recent blog post by Kun Lei, a PhD student from Shanghai Jiao Tong University and Shanghai Qi Zhi Institute, proposes an illuminating framework: all reinforcement learning algorithms can actually be understood through two questions: First, where does the data come from? Second, how frequently is the policy updated?

These two seemingly simple questions, like two main threads, re-organize the world of reinforcement learning. From them, we can discover that complex RL algorithms are merely different points moving along these two axes.

And when this structure is revealed, the entire algorithmic logic suddenly becomes intuitive, orderly, and easier to understand.

Blog Address: https://lei-kun.github.io/blogs/rl.html

01 Where Does the Data Come From?

The process of reinforcement learning is essentially a cycle of an agent continuously collecting experience and using that experience to improve its policy. The differences between algorithms largely depend on what kind of data they rely on.

The most direct approach is "on-policy learning." In this mode, the agent interacts with the environment and learns simultaneously. Every action yields new data, which is immediately used to update the model. These methods are like students constantly practicing in the field, with representative algorithms including PPO, SAC, and others.

The advantages of on-policy learning are flexibility and adaptability, but it can also be costly, as each trial and error might consume time, energy, or even cause losses.

A more conservative approach is "off-policy learning." It allows the agent to repeatedly use past experiences without having to re-interact with the environment each time. Algorithms save these experiences and sample them repeatedly for learning when needed. DQN, TD3, and DDPG belong to this category.

Off-policy learning improves sample efficiency and makes the learning process more stable, making it a mainstream solution in many practical applications.

There is also a more extreme approach called "offline learning." Here, the agent relies entirely on a fixed dataset for training and cannot interact with the environment anymore. While seemingly constrained, this method is particularly crucial in high-risk scenarios such as healthcare, autonomous driving, or robot control.

Algorithms must learn the best possible decisions from existing data without trial and error, with CQL and IQL being representative methods of this type.

From on-policy to off-policy, and then to offline, the data acquisition methods gradually shift from active exploration to passive utilization. The choice of algorithm often reflects the practical constraints of the task: Can safe trial and error be performed? Can new data be continuously acquired? Is the cost of trial and error acceptable? This is the first dimension of reinforcement learning: where does the data come from?

02 The Rhythm of Learning Updates

The second dimension of reinforcement learning is the rhythm of learning updates. Simply put, it's about how often the agent evaluates its policy and adjusts its behavior.

The simplest approach is "one-step learning." The agent trains once on a fixed dataset and, once a policy is learned, no further improvements are made. Imitation learning is a typical example. It is fast and low-risk, suitable for tasks with high safety requirements or limited data.

Another approach is "multi-step learning." The algorithm updates multiple times on a batch of data until performance converges, then collects new data. This is a compromise strategy that avoids the high cost of frequent interactions while achieving better performance than one-shot training.

The most representative is "iterative learning." These algorithms continuously evolve in a loop of "collecting data—updating the model—collecting more data," with each interaction driving performance improvements. They act like tireless learners, constantly exploring the unknown and refining themselves. PPO and SAC are examples of this approach.

From one-step to multi-step, and then to iterative, the update rhythm of algorithms becomes increasingly dense, signifying a transition from static to dynamic. The different rhythms reflect a trade-off between stability and adaptability.

03 Towards a More Fundamental Unified Framework

After clarifying the two main threads of "where the data comes from" and "the rhythm of learning updates," the blog post proposes a more fundamental unified perspective: regardless of how the algorithms vary, all reinforcement learning methods are essentially doing two things: evaluating the current policy, and then improving it.

In simple terms, reinforcement learning is like a process of repeated self-practice:

First, evaluate to see how the current policy is performing, identifying which actions are good and which are not;

Then, improve by adjusting the policy based on the evaluation results, making the next decision a bit smarter.

Q-learning, PPO, SAC... they may sound different, but they are all repeating these two actions. The only differences are how they evaluate, how fast they improve, or what data they use.

In the blog, the author uses a set of formulas to unify these two steps:

The Evaluation Phase (Policy Evaluation) is about measuring "how good this policy truly is." The algorithm makes the model predict how much reward an action taken in a certain state will yield, then compares it with actual feedback. If the error is too large, the model is adjusted to make its predictions closer to reality. Online algorithms directly compute with new data, while offline or off-policy algorithms correct biases in old data through methods like importance sampling and weighted averaging.

The Improvement Phase (Policy Improvement) is about optimizing the policy itself after obtaining new evaluation results. The model will tend to choose actions that bring higher expected rewards. However, to avoid "overcorrecting" too quickly, many algorithms add constraints or regularization terms, such as preventing the new policy from deviating too much from the old one (this is the idea behind PPO), or maintaining a certain degree of exploration in the policy (this is the role of entropy regularization in SAC).

From this perspective, the so-called different reinforcement learning algorithms are merely different implementations of these two processes. Some algorithms focus more on the accuracy of evaluation, some emphasize the stability of improvement, some update frequently and iterate quickly, while others are conservative and optimize slowly.

When we view reinforcement learning through "evaluation + improvement," the entire algorithmic system is laid out before us like an unraveling silk cocoon. All methods are no longer isolated techniques but different combinations of these two actions.

After clarifying these two main threads, the blog further extends its perspective to real-world intelligent systems, particularly the rapidly developing robot foundation models.

Kun Lei points out that this rhythm-centric way of thinking aligns perfectly with the training practices of modern robot foundation models. For instance, Generalist team's GEN-0 and Pi's pi_0.5, their growth process resembles a continuously operating data flywheel. The system consistently absorbs new tasks and scenarios, integrating them into a unified corpus, and then periodically undergoing retraining or fine-tuning.

Under such mechanisms, multi-step updates become a natural choice. Each training cycle brings small, controlled improvements, conservative enough to avoid the risk of distribution collapse, yet leaving sufficient room for exploration, enabling the model to steadily grow within an expanding data corpus.

Furthermore, as models gradually approach their capability limits, whether to surpass human performance in specific tasks or to align more precisely with human behavior, researchers typically turn to iterative online reinforcement learning, performing higher-frequency, more refined evaluations and improvements for specific targets.

This training strategy, transitioning from multi-step updates to online iteration, has been repeatedly validated in practice. For example, in typical settings like rl-100, multi-step updates can achieve stable progress with limited data, while appropriate online RL can further boost model performance while maintaining safety and stability.

04 A Young Researcher at the Forefront of RL

Author's Homepage: https://lei-kun.github.io/?utm

The author of this blog, Kun Lei, is currently a PhD student at Shanghai Jiao Tong University and Shanghai Qi Zhi Institute, advised by Professor Huazhe Xu of Tsinghua University.

Kun Lei graduated from Southwest Jiaotong University, where he began engaging in AI and optimization-related research during his undergraduate studies. He also collaborated on research with Professor Peng Guo of Southwest Jiaotong University and Professor Yi Wang of Auburn University in the United States.

Before pursuing his PhD, he worked as a research assistant at Shanghai Qi Zhi Institute, conducting research in reinforcement learning and robot intelligence with Professor Huazhe Xu. He later completed a four-month research internship at Westlake University, primarily exploring the application of embodied intelligence and reinforcement learning algorithms in real environments.

Kun Lei's research interests encompass deep reinforcement learning, embodied AI, and robot learning. Rather than solely pursuing algorithmic metrics, he is more concerned with how these algorithms can be truly implemented, how reinforcement learning can work effectively not only in simulated environments but also stably in real robotic systems, and how agents can learn quickly and adapt flexibly with limited data.

His blog also reveals that Kun Lei's research style combines engineering practice with intuitive thinking; he seeks clearer understanding rather than more complex models. This article on reinforcement learning exemplifies this approach, as he avoids stacking obscure formulas and instead uses two fundamental questions to outline the logical backbone of reinforcement learning.

The reason reinforcement learning can be daunting is its vast theoretical system and complex formulas. Beginners are often overwhelmed by concepts like Bellman equations, policy gradients, and discounted returns, where each term can expand into several pages of derivations, making it difficult to grasp the core.

The value of this blog lies in bringing everything back to basics. The author doesn't start from complex mathematics but poses two simple questions: Where does the data come from? How frequently is the policy updated?

These seemingly straightforward questions actually touch upon the roots of reinforcement learning. They help readers re-perceive the structure of algorithms, where different methods are no longer isolated techniques but rather different trade-offs revolving around these two dimensions. Through this perspective, the seemingly chaotic forest of reinforcement learning suddenly becomes navigable.

More importantly, this approach is not just a way of explaining, but also a habit of problem-solving. It reminds us that behind complex systems often lie the simplest rules, merely obscured by layers of formulas and terminology. When we return to the principles themselves and understand problems in a structured way, complexity ceases to be an obstacle.

SJTU PhD's Latest Insights: Clarifying Reinforcement Learning with Just Two Questions

Share Short URL