Reinforcement Learning (RL) is a crucial machine learning paradigm used to solve sequential decision-making problems. However, RL relies on large amounts of training data and computational resources, and it has limitations in terms of cross-task generalization capabilities. With the advent of Continual Learning (CL), Continual Reinforcement Learning (CRL) has emerged as a promising research direction, aiming to address these limitations by enabling agents to continuously learn, adapt to new tasks, and retain previously acquired knowledge.
This article provides a comprehensive examination of CRL, focusing on its core concepts, challenges, and methods. It proposes a new classification system for CRL methods, categorizing them into four types based on the kind of knowledge stored and/or transferred.
I. CRL Overview
A. Definition
Definition of CRL: CRL is an extension of Reinforcement Learning (RL), emphasizing an agent's ability to continuously learn, adapt, and retain knowledge in dynamic, multi-task environments.
Distinction from Traditional RL: Traditional RL typically focuses on a single task, whereas CRL emphasizes maintaining and improving generalization capabilities across a sequence of tasks.
Relationship with Multi-Task RL (MTRL) and Transfer RL (TRL):
MTRL: Handles multiple tasks simultaneously, with a fixed and known task set.
TRL: Transfers knowledge from a source task to a target task, accelerating learning for the target task.
CRL: Tasks typically arrive sequentially, environments continuously change, and the goal is to accumulate knowledge and quickly adapt to new tasks.
B. Challenges
Main Challenges Faced by CRL: Achieving a delicate balance among plasticity, stability, and scalability.
Stability: Avoiding catastrophic forgetting and maintaining performance on old tasks.
Plasticity: The ability to learn new tasks and leverage prior knowledge to improve performance on new tasks.
Scalability: The ability to learn multiple tasks with limited resources.
C. Metrics
Traditional RL Metrics: Typically use cumulative reward or success rate to measure an agent's performance.
CRL Metrics:
Average Performance: The agent's overall performance across all learned tasks.
Forgetting: The degree to which an agent's performance on old tasks declines after subsequent training.
Transfer: The agent's ability to use knowledge from previous tasks to improve performance on future tasks, including forward transfer and backward transfer.
D. Tasks
Navigation Tasks: Using a discrete action set in a 2D state space, where the agent explores an unknown environment to reach a target.
Control Tasks: Involving a 3D state space and a discrete action set, where the agent uses control commands to achieve specific target states.
Video Games: State space is typically images, actions are discrete, and the agent performs complex controls to achieve objectives.
E. Benchmarks
CRL Benchmarks: Such as CRL Maze, Lifelong Hanabi, Continual World, etc. These benchmarks differ in terms of the number of tasks, task sequence length, and observation types.
F. Scenario Settings
CRL Scenario Classification:
Lifelong Adaptation: Agents are trained on a sequence of tasks and evaluated only on new tasks.
Non-Stationarity Learning: Tasks differ in reward functions or transition functions, and agents are evaluated on all tasks.
Task Incremental Learning: Tasks differ significantly in reward and transition functions, and agents are evaluated on all tasks.
Task-Agnostic Learning: Agents are trained without task labels or identities, requiring them to infer task changes.
II. CRL Classification
This section systematically reviews the main methods in the field of Continual Reinforcement Learning (CRL) and proposes a new classification system. CRL methods are divided into four main categories based on the type of knowledge stored and/or transferred: Policy-focused, Experience-focused, Dynamic-focused, and Reward-focused methods.
A. Policy-focused Methods
This is the most dominant category of methods, emphasizing the storage and reuse of policy functions or value functions, divided into three sub-categories:
1) Policy Reuse
Retains and reuses complete policies from previous tasks.
Common practice: Using old policies to initialize new policies (e.g., MAXQINIT, ClonEx-SAC).
Advanced methods: Using task composition (e.g., Boolean algebra) to achieve zero-shot generalization (e.g., SOPGOL).
Less scalable, but strong knowledge transfer capabilities.
2) Policy Decomposition
Decomposes policies into shared and task-specific components.
Methods include:
Factorization (e.g., PG-ELLA, LPG-FTW)
Multi-head networks (e.g., OWL, DaCoRL)
Modular structures (e.g., SANE, CompoNet)
Hierarchical structures (e.g., H-DRLN, HLifeRL, MPHRL)
Advantages: Clear structure, strong scalability, suitable for complex tasks.
3) Policy Merging
Merges multiple policies into a single model, saving storage resources.
Techniques include:
Distillation (e.g., P&C, DisCoRL)
Hypernetworks (e.g., HN-PPO)
Masking (e.g., MASKBLC)
Regularization (e.g., EWC, Online-EWC, TRAC)
Advantages: Memory efficient, suitable for resource-constrained scenarios.
B. Experience-focused Methods
Emphasize the storage and reuse of historical experience, similar to experience replay mechanisms, divided into two types:
1) Direct Replay
Uses an experience buffer to store old task data (e.g., CLEAR, CoMPS, 3RL).
Advantages: Simple and effective, suitable for scenarios with clear task boundaries.
Disadvantages: High memory consumption, privacy risks.
2) Generative Replay
Uses generative models (e.g., VAE, GAN) to synthesize old task experience (e.g., RePR, SLER, S-TRIGGER).
Advantages: Memory efficient, suitable for scenarios with fuzzy task boundaries or resource constraints.
Disadvantages: Generation quality affects performance.
C. Dynamic-focused Methods
Adapt to non-stationary environments by modeling environmental dynamics (state transition functions), divided into two types:
1) Direct Modeling
Explicitly learns environmental transition functions (e.g., MOLe, LLIRL, HyperCRL).
Advantages: Suitable for tasks requiring long-term planning.
Disadvantages: Complex modeling, high computational overhead.
2) Indirect Modeling
Uses latent variables or abstract representations to infer environmental changes (e.g., LILAC, 3RL, Continual-Dreamer).
Advantages: More flexible, suitable for environments with unclear task boundaries or dynamic changes.
Often combined with intrinsic reward mechanisms.
D. Reward-focused Methods
Promote knowledge transfer and exploration by modifying or reshaping reward functions. Common methods include:
Reward Shaping: E.g., SR-LLRL, temporal logic-based shaping methods.
Intrinsic Rewards: E.g., IML, Reactive Exploration, driving exploration through curiosity.
Inverse Reinforcement Learning (IRL): E.g., ELIRL, learning reward functions from expert demonstrations.
Large Model Assisted Reward Design: E.g., MT-Core, using large language models to generate task-relevant intrinsic rewards.