Some marginal notes on initial research

09 Apr, 2026

Reading List

While I found a lot of material to read, I will try to keep it as lean as possible to start off.

I also need to refresh and/or broaden my knowledge on RL in general (since i think its a good starting point to assume that I know little or that what I know is probably wrong). But at least for today thats not the point. However I will start to re-visit basic RL topics in a more systematic way soon.

Here is what I plan on reading, although I might change it spontaneously:

A Survey of Continual Reinforcement Learning (Pan et al.)
A Definition of Continual Reinforcement Learning (Abel et al.)
HackAtari (Delfosse et al.)
OCAtari (Delfosse et al.)
Adaptive Reinforcement Learning (Bram Grooten), PhD Thesis

A Survey on Continual Reinforcement Learning

The paper starts of with an intro to RL and the problem of DRL agents struggling with efficient transfer of knowledge across tasks to adapt to new environments. Typically they need to be re-trained from scratch. There is however a lot of reasearch on enabling these agents to do better in such scenarios and avoid catastrophic forgetting.

The field of Continual RL (or lifelong / incremental learning).

Goal: Create systems that can adapt to new tasks while retaining knowledge from previous ones
Central Challenge: achieving a balance between stability (of previous learned knowledge) and plasticity (or flexibility to adapt to a new task). Therefore researchers need to adress catastrophic forgetting and enable knowledge transfer. One is not useful without the other.
This is called the "stability-plasticity dilemma".
Catastrophic Forgetting: learning a new tasks causes the model to overwrite and lose knowledge of previously learned tasks.
Knowledge Transfer: leveraging accumulated knowledge from past tasks to improve learning efficiency (and performance) of new or previously seen tasks.

Background

In this section the authors provide a fairly deep refresher of RL, starting with "what's an MDP?", and building from there. Next up they do the same thing for Continual Learning (CL = paradigm in ML that focuses on incrementally updating models to adapt to new tasks while maintaining performance on previous tasks).

Three categories of strategies to overcome the main issues in CL (forgetting & transfer):

Replay-based methods: retain subset of information relevant to previous tasks to aid the model in recalling past knowledge during learning new task
Regularization-based methods: constrain model updates to prevent conflicts between parameters learned from old and new tasks
Parameter isolation methods: segretage model parameters for different tasks, avoiding conflicts and adapting the model structure dynamically

Overview of Research in CRL

Definition of CRL: "extension of traditional RL to a dynamic, multi-task framework where agents continuously learn, adapt and retain knowledge across various tasks -> CRL is designed for environments that change continuously (tasks often arriving sequentially over time).
Research faces some challenges which can be described as a triangular balance across three key aspects: plasticity, stability, scalability (two of which we already heard of, and scalability -> ability of an agent to learn many tasks using limited resources, efficient use of memory and compute, ability to handle increasingly complex and diverse task distributions).
Metrics: the "tradition" in RL is to measure an agents performance by recording its expected accumulated reward over an episode. These metric are designed for single-task evaluation. For CL (especially when the number of tasks is clear and finite + task bounaries are well-defined): Average Performance, Forgetting, Transfer (Forward, Backward)
Tasks: there are some "typical" tasks for CRL. The main ones are navigation, control and ... (drummroll) ... video games (especially Atari 2600, as the most frequently used sets in DRL experiments).
The authors mention three papers where researchers evaluate CRL by combining different games into task sequences: 1 2 3. They all use completely different games as a sequence (not mods of the same game).
Benchmarks: the growth of CRL has been relatively slow because of the large amount of compute necessary for experiments. There are some benchmarks (CRL Maze, Lifelong Hanabi, Continual World, CORA, ...). The primary challenge lies in the design of task sequences -> various factors: task difficulty, task order, number of tasks. There is no ideal single benchmark (yet) and different benchmarks focus on different aspects of CRL. Building one that is comprehensive + standardized is an ongoing challenge.
Scenario Settings: CRL scenarios can be categorized into four main types baed on the non-stationarity property and availability of task identities:

Lifelong Adaptation: agent trained on sequence of tasks and its performance is evaluated only on new tasks
Non-Stationarity Learning: tasks in sequence differ in their reward functions or transitin functions (but share same underlying logic)
Task Incremental Learning: task in sequence doffer from one another both in reward function and transition function -> more distinct. Some even have different state and action space.
Task Agnostic Learning : agent is trained on a sequence of tasks without full knowledge of task labels or identities

Method Review

In this section the authors present a taxonomy of CRL Methods by asking "what knowledge is stored and/or transferred". They come up with four categories:

Policy-focused (Policy re-use, policy decomposition e.g. Progressive NNs or hierarchical decomposition, Policy Merging e.g. Distillation, Regularization and EWS,
Experience-focused (Direct Replay e.g. CLEAR, Generative Replay)
Dynamics-focused (Related to Model-Based RL -> learn model of env's dynamics to predict future state/reward. Direct Modeling, Indirect Modeling)
Reward-focused

Later on they also write a bit more about other "emerging" research directions such as Task Detection

Applications

Robotics
Game Playing (nice, they even mention HackAtari)
...

A Definition of Continual Reinforcement Learning

They start by describing RL as "finding a solution" and compare it to CRL as "endless adaption". Further, they ask if this endless adaption might be a better way to model the RL problem. They go on and state that the community lacks a clean and general definition of CRL.

(Informal) CL Problem: "An RL problem is an instance of CRL is the best agents never stop learning".
They follow up with defining some notation and based on that some more definitions which (in my opinion) seem more complex than they need to be...
The only real insight I got from this paper was that CRL is still not generally defined and "different views" on it persist.

OC-Atari, HackAtari & Co.

Introduction of "shortcut learning" -> misalignments hard to identify. More detailed information on that -> Deep Reinforcement Learning Agents are not even close to Human Intelligence
Lack of generality of deep RL agents
Atari Learning Environments (ALE) = most used RL benchmark (diverse set of challenges, does not require extensive computations and does not suffer from experimenter bias).
OC-Atari: Object Centric representations of Atari games (by reading RAM). Objects = fundamental building blocks that human reasoning relies on -> enables abstraction, generalizationn ...
HackAtari: framework to introduce novelty to the Atari environments (basically modded version of the original environments, by alternating values in RAM). Can be used to test agents, show misgeneralisation and enables CRL.
Example: LazyEnemy Pong, altering pattern and/or color of cars in Freeway.

Categorizes modifications into 4 paradigms:

Visual Domain Adaptation (e.g. changing colors)
Dynamics Adaptation (alter gameplay by introducing small perturbations)
Curriculum Reinforcement Learning: task complexity is incrementally increased -> learn different skills in a strucured manner
Reward Signal Adaptation: modify the reward function in order to test agents ability to adapt to new objectives

Related Work: CRL benchmarks -> need to expand Atari benchmarks. HackAtari provides new applications such as benchmarking for generalization (and I would also add forgetting here).

"We believe that comparison methods between different continual methods [Mundt et al., 2021] should be brought to RL" -> important!
Limitations: bound by CPU (slow). The next step is JAXAtari, which is basically a combination/upgrade of OC-Atari and HackAtari that leverages JAX for parallel computing (GPUs baby!) in order to speed up the environments and learning.
In "Deep Reinforcement Learning Agents are not even close to Human Intelligence" a user study was conducted to compare performance drop of humans when playing a simplified version of a game to the performance of an RL agent (humans are way better here).

CLEVA-Compass

The authors pick up on continual learning (and its problems). Since there is no agreed-upon formal definition "beyond the idea to continuously observe data [...]" things like reproducibility, interpretation of results and comparability are an even bigger problem with CL.

With this paper (which is a reproducibilty work) they try to promote transparency and comparability of reported results for the CL-case -> Contunual Learning EValuation Assessment (CLEVA) Compass (visual representation that presents and intuitive chart to identify a work's priorities and context in the borader literature landscape as well. It also enables a way to determine how methods differ in terms of reported metrics, where they resemble each other or what elements would be missing towards a fairer comparison).

No agreed upon formal definition of continual learning
Threat of "catastrophic inference" (forgetting)
Research often focuses on preventing forgetting (they name numerous methods). But in reality this is just one of CLs challenges.
Many many related paradigms (multi-tasl, transfer, domain adaptation, few-shot, curriculum, ...) with different relationships to CL
In depth explanation of the different elements of the CLEVA Compass
"The [task] order has a siginificant impact on obtainable continual performance depending on the constructed curriculum".
- A Wholistic View of Continual Learning with Deep Neural Networks. This is interesting as it emphatically mentions the relevante of task order (a lot of times).
- Curriculum Learning

Get in Touch

If you have any feedback you want to share with me feel free to reach out at mail@sebastianwette.de. I would be more than happy to chat about it.

Previous Next

#thesis