Committing to Continual RL: My Thesis Topic, Defined
Benchmarking CRL under controlled distribution shift with JAXtari (working title)
CRL agents must learn sequences of tasks without forgetting what they already know (catastrophic forgetting). Evaluating how well they do this turns out to be hard to do rigorously. The dominant approach (in the realm of Atari games for RL benchmarking) trains agents across different games in sequence. The problem is that different games differ along every axis at once: visuals, mechanics and reward structure/scale. As a result, even though forgetting may be easy to measure, it's hard to tell what caused or influenced it most (which would be interesting to know in order to see why methods fail).
This thesis takes a different approach. Instead of switching games it uses intra-game modification sequences built on JAXtari (a JAX-native reimplementation of ALE / Atari2600 environments, with a modular modification API). By keeping the base game fixed and varying only one axis at a time (visual/dynamic/reward) the benchmark isolates what actually drives forgetting.
The central question: does the type of environment change (and its magnitude) systematically affect how much agents forget and how much they transfer? A secondary question asks whether this interacts with observation modality, comparing pixel-based and object-centric agents on the same sequences under identical conditions.
The output is both an empirical study (of multiple CRL methods) and a reusable benchmark: a principled, controlled alternative to cross-game CRL evaluation that makes forgetting diagnosable rather than just observable.
Open Questions for me to answer
- What is "magnitude" and how to measure it? Especially: how to compare magnitude between different mod types?
- Task Distance? Human perceived vs. Agent perceived difficulty?
- Which game(s) to use initially? Decision based on what? (Speed, Complexity, Similarity to ALE?)
- How to the tasks sequences look?
- base -> dynamic change -> evaluate on base
- base -> slight change -> stronger change -> ... -> evaluate on base
- ... ?
- What metrics to use?
- How does the evaluation protocol look like? (Also config related stuff like Training Budget, etc.)
Get in Touch
If you have any feedback you want to share with me feel free to reach out at mail@sebastianwette.de. I would be more than happy to chat about it.