Sanity check experiments

08 Apr, 2026

Before getting all excited and diving deeper into my idea of a forgetting benchmark, I want to run some sanity check (plus my supervisors think it's a good idea). These initial experiments will provide me with some orientation, traction and, hopefully, guide me in a clearer direction. I sat down and wrote a short abstract on what I plan to do.

An abstract (kind of)

We conduct an initial exploratory study on controlled non-stationarity in reinforcement learning using a parameterized modification of Pong. Specifically, we vary the ball speed to create a sequence of environments with gradually increasing dynamics changes and train an RL agent sequentially on these tasks. By measuring performance and retention across tasks, we analyze how the magnitude of environment change influences forgetting and transfer. This experiment serves as an orientation step toward systematically studying task similarity and shift magnitude in continual reinforcement learning.

Underlying problem Continual reinforcement learning research typically evaluates agents on sequences of discrete tasks, but the magnitude of change between tasks is rarely quantified or controlled. As a result, it is difficult to understand how different levels of environment shift affect forgetting, transfer, and adaptation.

RQ -> How does the magnitude of controlled environment change influence retention and transfer in continual reinforcement learning?

Research gap While theoretical work discusses non-stationary MDPs and task similarity, empirical CRL benchmarks rarely provide calibrated, parameterized task changes that allow systematic study of how shift magnitude affects learning dynamics.

A bit more detail

Basically I want to build a first prototype of a (PQN) training pipeline to train incrementally on different modified versions of some base environment. Additionally I want to collect my first metrics to get a feel for what is actually happening.

Framework

The framework of choice here is JAXAtari. That is for a few simple reasons:

It is fast. Period. (Since I am lucky enough to have access to powerful compute clusters with shiny GPUs, I can leverage parallel environments).
It offers a neat interface to create and integrate mods.
With it you can train agents on raw pixels by using the renderer, or you can access the object-centric representations of the environment

Setup

Train on task $T_{0}$ = base Breakout (I switched to Breakout as an example for now)
Sequentially fine-tune on $T_{1}, T_{2}, \dots, T_{n}$ where $T_{k}$ has a paddle width reduced by $k$ -pixels
After finishing tuning on each $T_{i}$ , evaluate on all previous tasks. This will result in a lower-triangular score matrix $R_{i, j}$ which denotes performance on Task $T_{j}$ after training up to task $T_{i}$ .

After doing this we can calculate the retention as $R e t e n t i o n_{i, j} = \frac{R_{i, j}}{R_{j, j}}$ and plot the retained performance on older tasks against the task distance $d = i - j$ .

I just saw that there might be another way (that could be more expressive I think): train the agent on $T_{0}$ and then not tune one agent sequentially on all tasks but one agent on $T_{1}$ , one on $T_{2}$ , and so on... Could this be more useful to see how task distance influences forgetting?

But now the question is if I can just assume the "hardness gap" between the tasks is equal to this distance $d$ . From reasoning I would assume that there is a relationship between the two... but I am not sure it is linear (2px smaller is linearly harder than 1px smaller). That's where I might need some "difficulty heuristic" (?). I like "number of env steps to reach a threshold score" the best so far...

Another additional idea is to train a fresh agent on each task $T_{k}$ (without training on anything else) in order to distinguish forgetting of old tasks and inability to learn new tasks (and also from positive transfer to new tasks).

Some design "choices"

Keep training budget fixed for each task (kind of obvious).
Use multiple seeds
Store checkpoints for every task

Other mods I could work with

For Breakout one could also play around with:

Ball speed (horizontal and/or vertical)
Observation degradation (noise, flicker, occlusion)
Ball visibility
...

For Pong it's rather similar. In general I feel like splitting the mods in three separate categories:

Control Mods
Dynamics Mods
Observation Mods

Where do I stand right now?

Already created the mods (and tested them)
Trained an object-centric agent on base Breakout and evaluated it on a small paddle (to see if performance drops with smaller paddle). Turns out it does. But when I train on the small paddle and evaluate on the base environment performance stays about the same. This makes sense in a way when working with OC representations (OC -> better generalization?).

What to do next?

Retry training with a pixel based agent.
Afterwards start writing/modifying the training script.

Get in Touch

If you have any feedback you want to share with me feel free to reach out at mail@sebastianwette.de. I would be more than happy to chat about it.

#thesis

Sebi Wette