Sebi Wette

Sanity check experiments

Before getting all excited and diving deeper into my idea of a forgetting benchmark, I want to run some sanity check (plus my supervisors think it's a good idea). These initial experiments will provide me with some orientation, traction and, hopefully, guide me in a clearer direction. I sat down and wrote a short abstract on what I plan to do.

An abstract (kind of)

We conduct an initial exploratory study on controlled non-stationarity in reinforcement learning using a parameterized modification of Pong. Specifically, we vary the ball speed to create a sequence of environments with gradually increasing dynamics changes and train an RL agent sequentially on these tasks. By measuring performance and retention across tasks, we analyze how the magnitude of environment change influences forgetting and transfer. This experiment serves as an orientation step toward systematically studying task similarity and shift magnitude in continual reinforcement learning.

Underlying problem Continual reinforcement learning research typically evaluates agents on sequences of discrete tasks, but the magnitude of change between tasks is rarely quantified or controlled. As a result, it is difficult to understand how different levels of environment shift affect forgetting, transfer, and adaptation.

RQ -> How does the magnitude of controlled environment change influence retention and transfer in continual reinforcement learning?

Research gap While theoretical work discusses non-stationary MDPs and task similarity, empirical CRL benchmarks rarely provide calibrated, parameterized task changes that allow systematic study of how shift magnitude affects learning dynamics.


A bit more detail

Basically I want to build a first prototype of a (PQN) training pipeline to train incrementally on different modified versions of some base environment. Additionally I want to collect my first metrics to get a feel for what is actually happening.

Framework

The framework of choice here is JAXAtari. That is for a few simple reasons:

Setup

After doing this we can calculate the retention as Retentioni,j=Ri,jRj,j and plot the retained performance on older tasks against the task distance d=ij.

I just saw that there might be another way (that could be more expressive I think): train the agent on T0 and then not tune one agent sequentially on all tasks but one agent on T1, one on T2, and so on... Could this be more useful to see how task distance influences forgetting?

But now the question is if I can just assume the "hardness gap" between the tasks is equal to this distance d. From reasoning I would assume that there is a relationship between the two... but I am not sure it is linear (2px smaller is linearly harder than 1px smaller). That's where I might need some "difficulty heuristic" (?). I like "number of env steps to reach a threshold score" the best so far...

Another additional idea is to train a fresh agent on each task Tk (without training on anything else) in order to distinguish forgetting of old tasks and inability to learn new tasks (and also from positive transfer to new tasks).

Some design "choices"

Other mods I could work with

For Breakout one could also play around with:

For Pong it's rather similar. In general I feel like splitting the mods in three separate categories:

  1. Control Mods
  2. Dynamics Mods
  3. Observation Mods

Where do I stand right now?

What to do next?

Get in Touch

If you have any feedback you want to share with me feel free to reach out at mail@sebastianwette.de. I would be more than happy to chat about it.


#thesis