Notes from my physical notebook: early ideas

08 Apr, 2026

When I was starting to look for where to write my thesis I took some notes in a physical notebook (which I will likely continue to do), and I thought I might just "digitalize" them here. Or rather jot them down to give them some structure and maybe gain some insight.

Idea 1

I'll cut the other fields I was considering and only focus on what I wrote down about RL. First thing I asked myself was how continual learning works for humans and how we manage to not forget everything when we learn new things. I credited this to the size and the modular structure (different parts for dedicated sub-tasks) of the human brain. Keep in mind that this is just me throwing around wild claims. Next I noticed that, for the sake of keeping the problem manageable, I should probably focus on similar types of environments when working e.g. with Atari games (meaning not to mix up games that are different in how we perceive them, like a 2D top-down and a 2D side-scroller).

My first more concrete description of an idea to work on was to try and get inspiration from this "modular structure" of the brain and give an RL agent a skill library.

Idea 2

But I shifted and began to look at what methods already exist in order to solve the problem of catastrophic forgetting. I came across regularization methods, replay methods, architectural methods (some of which actually are a kind of skill library if you will) and many more. For a lot of these I found that they are inspired by biological brains which, as I mentioned before, I find really interesting.

The next idea I noted down was to do a comparison of 3–4 "biologically inspired" algorithms. But my fear here was with the novelty of such a thesis.

Idea 3

Now I drifted more and more towards the idea of creating a standardized benchmark protocol or "reproducible CRL benchmark suite" as I called it in my notes. The main thought is to take Atari environments (just one for now) and a set of modifications of this environment. These form a task sequence, so basically a queue of tasks which the agent is trained/tuned on. This process generates a matrix of scores from which we can then calculate retention, transfer, etc.

Could we plot a forgetting curve as a function of the difficulty or distance between the environments and the mods? For this we would need a distance measure between environments which, after some thought, seems to be infeasible. We could also for example check if a change in the observation causes more forgetting than a change in the environment's dynamics. Does the order of tasks matter (think curriculum learning)?

For the thesis I could then use such a benchmark to compare some methods.

More on task difficulty

CRL benchmarks often come with a task sequence. Each task is a slightly different MDP. A small modification of the same task is a non-stationary learning scenario while a completely different task (domain switching) would be considered task-incremental learning. To be honest the line is pretty blurry here and there are no real "hard definitions".

When we talk about ordered lists of tasks, what is the metric here? By which value is it ordered? This is an important question because different task orders can produce different results:

tasks too similar - easy transfer
tasks too different - no transfer

Of course there are cases for which it is easy. If we have a type of modification that just scales difficulty on "a single axis", we can see a clear order.

An example: imagine Pong but the player's paddle is only 2/3 of its original size. Clearly this is harder to play than the original, right? Now what if the paddle is only 1/2 the original size? Even harder! We can write a parameterized mod that just shrinks the player paddle by X percent.

paddleSizeMod(X=10)

When we take 10 instances of this, from 0% smaller paddle as the base environment to 90% smaller paddle as the last mod, we can say with confidence what the order of difficulty is.

But for all other types of mods that do crazy stuff on multiple elements of an environment, it's not as easy. For those mods, we need some other measure to compare them.

Can we find a nice heuristic to capture this? Could we study human behavior to find such a heuristic?

Get in Touch

If you have any feedback you want to share with me feel free to reach out at mail@sebastianwette.de. I would be more than happy to chat about it.

#thesis

Sebi Wette