TUD Lecture on RL #2

23 Apr, 2026

Lecture 2: Markov Decision Processes

An agent learns by interacting with its "environment". How can we formalize this?

Sequence of discrete time steps $t$ = 0,1,2,...
At each step
- Receive representation of state $S_{t}$
- Execute action $A_{t}$
- Obtain reward $R_{t}$ and reach a new state $S_{t + 1}$
Sequential interaction between agent and environment generates a trajectory: $S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, S_{2} . . .$
A Markov Decision Process (MDP) is a tuple $ℳ = (𝒮, 𝒜, R, P, ι, γ)$ where:
- $𝒮$ is the set containing all states
- $𝒜$ is the set containing all actions
- $R$ is the reward function
- $P$ is the transition function
- $ι$ is the probability dist. over initial states
- $γ \in (0, 1)$ is the discount factor
Types of MDPs
- Discrete (finite) MDP: discrete state and action space
- Continuous MDP: continuous state and discrete/continuous action space
- Deterministic MDP: det. transition function
- Stochastic MDP: stochastic transition function (probability)
The dynamics of an MDP is defined by a probability: $𝒫 (s^{'}, r ∣ s, a) ≜ Pr {S_{t} = s^{'}, R_{t} = r ∣ S_{t - 1} = s, A_{t - 1} = a}$
- Dynamics only depend on current state $s \in 𝒮$ and executed action $a i n m a t h c a l A$ -> known as the Markov property
- This equation enables obtaining all info about the env. (state transition probability function and the expected reward for state-action pairs, ...)

Interaction between an agent and the env. is modeled through episodes
- Agent starts an episode in one of the possible inital states
- Episode ends when the agent has performed a max. number of steps ("horizon") OR the agent has reached an absorbing state (e.g. game over in a Atari game).
- Episodes are used to train agents and evaluate their behavior

A Return is the sum of rewards collected after T steps
- Usually computed over whole episode
- Using the discount factor one can calc. the "discounted return" (it is mathematically convenient to discount rewards + it avoid infinite returns in cyclic MDPs)
The goal of RL is obtaining behavior that max. the discounted return starting from any time step. Thus the reward function expresses desired/undesired behavior that should be reinforced/avoided (is designed by a human expert)
- Reward functions where reward often changes are called dense
- Reward functions where reward is almost constant are called sparse 8e.g. always 0 except for 1 upon reaching a goal state

Is a scalar reward an adequate notion of purpose? Sutton hypothesis: all of what we mean by goals and purposes can be well thought of as the max. of the cumulative sum of a received scalar signal (reward).
A goal should specify waht we want to achieve and not how we want to achieve it. The same goal can be specified by infinite different reward functions.
The agent must be able to measure success explicitly and frequently

What do we mean by behavior of an agent? -> A policy is a proba. dist. that, givan a state $s \in 𝒮$ computes the proba. of executing any action $a \in 𝒜$ from that state. Commonly denoted as $π$
RL aims at obtaining the policy $π$ that maximizes the return obtained from any state -> called "optimal policy", denoted as $π^{*}$

The value function $V^{π} (s)$ of a state $s$ under a policy $π$ is the expected discounted return when starting in $s$ and following $π$ : $V^{π} (s) ≜ 𝔼_{π} [J_{t} ∣ S_{t} = s]$
The action value function $Q^{π} (s, a)$ of taking an action $a$ in state $s$ under a policy $π$ is the expected discounted return when starting in $s$ executing $a$ and following $π$ : $Q^{π} (s, a) ≜ 𝔼_{π} [J_{t} ∣ S_{t} = s, A_{t} = a]$
The value functions induced by an optimal policy are called optimal value functions (optimal value functions can be used to compute optimal policies)
- There is always a deterministic optimal policy for any MDP

The following equation is known as the Bellman (expectation) equation:
Can be expressed concisely using matrices
- is a linear equation and can be solved directly (only possible for small MDPs with Gaussian elimination or iterative solutions)
- there are many iterative methods for large MDPs: Dynamic Programming, Monte-Carlo RL and Temporal Difference Learning
Bellmans principle of optimality
- An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.