Est. time to complete: 1 hour 30 mins

https://embed.notionlytics.com/wt/ZXlKd1lXZGxTV1FpT2lJME4yTmlZVGN3TURGa1lqSTBZemt5WWpWbFlqaGxOVEF3WXpOaE5HWmhNeUlzSW5kdmNtdHpjR0ZqWlZSeVlXTnJaWEpKWkNJNklsRjBaRGt4TVRWNGJVVk9aVlJaYm5BMWIxUkhJbjA9

Parts of this tutorial have been adapted from Reinforcement Learning: an Introduction

In the previous tutorial, we saw how reinforcement learning algorithms learn a policy. The algorithm’s aim is to find the optimal policy. This is the policy that takes the actions that maximise the sum of future rewards received.

In this tutorial, we start by better defining the goal of learning the optimal policy. We then introduce the key concept (value functions) and equation (Bellman Equation) that allow us to build our first reinforcement learning algorithm in Tutorial 3!

1. Return $G_t$

In Tutorial 1 we discussed, informally, the objective of reinforcement learning algorithms. We said that the goal of a reinforcement learning algorithm is to maximize the cumulative reward it receives in the long run.

We define this as the return, denoted $G_t$.

Simple Return Formula

The simplest way to express the return $G_t$ is the sum of all future rewards you’ll receive where $r_t$ is the reward at time $t$, up to the final timestep $T$ (if it exists - otherwise continue to infinity, $\infty$).

$$ \begin{aligned}G^{\text{Sum}}t&=\sum{i=1}^{T-t}r_{t+i} \\&= r_{t+1}+r_{t+2}+r_{t+3}⋯+r_T\end{aligned} $$

(note: we use $G^{\text{Sum}}_t$ above since we define $G_t$ in the general case below)