Est. time to complete: 1 hour 45 mins

https://embed.notionlytics.com/wt/ZXlKd1lXZGxTV1FpT2lKaFpqYzFNMk16TURFeFpEYzBOR1V3T0dFM05UUmxOamcwWm1abVl6Sm1ZeUlzSW5kdmNtdHpjR0ZqWlZSeVlXTnJaWEpKWkNJNklsRjBaRGt4TVRWNGJVVk9aVlJaYm5BMWIxUkhJbjA9

1. Learning from Experience

When we are faced with a task, we rarely have a full model of how our actions will affect the state of the environment or what will get us rewards.

If we had a crystal ball and could see the future, we could perfectly plan ahead how environment will change as we take certain actions in certain states. However, we usually don’t have this!

If we had a crystal ball and could see the future, we could perfectly plan ahead how environment will change as we take certain actions in certain states. However, we usually don’t have this!

For example, if we're designing an agent to control an autonomous car, we can't perfectly predict the future state. This is because we don't have a perfect model of how other cars and pedestrians will move or react to our movements.

How the state evolves in response to our actions is defined by the state transition function. This can be stochastic (based on probabilities) or deterministic (not based on probability). In the autonomous driving example, this is simply how the state will evolve over time - how other road users will move. For this example, it’s hard to imagine what this function would look like. The state transition function for chess is much simpler - the board changes from how it was before your move to how it looks after your move.

When we don’t know the state transition function, we can’t plan ahead perfectly and work out what the optimal action is in each situation, because we don’t know what the outcomes of our actions will be.

However, what we can do is interact with the environment and learn from our experience in it. By taking actions in the environment, and seeing their results (how the state changes and any rewards we get), we can improve our future action selection.

Training & Testing in RL (compared with Machine Learning)

2. Temporal-Difference Learning

Temporal-Difference (TD) Learning is the first Reinforcement Learning algorithm we’re introducing.