"*I thought AlphaGo was based on probability calculation, and that it was merely a machine. But when I saw this move, I changed my mind. Surely, AlphaGo is creative. This move really was creative and beautiful."

*****—— Lee Sedol, 18-time Go world champion

<aside> 💡 I've written this assuming a smart reader who is comfortable with standard mathematical notation and is familiar with introductory probability. I mention neural networks and deep learning several times, but only as sidenotes; the core content should be accessible with no knowledge of deep learning.

</aside>

The ancient game of Go was widely considered to be the most challenging classic game for AI to play. In 2016, DeepMind's novel algorithm named AlphaGo beat 18-time Go world champion Lee Sedol 4-1. In 2017, DeepMind published AlphaGo Zero, an algorithm trained by only playing games against itself, that beat AlphaGo 100-0. How do these algorithms work, and what accounts for their success? One important part of their success is deep learning. However, this doesn't completely explain it. How do they take actions? How do they learn? The use of deep learning models in AlphaGo and AlphaGo Zero fits into a much broader and more general paradigm in AI called reinforcement learning (RL). Reinforcement learning is the *framework* AlphaGo and AlphaGo Zero use to learn and improve. In this post, I will explain some of the basic ideas behind RL, and give an overview of some different approaches in the field.

**Table of Contents**

Reinforcement learning is a paradigm where **agents** learn through interaction with their **environment.** Below is a visual representation of this framework. The agent first receives information about the current state of the environment, appropriately called the **state**. It then decides on and takes some **action**, which affects the environment. So, the agent then receives an updated state, along with a **reward**, and this loop continues until a **terminal state** is reached.

<aside>
🔑 I like defining terminology in context, as opposed to giving a long list of words at the beginning, so key terms that are central to the reinforcement learning vocabulary are given in **blue.**

</aside>

The standard reinforcement learning paradigm.

Here's an example we can use throughout this post to understand some of the concepts central to RL. Let's say you want to teach a robot to toss a ball back and forth with you — you want a robot partner to play catch with you.

- What might be the actions available to the agent? What is the environment? What are the states and rewards? Think of your own answers to these questions before looking at my suggestions.

Thinking about this agent-environment loop is interesting, but there's not very much we can do with this idea until we formalize it mathematically. Formalisms help us in several ways: they enable us to reason mathematically about objects and relationships, they make communication easier by providing a standardized set of tools and concepts, and they allow us to truthfully reason about complex systems, which makes problem-solving easier. In reinforcement learning, this agent-environment loop is formalized as a **Markov decision process (MDP).** MDPs are made of four parts:

- $\mathcal{S}$ is the
**state space,**with an individual state at time $t$ denoted as $s_t$ - $\mathcal{A}$ is the
**action space,**with an individual action at time $t$ denoted as $a_t$ - $P_a(s', s) = \Pr(s_{t+1} = s' \enspace|\enspace s_t = s, a_t = a)$ are the
**transition probabilities** - $r_t(s', s, a)$ is the
**reward**given to the agent for taking action $a$ at time $t$ when in state $s$, and ending up in state $s'$

<aside>
📖 That's what an MDP *is,* but why is it called that? Well, 'Markov' is a reference to Markov chains. The first sentence of the Markov chain Wikipedia page describes it better than I can, so *"A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event."* The idea is that the current state captures all relevant information about the probabilities of different future states.

</aside>

To understand this setup, some pseudo-code representing a general MDP may be helpful. If you aren't a programmer, feel free to skip the code and go straight to **Policy Optimization.** This is just a restatement of the MDP formalization, for people who learn better from code.

```
import environment
import agent
# create the environment, which returns an initial state and reward
state, reward = environment.initialize()
while True:
# the agent receives the current state, and chooses an action
action = agent.policy(state)
# the environment is updated, taking into account the agent's action
state, reward = environment.step(action)
if state == None: # the environment class returns `None` as the ending state
break
```

This formulation of the problem as an MDP is a deeply central part of reinforcement learning; almost everything you will encounter in RL is based on this!

Given this setup, a natural goal for the agent is to figure out which actions lead to the greatest reward. So, how might we formulate this mathematically? Well, our goal is to figure out the best actions to take. We could define a fixed list of actions to take, but we want our agent to be responsive to its environment. Remember that the agent will receive an observation for every time-step; we should use that information! So, we'll define a **policy function**, which takes in a state observation from the environment, and outputs an action. Formally, this is $\mu: S \rightarrow A$ . This means that given a state $s_t$, the action the agent takes at that time will be $a_t = \mu(s_t).$ We can now characterize the optimal policy $\mu^*$ as

$$ \mu^* = \argmax_\mu \sum_{t=0}^T \mathbb{E}[r_{t}(s_{t+1}, s_t, a_t) \enspace|\enspace \mu] $$

In words, the optimal policy $\mu^*$ is the policy that maximizes our expected cumulative reward. Hopefully this intuitively makes sense; we want our algorithm to maximize its reward, so the best policy is the one that does this across time. And we are trying to do this generally, not just for a single run, so we maximize the *expectation* of the cumulative reward. We're now most of the way towards setting up the reinforcement learning framework. I'll give a three important extensions to what we've discussed so far, and then you'll know the initial setup for a large section of reinforcement learning!