The standard reinforcement learning paradigm.
Here's an example we can use throughout this post to understand some of the concepts central to RL. Let's say you want to teach a robot to toss a ball back and forth with you — you want a robot partner to play catch with you.
Thinking about this agent-environment loop is interesting, but there's not very much we can do with this idea until we formalize it mathematically. Formalisms help us in several ways: they enable us to reason mathematically about objects and relationships, they make communication easier by providing a standardized set of tools and concepts, and they allow us to truthfully reason about complex systems, which makes problem-solving easier. In reinforcement learning, this agent-environment loop is formalized as a Markov decision process (MDP). MDPs are made of four parts:
<aside> 📖 That's what an MDP is, but why is it called that? Well, 'Markov' is a reference to Markov chains. The first sentence of the Markov chain Wikipedia page describes it better than I can, so "A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event." The idea is that the current state captures all relevant information about the probabilities of different future states.
</aside>
To understand this setup, some pseudo-code representing a general MDP may be helpful. If you aren't a programmer, feel free to skip the code and go straight to Policy Optimization. This is just a restatement of the MDP formalization, for people who learn better from code.
import environment
import agent
# create the environment, which returns an initial state and reward
state, reward = environment.initialize()
while True:
# the agent receives the current state, and chooses an action
action = agent.policy(state)
# the environment is updated, taking into account the agent's action
state, reward = environment.step(action)
if state == None: # the environment class returns `None` as the ending state
break
This formulation of the problem as an MDP is a deeply central part of reinforcement learning; almost everything you will encounter in RL is based on this!
Given this setup, a natural goal for the agent is to figure out which actions lead to the greatest reward. So, how might we formulate this mathematically? Well, our goal is to figure out the best actions to take. We could define a fixed list of actions to take, but we want our agent to be responsive to its environment. Remember that the agent will receive an observation for every time-step; we should use that information! So, we'll define a policy function, which takes in a state observation from the environment, and outputs an action. Formally, this is $\mu: S \rightarrow A$ . This means that given a state $s_t$, the action the agent takes at that time will be $a_t = \mu(s_t).$ We can now characterize the optimal policy $\mu^*$ as
$$ \mu^* = \argmax_\mu \sum_{t=0}^T \mathbb{E}[r_{t}(s_{t+1}, s_t, a_t) \enspace|\enspace \mu] $$
In words, the optimal policy $\mu^*$ is the policy that maximizes our expected cumulative reward. Hopefully this intuitively makes sense; we want our algorithm to maximize its reward, so the best policy is the one that does this across time. And we are trying to do this generally, not just for a single run, so we maximize the expectation of the cumulative reward. We're now most of the way towards setting up the reinforcement learning framework. I'll give a three important extensions to what we've discussed so far, and then you'll know the initial setup for a large section of reinforcement learning!