January 16, 2021

# 2. Background: Reinforcement Learning

## A. Preliminary

$$\pi=\Psi \left ({s}\right) \\=\left \lbrace{ ~p\left ({a_{i}|s}\right)~\bigg \vert ~\forall a_{i} \in \Delta _{\pi }~\wedge \sum {i}p\left ({a{i}|s}\right)=1}\right \rbrace \tag{1}$$

## B. Bellman Equation

$$V_{\pi }\left ({s}\right) = \sum {a}\pi \left ({s,a}\right)\sum {s^{\prime }}p\left ({s^{\prime }|s,a}\right)\left ({\mathop {}\mathbb {W}{s\rightarrow s^{\prime }|a} + \gamma V{\pi }\left ({s^{\prime }}\right)}\right) \tag{2}$$

$$Q_{\pi }\left ({s,a}\right)=\sum {s^{\prime }}p\left ({s^{\prime }|s,a}\right)\Biggl ({\vphantom {\left.{+\,\,\gamma \sum {a^{\prime }} \pi \left ({s^{\prime },a^{\prime }}\right) Q{\pi }\left ({s^{\prime },a^{\prime }}\right)}\right)}\mathop {}\mathbb {W}{s\rightarrow s^{\prime }|a}} \\\qquad \qquad \qquad \,\,\,\, {+\,\,\gamma \sum {a^{\prime }} \pi \left ({s^{\prime },a^{\prime }}\right) Q{\pi }\left ({s^{\prime },a^{\prime }}\right)}\Biggr) \tag{3}$$

$\gamma$ 是衰减系数（discount factor），一个 $[0, 1)$ 的常数，代表现在估值依赖于未来估值的程度，即 agent 有多么「farsighted」。

## C. RL Methods

### 1) Monte-Carlo Method:

MC 方法基于两个假设：

1. 事件（episodes）发生次数很大；
2. 每个状态和动作被访问很多次。