Definitions:
$$ \pi_\theta (a|s) = P(A_t=a|S_t=s; \theta) $$
$$ \tau = (s_0, a_0, r_0, s_1, a_1, r_2, s_2, a_2, \dots, s_T, a_T, r_T) \\ p_\theta(\tau) = p(s_0) \prod_{t=0^T} \pi_\theta(a_t|s_t)p(s_{t+1}|s_t, a_t) $$
$$ R(\tau) = \sum_{t=0}^T r_{t+1} $$
$$ J(\theta) = \mathbb E_{r \sim p_\theta(\tau)} [R(\tau)] = \sum_{\tau} p_\theta (\tau) R(\tau) $$
Theorem:
对于任意可微策略 $\pi_\theta(a|s)$ ,其目标函数 $J(\theta)$ 关于策略参数 $\theta$ 的梯度可以表示为:
$$ \begin{aligned} \nabla_\theta J(\theta) & = \mathbb E_{\tau \sim \pi_\theta}[\nabla_\theta \log P_\theta (\tau) R(\tau)] \\ & = \mathbb E_{s_t, a_t \sim \pi_\theta}[\nabla_\theta \log \pi_\theta (a_t|s_t) R(\tau)] \end{aligned} $$
为什么需要形式二?因为不包含环境动态 $p(s_{t+1}|s_t, a_t)$ 项,适用于model-free场景
Prove:
$$ \begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \mathbb E_{\tau \sim \pi_\theta} [R(\tau)] \\ & = \nabla_\theta \sum_{\tau} P_\theta (\tau) R(\tau) \\ & = \sum_{\tau} \nabla_\theta P_\theta (\tau) R(\tau) \end{aligned} $$
由于无法直接从 $\nabla_\theta \pi_\theta$中采样,因此使用对数导数技巧:
$$ \nabla_\theta p_\theta(x) = p_\theta(x) \nabla_\theta \log p_\theta(x) $$
因此
$$ \begin{aligned} \nabla_\theta J(\theta) & = \sum_{\tau} (\nabla_\theta P_\theta (\tau)) R(\tau) \\ & = \sum_{\tau} P_\theta(\tau) \nabla_\theta \log P_\theta(\tau) R(\tau) \\ & = \mathbb E_{\tau \sim \pi_\theta}[\nabla_\theta \log P_\theta(\tau) R(\tau)] \end{aligned} $$