Alright, let's keep going! After mastering Value-Based Learning methods, represented by Q-Learning, we're now embarking on a brand new chapter. We will shift from learning a "map" to learning a "compass," directly optimizing the agent's decision-making process.
Introduction to this Chapter:
Hello, brave explorer! Previously, our approach was to first learn a detailed "value map" (Q-table
) and then choose paths based on the scores on that map. This method has been very effective for many problems, but it has also encountered bottlenecks 🚧
:
37.5
degrees), we cannot calculate a Q-value for an infinite number of actions.To solve these problems, we need a completely new way of thinking: instead of learning values, we directly learn the policy itself! This is the core of Policy-Based methods. In this chapter, we will explore its most fundamental and core idea – Policy Gradient.
The first step in policy-based learning is to represent our policy π
with a set of parameters θ
, i.e., parameterized policy π(a|s, θ)
.
π(a|s, θ)
: Typically a neural network ðŸ§
.
s
.a
.θ
: The weights and biases of the neural network.Our goal is to adjust the parameters θ
to find an optimal policy π*
that maximizes the expected total return J(θ)
under that policy.
J(θ) = E_{τ~π_θ}[ G(τ) ]
J(θ)
: Performance Measure, i.e., the expected total return of policy π_θ
.Ï„
: A complete trajectory, (s_0, a_0, r_1), (s_1, a_1, r_2), ...
.E_{τ~π_θ}[...]
: Expectation taken over countless trajectories generated by following policy π_θ
.