Alright, let's keep going! After mastering Value-Based Learning methods, represented by Q-Learning, we're now embarking on a brand new chapter. We will shift from learning a "map" to learning a "compass," directly optimizing the agent's decision-making process.



📖 Chapter 4: Paving a New Path – Policy Gradient (PG) and REINFORCE

Introduction to this Chapter:

Hello, brave explorer! Previously, our approach was to first learn a detailed "value map" (Q-table) and then choose paths based on the scores on that map. This method has been very effective for many problems, but it has also encountered bottlenecks 🚧:

  1. Continuous Action Spaces: When actions are continuous (e.g., a robot arm needing to rotate 37.5 degrees), we cannot calculate a Q-value for an infinite number of actions.
  2. Stochastic Policies: In some games (like "Rock-Paper-Scissors"), the optimal policy itself is stochastic. Value-based learning methods usually only yield deterministic policies.

To solve these problems, we need a completely new way of thinking: instead of learning values, we directly learn the policy itself! This is the core of Policy-Based methods. In this chapter, we will explore its most fundamental and core idea – Policy Gradient.


Section 1: The New Paradigm – Direct Policy Parameterization

The first step in policy-based learning is to represent our policy π with a set of parameters θ, i.e., parameterized policy π(a|s, θ).

Our goal is to adjust the parameters θ to find an optimal policy π* that maximizes the expected total return J(θ) under that policy.

J(θ) = E_{τ~π_θ}[ G(τ) ]


Section 2: How to Optimize? – The Policy Gradient Theorem