Policy Gradient&REINFORCE

Alright, let's keep going! After mastering Value-Based Learning methods, represented by Q-Learning, we're now embarking on a brand new chapter. We will shift from learning a "map" to learning a "compass," directly optimizing the agent's decision-making process.

📖 Chapter 4: Paving a New Path – Policy Gradient (PG) and REINFORCE

Introduction to this Chapter:

Hello, brave explorer! Previously, our approach was to first learn a detailed "value map" (Q-table) and then choose paths based on the scores on that map. This method has been very effective for many problems, but it has also encountered bottlenecks 🚧:

Continuous Action Spaces: When actions are continuous (e.g., a robot arm needing to rotate 37.5 degrees), we cannot calculate a Q-value for an infinite number of actions.
Stochastic Policies: In some games (like "Rock-Paper-Scissors"), the optimal policy itself is stochastic. Value-based learning methods usually only yield deterministic policies.

To solve these problems, we need a completely new way of thinking: instead of learning values, we directly learn the policy itself! This is the core of Policy-Based methods. In this chapter, we will explore its most fundamental and core idea – Policy Gradient.

Section 1: The New Paradigm – Direct Policy Parameterization

The first step in policy-based learning is to represent our policy π with a set of parameters θ, i.e., parameterized policy π(a|s, θ).

Policy π(a|s, θ): Typically a neural network 🧠.
- Input: State s.
- Output: The probability of taking each action a.
- Parameters θ: The weights and biases of the neural network.

Our goal is to adjust the parameters θ to find an optimal policy π* that maximizes the expected total return J(θ) under that policy.

J(θ) = E_{τ~π_θ}[ G(τ) ]

J(θ): Performance Measure, i.e., the expected total return of policy π_θ.
τ: A complete trajectory, (s_0, a_0, r_1), (s_1, a_1, r_2), ....
E_{τ~π_θ}[...]: Expectation taken over countless trajectories generated by following policy π_θ.

📖 Chapter 4: Paving a New Path – Policy Gradient (PG) and REINFORCE

Section 1: The New Paradigm – Direct Policy Parameterization

Section 2: How to Optimize? – The Policy Gradient Theorem