AC & baseline | Notion

Alright, let's keep going! In the previous chapter, we mastered the powerful weapon of directly optimizing policies – the policy gradient – but also experienced its "side effect" of high variance. Now, we are about to undergo a magnificent upgrade, building a stronger, more stable learning framework by merging the wisdom of Value-Based Learning and Policy-Based Learning.

📖 Chapter 5: A Powerful Combination – Actor-Critic and Baseline Advantage Function

Introduction to this Chapter:

Hello, brave explorer! In the REINFORCE algorithm from the previous chapter, we were like a strict coach who, only after an entire game, would broadly praise or criticize every player (every action) based on the final score (total return G_t). While this approach was fair (unbiased), it was extremely inefficient (high variance), because a brilliant moment during the game could be overshadowed by the ultimate failure.

To solve this problem, a brilliant idea emerged: can we introduce a professional "scout" or "commentator" during the game, who immediately provides timely and professional feedback after each player makes a move?

This is the core idea behind the Actor-Critic (AC) architecture. It's no longer a "solo performance," but a brilliant play featuring two protagonists.

Section 1: The Debut of Two Protagonists – Actor-Critic Architecture

The Actor-Critic framework divides the agent into two cooperating components:

Role	Icon	Identity and Responsibilities
Actor	🎭	**Policy Function `π(a
Critic	🧐	Value Function. This is the "scout," responsible for observing the Actor's actions and the environment's feedback, then "evaluating" or "scoring" the Actor's actions. It only "comments" and does not act directly.

AC Collaboration Process 🤝:

Actor Acts 🎭: In state S_t, the Actor selects and executes action A_t according to its policy π(a|s, θ).
Environment Feedback 🌍: The environment transitions to the new state S_{t+1} and provides an immediate reward R_{t+1}.
Critic Evaluates 🧐: The Critic observes this transition process (S_t, A_t, R_{t+1}, S_{t+1}), and then provides a high-quality evaluation signal. This signal tells the Actor "how good" its action A_t was.
Both Learn 🧠:
- Actor Learning: Based on the evaluation signal provided by the Critic, the Actor uses policy gradient to update its policy parameters θ. If the evaluation is positive, the probability of that action increases; otherwise, it decreases.
- Critic Learning: The Critic's evaluation is not inherently accurate; it also needs to continuously improve its "appreciation level" and "scoring accuracy" through TD learning.

Section 2: The Key to Taming the Tiger – Baseline and Advantage Function

So, what kind of evaluation signal should the Critic provide to most effectively help the Actor learn and solve the high variance problem of REINFORCE?