Alright, let's keep going! In the previous chapter, we mastered the powerful weapon of directly optimizing policies – the policy gradient – but also experienced its "side effect" of high variance. Now, we are about to undergo a magnificent upgrade, building a stronger, more stable learning framework by merging the wisdom of Value-Based Learning and Policy-Based Learning.
Introduction to this Chapter:
Hello, brave explorer! In the REINFORCE algorithm from the previous chapter, we were like a strict coach who, only after an entire game, would broadly praise or criticize every player (every action) based on the final score (total return G_t
). While this approach was fair (unbiased), it was extremely inefficient (high variance), because a brilliant moment during the game could be overshadowed by the ultimate failure.
To solve this problem, a brilliant idea emerged: can we introduce a professional "scout" or "commentator" during the game, who immediately provides timely and professional feedback after each player makes a move?
This is the core idea behind the Actor-Critic (AC) architecture. It's no longer a "solo performance," but a brilliant play featuring two protagonists.
The Actor-Critic framework divides the agent into two cooperating components:
Role | Icon | Identity and Responsibilities |
---|---|---|
Actor | 🎭 | **Policy Function `π(a |
Critic | 🧐 | Value Function. This is the "scout," responsible for observing the Actor's actions and the environment's feedback, then "evaluating" or "scoring" the Actor's actions. It only "comments" and does not act directly. |
AC Collaboration Process 🤝
:
🎭
: In state S_t
, the Actor selects and executes action A_t
according to its policy π(a|s, θ)
.🌍
: The environment transitions to the new state S_{t+1}
and provides an immediate reward R_{t+1}
.🧐
: The Critic observes this transition process (S_t, A_t, R_{t+1}, S_{t+1})
, and then provides a high-quality evaluation signal. This signal tells the Actor "how good" its action A_t
was.🧠
:
θ
. If the evaluation is positive, the probability of that action increases; otherwise, it decreases.So, what kind of evaluation signal should the Critic provide to most effectively help the Actor learn and solve the high variance problem of REINFORCE?