Okay, let's continue! In the previous chapter, we mastered how to evaluate the value of a given policy (solving the Prediction Problem). Now, we will shift our focus to the Control Problem: how to find an optimal policy that allows the agent to achieve the maximum reward.
Introduction to this Chapter:
Hello, brave explorer! In practice, we are not satisfied with knowing "how good this maze is"; we want to know "should I go left or right at the fork in the road." To directly guide actions, we need the action-value function Q(s, a)
.
The Q-function tells us the long-term value of taking action a
in state s
. We simply choose the action with the highest Q-value in each state to find the optimal policy.
In this chapter, we will focus on two of the most basic and powerful Q-value learning algorithms: Sarsa and Q-Learning. Both belong to generalized TD learning but employ distinctly different strategies.
Sarsa (State-Action-Reward-State'-Action') is an On-Policy TD control algorithm.
👣 Core Idea:
Sarsa's update is based on the agent's actually executed trajectory. It corrects the Q-value of the current action based on the next action that "will be taken."
"I'm updating my review of the pizza place, taking into account the value of the ice cream shop I actually** will go to."**
📝
Q(S_t, A_t) ← Q(S_t, A_t) + α * [R_{t+1} + γQ(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]
Q(S_t, A_t)
: 🤔
The old Q-value for the current state-action pair.R_{t+1} + γQ(S_{t+1}, A_{t+1})
: 🎯
The TD target.
A_{t+1}
is the next action we actually select according to the policy in the next state S_{t+1}
.🔗
Sarsa is "on-policy," meaning: