Okay, let's continue! In the previous chapter, we mastered how to evaluate the value of a given policy (solving the Prediction Problem). Now, we will shift our focus to the Control Problem: how to find an optimal policy that allows the agent to achieve the maximum reward.



📖 Chapter 3: The Wisdom of Choice – Control through Sarsa and Q-Learning

Introduction to this Chapter:

Hello, brave explorer! In practice, we are not satisfied with knowing "how good this maze is"; we want to know "should I go left or right at the fork in the road." To directly guide actions, we need the action-value function Q(s, a).

The Q-function tells us the long-term value of taking action a in state s. We simply choose the action with the highest Q-value in each state to find the optimal policy.

In this chapter, we will focus on two of the most basic and powerful Q-value learning algorithms: Sarsa and Q-Learning. Both belong to generalized TD learning but employ distinctly different strategies.


Section 1: The Down-to-Earth Realist – Sarsa Algorithm

Sarsa (State-Action-Reward-State'-Action') is an On-Policy TD control algorithm.

👣 Core Idea:

Sarsa's update is based on the agent's actually executed trajectory. It corrects the Q-value of the current action based on the next action that "will be taken."

"I'm updating my review of the pizza place, taking into account the value of the ice cream shop I actually** will go to."**

Sarsa Algorithm Update Formula 📝

Q(S_t, A_t) ← Q(S_t, A_t) + α * [R_{t+1} + γQ(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]

Characteristics of Sarsa: On-Policy 🔗

Sarsa is "on-policy," meaning: