Introduction to This Chapter:
Hello, brave explorer! Welcome to the fascinating world of Reinforcement Learning (RL). Imagine teaching a puppy 🐶
to play frisbee 🥏
. You wouldn’t hand it a thick instruction manual saying, “When the frisbee is at this angle, jump at a 75-degree angle.” Instead, you give it a treat when it does well (e.g., catching the frisbee ✅
) and nothing when it fails (e.g., missing ❌
). Over time, the puppy “learns” the trick to catching frisbees through a cycle of “trial-and-error with rewards”.
Reinforcement learning is exactly this powerful field that enables computers to autonomously learn how to make optimal decisions in specific environments—just like the puppy. In this chapter, we’ll explore the basic “characters” of this world and the “physical laws” 🌌
that govern everything, with in-depth analyses of these laws.
In any RL game, several fixed “characters” interact. Understanding them is the foundation for understanding everything else.
Character | Icon | Description | Game Analogy (using Super Mario) |
---|---|---|---|
Agent | 🤖 | The learner and decision-maker. Our protagonist, which we aim to train to become smarter. | Mario: The hero jumping, eating mushrooms on screen. |
Environment | 🌍 | The external world where the agent exists. It defines the game’s rules and boundaries. | The entire game level, including bricks, pipes, enemies, and the goal flag. |
State (S ) |
📍 | A snapshot of the environment at a moment. Contains all information the agent needs to decide. | The current game screen: Mario’s position, whether he’s “big,” enemy locations, etc. |
Action (A ) |
🕹️ | The set of operations the agent can perform. | Mario’s possible moves: left, right, jump. |
Reward (R ) |
💎 | Immediate feedback from the environment on the agent’s action. A direct measure of an action’s quality. | +100 points (collecting a coin), -1 life (touching an enemy). |
Policy (π ) |
🧠 | The agent’s “brain” or “behavioral准则.” Defines which action the agent chooses in a given state. | A player’s “operation habits.” For example, an aggressive player jumps on enemies when spotted. |
Interaction Loop 🔄
: The game proceeds in this cycle:
🤖
observes the environment 🌍
in state S_t
.A_t
based on its policy 🧠
.A_t
, the environment 🌍
transitions to a new state S_{t+1}
.R_{t+1}
💎
as feedback.🤖
receives the new state and reward, then repeats step 1 to start a new decision cycle.The agent’s goal is to maximize future discounted return.
G_t = R_{t+1} + γR_{t+2} + γ²R_{t+3} + ... = Σ_{k=0}^{∞} γ^k R_{t+k+1}
G_t
: 📈 The total return starting from time step t
.R_{t+k+1}
: 💰 The immediate reward received k
steps in the future.