Introduction | Notion

🗺️ Tutorial on Core Ideas of Reinforcement Learning: A Learning Roadmap

Hello, brave explorer! Welcome to this exciting journey into reinforcement learning. Before diving into the details of each chapter, please accept this carefully crafted "learning map" we’ve prepared for you. It will act as a guide, showing you the full landscape of the knowledge continent we’re about to explore, as well as the intrinsic connections between various "attractions" (core algorithms).

Our journey will unfold along two main paths: [Value Learning] and [Policy Learning], ultimately witnessing their magnificent integration.

📍 Starting Point: Foundations of Reinforcement Learning (Chapter 1)

Goal: Establish a common language and understand the basic rules of the RL world.
Core Concepts:
- 🤖 The interaction loop between Agent and 🌍 Environment.
- 💎 Reward vs. 🏆 Return: Short-term vs. long-term objectives.
- 🧠 Policy vs. 🗺️ Value Function: Two core tools guiding actions.
Ultimate Law:
- ⚖️ Bellman Equation: The cosmic principle describing the intrinsic consistency of value functions, serving as the theoretical cornerstone for all our subsequent explorations.

First Main Path: The Road of Value-Based Methods

Core Idea: First learn an accurate "value map" (Q(s, a)), then determine the optimal path based on scores on the map.
Learning Route: Classic Tabular Methods → Deep Networks

Stage 1: Tabular Control (Chapters 2 & 3)

Environment: Discrete, small-scale state and action spaces (e.g., mazes).
Core Question: When the rules of the world are unknown, how to fill the Q-Table through interaction with the environment?
Two Foundational Algorithms:
- 👨‍🏫 Monte Carlo (MC): "Final exam"-style learning based on complete episodes.
- 🧑‍💻 Temporal Difference (TD): "Quiz"-style learning based on single-step updates, the core of all subsequent algorithms.
Two Classic Control Algorithms:
- 🚶‍♂️ Sarsa (On-Policy): Down-to-earth, learning from actual actions, emphasizing safety and stability.
- 👑 Q-Learning (Off-Policy): Ambitious, learning from optimal possibilities, emphasizing efficiency and optimality.

Stage 2: Deep Value Learning (Chapter 4)

Challenge: Q-Tables become impractical when the state space grows massive (e.g., game screens).
Solution: Approximate the Q-function using deep neural networks!