Hello, brave explorer! Welcome to this exciting journey into reinforcement learning. Before diving into the details of each chapter, please accept this carefully crafted "learning map" weβve prepared for you. It will act as a guide, showing you the full landscape of the knowledge continent weβre about to explore, as well as the intrinsic connections between various "attractions" (core algorithms).
Our journey will unfold along two main paths: [Value Learning] and [Policy Learning], ultimately witnessing their magnificent integration.
π€
The interaction loop between Agent and π
Environment.π
Reward vs. π
Return: Short-term vs. long-term objectives.π§
Policy vs. πΊοΈ
Value Function: Two core tools guiding actions.βοΈ
Bellman Equation: The cosmic principle describing the intrinsic consistency of value functions, serving as the theoretical cornerstone for all our subsequent explorations.Q(s, a)
), then determine the optimal path based on scores on the map.Classic Tabular Methods β Deep Networks
π¨βπ«
Monte Carlo (MC): "Final exam"-style learning based on complete episodes.π§βπ»
Temporal Difference (TD): "Quiz"-style learning based on single-step updates, the core of all subsequent algorithms.πΆββοΈ
Sarsa (On-Policy): Down-to-earth, learning from actual actions, emphasizing safety and stability.π
Q-Learning (Off-Policy): Ambitious, learning from optimal possibilities, emphasizing efficiency and optimality.