DQN——the beginning of DRL

📖 Bonus Chapter: The Deep Revolution – DQN and Function Approximation (New Chapter)

Introduction to this Chapter:

Hello, brave explorer! At the end of the previous chapter, we encountered the high wall of "the curse of dimensionality." Traditional tabular Q-Learning proved inadequate when faced with massive state spaces like those in Atari games, which take raw pixels as input. In 2013, the DeepMind team (later acquired by Google) dropped a bombshell: Deep Q-Network (DQN). It successfully combined deep convolutional neural networks (CNNs) with Q-Learning, enabling agents to learn to play various Atari games just by observing the game screen, even surpassing the performance of human professional players in many of them.

The advent of DQN marked the official beginning of the era of Deep Reinforcement Learning (DRL). In this chapter, we will delve into the two core secrets behind DQN's success.

Section 1: Approximating Q-Values with Neural Networks `🧠`

DQN's first core idea is to replace the Q-Table with a deep neural network. This network is called the Q-network.

Q-Network Q(s, a; w):
- Input s: State (e.g., pixel data from the game screen).
- Output: The Q-value for each possible action a.
- Parameters w: The weights and biases of the neural network. Our goal is to learn this set of optimal parameters w.

Now, the Q-Learning update objective is no longer to update a single cell in a table, but to update the network's parameters w through gradient descent.

Loss Function: We want the Q-network's predicted value Q(s, a; w) to be as close as possible to the TD target y. Therefore, we can define a Mean Squared Error (MSE) loss function:

L(w) = E[ (y - Q(s, a; w))² ]

y (TD Target): R + γ * max_{a'} Q(s', a'; w)

Seems simple? But it hides a crisis! 💥

If we directly train using this approach, we will find that the network is difficult to converge, and may even collapse. This is mainly due to two problems:

Sample Correlation: Reinforcement learning experiences are generated sequentially in time, and adjacent samples are highly correlated. This violates the "independent and identically distributed" assumption in deep learning.
Unstable Target: The network we use to calculate the target y is the same as the Q-network we are currently updating. This means that with every update step, the target y itself also changes. This is like chasing a moving target, making it difficult to stabilize.

Section 2: DQN's Two "Magic Weapons" `✨`

To solve the two problems mentioned above, DQN introduced two pioneering techniques:

📖 Bonus Chapter: The Deep Revolution – DQN and Function Approximation (New Chapter)

Section 1: Approximating Q-Values with Neural Networks 🧠

Section 2: DQN's Two "Magic Weapons" ✨

Section 1: Approximating Q-Values with Neural Networks `🧠`

Section 2: DQN's Two "Magic Weapons" `✨`