Introduction to Reinforcement Learning. Part 3: Q-Learning with Neural Networks, Algorithm DQN

Q-Learning with Neural Networks, Algorithm DQN

5 min readApr 3, 2020

In part 2, we saw how the Q-Learning algorithm works really well when the environment is simple and the function Q(s, a) can be represented using a table or a matrix of values. However, when there are billions of possible unique states and hundreds of available actions for each of them, the table becomes too big, and tabular methods become impractical. The Deep Q-Networks (DQN) algorithm was invented by Mnih et al. [1] to solve this. This algorithm combines the Q-Learning algorithm with deep neural networks (DNNs). As it is well known in the field of AI, DNNs are great non-linear function approximators. Thus, DNNs are used to approximate the Q-function, replacing the need for a table to store the Q-values. In reality, this algorithm uses two DNNs to stabilize the learning process. The first one is called the main neural network, represented by the weight vector θ, and it is used to estimate the Q-values for the current state s and action a: Q(s, a; θ). The second one is the target neural network, parametrized by the weight vector θ´, and it will have the exact same architecture as the main network, but it will be used to estimate the Q-values of the next state s´ and action a´. All the learning takes place in the main network. The target network is frozen (its parameters are left unchanged) for a few iterations (usually around 10000) and then the weights of the main network are copied into the target network, thus transferring the learned knowledge from one to the other. This makes the estimations produced by the target network more accurate after the copying has occurred.

Bellman’s Equation for DQN

Equation 1: Bellman’s Equation for the DQN algorithm.

Bellman’s equation has this shape now, where the Q functions are parametrized by the network weights θ and θ´. In order to train a neural network, we need a loss (or cost) function, which is defined as the squared difference between the two sides of the bellman equation, in the case of the DQN algorithm.

Equation 2: Loss function for the DQN algorithm.

This is the function we will minimize using gradient descent, which can be calculated automatically using a Deep Learning library such as TensorFlow or PyTorch.

The code, solving the CartPole Problem with TensorFlow

You can run the TensorFlow code yourself in this link (or a PyTorch version in this link), or keep reading to see the code without running it. As the code is a little longer than in the previous parts, I will only show the most important pieces here. The entire source code is available following the link above.

Here is the CartPole environment. I am using OpenAI Gym to visualize and run this environment. The goal is to move the cart left and right, in order to keep the pole in a vertical position. If the pole’s inclination is more than 15 degrees from the vertical axis, the episode will end and we will start over. Video 1 shows an example of running several episodes in this environment by taking actions randomly.

Video 1: Random gameplay on CartPole environment.

To implement the DQN algorithm, we will start by creating the main (main_nn) and target (target_nn) DNNs. The target network will be a copy of the main one, but with its own copy of the weights. We will also need an optimizer and a loss function.

Algorithm 1. The architecture of the Deep Neural Networks.

Next, we will create the experience replay buffer, to add the experience to the buffer and sample it later for training.

Algorithm 2. Experience Replay Buffer.

We will also write a helper function to run the ε-greedy policy, and to train the main network using the data stored in the buffer.

Algorithm 3. Functions for the ε-greedy policy and for training the neural network.

We will also define the necessary hyper-parameters and we will train the neural network. We will play one episode using the ε-greedy policy, store the data in the experience replay buffer, and train the main network after each step. Once every 2000 steps, we will copy the weights from the main network into the target network. We will also decrease the value of epsilon (ε) to start with high exploration and decrease the exploration over time. We will see how the algorithm starts learning after each episode.

Algorithm 4. Main loop.

This is the result that will be displayed:

Episode 0/1000. Epsilon: 0.99. Reward in last 100 episodes: 14.0 Episode 50/1000. Epsilon: 0.94. Reward in last 100 episodes: 22.2 Episode 100/1000. Epsilon: 0.89. Reward in last 100 episodes: 23.3 Episode 150/1000. Epsilon: 0.84. Reward in last 100 episodes: 23.4 Episode 200/1000. Epsilon: 0.79. Reward in last 100 episodes: 24.9 Episode 250/1000. Epsilon: 0.74. Reward in last 100 episodes: 30.4 Episode 300/1000. Epsilon: 0.69. Reward in last 100 episodes: 38.4 Episode 350/1000. Epsilon: 0.64. Reward in last 100 episodes: 51.4 Episode 400/1000. Epsilon: 0.59. Reward in last 100 episodes: 68.2 Episode 450/1000. Epsilon: 0.54. Reward in last 100 episodes: 82.4 Episode 500/1000. Epsilon: 0.49. Reward in last 100 episodes: 102.1 Episode 550/1000. Epsilon: 0.44. Reward in last 100 episodes: 129.7 Episode 600/1000. Epsilon: 0.39. Reward in last 100 episodes: 151.7 Episode 650/1000. Epsilon: 0.34. Reward in last 100 episodes: 173.0 Episode 700/1000. Epsilon: 0.29. Reward in last 100 episodes: 187.3 Episode 750/1000. Epsilon: 0.24. Reward in last 100 episodes: 190.9 Episode 800/1000. Epsilon: 0.19. Reward in last 100 episodes: 194.6 Episode 850/1000. Epsilon: 0.14. Reward in last 100 episodes: 195.9 Episode 900/1000. Epsilon: 0.09. Reward in last 100 episodes: 197.9 Episode 950/1000. Epsilon: 0.05. Reward in last 100 episodes: 200.0 Episode 1000/1000. Epsilon: 0.05. Reward in last 100 episodes: 200.0

Now that the agent has learned to maximize the reward for the CartPole environment, we will make the agent interact with the environment one more time, to visualize the result and see that it is now able to keep the pole balanced for 200 frames.