Introduction to Reinforcement Learning. Part 2: Q-Learning

In part 1, we described the multi-armed bandit problem and introduced several concepts such as the state, action, reward, and so on. However, the multi-armed bandit problem does not represent the entire reinforcement learning (RL) problem. In the bandit problems, each state is always the same, as we saw in part 1, where we had the same 5 arms all the time and the probabilities didn’t change over time.

Image for post
Image for post
Figure 1: The complete RL cycle.

Value Function

In order to quantify the reward the agent will get in the long term for each state, we introduce a Value Funtion V(s). This function outputs an estimate of the reward the agent will receive until the end of the episode, starting from state s. If we can correctly estimate this value, we could choose to take the action that will lead the agent to the state with the highest value.

Q-Learning — Solving the RL Problem

To solve the the RL problem, the agent needs to learn to take the best action in each of the possible states it encounters. For that, the Q-learning algorithm learns how much long-term reward it will get for each state-action pair (s, a). We call this an action-value function, and this algorithm represents it as the function Q(s, a), which outputs the return the agent will receive when taking action a from state s, and following the policy indicated by the function Q until the end of the episode. For instance, if in state s there we have actions a1 and a2 available, the function Q will provide two Q-values, one for each action. If Q(s, a1) = 1 and Q(s, a2) = 4, the agent knows that action a2 is better and it will collect a higher reward, so it should be the action taken by the agent.

Example: Grid-World

In this example, we have designed a simple environment in which the agent will start in the initial state s_i, and it must choose to move left or right at each step. If it reaches the leftmost state, the episode ends and the agent receives a reward of -5. On the other hand, if it reaches the rightmost state, the episode ends and the agent receives a reward of +5. The agent must learn to avoid the state with the reward of -5 and to move towards the state with the reward of +5. If the policy it learns is the policy with the highest possible return, we will say it has found the optimal policy.

Image for post
Image for post

Bellman’s Equation

Image for post
Image for post
Equation 1. Bellman’s equation.

The code, solving the example

You can run the code yourself in this link, or you can keep reading to see the code without running it.

Algorithm 1. Defining the Grid-World.
Algorithm 2. Function for the ε-greedy policy.
Algorithm 3. Function to simulate taking an action in the environemnt.
Algorithm 4. Main loop.
Q-values are: [[0.0, 0.0], [-5, 3.28], [2.95, 3.64], [3.28, 4.05], [3.64, 4.5], [4.05, 5], [0.0, 0.0]] 
Best action for state 0 is left
Best action for state 1 is right
Best action for state 2 is right
Best action for state 3 is right
Best action for state 4 is right
Best action for state 5 is right
Best action for state 6 is left

PhD Candidate 2021, NC State University

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store