Introduction to Reinforcement Learning. Part 2: Q-Learning

Q-Learning

Markel Sanz Ausin
5 min readMar 29, 2020

In part 1, we described the multi-armed bandit problem and introduced several concepts such as the state, action, reward, and so on. However, the multi-armed bandit problem does not represent the entire reinforcement learning (RL) problem. In the bandit problems, each state is always the same, as we saw in part 1, where we had the same 5 arms all the time and the probabilities didn’t change over time.

In the complete RL problem, the state changes every time we take an action. This is how we can represent it. The agent gets the state in which the environment is in, represented with the letter s. The agent then chooses which action to take, represented with the letter a. After following that action, the environment provides a reward, represented with the letter r, and transitions to a new state, represented as s’. This is the main RL cycle and can be observed in Figure 1.

Figure 1: The complete RL cycle.

As a result of this cycle, the action the agent chooses to take at each time-step must not maximize the immediate reward at that step. It must choose the action that will lead to the highest possible sum of future rewards (in other words, the return) until the end of the episode. This cycle results in a sequence of states, actions and rewards, from the beginning of the episode, until the end: s1, a1, r1; s2, a2, r2; …; sT, aT, rT. Where T indicates the last step in the episode.

Value Function

In order to quantify the reward the agent will get in the long term for each state, we introduce a Value Funtion V(s). This function outputs an estimate of the reward the agent will receive until the end of the episode, starting from state s. If we can correctly estimate this value, we could choose to take the action that will lead the agent to the state with the highest value.

Q-Learning — Solving the RL Problem

To solve the the RL problem, the agent needs to learn to take the best action in each of the possible states it encounters. For that, the Q-learning algorithm learns how much long-term reward it will get for each state-action pair (s, a). We call this an action-value function, and this algorithm represents it as the function Q(s, a), which outputs the return the agent will receive when taking action a from state s, and following the policy indicated by the function Q until the end of the episode. For instance, if in state s there we have actions a1 and a2 available, the function Q will provide two Q-values, one for each action. If Q(s, a1) = 1 and Q(s, a2) = 4, the agent knows that action a2 is better and it will collect a higher reward, so it should be the action taken by the agent.

Example: Grid-World

In this example, we have designed a simple environment in which the agent will start in the initial state s_i, and it must choose to move left or right at each step. If it reaches the leftmost state, the episode ends and the agent receives a reward of -5. On the other hand, if it reaches the rightmost state, the episode ends and the agent receives a reward of +5. The agent must learn to avoid the state with the reward of -5 and to move towards the state with the reward of +5. If the policy it learns is the policy with the highest possible return, we will say it has found the optimal policy.

The Q-Learning algorithm uses Bellman’s equation to solve this. This equation is used to learn the Q-values in a recursive manner.

Bellman’s Equation

Equation 1. Bellman’s equation.

The intuition behind this this equation is the following. The Q-value for state s and action a (Q(s, a)) must be equal to the immediate reward r obtained as a result of that action, plus the Q-value of the best possible next action a’ taken from the next state s’, multiplied by a discount factor γ, which is a value with range γ ∈ (0, 1]. This value γ is used to decide how much we want to weight the short and long-term rewards, and it is a hyper-parameter we need to decide.

The code, solving the example

You can run the code yourself in this link, or you can keep reading to see the code without running it.

Let’s start by defining our environment. The rewards are 0 for all states, except for the leftmost and rightmost states, which have rewards of -5 and +5, respectively. We will also create a list which defines whether a state is terminal or not. Finally, we will create a list called Q_values, where we will store the Q-values for each state-action pair.

Algorithm 1. Defining the Grid-World.

We will now create a function that chooses an action using the ε-greedy policy for a state that we will pass as input to the function.

Algorithm 2. Function for the ε-greedy policy.

We will also create a function to simulate our environment. It will receive a state and action as inputs, and it will return the reward r and the next state s’.

Algorithm 3. Function to simulate taking an action in the environemnt.

Finally, we will choose some hyper-parameters and we will run the algorithm that learns using the Q-learning algorithm (which uses Bellman’s equation).

Algorithm 4. Main loop.

At the end, we observe the learned Q-values, and we print the best action for each state. After executing it, we see that it has learned exactly what we wanted, the optimal policy, to always move to the right. It should be noted that the Q-values are discounted by a discount factor of 0.9.

Q-values are: [[0.0, 0.0], [-5, 3.28], [2.95, 3.64], [3.28, 4.05], [3.64, 4.5], [4.05, 5], [0.0, 0.0]] 
Best action for state 0 is left
Best action for state 1 is right
Best action for state 2 is right
Best action for state 3 is right
Best action for state 4 is right
Best action for state 5 is right
Best action for state 6 is left

If you wish to run the code yourself, you can do so in this link.

--

--