Introduction to Reinforcement Learning. Part 1: Multi-Armed Bandit Problem

Multi-Armed Bandit Problem

Markel Sanz Ausin
5 min readMar 22, 2020

Let’s talk about Reinforcement Learning (RL). This is an Artificial Intelligence (AI) technique in which an agent has to interact with an environment, choosing one of the available actions the environment provides in each possible state, to try and collect as many rewards as possible as a result of those actions.

At first, the agent knows nothing about the environment, which results in it taking actions randomly. But if certain action results in a positive reward, the agent must learn to select that action more frequently. On the other hand, if an action leads to a negative reward, the agent must learn to avoid that action in the future. This way, the agent will learn to choose the actions that maximize the sum of all the encountered rewards (aka the return).

The Multi-Armed Bandit Problem

Máquina tragaperras con dos brazos de los que tirar.
Source image

The multi-armed bandit problem can be seen as gambling on a slot machine inside a casino. If we have N different slot machines, and each of them will give us a positive reward with probability p and a negative reward with probability (1-p), can we build an agent that maximizes the collected reward by choosing to play in the machine with the highest success probability? The idea behind the multi-armed bandit problem is exactly the same; we have a bandit with N arms, and each arm has a different probability of success (the positive reward). The goal is to create an agent that maximizes those rewards.

For this article, we will use an example with N=5 arms (5-armed bandit). These will be the probabilities of success for each one of the five available arms: [0.1, 0.3, 0.05, 0.55, 0.4]. As we can observe, the best action would be to pull the fourth arm every time. However, the agent will not have this information at its disposal. Thus, it must try and pull each arm a few times to gather information and learn which one is the best. Once it starts collecting information, it will start to take better decisions, and collecting higher rewards over time.

ε-greedy policy

This will be the policy that decides what actions the agent will take, it will guide its behavior. The ε-greedy is a policy in which the agent will almost always take the action it thinks is best, given the information it possesses. However, every once in a while, with probability ε, the agent will take an action completely randomly. This way, if after the first action the agent has obtained a positive reward, it will not get stuck selecting that action all the time thinking that it is the best. With probability ε the agent will explore other actions. The value ε is a hyper-parameter that needs to be chosen by us, and it will be the tool we will use to balance the exploration vs. exploitation problem. Exploration consists of trying each available action a few times to see which one is the best, despite not collecting a lot of positive rewards while doing this. Exploitation consists of maximizing the rewards by selecting the best action each time. Balancing exploration and exploitation is crucial: if we only explore two of the five available actions, we won’t know if the other three will bring better rewards; but if we spend all our time exploring the environment, we will never use that knowledge to select the best action and collect as much reward as possible.

The code

We will build an agent that solves this simple multi-armed bandit problem. You can run the code yourself step by step in this link, or you can keep reading to see the code without running it.

Let’s begin. We are going to write a function that allows us to choose one of the five available arms, and returns a reward according to the probabilities mentioned before.

Algorithm 1. Function to pull one of the bandit arms.

Now, we will create the function that implements our policy, that decides which action the agent must take. With probability epsilon, the agent will take an action randomly; else, it will take the action it thinks brings the highest average reward.

Algorithm 2. Function that implements the ε-greedy policy.

Finally, let’s define the probabilities of the different arms as mentioned before, and let’s assign a value to our hyper-parameters epsilon and to the number of iterations we want to run this algorithm for (number of actions to take). Let’s also create three lists where we will keep track of the information collected after taking each action, and the rewards collected for each arm.

We run the algorithm by calling the functions, and we store the resulting rewards.

Algorithm 3. Runs the main loop and prints the total reward.

After running it, this will be the printed result:

Average reward for bandits is [0.19, 0.31, 0.04, 0.55, 0.40]
Best bandit is 3 with an average observed reward of 0.55
Total observed reward in the 1000 episodes has been 493

Note that the list with the average rewards produced by each arm looks very similar to the original probabilities we defined. This means that with 1000 iterations, we have found the success probability p of each bandit, and now the agent knows that the fourth bandit (index 3 in the Python list) is the best one. In this process, we have obtained 493 positive rewards. It is a pretty high value, considering that the best we could have done is 550 positive rewards, since the success probability for arm 4 is 0.55. Feel free to change the value of epsilon and the number of iterations, to see how these affect the total reward.

If you wish to run the code step by step, you can do it by following this link.

--

--