On entering casinos, there would be multiple machines that could make us win more money and some machines that can make us bankrupt. How nice would it be if we knew the working of the machines so that we could leverage the maximum benefit out of it? Dont worry! The multi-arm bandit is one such machine from which we can get the maximum benefit. Instead of relying on random chance, we go for a systematic approach by simply pulling random levers. Let’s try to understand what it is and the different strategies to solve it. We would also implement these algorithms in Python.

- To understand the multi-arm bandit problem
- How reinforcement learning techniques can be used to solve the multi-arm bandit scenario?
- The greedy algorithm, epsilon algorithm, and UCB 1.
- The advantages and disadvantages of these algorithms.

This article was published as a part of the Data Science Blogathon.

Reinforcement Learning is a subset of machine learning. In reinforcement learning, an agent learns through interaction with its environment by taking actions and receiving rewards based on the outcomes. The objective of a reinforcement learning algorithm is to devise a policy that maximizes the cumulative reward over time while avoiding negative outcomes. We can see how reinforcement learning is transforming the entire world with ChatGPT, online chess games, etc. With its ability to learn through trial and error, this technique can help create more efficient and effective decision-making processes, making it an exciting area of research.

The multi-arm bandit (MAB) problem is a classic reinforcement learning problem that involves balancing exploration (trying out different actions to gather information) and exploitation (choosing the action with the highest estimated reward).

The problem is defined as follows:

- There is a gambler who can choose one of K’s arms at each time step.
- Each action has an unknown reward distribution with mean reward μ_i for arm i.
- The goal of the gambler is to maximize the total reward over T time steps by balancing exploration and exploitation.
- The gambler does not know the true mean rewards μ_i but can estimate them based on the rewards received from each arm.

The multi-arm bandit problem’s challenge is finding a strategy that balances exploration and exploitation effectively, as too much exploration may result in suboptimal rewards. At the same time, too much exploitation may prevent the gambler from discovering a better arm. The solution to this problem lies in developing algorithms that estimate the mean rewards of each arm and choose actions that efficiently balance exploration and exploitation.

There are mainly three simple algorithms to maximize our total reward

- Greedy Algorithm
- Epsilon Greedy
- UCB 1

Let us see each algorithm in detail and understand its Python code.

As the name suggests, this algorithm picks the available greedy arm. Now let’s see how to choose the greedy arm. Now we have n-arms in front of us. The greedy arm picks the arm which returns the best average reward till then. But we would have no idea about the distribution of the arms or the average reward of the arms. So, let’s pick each arm once so that we have some idea of how each arm returns the rewards. Then, we greedily choose the arm with a higher average reward than the previous time step. We will repeat this process infinite (a very large number) times to get the best arm.

I hope you would have got some idea of how this algorithm works. Now let’s see the python code to understand how this approach finds the best arm.

Let’s import the required libraries and define a class structure to solve this greedy approach.

**Python Code:**

`import numpy as np import matplotlib.pyplot as plt`

`class Greedy: def __init__(self,n_arm,reward_func): self.n_arm = n_arm # Number of arms self.arm_avg = np.zeros(n_arm) # The average arm reward self.arm_pick = np.ones(n_arm) # Number of times the arm has been picked self.reward_func = reward_func # The distribution of all arms self.its = 0 # Total number of iterations`

`# Initially we pick all the arms once def initialise(self): self.arm_avg = np.array([reward() for reward in self.reward_func]) def update(self,its): self.its += its for i in range(its): great_arm = self.best_arm() #Choose the best arm reward = self.reward_func[great_arm]() # Pull the lever num = (self.arm_avg[great_arm] * self.arm_pick[great_arm] + reward) denom = (self.arm_pick[great_arm] + 1.0) self.arm_avg[great_arm] = num / denom #Update the arm's average reward self.arm_pick[great_arm] += 1 # Increment the number of times the arm has been picked def best_arm(self): return np.argmax(self.arm_avg) # Function to compute best arm`

Let us consider six arms, with the following probability distributions for each arm where each index in the below array represents one arm.

```
functions = [
lambda : np.random.randn(),
lambda : np.random.randn()+2,
lambda : np.random.randn()+3,
lambda : np.random.randn()+4,
lambda : np.random.randn()+5,
lambda : np.random.randint(0,10)
]
```

Now let us create an object for the above class and find the best arm. Before that looking at the above arms, we can clearly see that the function second last arm, i.e, that arm with normal distribution with mean = 5 and std = 1, is the best arm. So, we expect our function to return 4 as the result as it is in the fourth index of the array. Let us verify if our algorithm gives us the same.

```
obj = Greedy(6,functions)
obj.initialise()
obj.update(100)
print(obj.best_arm())
```

The greedy approach may converge to a suboptimal solution early on, as it only chooses the arm with the highest estimated reward and does not explore other arms. This may result in missing out on better arms with higher expected rewards. The greedy approach does not explore other arms, meaning it may not have enough information to estimate its expected rewards accurately. This can lead to suboptimal decisions and reduced cumulative rewards over time. The greedy approach is sensitive to the initialization of the estimated rewards, which may lead to suboptimal decisions in the early stages.

In this approach, we introduce small randomness in the algorithm. We introduce this by adding a factor called epsilon. This algorithm is the same as the greedy algorithm, but we generate a random number. If the generated number is greater than epsilon, then we exploit the arm with the highest arm average until that time stamp. Else we explore a new arm(which could be the greedy arm).

I hope this small, precise explanation has helped you understand how the algorithm works. Now let us jump into the code to understand the nitty-gritty details of the algorithm further.

Let us first define the class.

```
class e_Greedy:
# Define the constructor
def __init__(self,n_arm,reward_func,eps=0.1):
self.n_arm = n_arm
self.arm_avg = np.zeros(n_arm)
self.arm_pick = np.zeros(n_arm)
self.eps = eps # Epsilon value
self.its = 0
self.reward_func = reward_func
self.rews = []
def update(self,its):
self.its += its
for i in range(its):
rand_eps = np.random.uniform(0,1,1)[0] # We generate a random numver
if rand_eps < self.eps:
arm = np.random.choice(self.n_arm,1)[0] # Choose a random arm
else:
arm = self.best_arm() # Choose the greedy arm
reward = self.reward_func[arm]()
num = (self.arm_avg[great_arm] * self.arm_pick[great_arm] + reward)
denom = (self.arm_pick[great_arm] + 1.0)
self.arm_avg[great_arm] = num / denom #Update the arm's average reward
self.arm_pick[arm] += 1
self.rews.append(reward)
def best_arm(self):
return np.argmax(self.arm_avg)
def plotter(self):
# A function to print the plot
rews = np.cumsum(self.rews).astype(float)
for i in range(len(rews)):
rews[i] = rews[i]/(i+1.0)
plt.plot(range(1,len(rews)+1),rews)
def get_arm_avg(self):
return self.arm_avg
```

Let’s create the object and verify if our output is right!

```
functions = [
lambda : np.random.randn(),
lambda : np.random.randn()+2,
lambda : np.random.randn()+3,
lambda : np.random.randn()+4,
lambda : np.random.randn()+5,
lambda : np.random.randint(0,10)
]
obj = e_Greedy(6,functions,0.1)
obj.update(1000)
print(obj.best_arm())
obj.plotter()
```

We can see that our cumulative average reward tends to 5, which is our expected answer.

UCB (Upper Confidence Bound) and epsilon-greedy are both algorithms used in multi-armed bandit problems, where the goal is to balance exploration (trying out different actions to gather information) and exploitation (choosing the action with the highest estimated reward).

UCB is often considered to be better than epsilon-greedy because it uses a more sophisticated exploration strategy. While epsilon-greedy explores randomly with a fixed probability (epsilon), UCB balances exploration and exploitation based on the uncertainty of each action’s estimated reward. It chooses actions with a high upper confidence bound, which considers both the estimated reward and the number of times it has been tried. This makes UCB more efficient regarding the number of trials needed to converge to the optimal action, especially when the number of arms is large or when there is a significant difference in the expected rewards.

We calculate the UCB and then select the arm based on the maximum value of the confidence bound.

Now let us dive into the code and understand how exactly the UCB works.

```
class UCB:
def __init__(self,n_arm,reward_func,c=1):
self.n_arm = n_arm
self.arm_avg = np.zeros(n_arm)
self.arm_pick = np.ones(n_arm)
self.c = c
self.its = 0
self.reward_func = reward_func
self.rews = []
def initialise(self):
self.arm_avg = np.array([i() for i in self.reward_func])
def best_arm(self):
# We see that the way we pick our arm has alone changed
return np.argmax(self.arm_avg + self.c * (np.array([np.log(self.its)])/self.arm_pick)**0.5)
def update(self,its):
self.its += its
for i in range(its):
arm = self.best_arm()
reward = self.reward_func[arm]()#np.random.randint(0,10,(1,))[0]
num = ( self.arm_avg[arm] * self.arm_pick[arm] + reward )
denom = (self.arm_pick[arm] + 1.0)
self.arm_avg[arm] = num / denom
self.arm_pick[arm] += 1
self.rews.append(reward)
def plotter(self):
rews = np.cumsum(self.rews).astype(float)
for i in range(len(rews)):
rews[i] = rews[i]/(i+1.0)
plt.plot(range(1,len(rews)+1),rews)
#import csv
```

Lets now initialize the object and see how this performs

```
obj = UCB(6,functions,2.5)
obj.initialise()
obj.update(1000)
obj.plotter()
```

We can observe that UCB has found the best arm earlier than the Epsilon-Greedy algorithm.

In this article, we understood the different ways to tackle the multi-arm bandit problem. We also understood the Object Oriented way of coding the multi-arm bandit problem. We also understood how each algorithm has a tradeoff between exploration and exploitation and understood their drawbacks. We understood how to maximize the average reward which we get at the end of the trials.

- A greedy algorithm seldom explores, as we greedily choose the arm with a higher average reward than the previous time step.
- Epsilon greedy algorithm explores with a probability UCB of Epsilon.
- UCB1 picks the less certain arm.

**The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.**

Lorem ipsum dolor sit amet, consectetur adipiscing elit,