Multi K-Armed Bandit Problems in Reinforcement Learning

September 4, 2025

The K-Armed Bandit Problem

The K-Armed bandit problem is an reinforcement learning usecase where our Bandit has the options of using K different types of Arm to conduct his robbery. A different weapon from the set of K different arms will yield different reward with a known probability distribution each time our Bandit conduct their robbery. Our program needs to help our bandit agents to conduct their n-numbers of bank robberies as optimal as possible to maximize their total rewards.

We applies our K-Armed Bandit strategy for our bandit by ultilizing the highest value weapon within their k-armed arsenal (1-Epsilon) percentage of time as their Greedy Moves but also mix in with Random Moves at Epsilon percentage of time to discover higher value weapons. We will then use our Learning from Random Move to update our Action Value for each weapons and select a new weapon as our Greedy Move for following roberries.

We will use the NUMPY Argmax module np.argmax() in our Python program to find the current Highest Value weapon for each of our iterations.

Python Program for the K-Armed Bandit Problem

import numpy as np

class EpsilonGreedy:
	def __init__(self, k_arms, epsilon):
		self.k_arms = k_arms
		self.epsilon = epsilon
		self.counts = np.zeros(k_arms) # Store for Number of Arm is pulled
		self.values = np.zeros(k_arms) # Store for Estimated Value for each Arm

	def select_arm(self):
		if np.random.rand() < self.epsilon:
			print("Selecting 1 random Arm between 1 and k_arms")
			return np.random.randint(0, self.k_arms)
		else:
			max_value = np.argmax(self.values) 
			print("Selecting Max Value Arm", max_value)
			return max_value

	def update(self, chosen_arm, reward):
		self.counts[chosen_arm] += 1
		c = self.counts[chosen_arm]
		value = self.values[chosen_arm]
		updated_value = ((c-1)/c) * value + (1/c) * reward 
		self.values[chosen_arm] = updated_value

		# print(chosen_arm, " has been selected ", n,  "times")
		# print("Current value for ", chosen_arm, " is", updated_value)


k_arms = 10 # Ten weapon options
epsilon = 0.1 # Random weapon for 10% of trials
n_trials = 1000

rewards = np.random.randn(k_arms, n_trials)

agent = EpsilonGreedy(k_arms, epsilon)
total_reward = 0

for t in range(n_trials):
	arm = agent.select_arm()
	print(arm)
	reward = rewards[arm, t]
	agent.update(arm, reward)
	total_reward += reward

print("Total Reward ", total_reward)