Multi-Armed Bandit


class genrl.agents.bandits.multiarmed.base.MABAgent(bandit: genrl.core.bandit.MultiArmedBandit)[source]

Bases: genrl.core.bandit.BanditAgent

Base Class for Contextual Bandit solving Policy

  • bandit (MultiArmedlBandit type object) – The Bandit to solve
  • requires_init_run – Indicated if initialisation of Q values is required

Get the history of actions taken for contexts

Returns:List of context, actions pairs
Return type:list

Get the number of times each action has been taken

Returns:Numpy array with count for each action
Return type:numpy.ndarray

Get the current regret

Returns:The current regret
Return type:float

Get the history of regrets incurred for each step

Returns:List of rewards
Return type:list

Get the history of rewards received for each step

Returns:List of rewards
Return type:list
select_action(context: int) → int[source]

Select an action

This method needs to be implemented in the specific policy.

Parameters:context (int) – the context to select action for
Returns:Selected action
Return type:int
update_params(context: int, action: int, reward: Union[int, float]) → None[source]

Update parmeters for the policy

This method needs to be implemented in the specific policy.

  • context (int) – context for which action is taken
  • action (int) – action taken for the step
  • reward (int or float) – reward obtained for the step

Bayesian Bandit

class genrl.agents.bandits.multiarmed.bayesian.BayesianUCBMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 1.0, beta: float = 1.0, confidence: float = 3.0)[source]

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Multi-Armed Bandit Solver with Bayesian Upper Confidence Bound based Action Selection Strategy.

Refer to Section 2.7 of Reinforcement Learning: An Introduction.

  • bandit (MultiArmedlBandit type object) – The Bandit to solve
  • alpha (float) – alpha value for beta distribution
  • beta (float) – beta values for beta distibution
  • c (float) – Confidence level which controls degree of exploration

alpha parameter of beta distribution associated with the policy


beta parameter of beta distribution associated with the policy


Confidence level which weights the exploration term


Q values for all the actions for alpha, beta and c

select_action(context: int) → int[source]

Select an action according to bayesian upper confidence bound

Take action that maximises a weighted sum of the Q values and a beta distribution paramerterized by alpha and beta and weighted by c for each action

  • context (int) – the context to select action for
  • t (int) – timestep to choose action for

Selected action

Return type:


update_params(context: int, action: int, reward: float) → None[source]

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step

  • context (int) – context for which action is taken
  • action (int) – action taken for the step
  • reward (float) – reward obtained for the step

Bernoulli Bandit

class genrl.agents.bandits.multiarmed.bernoulli_mab.BernoulliMAB(bandits: int = 1, arms: int = 5, reward_probs: numpy.ndarray = None, context_type: str = 'tensor')[source]

Bases: genrl.core.bandit.MultiArmedBandit

Contextual Bandit with categorial context and bernoulli reward distribution

  • bandits (int) – Number of bandits
  • arms (int) – Number of arms in each bandit
  • reward_probs (numpy.ndarray) – Probabilities of getting rewards

Espilon Greedy

class genrl.agents.bandits.multiarmed.epsgreedy.EpsGreedyMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, eps: float = 0.05)[source]

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Contextual Bandit Policy with Epsilon Greedy Action Selection Strategy.

Refer to Section 2.3 of Reinforcement Learning: An Introduction.

  • bandit (MultiArmedlBandit type object) – The Bandit to solve
  • eps (float) – Probability with which a random action is to be selected.

Exploration constant


Q values assigned by the policy to all actions

select_action(context: int) → int[source]

Select an action according to epsilon greedy startegy

A random action is selected with espilon probability over the optimal action according to the current Q values to encourage exploration of the policy.

Parameters:context (int) – the context to select action for
Returns:Selected action
Return type:int
update_params(context: int, action: int, reward: float) → None[source]

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step.

  • context (int) – context for which action is taken
  • action (int) – action taken for the step
  • reward (float) – reward obtained for the step


class genrl.agents.bandits.multiarmed.gaussian_mab.GaussianMAB(bandits: int = 10, arms: int = 5, reward_means: numpy.ndarray = None, context_type: str = 'tensor')[source]

Bases: genrl.core.bandit.MultiArmedBandit

Contextual Bandit with categorial context and gaussian reward distribution

  • bandits (int) – Number of bandits
  • arms (int) – Number of arms in each bandit
  • reward_means (numpy.ndarray) – Mean of gaussian distribution for each reward


class genrl.agents.bandits.multiarmed.gradient.GradientMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 0.1, temp: float = 0.01)[source]

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Multi-Armed Bandit Solver with Softmax Action Selection Strategy.

Refer to Section 2.8 of Reinforcement Learning: An Introduction.

  • bandit (MultiArmedlBandit type object) – The Bandit to solve
  • alpha (float) – The step size parameter for gradient based update
  • temp (float) – Temperature for softmax distribution over Q values of actions

Step size parameter for gradient based update of policy


History of probabilty values assigned to each action for each timestep


Q values assigned by the policy to all actions

select_action(context: int) → int[source]

Select an action according by softmax action selection strategy

Action is sampled from softmax distribution computed over the Q values for all actions

Parameters:context (int) – the context to select action for
Returns:Selected action
Return type:int

Temperature for softmax distribution over Q values of actions

update_params(context: int, action: int, reward: float) → None[source]

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the Q values through a gradient ascent step

  • context (int) – context for which action is taken
  • action (int) – action taken for the step
  • reward (float) – reward obtained for the step

Thmopson Sampling

class genrl.agents.bandits.multiarmed.thompson.ThompsonSamplingMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 1.0, beta: float = 1.0)[source]

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Multi-Armed Bandit Solver with Bayesian Upper Confidence Bound based Action Selection Strategy.

  • bandit (MultiArmedlBandit type object) – The Bandit to solve
  • a (float) – alpha value for beta distribution
  • b (float) – beta values for beta distibution

alpha parameter of beta distribution associated with the policy


beta parameter of beta distribution associated with the policy


Q values for all the actions for alpha, beta and c

select_action(context: int) → int[source]

Select an action according to Thompson Sampling

Samples are taken from beta distribution parameterized by alpha and beta for each action. The action with the highest sample is selected.

Parameters:context (int) – the context to select action for
Returns:Selected action
Return type:int
update_params(context: int, action: int, reward: float) → None[source]

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the alpha value of beta distribution by adding the reward while the beta value is updated by adding 1 - reward. Update the counts the action taken.

  • context (int) – context for which action is taken
  • action (int) – action taken for the step
  • reward (float) – reward obtained for the step

Upper Confidence Bound

class genrl.agents.bandits.multiarmed.ucb.UCBMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, confidence: float = 1.0)[source]

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Multi-Armed Bandit Solver with Upper Confidence Bound based Action Selection Strategy.

Refer to Section 2.7 of Reinforcement Learning: An Introduction.

  • bandit (MultiArmedlBandit type object) – The Bandit to solve
  • c (float) – Confidence level which controls degree of exploration

Confidence level which weights the exploration term


q values assigned by the policy to all actions

select_action(context: int) → int[source]

Select an action according to upper confidence bound action selction

Take action that maximises a weighted sum of the Q values for the action and an exploration encouragement term controlled by c.

Parameters:context (int) – the context to select action for
Returns:Selected action
Return type:int
update_params(context: int, action: int, reward: float) → None[source]

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step.

  • context (int) – context for which action is taken
  • action (int) – action taken for the step
  • reward (float) – reward obtained for the step