Multi-Armed Bandit

Base

class genrl.agents.bandits.multiarmed.base.MABAgent(bandit: genrl.core.bandit.MultiArmedBandit)[source]

Bases: genrl.core.bandit.BanditAgent

Base Class for Contextual Bandit solving Policy

Parameters:
  • bandit (MultiArmedlBandit type object) – The Bandit to solve
  • requires_init_run – Indicated if initialisation of Q values is required
action_hist

Get the history of actions taken for contexts

Returns:List of context, actions pairs
Return type:list
counts

Get the number of times each action has been taken

Returns:Numpy array with count for each action
Return type:numpy.ndarray
regret

Get the current regret

Returns:The current regret
Return type:float
regret_hist

Get the history of regrets incurred for each step

Returns:List of rewards
Return type:list
reward_hist

Get the history of rewards received for each step

Returns:List of rewards
Return type:list
select_action(context: int) → int[source]

Select an action

This method needs to be implemented in the specific policy.

Parameters:context (int) – the context to select action for
Returns:Selected action
Return type:int
update_params(context: int, action: int, reward: Union[int, float]) → None[source]

Update parmeters for the policy

This method needs to be implemented in the specific policy.

Parameters:
  • context (int) – context for which action is taken
  • action (int) – action taken for the step
  • reward (int or float) – reward obtained for the step

Bayesian Bandit

class genrl.agents.bandits.multiarmed.bayesian.BayesianUCBMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 1.0, beta: float = 1.0, confidence: float = 3.0)[source]

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Multi-Armed Bandit Solver with Bayesian Upper Confidence Bound based Action Selection Strategy.

Refer to Section 2.7 of Reinforcement Learning: An Introduction.

Parameters:
  • bandit (MultiArmedlBandit type object) – The Bandit to solve
  • alpha (float) – alpha value for beta distribution
  • beta (float) – beta values for beta distibution
  • c (float) – Confidence level which controls degree of exploration
a

alpha parameter of beta distribution associated with the policy

Type:numpy.ndarray
b

beta parameter of beta distribution associated with the policy

Type:numpy.ndarray
confidence

Confidence level which weights the exploration term

Type:float
quality

Q values for all the actions for alpha, beta and c

Type:numpy.ndarray
select_action(context: int) → int[source]

Select an action according to bayesian upper confidence bound

Take action that maximises a weighted sum of the Q values and a beta distribution paramerterized by alpha and beta and weighted by c for each action

Parameters:
  • context (int) – the context to select action for
  • t (int) – timestep to choose action for
Returns:

Selected action

Return type:

int

update_params(context: int, action: int, reward: float) → None[source]

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step

Parameters:
  • context (int) – context for which action is taken
  • action (int) – action taken for the step
  • reward (float) – reward obtained for the step

Bernoulli Bandit

class genrl.agents.bandits.multiarmed.bernoulli_mab.BernoulliMAB(bandits: int = 1, arms: int = 5, reward_probs: numpy.ndarray = None, context_type: str = 'tensor')[source]

Bases: genrl.core.bandit.MultiArmedBandit

Contextual Bandit with categorial context and bernoulli reward distribution

Parameters:
  • bandits (int) – Number of bandits
  • arms (int) – Number of arms in each bandit
  • reward_probs (numpy.ndarray) – Probabilities of getting rewards

Espilon Greedy

class genrl.agents.bandits.multiarmed.epsgreedy.EpsGreedyMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, eps: float = 0.05)[source]

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Contextual Bandit Policy with Epsilon Greedy Action Selection Strategy.

Refer to Section 2.3 of Reinforcement Learning: An Introduction.

Parameters:
  • bandit (MultiArmedlBandit type object) – The Bandit to solve
  • eps (float) – Probability with which a random action is to be selected.
eps

Exploration constant

Type:float
quality

Q values assigned by the policy to all actions

Type:numpy.ndarray
select_action(context: int) → int[source]

Select an action according to epsilon greedy startegy

A random action is selected with espilon probability over the optimal action according to the current Q values to encourage exploration of the policy.

Parameters:context (int) – the context to select action for
Returns:Selected action
Return type:int
update_params(context: int, action: int, reward: float) → None[source]

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step.

Parameters:
  • context (int) – context for which action is taken
  • action (int) – action taken for the step
  • reward (float) – reward obtained for the step

Gaussian

class genrl.agents.bandits.multiarmed.gaussian_mab.GaussianMAB(bandits: int = 10, arms: int = 5, reward_means: numpy.ndarray = None, context_type: str = 'tensor')[source]

Bases: genrl.core.bandit.MultiArmedBandit

Contextual Bandit with categorial context and gaussian reward distribution

Parameters:
  • bandits (int) – Number of bandits
  • arms (int) – Number of arms in each bandit
  • reward_means (numpy.ndarray) – Mean of gaussian distribution for each reward

Gradient

class genrl.agents.bandits.multiarmed.gradient.GradientMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 0.1, temp: float = 0.01)[source]

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Multi-Armed Bandit Solver with Softmax Action Selection Strategy.

Refer to Section 2.8 of Reinforcement Learning: An Introduction.

Parameters:
  • bandit (MultiArmedlBandit type object) – The Bandit to solve
  • alpha (float) – The step size parameter for gradient based update
  • temp (float) – Temperature for softmax distribution over Q values of actions
alpha

Step size parameter for gradient based update of policy

Type:float
probability_hist

History of probabilty values assigned to each action for each timestep

Type:numpy.ndarray
quality

Q values assigned by the policy to all actions

Type:numpy.ndarray
select_action(context: int) → int[source]

Select an action according by softmax action selection strategy

Action is sampled from softmax distribution computed over the Q values for all actions

Parameters:context (int) – the context to select action for
Returns:Selected action
Return type:int
temp

Temperature for softmax distribution over Q values of actions

Type:float
update_params(context: int, action: int, reward: float) → None[source]

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the Q values through a gradient ascent step

Parameters:
  • context (int) – context for which action is taken
  • action (int) – action taken for the step
  • reward (float) – reward obtained for the step

Thmopson Sampling

class genrl.agents.bandits.multiarmed.thompson.ThompsonSamplingMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 1.0, beta: float = 1.0)[source]

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Multi-Armed Bandit Solver with Bayesian Upper Confidence Bound based Action Selection Strategy.

Parameters:
  • bandit (MultiArmedlBandit type object) – The Bandit to solve
  • a (float) – alpha value for beta distribution
  • b (float) – beta values for beta distibution
a

alpha parameter of beta distribution associated with the policy

Type:numpy.ndarray
b

beta parameter of beta distribution associated with the policy

Type:numpy.ndarray
quality

Q values for all the actions for alpha, beta and c

Type:numpy.ndarray
select_action(context: int) → int[source]

Select an action according to Thompson Sampling

Samples are taken from beta distribution parameterized by alpha and beta for each action. The action with the highest sample is selected.

Parameters:context (int) – the context to select action for
Returns:Selected action
Return type:int
update_params(context: int, action: int, reward: float) → None[source]

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the alpha value of beta distribution by adding the reward while the beta value is updated by adding 1 - reward. Update the counts the action taken.

Parameters:
  • context (int) – context for which action is taken
  • action (int) – action taken for the step
  • reward (float) – reward obtained for the step

Upper Confidence Bound

class genrl.agents.bandits.multiarmed.ucb.UCBMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, confidence: float = 1.0)[source]

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Multi-Armed Bandit Solver with Upper Confidence Bound based Action Selection Strategy.

Refer to Section 2.7 of Reinforcement Learning: An Introduction.

Parameters:
  • bandit (MultiArmedlBandit type object) – The Bandit to solve
  • c (float) – Confidence level which controls degree of exploration
confidence

Confidence level which weights the exploration term

Type:float
quality

q values assigned by the policy to all actions

Type:numpy.ndarray
select_action(context: int) → int[source]

Select an action according to upper confidence bound action selction

Take action that maximises a weighted sum of the Q values for the action and an exploration encouragement term controlled by c.

Parameters:context (int) – the context to select action for
Returns:Selected action
Return type:int
update_params(context: int, action: int, reward: float) → None[source]

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step.

Parameters:
  • context (int) – context for which action is taken
  • action (int) – action taken for the step
  • reward (float) – reward obtained for the step