MultiArmed Bandit¶
Base¶

class
genrl.agents.bandits.multiarmed.base.
MABAgent
(bandit: genrl.core.bandit.MultiArmedBandit)[source]¶ Bases:
genrl.core.bandit.BanditAgent
Base Class for Contextual Bandit solving Policy
Parameters:  bandit (MultiArmedlBandit type object) – The Bandit to solve
 requires_init_run – Indicated if initialisation of Q values is required

action_hist
¶ Get the history of actions taken for contexts
Returns: List of context, actions pairs Return type: list

counts
¶ Get the number of times each action has been taken
Returns: Numpy array with count for each action Return type: numpy.ndarray

regret
¶ Get the current regret
Returns: The current regret Return type: float

regret_hist
¶ Get the history of regrets incurred for each step
Returns: List of rewards Return type: list

reward_hist
¶ Get the history of rewards received for each step
Returns: List of rewards Return type: list

select_action
(context: int) → int[source]¶ Select an action
This method needs to be implemented in the specific policy.
Parameters: context (int) – the context to select action for Returns: Selected action Return type: int

update_params
(context: int, action: int, reward: Union[int, float]) → None[source]¶ Update parmeters for the policy
This method needs to be implemented in the specific policy.
Parameters:  context (int) – context for which action is taken
 action (int) – action taken for the step
 reward (int or float) – reward obtained for the step
Bayesian Bandit¶

class
genrl.agents.bandits.multiarmed.bayesian.
BayesianUCBMABAgent
(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 1.0, beta: float = 1.0, confidence: float = 3.0)[source]¶ Bases:
genrl.agents.bandits.multiarmed.base.MABAgent
MultiArmed Bandit Solver with Bayesian Upper Confidence Bound based Action Selection Strategy.
Refer to Section 2.7 of Reinforcement Learning: An Introduction.
Parameters:  bandit (MultiArmedlBandit type object) – The Bandit to solve
 alpha (float) – alpha value for beta distribution
 beta (float) – beta values for beta distibution
 c (float) – Confidence level which controls degree of exploration

a
¶ alpha parameter of beta distribution associated with the policy
Type: numpy.ndarray

b
¶ beta parameter of beta distribution associated with the policy
Type: numpy.ndarray

confidence
¶ Confidence level which weights the exploration term
Type: float

quality
¶ Q values for all the actions for alpha, beta and c
Type: numpy.ndarray

select_action
(context: int) → int[source]¶ Select an action according to bayesian upper confidence bound
Take action that maximises a weighted sum of the Q values and a beta distribution paramerterized by alpha and beta and weighted by c for each action
Parameters:  context (int) – the context to select action for
 t (int) – timestep to choose action for
Returns: Selected action
Return type: int

update_params
(context: int, action: int, reward: float) → None[source]¶ Update parmeters for the policy
Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step
Parameters:  context (int) – context for which action is taken
 action (int) – action taken for the step
 reward (float) – reward obtained for the step
Bernoulli Bandit¶

class
genrl.agents.bandits.multiarmed.bernoulli_mab.
BernoulliMAB
(bandits: int = 1, arms: int = 5, reward_probs: numpy.ndarray = None, context_type: str = 'tensor')[source]¶ Bases:
genrl.core.bandit.MultiArmedBandit
Contextual Bandit with categorial context and bernoulli reward distribution
Parameters:  bandits (int) – Number of bandits
 arms (int) – Number of arms in each bandit
 reward_probs (numpy.ndarray) – Probabilities of getting rewards
Espilon Greedy¶

class
genrl.agents.bandits.multiarmed.epsgreedy.
EpsGreedyMABAgent
(bandit: genrl.core.bandit.MultiArmedBandit, eps: float = 0.05)[source]¶ Bases:
genrl.agents.bandits.multiarmed.base.MABAgent
Contextual Bandit Policy with Epsilon Greedy Action Selection Strategy.
Refer to Section 2.3 of Reinforcement Learning: An Introduction.
Parameters:  bandit (MultiArmedlBandit type object) – The Bandit to solve
 eps (float) – Probability with which a random action is to be selected.

eps
¶ Exploration constant
Type: float

quality
¶ Q values assigned by the policy to all actions
Type: numpy.ndarray

select_action
(context: int) → int[source]¶ Select an action according to epsilon greedy startegy
A random action is selected with espilon probability over the optimal action according to the current Q values to encourage exploration of the policy.
Parameters: context (int) – the context to select action for Returns: Selected action Return type: int

update_params
(context: int, action: int, reward: float) → None[source]¶ Update parmeters for the policy
Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step.
Parameters:  context (int) – context for which action is taken
 action (int) – action taken for the step
 reward (float) – reward obtained for the step
Gaussian¶

class
genrl.agents.bandits.multiarmed.gaussian_mab.
GaussianMAB
(bandits: int = 10, arms: int = 5, reward_means: numpy.ndarray = None, context_type: str = 'tensor')[source]¶ Bases:
genrl.core.bandit.MultiArmedBandit
Contextual Bandit with categorial context and gaussian reward distribution
Parameters:  bandits (int) – Number of bandits
 arms (int) – Number of arms in each bandit
 reward_means (numpy.ndarray) – Mean of gaussian distribution for each reward
Gradient¶

class
genrl.agents.bandits.multiarmed.gradient.
GradientMABAgent
(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 0.1, temp: float = 0.01)[source]¶ Bases:
genrl.agents.bandits.multiarmed.base.MABAgent
MultiArmed Bandit Solver with Softmax Action Selection Strategy.
Refer to Section 2.8 of Reinforcement Learning: An Introduction.
Parameters:  bandit (MultiArmedlBandit type object) – The Bandit to solve
 alpha (float) – The step size parameter for gradient based update
 temp (float) – Temperature for softmax distribution over Q values of actions

alpha
¶ Step size parameter for gradient based update of policy
Type: float

probability_hist
¶ History of probabilty values assigned to each action for each timestep
Type: numpy.ndarray

quality
¶ Q values assigned by the policy to all actions
Type: numpy.ndarray

select_action
(context: int) → int[source]¶ Select an action according by softmax action selection strategy
Action is sampled from softmax distribution computed over the Q values for all actions
Parameters: context (int) – the context to select action for Returns: Selected action Return type: int

temp
¶ Temperature for softmax distribution over Q values of actions
Type: float

update_params
(context: int, action: int, reward: float) → None[source]¶ Update parmeters for the policy
Updates the regret as the difference between max Q value and that of the action. Updates the Q values through a gradient ascent step
Parameters:  context (int) – context for which action is taken
 action (int) – action taken for the step
 reward (float) – reward obtained for the step
Thmopson Sampling¶

class
genrl.agents.bandits.multiarmed.thompson.
ThompsonSamplingMABAgent
(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 1.0, beta: float = 1.0)[source]¶ Bases:
genrl.agents.bandits.multiarmed.base.MABAgent
MultiArmed Bandit Solver with Bayesian Upper Confidence Bound based Action Selection Strategy.
Parameters:  bandit (MultiArmedlBandit type object) – The Bandit to solve
 a (float) – alpha value for beta distribution
 b (float) – beta values for beta distibution

a
¶ alpha parameter of beta distribution associated with the policy
Type: numpy.ndarray

b
¶ beta parameter of beta distribution associated with the policy
Type: numpy.ndarray

quality
¶ Q values for all the actions for alpha, beta and c
Type: numpy.ndarray

select_action
(context: int) → int[source]¶ Select an action according to Thompson Sampling
Samples are taken from beta distribution parameterized by alpha and beta for each action. The action with the highest sample is selected.
Parameters: context (int) – the context to select action for Returns: Selected action Return type: int

update_params
(context: int, action: int, reward: float) → None[source]¶ Update parmeters for the policy
Updates the regret as the difference between max Q value and that of the action. Updates the alpha value of beta distribution by adding the reward while the beta value is updated by adding 1  reward. Update the counts the action taken.
Parameters:  context (int) – context for which action is taken
 action (int) – action taken for the step
 reward (float) – reward obtained for the step
Upper Confidence Bound¶

class
genrl.agents.bandits.multiarmed.ucb.
UCBMABAgent
(bandit: genrl.core.bandit.MultiArmedBandit, confidence: float = 1.0)[source]¶ Bases:
genrl.agents.bandits.multiarmed.base.MABAgent
MultiArmed Bandit Solver with Upper Confidence Bound based Action Selection Strategy.
Refer to Section 2.7 of Reinforcement Learning: An Introduction.
Parameters:  bandit (MultiArmedlBandit type object) – The Bandit to solve
 c (float) – Confidence level which controls degree of exploration

confidence
¶ Confidence level which weights the exploration term
Type: float

quality
¶ q values assigned by the policy to all actions
Type: numpy.ndarray

select_action
(context: int) → int[source]¶ Select an action according to upper confidence bound action selction
Take action that maximises a weighted sum of the Q values for the action and an exploration encouragement term controlled by c.
Parameters: context (int) – the context to select action for Returns: Selected action Return type: int

update_params
(context: int, action: int, reward: float) → None[source]¶ Update parmeters for the policy
Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step.
Parameters:  context (int) – context for which action is taken
 action (int) – action taken for the step
 reward (float) – reward obtained for the step