Multi-Armed Bandit¶
Base¶
-
class
genrl.agents.bandits.multiarmed.base.
MABAgent
(bandit: genrl.core.bandit.MultiArmedBandit)[source]¶ Bases:
genrl.core.bandit.BanditAgent
Base Class for Contextual Bandit solving Policy
Parameters: - bandit (MultiArmedlBandit type object) – The Bandit to solve
- requires_init_run – Indicated if initialisation of Q values is required
-
action_hist
¶ Get the history of actions taken for contexts
Returns: List of context, actions pairs Return type: list
-
counts
¶ Get the number of times each action has been taken
Returns: Numpy array with count for each action Return type: numpy.ndarray
-
regret
¶ Get the current regret
Returns: The current regret Return type: float
-
regret_hist
¶ Get the history of regrets incurred for each step
Returns: List of rewards Return type: list
-
reward_hist
¶ Get the history of rewards received for each step
Returns: List of rewards Return type: list
-
select_action
(context: int) → int[source]¶ Select an action
This method needs to be implemented in the specific policy.
Parameters: context (int) – the context to select action for Returns: Selected action Return type: int
-
update_params
(context: int, action: int, reward: Union[int, float]) → None[source]¶ Update parmeters for the policy
This method needs to be implemented in the specific policy.
Parameters: - context (int) – context for which action is taken
- action (int) – action taken for the step
- reward (int or float) – reward obtained for the step
Bayesian Bandit¶
-
class
genrl.agents.bandits.multiarmed.bayesian.
BayesianUCBMABAgent
(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 1.0, beta: float = 1.0, confidence: float = 3.0)[source]¶ Bases:
genrl.agents.bandits.multiarmed.base.MABAgent
Multi-Armed Bandit Solver with Bayesian Upper Confidence Bound based Action Selection Strategy.
Refer to Section 2.7 of Reinforcement Learning: An Introduction.
Parameters: - bandit (MultiArmedlBandit type object) – The Bandit to solve
- alpha (float) – alpha value for beta distribution
- beta (float) – beta values for beta distibution
- c (float) – Confidence level which controls degree of exploration
-
a
¶ alpha parameter of beta distribution associated with the policy
Type: numpy.ndarray
-
b
¶ beta parameter of beta distribution associated with the policy
Type: numpy.ndarray
-
confidence
¶ Confidence level which weights the exploration term
Type: float
-
quality
¶ Q values for all the actions for alpha, beta and c
Type: numpy.ndarray
-
select_action
(context: int) → int[source]¶ Select an action according to bayesian upper confidence bound
Take action that maximises a weighted sum of the Q values and a beta distribution paramerterized by alpha and beta and weighted by c for each action
Parameters: - context (int) – the context to select action for
- t (int) – timestep to choose action for
Returns: Selected action
Return type: int
-
update_params
(context: int, action: int, reward: float) → None[source]¶ Update parmeters for the policy
Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step
Parameters: - context (int) – context for which action is taken
- action (int) – action taken for the step
- reward (float) – reward obtained for the step
Bernoulli Bandit¶
-
class
genrl.agents.bandits.multiarmed.bernoulli_mab.
BernoulliMAB
(bandits: int = 1, arms: int = 5, reward_probs: numpy.ndarray = None, context_type: str = 'tensor')[source]¶ Bases:
genrl.core.bandit.MultiArmedBandit
Contextual Bandit with categorial context and bernoulli reward distribution
Parameters: - bandits (int) – Number of bandits
- arms (int) – Number of arms in each bandit
- reward_probs (numpy.ndarray) – Probabilities of getting rewards
Espilon Greedy¶
-
class
genrl.agents.bandits.multiarmed.epsgreedy.
EpsGreedyMABAgent
(bandit: genrl.core.bandit.MultiArmedBandit, eps: float = 0.05)[source]¶ Bases:
genrl.agents.bandits.multiarmed.base.MABAgent
Contextual Bandit Policy with Epsilon Greedy Action Selection Strategy.
Refer to Section 2.3 of Reinforcement Learning: An Introduction.
Parameters: - bandit (MultiArmedlBandit type object) – The Bandit to solve
- eps (float) – Probability with which a random action is to be selected.
-
eps
¶ Exploration constant
Type: float
-
quality
¶ Q values assigned by the policy to all actions
Type: numpy.ndarray
-
select_action
(context: int) → int[source]¶ Select an action according to epsilon greedy startegy
A random action is selected with espilon probability over the optimal action according to the current Q values to encourage exploration of the policy.
Parameters: context (int) – the context to select action for Returns: Selected action Return type: int
-
update_params
(context: int, action: int, reward: float) → None[source]¶ Update parmeters for the policy
Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step.
Parameters: - context (int) – context for which action is taken
- action (int) – action taken for the step
- reward (float) – reward obtained for the step
Gaussian¶
-
class
genrl.agents.bandits.multiarmed.gaussian_mab.
GaussianMAB
(bandits: int = 10, arms: int = 5, reward_means: numpy.ndarray = None, context_type: str = 'tensor')[source]¶ Bases:
genrl.core.bandit.MultiArmedBandit
Contextual Bandit with categorial context and gaussian reward distribution
Parameters: - bandits (int) – Number of bandits
- arms (int) – Number of arms in each bandit
- reward_means (numpy.ndarray) – Mean of gaussian distribution for each reward
Gradient¶
-
class
genrl.agents.bandits.multiarmed.gradient.
GradientMABAgent
(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 0.1, temp: float = 0.01)[source]¶ Bases:
genrl.agents.bandits.multiarmed.base.MABAgent
Multi-Armed Bandit Solver with Softmax Action Selection Strategy.
Refer to Section 2.8 of Reinforcement Learning: An Introduction.
Parameters: - bandit (MultiArmedlBandit type object) – The Bandit to solve
- alpha (float) – The step size parameter for gradient based update
- temp (float) – Temperature for softmax distribution over Q values of actions
-
alpha
¶ Step size parameter for gradient based update of policy
Type: float
-
probability_hist
¶ History of probabilty values assigned to each action for each timestep
Type: numpy.ndarray
-
quality
¶ Q values assigned by the policy to all actions
Type: numpy.ndarray
-
select_action
(context: int) → int[source]¶ Select an action according by softmax action selection strategy
Action is sampled from softmax distribution computed over the Q values for all actions
Parameters: context (int) – the context to select action for Returns: Selected action Return type: int
-
temp
¶ Temperature for softmax distribution over Q values of actions
Type: float
-
update_params
(context: int, action: int, reward: float) → None[source]¶ Update parmeters for the policy
Updates the regret as the difference between max Q value and that of the action. Updates the Q values through a gradient ascent step
Parameters: - context (int) – context for which action is taken
- action (int) – action taken for the step
- reward (float) – reward obtained for the step
Thmopson Sampling¶
-
class
genrl.agents.bandits.multiarmed.thompson.
ThompsonSamplingMABAgent
(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 1.0, beta: float = 1.0)[source]¶ Bases:
genrl.agents.bandits.multiarmed.base.MABAgent
Multi-Armed Bandit Solver with Bayesian Upper Confidence Bound based Action Selection Strategy.
Parameters: - bandit (MultiArmedlBandit type object) – The Bandit to solve
- a (float) – alpha value for beta distribution
- b (float) – beta values for beta distibution
-
a
¶ alpha parameter of beta distribution associated with the policy
Type: numpy.ndarray
-
b
¶ beta parameter of beta distribution associated with the policy
Type: numpy.ndarray
-
quality
¶ Q values for all the actions for alpha, beta and c
Type: numpy.ndarray
-
select_action
(context: int) → int[source]¶ Select an action according to Thompson Sampling
Samples are taken from beta distribution parameterized by alpha and beta for each action. The action with the highest sample is selected.
Parameters: context (int) – the context to select action for Returns: Selected action Return type: int
-
update_params
(context: int, action: int, reward: float) → None[source]¶ Update parmeters for the policy
Updates the regret as the difference between max Q value and that of the action. Updates the alpha value of beta distribution by adding the reward while the beta value is updated by adding 1 - reward. Update the counts the action taken.
Parameters: - context (int) – context for which action is taken
- action (int) – action taken for the step
- reward (float) – reward obtained for the step
Upper Confidence Bound¶
-
class
genrl.agents.bandits.multiarmed.ucb.
UCBMABAgent
(bandit: genrl.core.bandit.MultiArmedBandit, confidence: float = 1.0)[source]¶ Bases:
genrl.agents.bandits.multiarmed.base.MABAgent
Multi-Armed Bandit Solver with Upper Confidence Bound based Action Selection Strategy.
Refer to Section 2.7 of Reinforcement Learning: An Introduction.
Parameters: - bandit (MultiArmedlBandit type object) – The Bandit to solve
- c (float) – Confidence level which controls degree of exploration
-
confidence
¶ Confidence level which weights the exploration term
Type: float
-
quality
¶ q values assigned by the policy to all actions
Type: numpy.ndarray
-
select_action
(context: int) → int[source]¶ Select an action according to upper confidence bound action selction
Take action that maximises a weighted sum of the Q values for the action and an exploration encouragement term controlled by c.
Parameters: context (int) – the context to select action for Returns: Selected action Return type: int
-
update_params
(context: int, action: int, reward: float) → None[source]¶ Update parmeters for the policy
Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step.
Parameters: - context (int) – context for which action is taken
- action (int) – action taken for the step
- reward (float) – reward obtained for the step