Multi-Armed Bandit¶

Base¶

class genrl.agents.bandits.multiarmed.base.MABAgent(bandit: genrl.core.bandit.MultiArmedBandit)[source]¶

Bases: genrl.core.bandit.BanditAgent

Base Class for Contextual Bandit solving Policy

Parameters:	bandit (MultiArmedlBandit type object) – The Bandit to solve requires_init_run – Indicated if initialisation of Q values is required

action_hist¶

Get the history of actions taken for contexts

Returns:	List of context, actions pairs
Return type:	list

counts¶

Get the number of times each action has been taken

Returns:	Numpy array with count for each action
Return type:	numpy.ndarray

regret¶

Get the current regret

Returns:	The current regret
Return type:	float

regret_hist¶

Get the history of regrets incurred for each step

Returns:	List of rewards
Return type:	list

reward_hist¶

Get the history of rewards received for each step

Returns:	List of rewards
Return type:	list

select_action(context: int) → int[source]¶

Select an action

This method needs to be implemented in the specific policy.

Parameters:	context (int) – the context to select action for
Returns:	Selected action
Return type:	int

update_params(context: int, action: int, reward: Union[int, float]) → None[source]¶

Update parmeters for the policy

This method needs to be implemented in the specific policy.

Parameters:	context (int) – context for which action is taken action (int) – action taken for the step reward (int or float) – reward obtained for the step

Bayesian Bandit¶

class genrl.agents.bandits.multiarmed.bayesian.BayesianUCBMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 1.0, beta: float = 1.0, confidence: float = 3.0)[source]¶

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Multi-Armed Bandit Solver with Bayesian Upper Confidence Bound based Action Selection Strategy.

Refer to Section 2.7 of Reinforcement Learning: An Introduction.

Parameters:	bandit (MultiArmedlBandit type object) – The Bandit to solve alpha (float) – alpha value for beta distribution beta (float) – beta values for beta distibution c (float) – Confidence level which controls degree of exploration

a¶

alpha parameter of beta distribution associated with the policy

Type:	numpy.ndarray

b¶

beta parameter of beta distribution associated with the policy

Type:	numpy.ndarray

confidence¶

Confidence level which weights the exploration term

Type:	float

quality¶

Q values for all the actions for alpha, beta and c

Type:	numpy.ndarray

select_action(context: int) → int[source]¶

Select an action according to bayesian upper confidence bound

Take action that maximises a weighted sum of the Q values and a beta distribution paramerterized by alpha and beta and weighted by c for each action

Parameters:	context (int) – the context to select action for t (int) – timestep to choose action for
Returns:	Selected action
Return type:	int

update_params(context: int, action: int, reward: float) → None[source]¶

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step

Parameters:	context (int) – context for which action is taken action (int) – action taken for the step reward (float) – reward obtained for the step

Bernoulli Bandit¶

class genrl.agents.bandits.multiarmed.bernoulli_mab.BernoulliMAB(bandits: int = 1, arms: int = 5, reward_probs: numpy.ndarray = None, context_type: str = 'tensor')[source]¶

Bases: genrl.core.bandit.MultiArmedBandit

Contextual Bandit with categorial context and bernoulli reward distribution

Parameters:	bandits (int) – Number of bandits arms (int) – Number of arms in each bandit reward_probs (numpy.ndarray) – Probabilities of getting rewards

Espilon Greedy¶

class genrl.agents.bandits.multiarmed.epsgreedy.EpsGreedyMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, eps: float = 0.05)[source]¶

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Contextual Bandit Policy with Epsilon Greedy Action Selection Strategy.

Refer to Section 2.3 of Reinforcement Learning: An Introduction.

Parameters:	bandit (MultiArmedlBandit type object) – The Bandit to solve eps (float) – Probability with which a random action is to be selected.

eps¶

Exploration constant

Type:	float

quality¶

Q values assigned by the policy to all actions

Type:	numpy.ndarray

select_action(context: int) → int[source]¶

Select an action according to epsilon greedy startegy

A random action is selected with espilon probability over the optimal action according to the current Q values to encourage exploration of the policy.

Parameters:	context (int) – the context to select action for
Returns:	Selected action
Return type:	int

update_params(context: int, action: int, reward: float) → None[source]¶

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step.

Parameters:	context (int) – context for which action is taken action (int) – action taken for the step reward (float) – reward obtained for the step

Gaussian¶

class genrl.agents.bandits.multiarmed.gaussian_mab.GaussianMAB(bandits: int = 10, arms: int = 5, reward_means: numpy.ndarray = None, context_type: str = 'tensor')[source]¶

Bases: genrl.core.bandit.MultiArmedBandit

Contextual Bandit with categorial context and gaussian reward distribution

Parameters:	bandits (int) – Number of bandits arms (int) – Number of arms in each bandit reward_means (numpy.ndarray) – Mean of gaussian distribution for each reward

Gradient¶

class genrl.agents.bandits.multiarmed.gradient.GradientMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 0.1, temp: float = 0.01)[source]¶

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Multi-Armed Bandit Solver with Softmax Action Selection Strategy.

Refer to Section 2.8 of Reinforcement Learning: An Introduction.

Parameters:	bandit (MultiArmedlBandit type object) – The Bandit to solve alpha (float) – The step size parameter for gradient based update temp (float) – Temperature for softmax distribution over Q values of actions

alpha¶

Step size parameter for gradient based update of policy

Type:	float

probability_hist¶

History of probabilty values assigned to each action for each timestep

Type:	numpy.ndarray

quality¶

Q values assigned by the policy to all actions

Type:	numpy.ndarray

select_action(context: int) → int[source]¶

Select an action according by softmax action selection strategy

Action is sampled from softmax distribution computed over the Q values for all actions

Parameters:	context (int) – the context to select action for
Returns:	Selected action
Return type:	int

temp¶

Temperature for softmax distribution over Q values of actions

Type:	float

update_params(context: int, action: int, reward: float) → None[source]¶

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the Q values through a gradient ascent step

Parameters:	context (int) – context for which action is taken action (int) – action taken for the step reward (float) – reward obtained for the step

Thmopson Sampling¶

class genrl.agents.bandits.multiarmed.thompson.ThompsonSamplingMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 1.0, beta: float = 1.0)[source]¶

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Multi-Armed Bandit Solver with Bayesian Upper Confidence Bound based Action Selection Strategy.

Parameters:	bandit (MultiArmedlBandit type object) – The Bandit to solve a (float) – alpha value for beta distribution b (float) – beta values for beta distibution

a¶

alpha parameter of beta distribution associated with the policy

Type:	numpy.ndarray

b¶

beta parameter of beta distribution associated with the policy

Type:	numpy.ndarray

quality¶

Q values for all the actions for alpha, beta and c

Type:	numpy.ndarray

select_action(context: int) → int[source]¶

Select an action according to Thompson Sampling

Samples are taken from beta distribution parameterized by alpha and beta for each action. The action with the highest sample is selected.

Parameters:	context (int) – the context to select action for
Returns:	Selected action
Return type:	int

update_params(context: int, action: int, reward: float) → None[source]¶

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the alpha value of beta distribution by adding the reward while the beta value is updated by adding 1 - reward. Update the counts the action taken.

Parameters:	context (int) – context for which action is taken action (int) – action taken for the step reward (float) – reward obtained for the step

Upper Confidence Bound¶

class genrl.agents.bandits.multiarmed.ucb.UCBMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, confidence: float = 1.0)[source]¶

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Multi-Armed Bandit Solver with Upper Confidence Bound based Action Selection Strategy.

Refer to Section 2.7 of Reinforcement Learning: An Introduction.

Parameters:	bandit (MultiArmedlBandit type object) – The Bandit to solve c (float) – Confidence level which controls degree of exploration

confidence¶

Confidence level which weights the exploration term

Type:	float

quality¶

q values assigned by the policy to all actions

Type:	numpy.ndarray

select_action(context: int) → int[source]¶

Select an action according to upper confidence bound action selction

Take action that maximises a weighted sum of the Q values for the action and an exploration encouragement term controlled by c.

Parameters:	context (int) – the context to select action for
Returns:	Selected action
Return type:	int

update_params(context: int, action: int, reward: float) → None[source]¶

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step.

Parameters:	context (int) – context for which action is taken action (int) – action taken for the step reward (float) – reward obtained for the step