Contextual Bandits Overview¶

Problem Setting¶

To get some background on the basic multi armed bandit problem, we recommend that you go through the Multi Armed Bandit Overview first. The contextual bandit (CB) problem varies from the basic case in that at each timestep, a context vector $$x \in \mathbb{R}^d$$ is presented to the agent. The agent must then decide on an action $$a \in \mathcal{A}$$ to take based on that context. After the action is taken, the reward $$r \in \mathbb{R}$$ for only that action is revealed to the agent (a feature of all reinforcement learning problems). The aim of the agent remains the same - minimising regret and thus finding an optimal policy.

Here you still have the problem of exploration vs exploitation, but the agent also needs to find some relation between the context and reward.

A Simple Example¶

Lets consider the simplest case of the CB problem. Instead of having only one $$k$$-armed bandit that needs to be solved, say we have $$m$$ different $$k$$-armed Bernoulli bandits. At each timestep, the context presented is the number of the bandit for which an action needs to be selected: $$i \in \mathbb{I}$$ where $$0 < i \le m$$

Although real life CB problems usually have much higher dimensional contexts, such a toy problem can be usefull for testing and debugging agents.

To instantiate a Bernoulli bandit with $$m =10$$ and $$k = 5$$ (10 different 5-armed bandits) -

from genrl.bandit import BernoulliMAB

bandit = BernoulliMAB(bandits=10, arms=5, context_type="int")


Note that this is using the same BernoulliMAB as in the simple bandit case except that instead of the bandits argument defaulting to 1, we are explicitly saying we want multiple bandits (a contexutal case)

Suppose you want to solve this bandit with a UCB based policy.

from genrl.bandit import UCBMABAgent

agent = UCBMABAgent(bandit)
context = bandit.reset()

action = agent.select_action(context)
new_context, reward = bandit.step(action)


To train the agent, you an set up a loop which calls the update_params method on the agent whenever you want to agent to learn from actions it has taken. For convinience it is highly recommended to use the MABTrainer in such cases.

Data based Conextual Bandits¶

Lets consider a more realistic class of CB problem. I real life, you the CB setting is usually used to model recommendation or classification problems. Here, instead of getting an integer as the context, you will get a $$d$$-dimensional feature vector $$\mathbf{x} \in \mathbb{R}^d$$. This is also different from regular classification since you only get the reward $$r \in \mathbb{R}$$ for the action you have taken.

While tabular solutions can work well for integer contexts (see the implentation of any genrl.bandit.MABAgent for details), when you have a high dimensional vector, the agent should be able to infer the complex relation between the contexts and rewards. This can be done by modelling a conditional distribution over rewards for each action given the context.

$P(r | a, \mathbf{x})$

There are many ways to do this. For a detailed explanation and comparison of contextual bandit methods you can refer to this paper.

The following are the agents implemented in genrl

You can find the tutorials for most of these in Bandit Tutorials.

All the methods which use neural networks, provide an option to train and evaluate with dropout, have a decaying learning rate and a limit for gradient clipping. The sizes of hidden layers for the networks can also be specified. Refer to docs of the specific agents to see how to use these options.

Individual agents will have other method specific paramters to control behavior. Although default values have been provided, it may be neccessary to tune these for individual use cases.

The following bandits based on datasets are implemented in genrl

For each bandit, while instatiating an object you can either specify a path to the data file or pass download=True as an argument to download the data directly.

Data based Bandit Example¶

For this example, we’ll model the Statlog dataset as a bandit problem. You can read more about the bandit in the Statlog docs. In brief we have the number of arms as $$k = 7$$ and dimension of context vector as $$d = 9$$. The agent will get a reward $$r =1$$ if it selects the correct arm else $$r = 0$$.

from genrl.bandit import StatlogDataBandit

context = bandit.reset()


Suppose you want to solve this bandit with a Greedy neural network based policy.

from genrl.bandit import NeuralLinearPosteriorAgent

agent = NeuralLinearPosteriorAgent(bandit)
context = bandit.reset()

action = agent.select_action(context)
new_context, reward = bandit.step(action)


To train the agent, we highly reccomend using the DCBTrainer. You can refer to the implementation of the train function to get an idea of how to implemente your own training loop.

from genrl.bandit import DCBTrainer

trainer = DCBTrainer(agent, bandit)
trainer.train(timesteps=5000, batch_size=32)