# Contextual Bandits Overview¶

## Problem Setting¶

To get some background on the basic multi armed bandit problem, we recommend that you go through the Multi Armed Bandit Overview first. The contextual bandit (CB) problem varies from the basic case in that at each timestep, a context vector \(x \in \mathbb{R}^d\) is presented to the agent. The agent must then decide on an action \(a \in \mathcal{A}\) to take based on that context. After the action is taken, the reward \(r \in \mathbb{R}\) for only that action is revealed to the agent (a feature of all reinforcement learning problems). The aim of the agent remains the same - minimising regret and thus finding an optimal policy.

Here you still have the problem of exploration vs exploitation, but the agent also needs to find some relation between the context and reward.

## A Simple Example¶

Lets consider the simplest case of the CB problem. Instead of having only one \(k\)-armed bandit that needs to be solved, say we have \(m\) different \(k\)-armed Bernoulli bandits. At each timestep, the context presented is the number of the bandit for which an action needs to be selected: \(i \in \mathbb{I}\) where \(0 < i \le m\)

Although real life CB problems usually have much higher dimensional contexts, such a toy problem can be usefull for testing and debugging agents.

To instantiate a Bernoulli bandit with \(m =10\) and \(k = 5\) (10 different 5-armed bandits) -

```
from genrl.bandit import BernoulliMAB
bandit = BernoulliMAB(bandits=10, arms=5, context_type="int")
```

Note that this is using the same `BernoulliMAB`

as in the simple
bandit case except that instead of the `bandits`

argument defaulting
to `1`

, we are explicitly saying we want multiple bandits (a
contexutal case)

Suppose you want to solve this bandit with a UCB based policy.

```
from genrl.bandit import UCBMABAgent
agent = UCBMABAgent(bandit)
context = bandit.reset()
action = agent.select_action(context)
new_context, reward = bandit.step(action)
```

To train the agent, you an set up a loop which calls the
`update_params`

method on the agent whenever you want to agent to
learn from actions it has taken. For convinience it is highly
recommended to use the `MABTrainer`

in such cases.

## Data based Conextual Bandits¶

Lets consider a more realistic class of CB problem. I real life, you the CB setting is usually used to model recommendation or classification problems. Here, instead of getting an integer as the context, you will get a \(d\)-dimensional feature vector \(\mathbf{x} \in \mathbb{R}^d\). This is also different from regular classification since you only get the reward \(r \in \mathbb{R}\) for the action you have taken.

While tabular solutions can work well for integer contexts (see the
implentation of any `genrl.bandit.MABAgent`

for details), when you
have a high dimensional vector, the agent should be able to infer the
complex relation between the contexts and rewards. This can be done by
modelling a conditional distribution over rewards for each action given
the context.

There are many ways to do this. For a detailed explanation and comparison of contextual bandit methods you can refer to this paper.

The following are the agents implemented in `genrl`

- Linear Posterior Inference
- Neural Network based Linear
- Variational
- Neural Netowork based Espilon Greedy
- Bootstrap
- Parameter noise Sampling

You can find the tutorials for most of these in Bandit Tutorials.

All the methods which use neural networks, provide an option to train and evaluate with dropout, have a decaying learning rate and a limit for gradient clipping. The sizes of hidden layers for the networks can also be specified. Refer to docs of the specific agents to see how to use these options.

Individual agents will have other method specific paramters to control behavior. Although default values have been provided, it may be neccessary to tune these for individual use cases.

The following bandits based on datasets are implemented in `genrl`

- Adult Census Income Dataset
- US Census Dataset
- Forest covertype Datset
- MAGIC Gamma Telescope dataset
- Mushroom Dataset
- Statlog Space Shuttle Dataset

For each bandit, while instatiating an object you can either specify a
path to the data file or pass `download=True`

as an argument to
download the data directly.

## Data based Bandit Example¶

For this example, we’ll model the Statlog dataset as a bandit problem. You can read more about the bandit in the Statlog docs. In brief we have the number of arms as \(k = 7\) and dimension of context vector as \(d = 9\). The agent will get a reward \(r =1\) if it selects the correct arm else \(r = 0\).

```
from genrl.bandit import StatlogDataBandit
bandit = StatlogDataBandit(download=True)
context = bandit.reset()
```

Suppose you want to solve this bandit with a Greedy neural network based policy.

```
from genrl.bandit import NeuralLinearPosteriorAgent
agent = NeuralLinearPosteriorAgent(bandit)
context = bandit.reset()
action = agent.select_action(context)
new_context, reward = bandit.step(action)
```

To train the agent, we highly reccomend using the `DCBTrainer`

. You
can refer to the implementation of the `train`

function to get an idea
of how to implemente your own training loop.

```
from genrl.bandit import DCBTrainer
trainer = DCBTrainer(agent, bandit)
trainer.train(timesteps=5000, batch_size=32)
```

## Further material about bandits¶

- Deep Contextual Multi-armed Bandits, Collier and Llorens, 2018
- Deep Bayesian Bandits Showdown, Riquelme∗ et al, 2018
- A Contextual Bandit Bake-off, Bietti et al, 2020