For an introduction to the Contextual Bandit problem, refer to Contextual Bandits Overview.

In the bootstrap agent multiple different neural network based models are trained simultaneously. Different transition databases are maintained for each model and every time we observe a transition it is added to each dataset with some probability. At each timestep, the model used to select an action is chosen randomly from the set of models.

By having multiple different models initialised with different random weights, we promote the exploration of the loss landscape which may have multiple different local optima.

An example of using a bootstrap based agent in genrl with 10 models with a hidden layer of 128 neurons which also uses dropout for training -

from genrl.bandit import BootstrapNeuralAgent, DCBTrainer

agent = BootstrapNeuralAgent(bandit, hidden_dims=[128], n=10, dropout_p=0.5, device="cuda")

trainer = DCBTrainer(agent, bandit)

Refer to the BootstrapNeuralAgent and DCBTrainer docs for more details.