Linear Posterior InferenceΒΆ
For an introduction to the Contextual Bandit problem, refer to Contextual Bandits Overview.
In this agent we assume a linear relationship between context and reward distribution of the form
We can utilise bayesian linear regression to find the parameters \(\beta\) and \(\sigma\). Since our agent is continually learning, the parameters of the model will being updated according the (\(\mathbf{x}\), \(a\), \(r\)) transitions it observes.
For more complex non linear relations, we can make use of neural networks to transform the context into a learned embedding space. The above method can then be used on this latent embedding to model the reward.
An example of using a neural network based linear posterior agent in
genrl
-
from genrl.bandit import NeuralLinearPosteriorAgent, DCBTrainer
agent = NeuralLinearPosteriorAgent(bandit, lambda_prior=0.5, a0=2, b0=2, device="cuda")
trainer = DCBTrainer(agent, bandit)
trainer.train()
Note that the priors here are used to parameterise the initial
distribution over \(\beta\) and \(\sigma\). More specificaly
lambda_prior
is used to parameterise a guassian distribution for
\(\beta\) while a0
and b0
are paramters of an inverse gamma
distribution over \(\sigma^2\). These are updated over the course of
exploring a bandit. More details can be found in Section 3 of
this paper.
All hyperparameters can be tuned for individual use cases to improve training efficiency and achieve convergence faster.
Refer to the LinearPosteriorAgent, NeuralLinearPosteriorAgent and DCBTrainer docs for more details.