Welcome to GenRL’s documentation!¶
Features¶
 Unified Trainer and Logging class: code reusability and highlevel UI
 Readymade algorithm implementations: readymade implementations of popular RL algorithms.
 Extensive Benchmarking
 Environment implementations
 Heavy Encapsulation useful for new algorithms
Contents¶
Installation¶
PyPI Package¶
GenRL is compatible with Python 3.6 or later and also depends on pytorch
and openaigym
. The easiest way to install GenRL is with pip, Python’s preferred package installer.
$ pip install genrl
Note that GenRL is an active project and routinely publishes new releases. In order to upgrade GenRL to the latest version, use pip as follows.
$ pip install U genrl
From Source¶
If you intend to install the latest unreleased version of the library (i.e from source), you can simply do:
$ git clone https://github.com/SforAiDl/genrl.git
$ cd genrl
$ python setup.py install
About¶
Introduction¶
Reinforcement Learning has taken massive leaps forward in extending current AI research. David Silver’s paper on playing Atari with Deep Reinforcement Learning can be considered one of the seminal papers in establishing a completely new landscape of Reinforcement Learning Research. With applications in Robotics, Healthcare and numerous other domains, RL has become the prime mechanism of modelling Sequential Decision Making through AI.
Yet, current libraries and resources in Reinforcement Learning are either very limited, messy and/or are scattered. OpenAI’s Spinning Up is a great resource for getting started with Deep Reinforcement Learning but it fails to cover more basic concepts in Reinforcement Learning for e.g. Multi Armed Bandits. garage is a great resource for reproducing and evaluating RL algorithms but it fails to introduce a newbie to RL.
With GenRL, our goal is threefold:  To educate the user about Reinforcement learning.  Easy to understand implementations of State of the Art Reinforcement Learning Algorithms.  Providing utilities for developing and evaluating new RL algorithms. Or in a sense be able to implement any new RL algorithm in less than 200 lines.
Policies and Values¶
Modern research on Reinforcement Learning is majorly based on Markov Decision Processes. Policy and Value Functions are one of the core parts of such a problem formulation. And so, polices and values form one of the core parts of our library.
Trainers and Loggers¶
Trainers¶
Most current algorithms follow a standard procedure of training. Considering a classification between OnPolicy and OffPolicy Algorithms, we provide high level APIs through Trainers which can be coupled with Agents and Environments for training seamlessly.
Lets take the example of an OnPolicy Algorithm, Proximal Policy Optimization. In our Agent, we make sure to define three methods: collect_rollouts
, get_traj_loss
and finally update_policy
.
The OnPolicyTrainer
simply calls these functions and enables high level usage by simple defining of three methods.
Loggers¶
At the moment, we support three different types of Loggers. HumanOutputFormat
, TensorboardLogger
and CSVLogger
. Any of these loggers can be initialized really easily by the top level Logger
class and specifying the individual formats in which logging should performed.
logger = Logger(logdir='logs/', formats=['stdout', 'tensorboard'])
After which logger can perform logging easily by providing it with dictionaries of data. For e.g.
logger.write({"logger":0})
Note: The Tensorboard logger requires an extra xaxis parameter, as it plots data rather than just show it in a tabular format.
Agent Encapsulation¶
WIP
Environments¶
Wrappers
Tutorials¶
Bandit Tutorials¶
Multi Armed Bandit Overview¶
Training an EpsilonGreedy agent on a Bernoulli Multi Armed Bandit¶
Multi armed bandits is one of the most basic problems in RL. Think of it like this, you have ‘n’ levers in front of you and each of these levers will give you a different reward. For the purposes of formalising the problem the reward is written down in terms of a reward function i.e., the probability of getting a reward when a lever is pulled.
Suppose you try out one of the levers and get a positive reward. What do you do next? Should you just keep pulling that lever every time or think what if there might be a better reward to pulling one of the other levers? This is the exploration  exploitation dilemma.
Exploitation  Utilise the information you have gathered till now, to make the best decision. In this case, after 1 try you know a lever is giving you a positive reward and you just exploit it further. Since you do not care about other arms if you keep exploiting, it is known as the greedy action.
Exploration  You explore the untried levers in an attempt to maybe discover another one which has a higher payout than the one you currently have some knowledge about. This is exploring all your options without worrying about the shortterm rewards, in hope of finding a lever with a bigger reward, in the long run.
You have to use an algorithm which correctly trades off exploration and exploitation as we do not want a ‘greedy’ algorithm which only exploits and does not explore at all, because there are very high chances that it will converge to a suboptimal policy. We do not want an algorithm that keeps exploring either as this would lead to suboptimal rewards inspite of knowing the best action to be taken. In this case, the optimal policy will be to always pull the lever with the highest reward, but at the beginning we do not know the probability distribution of the rewards.
So, we want a policy which explores actively at the beginning, building up an estimate for the reward values(defined as quality) of all the actions, and then exploiting that from that time onwards.
A Bernoulli MultiArmed Bandit has multiple arms with each having a different bernoulli distribution over its reward. Basically each arm has a probabilty associated with it which is the probability of getting a reward if that arm is pulled. Our aim is to find the arm which has the highest probabilty, thus giving us the maximum return.
Notation:
\(Q_t(a)\): Estimated quality of action ‘a’ at timestep ‘t’.
\(q(a)\): True value of action ‘a’.
We want our estimate \(Q_t(a)\) to be as close to the true value \(q(a)\) as possible, so we can make the correct decision.
Let the action with the maximum quality be \(a^*\):s
Our goal is to find this \(q^*\).
The ‘regret function’ is defined as the sum of ‘regret’ accumulated over all timesteps. This regret is the cost of not choosing the optimal arm and instead of exploring. Mathematically it can be written as:
Some policies which are effective at exploring are: 1. Epsilon Greedy 2. Gradient Algorithm 3. UCB(Upper Confidence Bound) 4. Bayesian 5. Thompson Sampling
Epsilon Greedy is the most basic exploratory policy which follows a simple principle to balance exploration and exploitation. It ‘exploits’ the current knowledge of the bandit most of the times, i.e. takes the action with the largest q value. But with a small probability epsilon, it also explores a random action. The value of epsilon signifies how much you want the agent explore. Higher the value, the more it explores. But remember you do not want an agent to explore too much even after it has a pretty confident estimate of the reward function, so the value of epislon should neither be too high nor too low!
For the bandit, you can set the number of bandits, number of arms, and also reward probabilities of each of these arms seperately.
Code to train an Epsilon Greedy agent on a Bernoulli MultiArmed Bandit:
import gym
import numpy as np
from genrl.bandit import BernoulliMAB, EpsGreedyMABAgent, MABTrainer
reward_probs = np.random.random(size=(bandits, arms))
bandit = BernoulliMAB(arms=5, reward_probs=reward_probs, context_type="int")
agent = EpsGreedyMABAgent(bandit, eps=0.05)
trainer = MABTrainer(agent, bandit)
trainer.train(timesteps=10000)
More details can be found in the docs for BernoulliMAB, EpsGreedyMABAgent, MABTrainer.
You can also refer to the book “Reinforcement Learning: An Introduction”, Chapter 2 for further information on bandits.
Contextual Bandits Overview¶
Problem Setting¶
To get some background on the basic multi armed bandit problem, we recommend that you go through the Multi Armed Bandit Overview first. The contextual bandit (CB) problem varies from the basic case in that at each timestep, a context vector \(x \in \mathbb{R}^d\) is presented to the agent. The agent must then decide on an action \(a \in \mathcal{A}\) to take based on that context. After the action is taken, the reward \(r \in \mathbb{R}\) for only that action is revealed to the agent (a feature of all reinforcement learning problems). The aim of the agent remains the same  minimising regret and thus finding an optimal policy.
Here you still have the problem of exploration vs exploitation, but the agent also needs to find some relation between the context and reward.
A Simple Example¶
Lets consider the simplest case of the CB problem. Instead of having only one \(k\)armed bandit that needs to be solved, say we have \(m\) different \(k\)armed Bernoulli bandits. At each timestep, the context presented is the number of the bandit for which an action needs to be selected: \(i \in \mathbb{I}\) where \(0 < i \le m\)
Although real life CB problems usually have much higher dimensional contexts, such a toy problem can be usefull for testing and debugging agents.
To instantiate a Bernoulli bandit with \(m =10\) and \(k = 5\) (10 different 5armed bandits) 
from genrl.bandit import BernoulliMAB
bandit = BernoulliMAB(bandits=10, arms=5, context_type="int")
Note that this is using the same BernoulliMAB
as in the simple
bandit case except that instead of the bandits
argument defaulting
to 1
, we are explicitly saying we want multiple bandits (a
contexutal case)
Suppose you want to solve this bandit with a UCB based policy.
from genrl.bandit import UCBMABAgent
agent = UCBMABAgent(bandit)
context = bandit.reset()
action = agent.select_action(context)
new_context, reward = bandit.step(action)
To train the agent, you an set up a loop which calls the
update_params
method on the agent whenever you want to agent to
learn from actions it has taken. For convinience it is highly
recommended to use the MABTrainer
in such cases.
Data based Conextual Bandits¶
Lets consider a more realistic class of CB problem. I real life, you the CB setting is usually used to model recommendation or classification problems. Here, instead of getting an integer as the context, you will get a \(d\)dimensional feature vector \(\mathbf{x} \in \mathbb{R}^d\). This is also different from regular classification since you only get the reward \(r \in \mathbb{R}\) for the action you have taken.
While tabular solutions can work well for integer contexts (see the
implentation of any genrl.bandit.MABAgent
for details), when you
have a high dimensional vector, the agent should be able to infer the
complex relation between the contexts and rewards. This can be done by
modelling a conditional distribution over rewards for each action given
the context.
There are many ways to do this. For a detailed explanation and comparison of contextual bandit methods you can refer to this paper.
The following are the agents implemented in genrl
 Linear Posterior Inference
 Neural Network based Linear
 Variational
 Neural Netowork based Espilon Greedy
 Bootstrap
 Parameter noise Sampling
You can find the tutorials for most of these in Bandit Tutorials.
All the methods which use neural networks, provide an option to train and evaluate with dropout, have a decaying learning rate and a limit for gradient clipping. The sizes of hidden layers for the networks can also be specified. Refer to docs of the specific agents to see how to use these options.
Individual agents will have other method specific paramters to control behavior. Although default values have been provided, it may be neccessary to tune these for individual use cases.
The following bandits based on datasets are implemented in genrl
 Adult Census Income Dataset
 US Census Dataset
 Forest covertype Datset
 MAGIC Gamma Telescope dataset
 Mushroom Dataset
 Statlog Space Shuttle Dataset
For each bandit, while instatiating an object you can either specify a
path to the data file or pass download=True
as an argument to
download the data directly.
Data based Bandit Example¶
For this example, we’ll model the Statlog dataset as a bandit problem. You can read more about the bandit in the Statlog docs. In brief we have the number of arms as \(k = 7\) and dimension of context vector as \(d = 9\). The agent will get a reward \(r =1\) if it selects the correct arm else \(r = 0\).
from genrl.bandit import StatlogDataBandit
bandit = StatlogDataBandit(download=True)
context = bandit.reset()
Suppose you want to solve this bandit with a Greedy neural network based policy.
from genrl.bandit import NeuralLinearPosteriorAgent
agent = NeuralLinearPosteriorAgent(bandit)
context = bandit.reset()
action = agent.select_action(context)
new_context, reward = bandit.step(action)
To train the agent, we highly reccomend using the DCBTrainer
. You
can refer to the implementation of the train
function to get an idea
of how to implemente your own training loop.
from genrl.bandit import DCBTrainer
trainer = DCBTrainer(agent, bandit)
trainer.train(timesteps=5000, batch_size=32)
Further material about bandits¶
 Deep Contextual Multiarmed Bandits, Collier and Llorens, 2018
 Deep Bayesian Bandits Showdown, Riquelme∗ et al, 2018
 A Contextual Bandit Bakeoff, Bietti et al, 2020
UCB¶
Training a UCB algorithm on a Bernoulli MultiArmed Bandit¶
For an introduction to Multi Armed Bandits, refer to Multi Armed Bandit Overview
The UCB algorithm follows a basic principle  ‘Optimism in the face of uncertainty’. What this means is that we should always select the action whose reward we are most uncertain of. We quantify the uncertainty of taking an action by calculating an upper bound of the quality(reward) for that action. We then select the greedy action with respect to this upper bound.
Hoeffding’s inequality:
,
q(a) is the quality of that action,
\(Q_t(a)\) is the estimate of the quality of action ‘a’ at time ‘t’,
\(U_t(a)\) is the upper bound for uncertainty for that action at time ‘t’,
\(N_t(a\) is the number of times action ‘a’ has been selected
Action taken: a = argmax\((Q_t(a) + U_t(a))\)
As we can see, the less an action has been tried, more the uncertainty is (due to \(N_t(a)\) being in the denominator), which leads to that action having a higher chance of being explored. Also, theoretically, as \({N_t(a)}\) goes to infinity, the uncertainty decreases to 0 giving us the true value of the quality of that action: q(a). This allows us to ‘exploit’ the greedy action \(a^*\) from then.
Code to train a UCB agent on a Bernoulli MultiArmed Bandit:
import gym
import numpy as np
from genrl.bandit import BernoulliMAB, MABTrainer, UCBMABAgent
bandits = 10
arms = 5
reward_probs = np.random.random(size=(bandits, arms))
bandit = BernoulliMAB(bandits, arms, reward_probs, context_type="int")
agent = UCBMABAgent(bandit, confidence=1.0)
trainer = MABTrainer(agent, bandit)
trainer.train(timesteps=10000)
More details can be found in the docs for BernoulliMAB, UCB and MABTrainer.
Thompson Sampling¶
Using Thompson Sampling on a Bernoulli MultiArmed Bandit¶
For an introduction to Multi Armed Bandits, refer to Multi Armed Bandit Overview
Thompson Sampling is one of the best methods for solving the Bernoulli multiarmed bandits problem. It is a ‘samplebased probability matching’ method.
We initially assume an initial distribution(prior) over the quality of each of the arms. We can model this prior using a Beta distribution, parametrised by alpha(\(\alpha\)) and beta(\(\beta\)).
Let’s just think of the denominator as some normalising constant, and focus on the numerator for now. We initialise \(\alpha\) = \(\beta\) = 1. This will result in a uniform distribution over the values (0, 1), making all the values of quality from 0 to 1 equally probable, so this is a fair initial assumption. Now think of \(\alpha\) as the number of times we get the reward ‘1’ and \(\beta\) as the number of times we get ‘0’, for a particular arm. As our agent interacts with the environment and gets a reward for pulling any arm, we will update our prior for that arm using Bayes Theorem. What this does is that it gives a posterior distribution over the quality, according to the rewards we have seen so far.
At each timestep, we sample the quality: \(Q_t(a)\) for each arm from the posterior and select the sample with the highest value. The more an action is tried out, the narrower is the distribution over its quality, meaning we have a confident estimate of its quality (q(a)). If an action has not been tried out that often, it will have a more wider distribution (high variance), meaning we are uncertain about our estimate of its quality (q(a)). This wider variance of an arm with an uncertain estimate creates opportunities for it to be picked during sampling.
As we experience more successes for a particular arm, the value of \(\alpha\) for that arm increases and similiarly, the more failures we experience, the value of \(\beta\) increases. Higher the value of one of the parameters as compared to the other, the more skewed is the distribution in one of the directions. For eg. if \(\alpha\) = 100 and \(\beta\) = 50, we have seen considerably more successes than failures for this arm and so our estimate for its quality should be >0.5. This will be reflected in the posterior of this arm, i.e. the mean of the distribution, characterised by \(\frac{\alpha}{\alpha + \beta}\) will be 0.66, which is >0.5 as we expected.
Code to use Thompson Sampling on a Bernoulli MultiArmed Bandit:
import gym
import numpy as np
from genrl.bandit import BernoulliMAB, MABTrainer, ThompsonSamplingMABAgent
bandits = 10
arms = 5
alpha = 1.0
beta = 1.0
reward_probs = np.random.random(size=(bandits, arms))
bandit = BernoulliMAB(bandits, arms, reward_probs, context_type="int")
agent = ThompsonSamplingMABAgent(bandit, alpha, beta)
trainer = MABTrainer(agent, bandit)
trainer.train(timesteps=10000)
More details can be found in the docs for BernoulliMAB, UCB and MABTrainer.
Bayesian¶
Using Bayesian Method on a Bernoulli MultiArmed Bandit¶
For an introduction to Multi Armed Bandits, refer to Multi Armed Bandit Overview
This method is also based on the prinicple  ‘Optimism in the face of uncertainty’, like UCB. We initially assume an initial distribution(prior) over the quality of each of the arms. We can model this prior using a Beta distribution, parametrised by alpha(\(\alpha\)) and beta(\(\beta\)).
Let’s just think of the denominator as some normalising constant, and focus on the numerator for now. We initialise \(\alpha\) = \(\beta\) = 1. This will result in a uniform distribution over the values (0, 1), making all the values of quality from 0 to 1 equally probable, so this is a fair initial assumption. Now think of \(\alpha\) as the number of times we get the reward ‘1’ and \(\beta\) as the number of times we get ‘0’, for a particular arm. As our agent interacts with the environment and gets a reward for pulling any arm, we will update our prior for that arm using Bayes Theorem. What this does is that it gives a posterior distribution over the quality, according to the rewards we have seen so far.
This is quite similar to Thompson Sampling. But what is different here is that we explicity try to calculate the uncertainty of a particular action by calculating the standard deviation(\(\sigma\)) of its posterior. We add this std. dev to the mean of the posterior, giving us an upper bound of the quality of that arm. At each timestep we select a greedy action based on this upper bound we calculated.
As we try out an action more and more, the standard deviation of the posterior decreases, corresponding to a decrease in the uncertainty of that action, which is exactly what we want. If an action has not been tried that often, it will have a wider posterior, meaning higher chances of it getting selected based on its upper bound.
Code to use Bayesian method on a Bernoulli MultiArmed Bandit:
import gym
import numpy as np
from genrl.bandit import BayesianUCBMABAgent, BernoulliMAB, MABTrainer
bandits = 10
arms = 5
alpha = 1.0
beta = 1.0
reward_probs = np.random.random(size=(bandits, arms))
bandit = BernoulliMAB(bandits, arms, reward_probs, context_type="int")
agent = BayesianUCBMABAgent(bandit, alpha, beta)
trainer = MABTrainer(agent, bandit)
trainer.train(timesteps=10000)
More details can be found in the docs for BernoulliMAB, BayesianUCBMABAgent and MABTrainer.
Gradients¶
Using Gradient Method on a Bernoulli MultiArmed Bandit¶
For an introduction to Multi Armed Bandits, refer to Multi Armed Bandit Overview
This method is different compared to others. In other methods, we explicity attempt to estimate the ‘value’ of taking an action (its quality) whereas in this method we approach the problem in a different way. Here, instead of estimating how good an action is through its quality, we only care about its preference of being selected compared to other actions. We denote this preference by \(H_t(a)\). The larger the preference of an action ‘a’, more are the chances of it being selected, but this preference has no interpretation in terms of the reward for that action. Only the relative preference is important.
The action probabilites are related to these action preferences \(H_t(a)\) by a softmax function. The probability of taking action \(a_j\) is given by:
where, A is the total number of actions and \(\pi_t(a)\) is the probability of taking action ‘a’ at timestep ‘t’.
We initialise the preferences for all the actions to be 0, meaning \(\pi_t(a) = \frac{1}{A}\) for all actions.
After computing \(\pi_t(a)\) for all actions at each timestep, the action is sampled using this probability. Then that action is performed and based on the reward we get, we update our preferences.
The update rule bacially performs stochastic gradient ascent:
\(H_{t+1}(a_t) = H_t(a_t) + \alpha (R_t  \bar{R_t})(1\pi_t(a_t))\), for \(a_t\): action taken at time ‘t’
\(H_{t+1}(a) = H_t(a)  \alpha (R_t  \bar{R_t})(\pi_t(a))\) for rest of the actions
where, \(\alpha\) is the step size, \(R_t\) is the reward obtained at time ‘t’ and \(\bar{R_t}\) is the mean reward obtained upto time t. If current reward is larger than the mean reward, we increase our preference for that action taken at time ‘t’. If it is lower than the mean reward, we decrease our preference for that action. The preferences for the rest of the actions are updated in the opposite direction.
For a more detailed mathematical analysis and derivation of the update rule, refer to chapter 2 of Sutton & Barto.
Code to use the Gradient method on a Bernoulli MultiArmed Bandit:
import gym
import numpy as np
from genrl.bandit import BernoulliMAB, GradientMABAgent, MABTrainer
bandits = 10
arms = 5
reward_probs = np.random.random(size=(bandits, arms))
bandit = BernoulliMAB(bandits, arms, reward_probs, context_type="int")
agent = GradientMABAgent(bandit, alpha=0.1, temp=0.01)
trainer = MABTrainer(agent, bandit)
trainer.train(timesteps=10000)
More details can be found in the docs for BernoulliMAB, BayesianUCBMABAgent and MABTrainer.
Linear Posterior Inference¶
For an introduction to the Contextual Bandit problem, refer to Contextual Bandits Overview.
In this agent we assume a linear relationship between context and reward distribution of the form
We can utilise bayesian linear regression to find the parameters \(\beta\) and \(\sigma\). Since our agent is continually learning, the parameters of the model will being updated according the (\(\mathbf{x}\), \(a\), \(r\)) transitions it observes.
For more complex non linear relations, we can make use of neural networks to transform the context into a learned embedding space. The above method can then be used on this latent embedding to model the reward.
An example of using a neural network based linear posterior agent in
genrl

from genrl.bandit import NeuralLinearPosteriorAgent, DCBTrainer
agent = NeuralLinearPosteriorAgent(bandit, lambda_prior=0.5, a0=2, b0=2, device="cuda")
trainer = DCBTrainer(agent, bandit)
trainer.train()
Note that the priors here are used to parameterise the initial
distribution over \(\beta\) and \(\sigma\). More specificaly
lambda_prior
is used to parameterise a guassian distribution for
\(\beta\) while a0
and b0
are paramters of an inverse gamma
distribution over \(\sigma^2\). These are updated over the course of
exploring a bandit. More details can be found in Section 3 of
this paper.
All hyperparameters can be tuned for individual use cases to improve training efficiency and achieve convergence faster.
Refer to the LinearPosteriorAgent, NeuralLinearPosteriorAgent and DCBTrainer docs for more details.
Variational Inference¶
For an introduction to the Contextual Bandit problem, refer to Contextual Bandits Overview.
In this method, we try find a distribution \(P_{\theta}(r  \mathbf{x}, a)\) by minimising the KL divergence with the true distribution. For the model we take a neueral network where each weight is modelled by independant gaussians, also known as Bayesian Neural Nets.
An example of using a variational inference based agent in genrl
with bayesian net of hidden layer of 128 neurons and standard deviation
of 0.1 for al the weights 
from genrl.bandit import VariationalAgent, DCBTrainer
agent = VariationalAgent(bandit, hidden_dims=[128], noise_std=0.1, device="cuda")
trainer = DCBTrainer(agent, bandit)
trainer.train()
Refer to the VariationalAgent, and DCBTrainer docs for more details.
Bootstrap¶
For an introduction to the Contextual Bandit problem, refer to Contextual Bandits Overview.
In the bootstrap agent multiple different neural network based models are trained simultaneously. Different transition databases are maintained for each model and every time we observe a transition it is added to each dataset with some probability. At each timestep, the model used to select an action is chosen randomly from the set of models.
By having multiple different models initialised with different random weights, we promote the exploration of the loss landscape which may have multiple different local optima.
An example of using a bootstrap based agent in genrl
with 10 models
with a hidden layer of 128 neurons which also uses dropout for training

from genrl.bandit import BootstrapNeuralAgent, DCBTrainer
agent = BootstrapNeuralAgent(bandit, hidden_dims=[128], n=10, dropout_p=0.5, device="cuda")
trainer = DCBTrainer(agent, bandit)
trainer.train()
Refer to the BootstrapNeuralAgent and DCBTrainer docs for more details.
Parameter Noise Sampling¶
For an introduction to the Contextual Bandit problem, refer to Contextual Bandits Overview.
One of the ways to improve exploration of our algorithms is to introduce noise into the weights of the neural network while selecting actions. This does not affect the gradients but will have a similar effect to epsilon greedy exploration.
The noise distribution is regularly updated during training to keep the KL divergence of the prediction and noise predictions within certain limits.
An example of using a noise sampling based agent in genrl
with noise
standard deviation as 0.1, KL divergence limit as 0.1 and batch size for
updating the noise distribution as 128 
from genrl.bandit import BootstrapNeuralAgent, DCBTrainer
agent = NeuralNoiseSamplingAgent(bandit, hidden_dims=[128], noise_std_dev=0.1, eps=0.1, noise_update_batch_size=128, device="cuda")
trainer = DCBTrainer(agent, bandit)
trainer.train()
Refer to the NeuralNoiseSamplingAgent, and DCBTrainer docs for more details.
Adding a new Data Bandit¶
The bandit
submodule like all of genrl
has been designed to be
easily extensible for custom additions. This tutorial will show how to
create a dataset based bandit which will work with the rest of
genrl.bandit
For this tutorial, we will use the Wine dataset which is a simple datset often used for testing classifiers. It has 178 examples each with 14 features, the first of which gives the cultivar of the wine (the feature we need to classify each wine sample into) (this can be one of three) and the rest give the properties of the wine itself. Formulated as a bandit problem we have a bandit with 3 arms and a 13dimensional context. The agent will get a reward of 1 if it correctly selects the arm else 0.
To start off with lets import necessary modules, specify the data URL and
make a class which inherits from
genrl.utils.data_bandits.base.DataBasedBandit
from typing import Tuple
import pandas as pd
import torch
from genrl.utils.data_bandits.base import DataBasedBandit
from genrl.utils.data_bandits.utils import download_data
URL = "http://archive.ics.uci.edu/ml/machinelearningdatabases/wine/wine.data"
class WineDataBandit(DataBasedBandit):
def __init__(self, **kwargs):
def reset(self) > torch.Tensor:
def _compute_reward(self, action: int) > Tuple[int, int]:
def _get_context(self) > torch.Tensor:
We will need to implement __init__
, reset
, _compute_reward
and _get_context
to make the class functional.
For dataset based bandits, we can generally load the data into memory during
initialisation. This can be in some tabular form (numpy.array
,
torch.Tensor
or pandas.DataFrame
) and maintaining an index. When reset,
the bandit would set its index to 0 and reshuffle the rows of the table.
For stepping, the bandit can compute rewards from the current row of the table
as given by the index and then increment the index to move to the next row.
Lets start with __init__
. Here we need to download the data if
specified and load it into memory. Many utility functions are available
in genrl.utils.data_bandits.utils
including
download_data
to download data from a URL as well as functions to
fetch data from memory.
For most cases, you can load the data into a pandas.DataFrame
. You
also need to specify the n_actions
, context_dim
and len
here.
def __init__(self, **kwargs):
super(WineDataBandit, self).__init__(**kwargs)
path = kwargs.get("path", "./data/Wine/")
download = kwargs.get("download", None)
force_download = kwargs.get("force_download", None)
url = kwargs.get("url", URL)
if download:
path = download_data(path, url, force_download)
self._df = pd.read_csv(path, header=None)
self.n_actions = len(self._df[0].unique())
self.context_dim = self._df.shape[1]  1
self.len = len(self._df)
The reset
method will shuffle the indices of the data and return the
counting index to 0. You must have a call to _reset
here to reset
any metrics, counters etc… (which is implemented in the base class)
def reset(self) > torch.Tensor:
self._reset()
self.df = self._df.sample(frac=1).reset_index(drop=True)
return self._get_context()
The new bandit does not explicitly need to implement the step
method
since this is already implmented in the base class. We do however need
to implement _compute_reward
and _get_context
which step
uses.
In _compute_reward
, we need to figure out whether the given action
corresponds to the correct label for this index or not and return the
reward appropriately. This method also return the maxium possible reward
in the current context which is used to compute regret.
def _compute_reward(self, action: int) > Tuple[int, int]:
label = self._df.iloc[self.idx, 0]
r = int(label == (action + 1))
return r, 1
The _get_context
method should return a 13dimensional
torch.Tensor
(in this case) corresponding to the context for the
current index.
def _get_context(self) > torch.Tensor:
return torch.tensor(
self._df.iloc[self.idx, 1:].values,
device=self.device,
dtype=torch.float,
)
Once you are done with the above, you can use the WineDataBandit
class like you would any other bandit from from
genrl.utils.data_bandits
. You can use it with any of the
cb_agents
as well as training on it with
genrl.bandit.DCBTrainer.
Adding a new Deep Contextual Bandit Agent¶
The bandit
submodule like all of genrl
has been designed to be
easily extensible for custom additions. This tutorial will show how to
create a deep contextual bandit agent which will work with the rest of
genrl.bandit
For the purpose of this tutorial we will consider a simple neural network based agent. Although this is a simplictic agent, implementation of any level of agent will need to have the following steps.
To start off with lets import necessary modules and make a class which
inherits from genrl.agents.bandits.contextual.base.DCBAgent
from typing import Optional
import torch
from genrl.agents.bandits.contextual.base import DCBAgent
from genrl.agents.bandits.contextual.common import NeuralBanditModel, TransitionDB
from genrl.utils.data_bandits.base import DataBasedBandit
class NeuralAgent(DCBAgent):
"""Deep contextual bandit agent based on a neural network."""
def __init__(self, bandit: DataBasedBandit, **kwargs):
def select_action(self, context: torch.Tensor) > int:
def update_db(self, context: torch.Tensor, action: int, reward: int):
def update_params(
self,
action: Optional[int] = None,
batch_size: int = 512,
train_epochs: int = 20,
):
We will need to implement __init__
, select_action
, update_db
and update_param
to make the class functional.
Lets start off with __init__
. Here we will need to initialise some
required parameters (init_pulls
, eval_with_dropout
, t
and
update_count
) along with our transition database and the neural
network. For the neural network, you can use the NeuralBanditModel
class. It packages together many of the functionalities a neural network
might require. Refer to the docs for more details.
def __init__(self, bandit: DataBasedBandit, **kwargs):
super(NeuralAgent, self).__init__(bandit, **kwargs)
self.model = (
NeuralBanditModel(
context_dim=self.context_dim,
n_actions=self.n_actions,
**kwargs
)
.to(torch.float)
.to(self.device)
)
self.eval_with_dropout = kwargs.get("eval_with_dropout", False)
self.db = TransitionDB(self.device)
self.t = 0
self.update_count = 0
For the select action function, the agent will pass the context vector
through the neural network to produce logits for each action. It will
then select the action with highest logit value. Note that it must also
increment the timestep, and if take every action atleast init_pulls
number of times initially.
def select_action(self, context: torch.Tensor) > int:
"""Selects action for a given context"""
self.model.use_dropout = self.eval_with_dropout
self.t += 1
if self.t < self.n_actions * self.init_pulls:
return torch.tensor(
self.t % self.n_actions, device=self.device, dtype=torch.int
)
results = self.model(context)
action = torch.argmax(results["pred_rewards"]).to(torch.int)
return action
For updating the databse we can use the add
method of
TransitionDB
class.
def update_db(self, context: torch.Tensor, action: int, reward: int):
"""Updates transition database."""
self.db.add(context, action, reward)
In update_params
we need to train the model on the observations seen
so far. Since the NeuralBanditModel
class already hass a train
function, we just need to call that. However if you are writing your own
model, this is where the updates to the parameters would happen.
def update_params(
self,
action: Optional[int] = None,
batch_size: int = 512,
train_epochs: int = 20,
):
"""Update parameters of the agent."""
self.update_count += 1
self.model.train_model(self.db, train_epochs, batch_size)
Note that some of these functions have unused arguments. The signatures have been decided so as such to ensure generality over all classes of algorithms.
Once you are done with the above, you can use the NeuralAgent
class
like you would any other agent from genrl.bandit
. You can use it
with any of the bandits as well as training it with
genrl.bandit.DCBTrainer.
Classical¶
QLearning using GenRL¶
What is QLearning?¶
QLearning is one of the stepping stones for many reinforcement learning algorithms like DQN. AlphaGO is also one of the famous examples that use QLearning at the heart.
Essentially, a RL agent take an action on the environment and then collect rewards and update its policy, and over time gets better at collecting higher rewards.
In QLearning, we generally maintain a “Qtable” of Qvalues by mapping them to a (state, action) pair.
A natural question is, What are these Qvalues ? It is nothing but the “Quality” of an action taken from a particular state. The more the Qvalue the more chances of getting a better reward.
QTable is often initialized with random values/with zeros and as the agent collects rewards via performing actions on the environment we update this QTable at the \(i\) th step using the following formulation 
Here \(\alpha\) is the learning rate in ML terms, \(\gamma\) is the discount factor for the rewards and \(s'\) is the state reached after taking action \(a\) from state \(s\).
FrozenLakev0 environment¶
So to demonstrate how easy it is to train a QLearning approach in GenRL, we are taking a very simple gym environment.
Description of the environment (from the documentation) 
“The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.
Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you’ll fall into the freezing water. At this time, there’s an international frisbee shortage, so it’s absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won’t always move in the direction you intend.
The surface is described using a grid like the following:
SFFF (S: starting point, safe)
FHFH (F: frozen surface, safe)
FFFH (H: hole, fall to your doom)
HFFG (G: goal, where the frisbee is located)
The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.”
Code¶
Let’s import all the usefull stuff first.
import gym
from genrl import QLearning # for the agent
from genrl.classical.common import Trainer # for training the agent
Now that we have imported all the necessary stuff let’s go ahead and define the environment, the agent and an object for the Trainer class.
env = gym.make("FrozenLakev0")
agent = QLearning(env, gamma=0.6, lr=0.1, epsilon=0.1)
trainer = Trainer(
agent,
env,
model="tabular",
n_episodes=3000,
start_steps=100,
evaluate_frequency=100,
)
Great so far so good! Now moving towards the training process it is just calling the train method in the trainer class.
trainer.train()
trainer.evaluate()
That’s it! You have successfully trained a QLearning agent. You can now go ahead and play with your own environments using GenRL!
SARSA using GenRL¶
What is SARSA?¶
SARSA is an acronym for StateActionRewardStateAction. It is an onpolicy TD control method. Our aim is basically to estimate the Qvalue or the utility value for stateaction pair using the TD update rule given below.
FrozenLakev0 environment¶
So to demonstrate how easy it is to train a SARSA approach in GenRL, we are taking a very simple gym environment.
Description of the environment (from the documentation) 
“The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.
Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you’ll fall into the freezing water. At this time, there’s an international frisbee shortage, so it’s absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won’t always move in the direction you intend.
The surface is described using a grid like the following:
SFFF (S: starting point, safe)
FHFH (F: frozen surface, safe)
FFFH (H: hole, fall to your doom)
HFFG (G: goal, where the frisbee is located)
The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.”
Code¶
Let’s import all the usefull stuff first.
import gym
from genrl import SARSA # for the agent
from genrl.classical.common import Trainer # for training the agent
Now that we have imported all the necessary stuff let’s go ahead and define the environment, the agent and an object for the Trainer class.
env = gym.make("FrozenLakev0")
agent = SARSA(env, gamma=0.6, lr=0.1, epsilon=0.1)
trainer = Trainer(
agent,
env,
model="tabular",
n_episodes=3000,
start_steps=100,
evaluate_frequency=100,
)
Great so far so good! Now moving towards the training process it is just calling the train method in the trainer class.
trainer.train()
trainer.evaluate()
That’s it! You have successfully trained a SARSA agent. You can now go ahead and play with your own environments using GenRL!
Deep RL Tutorials¶
Deep Reinforcement Learning Background¶
Background¶
The goal of Reinforcement Learning Algorithms is to maximize reward. This is usually achieved by having a policy \(\pi_{\theta}\) perform optimal behavior. Let’s denote this optimal policy by \(\pi_{\theta}^{*}\). For ease, we define the Reinforcement Learning problem as a Markov Decision Process.
Markov Decision Process¶
An Markov Decision Process (MDP) is defined by \((S, A, r, P_{a})\) where,
 \(S\) is a set of States.
 \(A\) is a set of Actions.
 \(r : S \rightarrow \mathbb{R}\) is a reward function.
 \(P_{a}(s, s')\) is the transition probability that action \(a\) in state \(s\) leads to state \(s'\).
Often we define two functions, a policy function \(\pi_{\theta}(s,a)\) and \(V_{\pi_{\theta}}(s)\).
Policy Function¶
The policy is the agent’s strategy, we our goal is to make it optimal. The optimal policy is usually denoted by \(\pi_{\theta}^{*}\). There are usually 2 types of policies:
Stochastic Policy¶
The Policy Function is a stochastic variable defining a probability distribution over actions given states i.e. likelihood of every action when an agent is in a particular state. Formally,
Deterministic Policy¶
The Policy Function maps from States directly to Actions.
Value Function¶
The Value Function is defined as the expected return obtained when we follow a policy \(\pi\) starting from state S. Usually there are two types of value functions defined State Value Function and a State Action Value Function.
State Value Function¶
The State Value Function is defined as the expected return starting from only State s.
State Action Value Function¶
The Action Value Function is defined as the expected return starting from a state s and a taking an action a.
The Action Value Function is also known as the Quality Function as it would denote how good a particular action is for a state s.
Approximators¶
Neural Networks are often used as approximators for Policy and Value Functions. In such a case, we say these are parameterised by \(\theta\). For e.g. \(\pi_{\theta}\).
Objective¶
The objective is to choose/learn a policy that will maximize a cumulative function of rewards received at each step, typically the discounted reward over a potential infinite horizon. We formulate this cumulative function as
where we choose an action according to our policy, \(a_{t} = \pi_{\theta}(s_{t})\).
Vanilla Policy Gradient¶
For background on Deep RL, its core definitions and problem formulations refer to Deep RL Background
Objective¶
The objective is to choose/learn a policy that will maximize a cumulative function of rewards received at each step, typically the discounted reward over a potential infinite horizon. We formulate this cumulative function as
where we choose the action \(a_{t} = \pi_{\theta}(s_{t})\).
Algorithm Details¶
Collect Experience¶
To make our agent learn, we first need to collect some experience in an online fashion. For this we make use of the collect_rollouts
method. This method is defined in the OnPolicyAgent
Base Class.
For updation, we would need to compute advantages from this experience. So, we store our experience in a Rollout Buffer. Action Selection —————
Note: We sample a stochastic action from the distribution on the action space by providing False
as an argument to select_action
.
For practical purposes we would assume that we are working with a finite horizon MDP.
Update Equations¶
Let \(\pi_{\theta}\) denote a policy with parameters \(\theta\), and \(J(\pi_{\theta})\) denote the expected finitehorizon undiscounted return of the policy.
At each update timestep, we get value and log probabilities:
Now, that we have the log probabilities we calculate the gradient of \(J(\pi_{\theta})\) as:
where \(\tau\) is the trajectory.
We then update the policy parameters via stochastic gradient ascent:
The key idea underlying vanilla policy gradients is to push up the probabilities of actions that lead to higher return, and push down the probabilities of actions that lead to lower return, until you arrive at the optimal policy.
Training through the API¶
import gym
from genrl.agents import VPG
from genrl.trainers import OnPolicyTrainer
from genrl.environments import VectorEnv
env = VectorEnv("CartPolev0")
agent = VPG('mlp', env)
trainer = OnPolicyTrainer(agent, env, log_mode=['stdout'])
trainer.train()
timestep Episode loss mean_reward
0 0 9.1853 22.3825
20480 10 24.5517 80.3137
40960 20 24.4992 117.7011
61440 30 22.578 121.543
81920 40 20.423 114.7339
102400 50 21.7225 128.4013
122880 60 21.0566 116.034
143360 70 21.628 115.0562
163840 80 23.1384 133.4202
184320 90 23.2824 133.4202
204800 100 26.3477 147.87
225280 110 26.7198 139.7952
245760 120 30.0402 184.5045
266240 130 30.293 178.8646
286720 140 29.4063 162.5397
307200 150 30.9759 183.6771
327680 160 30.6517 186.1818
348160 170 31.7742 184.5045
368640 180 30.4608 186.1818
389120 190 30.2635 186.1818
Advantage Actor Critic¶
For background on Deep RL, its core definitions and problem formulations refer to Deep RL Background
Objective¶
The objective is to maximize the discounted cumulative reward function:
This comprises of two parts in the Adantage Actor Critic Algorithm:
 To choose/learn a policy that will increase the probability of landing an action that has higher expected return than the value of just the state and decrease the probability of landing an action that has lower expected return than the value of the state. The Advantage is computed as:
 To learn a State Action Value Function (in the name of Critic) that estimates the future cumulative rewards given the current state and action. This function helps the policy in evaluation potential state, action pairs.
where we choose the action \(a_{t} = \pi_{\theta}(s_{t})\).
Algorithm Details¶
Action Selection and Values¶
ac
here is an object of the ActorCritic
class, which defined two methods: get_value
and get_action
and ofcourse they return the value approximation from the Critic and action from the Actor.
Note: We sample a stochastic action from the distribution on the action space by providing False
as an argument to select_action
.
For practical purposes we would assume that we are working with a finite horizon MDP.
Collect Experience¶
To make our agent learn, we first need to collect some experience in an online fashion. For this we make use of the collect_rollouts
method. This method is defined in the OnPolicyAgent
Base Class.
For updation, we would need to compute advantages from this experience. So, we store our experience in a Rollout Buffer.
Compute discounted Returns and Advantages¶
Next we can compute the advantages and the actual discounted returns for each state. This can be done very easily by simply calling compute_returns_and_advantage
. Note this implementation of the rollout buffer is borrowed from Stable Baselines.
Update Equations¶
Let \(\pi_{\theta}\) denote a policy with parameters \(\theta\), and \(J(\pi_{\theta})\) denote the expected finitehorizon undiscounted return of the policy.
At each update timestep, we get value and log probabilities:
Now, that we have the log probabilities we calculate the gradient of \(J(\pi_{\theta})\) as:
where \(\tau\) is the trajectory.
We then update the policy parameters via stochastic gradient ascent:
The key idea underlying Advantage Actor Critic Algorithm is to push up the probabilities of actions that lead to higher return than the expected return of that state, and push down the probabilities of actions that lead to lower return than the expected return of that state, until you arrive at the optimal policy.
Training through the API¶
import gym
from genrl.agents import A2C
from genrl.trainers import OnPolicyTrainer
from genrl.environments import VectorEnv
env = VectorEnv("CartPolev0")
agent = A2C('mlp', env)
trainer = OnPolicyTrainer(agent, env, log_mode=['stdout'])
trainer.train()
Proximal Policy Optimization¶
For background on Deep RL, its core definitions and problem formulations refer to Deep RL Background
Objective¶
The objective is to maximize the discounted cumulative reward function:
The Proximal Policy Optimization Algorithm is very similar to the Advantage Actor Critic Algorithm except we add multiply the advantages with a ratio between the log probability of actions at experience collection time and at updation time. What this does is  helps in establishing a trust region for not moving too away from the old policy and at the same time taking gradient ascent steps in the directions of actions which result in positive advantages.
where we choose the action \(a_{t} = \pi_{\theta}(s_{t})\).
Algorithm Details¶
Action Selection and Values¶
ac
here is an object of the ActorCritic
class, which defined two methods: get_value
and get_action
and ofcourse they return the value approximation from the Critic and action from the Actor.
Note: We sample a stochastic action from the distribution on the action space by providing False
as an argument to select_action
.
For practical purposes we would assume that we are working with a finite horizon MDP.
Collect Experience¶
To make our agent learn, we first need to collect some experience in an online fashion. For this we make use of the collect_rollouts
method. This method is defined in the OnPolicyAgent
Base Class.
For updation, we would need to compute advantages from this experience. So, we store our experience in a Rollout Buffer.
Compute discounted Returns and Advantages¶
Next we can compute the advantages and the actual discounted returns for each state. This can be done very easily by simply calling compute_returns_and_advantage
. Note this implementation of the rollout buffer is borrowed from Stable Baselines.
Update Equations¶
Let \(\pi_{\theta}\) denote a policy with parameters \(\theta\), and \(J(\pi_{\theta})\) denote the expected finitehorizon undiscounted return of the policy.
At each update timestep, we get value and log probabilities:
In the case of PPO our loss function is:
where \(\tau\) is the trajectory.
We then update the policy parameters via stochastic gradient ascent:
Training through the API¶
import gym
from genrl.agents import PPO1
from genrl.trainers import OnPolicyTrainer
from genrl.environments import VectorEnv
env = VectorEnv("CartPolev0")
agent = PPO1('mlp', env)
trainer = OnPolicyTrainer(agent, env, log_mode=['stdout'])
trainer.train()
Deep QNetworks (DQN)¶
For background on Deep RL, its core definitions and problem formulations refer to Deep RL Background
Objective¶
The DQN uses the concept of Qlearning. When the state space is too huge, it require a large number of epochs to explore and update the Qvalue of every state even at least once. Hence, we make use of function approximators. DQN uses a neural network as a function approximator and objective is to get as close to the Bellman Expectation of the Qvalue function as possible. This is done by minimising the loss function which is defined as
Unlike in regular Qlearning, DQNs need more stability while updating so we often use a second neural network which we call our target model.
Algorithm Details¶
EpsilonGreedy Action Selection¶
We choose the greedy action with a probability of \(1  \epsilon\) and the rest of the time, we sample the action randomly. During evaluation, we use only greedy actions to judge how well the agent performs.
Experience Replay¶
Whenever an experience is played through (during the training loop), the experience is stored in what we call a Replay Buffer.
91 92 93 94 95 96 97 98 99 100 101 102 103 104  def log(self, timestep: int) > None:
"""Helper function to log
Sends useful parameters to the logger.
Args:
timestep (int): Current timestep of training
"""
self.logger.write(
{
"timestep": timestep,
"Episode": self.episodes,
**self.agent.get_logging_params(),
"Episode Reward": safe_mean(self.training_rewards),

The transitions are later sampled in batches from the replay buffer for updating the network.
Update Qvalue Network¶
Once our Replay Buffer has enough experiences, we start updating the Qvalue networks in the following code according to the above objective.
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 
for timestep in range(0, self.max_timesteps, self.env.n_envs):
self.agent.update_params_before_select_action(timestep)
action = self.get_action(state, timestep)
next_state, reward, done, info = self.env.step(action)
if self.render:
self.env.render()
# true_dones contains the "true" value of the dones (game over statuses). It is set
# to False when the environment is not actually done but instead reaches the max
# episode length.
true_dones = [info[i]["done"] for i in range(self.env.n_envs)]
self.buffer.push((state, action, reward, next_state, true_dones))
state = next_state.detach().clone()
if self.check_game_over_status(done):
self.noise_reset()
if self.episodes % self.log_interval == 0:
self.log(timestep)
if self.episodes == self.epochs:
break
if timestep >= self.start_update and timestep % self.update_interval == 0:
self.agent.update_params(self.update_interval)
if (
timestep >= self.start_update
and self.save_interval != 0
and timestep % self.save_interval == 0
):
self.save(timestep)
self.env.close()
self.logger.close()

The function get_q_values calculates the Qvalues of the states sampled from the replay buffer. The get_target_q_values function will get the Qvalues of the same states as calculated by the target network. The update_params function is used to calculate the MSE Loss between the Qvalues and the Target Qvalues and updated using Stochastic Gradient Descent.
Training through the API¶
from genrl.agents import DQN
from genrl.environments import VectorEnv
from genrl.trainers import OffPolicyTrainer
env = VectorEnv("CartPolev0")
agent = DQN("mlp", env)
trainer = OffPolicyTrainer(agent, env, max_timesteps=20000)
trainer.train()
trainer.evaluate()
Variants of DQN¶
Some of the other variants of DQN that we have implemented in the repo are:  Double DQN  Dueling DQN  Prioritized Replay DQN  Noisy DQN  Categorical DQN
For some extensions of the DQN (like DoubleDQN) we have provided the methods in a file under genrl/agents/dqn/utils.py
class DuelingDQN(DQN):
def __init__(self, *args, **kwargs):
super(DuelingDQN, self).__init__(*args, **kwargs)
self.dqn_type = "dueling" # You can choose "noisy" for NoisyDQN and "categorical" for CategoricalDQN
self._create_model()
def get_target_q_values(self, *args):
return ddqn_q_target(self, *args)
The above two snippets define the same class. You can find similar APIs for the other variants in the genrl/deep/agents/dqn folder.
Double Deep QNetwork¶
Objective¶
Double DQN builds upon the notion of Double QLearning and extends it to Deep Qnetworks. We use function approximators for predicting the Qvalues of the states and a function approximator is always corrupted with some noise. Now, when we maximise over the values of stateaction pairs while calculating the target for the TDupdate, the maximum is taken over the true values plus the noise. Thus, the maximum of a noisy function is always bigger than the maximum of the true function:
where \(X_1\) and \(X_2\) are two random variables. This leads to overestimations of the values of stateaction pairs and cnsequently suboptimal action selection. This overestimation is bound to propagate and increase over the course of multiple updates because the same approximator is used to select the maximum action and to estimate it’s Qvalue.
This problem can be solved by decoupling the action selection and the value estimation using two separate function approximators(and hence different noise distributions) for both the purposes which is what a DoubleDQN does. The loss function is defined as:
Algorithm Details¶
EpsilonGreedy Action Selection¶
The action exploration is stochastic wherein the greedy action is chosen with a probability of \(1  \epsilon\) and rest of the time, we sample the action randomly. During evaluation, we use only greedy actions to judge how well the agent performs.
Experience Replay¶
Every transition occuring during the training is stored in a separate Replay Buffer
91 92 93 94 95 96 97 98 99 100 101 102 103 104  def log(self, timestep: int) > None:
"""Helper function to log
Sends useful parameters to the logger.
Args:
timestep (int): Current timestep of training
"""
self.logger.write(
{
"timestep": timestep,
"Episode": self.episodes,
**self.agent.get_logging_params(),
"Episode Reward": safe_mean(self.training_rewards),

The transitions are later sampled in batches from the replay buffer for updating the network.
Update the QNetwork¶
Doble DQN decouples the selection of the action from the evaluation of the Qvalues while calculating the target value for the update. The loss function for a time step t is defined as:
The only thing that differs with DoubleDQN is the get_target_q_values function as shown below.
from genrl.agents import DQN
from genrl.trainers import OffPolicyTrainer
class DoubleDQN(DQN):
def __init__(self, *args, **kwargs):
super(DoubleDQN, self).__init__(*args, **kwargs)
self._create_model()
def get_target_q_values(self, next_states, rewards, dones):
next_q_value_dist = self.model(next_states)
next_best_actions = torch.argmax(next_q_value_dist, dim=1).unsqueeze(1)
rewards, dones = rewards.unsqueeze(1), dones.unsqueeze(1)
next_q_target_value_dist = self.target_model(next_states)
max_next_q_target_values = next_q_target_value_dist.gather(2, next_best_actions)
target_q_values = rewards + agent.gamma * torch.mul(
max_next_q_target_values, (1  dones)
)
return target_q_values
Training through the API¶
from genrl.agents import DoubleDQN
from genrl.environments import VectorEnv
from genrl.trainers import OffPolicyTrainer
env = VectorEnv("CartPolev0")
agent = DoubleDQN("mlp", env)
trainer = OffPolicyTrainer(agent, env, max_timesteps=20000)
trainer.train()
trainer.evaluate()
timestep Episode value_loss epsilon Episode Reward
24 0.0 0 0.9766 0
720 25.0 0 0.5184 26.96
1168 50.0 0.49 0.1646 18.6
3248 75.0 4.1546 0.0326 74.88
7512 100.0 7.3164 0.0102 166.36
12424 125.0 12.3175 0.01 200.0
Evaluated for 10 episodes, Mean Reward: 200.0, Std Deviation for the Reward: 0.0
Dueling Deep QNetwork¶
Objective¶
The main objective of DQN is to learn a function approximator for the Qfunction using a neural network. This is done by training the approximator to get as close to the Bellman Expectation of the Qvalue function as possible by minimising the loss which is defined as:
Dueling Deep Qnetwork modifies the architecture of a simple DQN into one better suited for modelfree RL
Algorithm Details¶
Network architechture¶
The Dueling DQN architechture splits the single stream of fully connected layers in a normal DQN into two separate streams : one representing the value function and the other representing the advantage function. Advantage function.
The advantage for a state action pair represents how beneficial it is to take an action over others when in a particular state. The dueling architechture can learn which states are or are not valuable without having to learn the effect of action for each state. This is useful in instances when taking any action would affect the environment in any significant way.
Another layer combines the value stream and the advantage stream to get the Qvalues
Combining the value and the advantage streams¶
 Value Function : \(V(s; \theta, \beta)\)
 Advantage Function : \(A(s, a; \theta, \alpha)\)
where \(\theta\) denotes the parameters of the underlying convolutional layers whereas \(\alpha\) and \(\beta\) are the parameters of the two separate streams of fully connected layers
The two stream cannot be simply added (\(Q(s, a; \theta, \alpha, \beta) = V(s; \theta, \beta) + A(s, a; \theta, \alpha)\)) to get the Qvalues because:
 \(Q(s, a; \theta, \alpha, \beta)\) is only a parameterized estimate of the true Qfunction
 It would be wrong to assume that \(V(s; \theta, \beta)\) and \(Q(s, a; \theta, \alpha)\) are reasonable estimates of the value and the advantage functions
To address these concerns, we train in order to force the expected value of the advantage function to be zero (the expectation of advantage is always zero)
Thus, the combining module combines the value and advantage streams to get the Qvalues in the following fashion:
EpsilonGreedy Action Selection¶
Similar to a normal DQN, the action exploration is stochastic wherein the greedy action is chosen with a probability of \(1  \epsilon\) and rest of the time, we sample the action randomly. During evaluation, we use only greedy actions to judge how well the agent performs.
Experience Replay¶
Every transition occuring during the training is stored in a separate Replay Buffer
91 92 93 94 95 96 97 98 99 100 101 102 103 104  def log(self, timestep: int) > None:
"""Helper function to log
Sends useful parameters to the logger.
Args:
timestep (int): Current timestep of training
"""
self.logger.write(
{
"timestep": timestep,
"Episode": self.episodes,
**self.agent.get_logging_params(),
"Episode Reward": safe_mean(self.training_rewards),

The transitions are later sampled in batches from the replay buffer for updating the network
Update the Q Network¶
Once enough number of transitions ae stored in the replay buffer, we start updating the Qvalues according to the given objective. The loss function is defined in a fashion similar to a DQN. This allows any new improvisations in training techniques of DQN such as Double DQN or NoisyNet DQN to be readily adapted in the dueling architechture.
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 
for timestep in range(0, self.max_timesteps, self.env.n_envs):
self.agent.update_params_before_select_action(timestep)
action = self.get_action(state, timestep)
next_state, reward, done, info = self.env.step(action)
if self.render:
self.env.render()
# true_dones contains the "true" value of the dones (game over statuses). It is set
# to False when the environment is not actually done but instead reaches the max
# episode length.
true_dones = [info[i]["done"] for i in range(self.env.n_envs)]
self.buffer.push((state, action, reward, next_state, true_dones))
state = next_state.detach().clone()
if self.check_game_over_status(done):
self.noise_reset()
if self.episodes % self.log_interval == 0:
self.log(timestep)
if self.episodes == self.epochs:
break
if timestep >= self.start_update and timestep % self.update_interval == 0:
self.agent.update_params(self.update_interval)
if (
timestep >= self.start_update
and self.save_interval != 0
and timestep % self.save_interval == 0
):
self.save(timestep)
self.env.close()
self.logger.close()

Training through the API¶
from genrl.agents import DuelingDQN
from genrl.environments import VectorEnv
from genrl.trainers import OffPolicyTrainer
env = VectorEnv("CartPolev0")
agent = DuelingDQN("mlp", env)
trainer = OffpolicyTrainer(agent, env, max_timesteps=20000)
trainer.train()
trainer.evaluate()
Deep Q Networks with Noisy Nets¶
Objective¶
NoisyNet DQN is a variant of DQN which uses fully connected layers with noisy parameters to drive exploration. Thus, the parametrised actionvalue function can now be seen as a random variable. The new loss function which needs to minimised is defined as:
where \(\zeta\) is a set of learnable parameters for the noise.
Algorithm Details¶
Action Selection¶
The action selection is no longer epsilongreedy since the exploration is driven by the noise in the neural network layers. The action selection is done greedily.
Noisy Parameters¶
A noisy parameter \(\theta\) is defined as:
where \(\Sigma\) and \(\mu\) are vectors of trainable parameters and \(\epsilon\) is a vector of zero mean noise. Hence, the loss function is now defined with respect to \(\Sigma\) and \(\mu\) and the optimization now takes place with respect to \(\Sigma\) and \(\mu\). \(\epsilon\) is sampled from factorised gaussian noise.
Experience Replay¶
Every transition occuring during the training is stored in a separate Replay Buffer
91 92 93 94 95 96 97 98 99 100 101 102 103 104  def log(self, timestep: int) > None:
"""Helper function to log
Sends useful parameters to the logger.
Args:
timestep (int): Current timestep of training
"""
self.logger.write(
{
"timestep": timestep,
"Episode": self.episodes,
**self.agent.get_logging_params(),
"Episode Reward": safe_mean(self.training_rewards),

The transitions are later sampled in batches from the replay buffer for updating the network
Update the QNetwork¶
Once enough number of transitions ae stored in the replay buffer, we start updating the Qvalues according to the given objective. The loss function is defined in a fashion similar to a DQN. This allows any new improvisations in training techniques of DQN such as Double DQN or NoisyNet DQN to be readily adapted in the dueling architechture.
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 
for timestep in range(0, self.max_timesteps, self.env.n_envs):
self.agent.update_params_before_select_action(timestep)
action = self.get_action(state, timestep)
next_state, reward, done, info = self.env.step(action)
if self.render:
self.env.render()
# true_dones contains the "true" value of the dones (game over statuses). It is set
# to False when the environment is not actually done but instead reaches the max
# episode length.
true_dones = [info[i]["done"] for i in range(self.env.n_envs)]
self.buffer.push((state, action, reward, next_state, true_dones))
state = next_state.detach().clone()
if self.check_game_over_status(done):
self.noise_reset()
if self.episodes % self.log_interval == 0:
self.log(timestep)
if self.episodes == self.epochs:
break
if timestep >= self.start_update and timestep % self.update_interval == 0:
self.agent.update_params(self.update_interval)
if (
timestep >= self.start_update
and self.save_interval != 0
and timestep % self.save_interval == 0
):
self.save(timestep)
self.env.close()
self.logger.close()

Training through the API¶
from genrl.agents import NoisyDQN
from genrl.environments import VectorEnv
from genrl.trainers import OffPolicyTrainer
env = VectorEnv("CartPolev0")
agent = NoisyDQN("mlp", env)
trainer = OffPolicyTrainer(agent, env, max_timesteps=20000)
trainer.train()
trainer.evaluate()
Prioritized Deep QNetworks¶
Objective¶
The main motivation behind using prioritized experience replay over uniformly sampled experience replay stems from the fact that an agent may be able to learn more from some transitions than others. In uniformly sampled experience replay, some transitions which might not be very useful for the agent or that might be redundant will be replayed with the same frequency as those having more learning potential. Prioritized experience replay solves this problem by replaying more useful transitions more frequently.
The loss function for prioritized DQN is defined as
Algorithm Details¶
EpsilonGreedy Action Selection¶
The action exploration is stochastic wherein the greedy action is chosen with a probability of \(1  \epsilon\) and rest of the time, we sample the action randomly. During evaluation, we use only greedy actions to judge how well the agent performs.
Prioritized Experience Replay¶
The replay buffer is no longer uniformly sampled, but is sampled according to the priority of a transition. Transitions with greater scope of learning are assigned a higher priorities. Priority of a particular transition is decided using the TDerror since the measure of the magnitude of the TD error can be interpreted as how unexpected the transition is.
The transition with the maximum TDerror is given the maximum priority. Every new transition is given the highest priority to ensure that each transition is considered atleast once.
Sampling transition greedily has some disadvantages such as transitions having a low TDerror on the first replay might not be sampled ever again, higher chances of overfitting since only a small set of transitions with high priorities are replayed over and over again and sensitivity to noise spikes. To tackle these problems, instead of sampling transitions greedily everytime, we use a stochastic approach wherein each transition is assigned a certain probability with which it is sampled. The sampling probability is defined as
where \(p_i > 0\) is the priority of transition \(i\). \(\alpha\) determines the amount of prioritization done. The priority of the transition can be defined in the following two ways:
 \(p_i = \delta_i + \epsilon\)
 \(p_i = \frac{1}{rank(i)}\)
where \(\epsilon\) is a small positive constant to ensure that the sampling probability is not zero for any transition and \(rank(i)\) is the rank of the transition when the replay buffer is sorted with respect to priorities.
We also use importance sampling (IS) weights to correct certain bais introduced by prioritized experience replay.
Update the Qvalue Networks¶
The importance sampling weights can be folded into the Qlearning update by using \(w\delta_i\) instead of \(\delta_i\). Once our Replay Buffer has enough experiences, we start updating the Qvalue networks in the following code according to the above objective.
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 
for timestep in range(0, self.max_timesteps, self.env.n_envs):
self.agent.update_params_before_select_action(timestep)
action = self.get_action(state, timestep)
next_state, reward, done, info = self.env.step(action)
if self.render:
self.env.render()
# true_dones contains the "true" value of the dones (game over statuses). It is set
# to False when the environment is not actually done but instead reaches the max
# episode length.
true_dones = [info[i]["done"] for i in range(self.env.n_envs)]
self.buffer.push((state, action, reward, next_state, true_dones))
state = next_state.detach().clone()
if self.check_game_over_status(done):
self.noise_reset()
if self.episodes % self.log_interval == 0:
self.log(timestep)
if self.episodes == self.epochs:
break
if timestep >= self.start_update and timestep % self.update_interval == 0:
self.agent.update_params(self.update_interval)
if (
timestep >= self.start_update
and self.save_interval != 0
and timestep % self.save_interval == 0
):
self.save(timestep)
self.env.close()
self.logger.close()

Training through the API¶
from genrl.agents import PrioritizedReplayDQN
from genrl.environments import VectorEnv
from genrl.trainers import OffPolicyTrainer
env = VectorEnv("CartPolev0")
agent = PrioritizedReplayDQN("mlp", env)
trainer = OffPolicyTrainer(agent, env, max_timesteps=20000)
trainer.train()
trainer.evaluate()
Deep Deterministic Policy Gradients¶
Objective¶
Deep Deterministic Policy Gradients (DDPG) is a modelfree actorcritic algorithm which deals with continuous action spaces. One simple approach of dealing with continuous action spaces can be discretizing the action space. However, this gives rise to several problems, the most significant being that the size of the actionspace increases exponentially with the number of degrees of freedom. DDPG builds up on Deterministic Policy Gradients to learn deterministic policies in highdimensional continuous actionspaces.
Algorithms Details¶
Deterministic Policy Gradient¶
In cases with continuous actionspaces, using Qlearning like approach (greedy policy improvement) to learn deterministic policies is not feasible since it involves selecting the action with the maximum action value function at every step and it is not possible to check the action value for every possible action in case of continuous action spaces.
This problem can be solved by considering the fact that a policy can be improved by moving it in the direction of increasing actionvalue function:
Action Selection¶
To ensure sufficient exploration, noise is added to the action selected using the current policy. The noise is sampled from a noise process \(\mathcal{N}\) :
\(\mathcal{N}\) can be chosen to suit the environment (for eg. OrnsteinUhlenbeck process, Gaussian noise, etc.)
156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179  def select_action(
self, state: torch.Tensor, deterministic: bool = True
) > torch.Tensor:
"""Select action given state
Deterministic Action Selection with Noise
Args:
state (:obj:`torch.Tensor`): Current state of the environment
deterministic (bool): Should the policy be deterministic or stochastic
Returns:
action (:obj:`torch.Tensor`): Action taken by the agent
"""
action, _ = self.ac.get_action(state, deterministic)
action = action.detach()
# add noise to output from policy network
if self.noise is not None:
action += self.noise()
return torch.clamp(
action, self.env.action_space.low[0], self.env.action_space.high[0]
)

Experience Replay¶
Similar to DQNs, DDPG being an offpolicy algorithm, makes use of Replay Buffers. Whenever a transition \((s_t, a_t, r_t, s_{t+1})\) is encountered, it is stored into the replay buffer. Batches of these transitions are sampled while updating the network parameters. This helps in breaking the strong correlation between the updates that would have been present had the transitions been trained and discarded immediately after they are encountered and also helps to avoid the rapid forgetting of the possibly rare transitions that would be useful later on.
91 92 93 94 95 96 97 98 99 100 101 102 103 104  def log(self, timestep: int) > None:
"""Helper function to log
Sends useful parameters to the logger.
Args:
timestep (int): Current timestep of training
"""
self.logger.write(
{
"timestep": timestep,
"Episode": self.episodes,
**self.agent.get_logging_params(),
"Episode Reward": safe_mean(self.training_rewards),

Update the Value and Policy Networks¶
DDPG makes use of target networks for the actor(policy) and the critic(value) networks to stabilise the training. The Qnetwork is update using TDlearning updates. The target and the loss function for the same are defined as:
Buliding up on Deterministic Policy Gradients, the gradient of the policy can be determined using the actionvalue function as
The target networks are updated at regular intervals
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 
for timestep in range(0, self.max_timesteps, self.env.n_envs):
self.agent.update_params_before_select_action(timestep)
action = self.get_action(state, timestep)
next_state, reward, done, info = self.env.step(action)
if self.render:
self.env.render()
# true_dones contains the "true" value of the dones (game over statuses). It is set
# to False when the environment is not actually done but instead reaches the max
# episode length.
true_dones = [info[i]["done"] for i in range(self.env.n_envs)]
self.buffer.push((state, action, reward, next_state, true_dones))
state = next_state.detach().clone()
if self.check_game_over_status(done):
self.noise_reset()
if self.episodes % self.log_interval == 0:
self.log(timestep)
if self.episodes == self.epochs:
break
if timestep >= self.start_update and timestep % self.update_interval == 0:
self.agent.update_params(self.update_interval)
if (
timestep >= self.start_update
and self.save_interval != 0
and timestep % self.save_interval == 0
):
self.save(timestep)
self.env.close()
self.logger.close()

Training through the API¶
from genrl.agents import DDPG
from genrl.environments import VectorEnv
from genrl.trainers import OffPolicyTrainer
env = VectorEnv("MountainCarContinuousv0")
agent = DDPG("mlp", env)
trainer = OffPolicyTrainer(agent, env, max_timesteps=20000)
trainer.train()
trainer.evaluate()
Twin Delayed DDPG¶
Objective¶
Similar to Deep QNetworks, the problem of overestimation of the state values, occuring due to noisy function approximators and using the same function approximator for action selection and value estimation also persists in actorcritic algorithms with continuous actionspaces. Double DQN, the solution for this problem in Deep QNetworks is not effective in actorcritic algorithms due to the slow rate of change of the policy. Twin Delayed DDPG (TD3) uses Clipped Double QLearning to address this problem. TD3 uses two Q function approximators and the loss function for each is given by
Algorithm Details¶
Clipped Double QLearning¶
Double DQNs are not effective in actorcritic algorithms due to the slow change in the policy and the original double QLearning (although being somewhat effective) does not completely solve the problem of overestimation. To tackle this TD3 uses Clipped Double QLearning Clipped Double QLearning proposes to upper bound the less biased critic network by the more biased one and hence no additional overestimation can be introdiced. Although, this may introduce underestimation, it is not much of a concern since underestimation errors don’t propagate through policy updates. The target function calculated usign Clipped Double QLearning for the updates can be written as
Both of the critic networks are updated using the loss functions mentioned above.
Experience Replay¶
TD3 being an offpolicy algorithm, makes use of Replay Buffer. Whenever a transition \((s_t, a_t, r_t, s_{t+1})\) is encountered, it is stored into the replay buffer. Batches of these transitions are sampled while updating the network parameters. This helps in breaking the strong correlation between the updates that would have been present had the transitions been trained and discarded immediately after they are encountered and also helps to avoid the rapid forgetting of the possibly rare transitions that would be useful later on.
91 92 93 94 95 96 97 98 99 100 101 102 103 104  def log(self, timestep: int) > None:
"""Helper function to log
Sends useful parameters to the logger.
Args:
timestep (int): Current timestep of training
"""
self.logger.write(
{
"timestep": timestep,
"Episode": self.episodes,
**self.agent.get_logging_params(),
"Episode Reward": safe_mean(self.training_rewards),

Target Policy Smoothing Regularization¶
TD3 adds noise to the target action to reduce the variance induced by function approximation error. This acts as a form of regularization which smoothens the changes in the actionvalues along changes in action
Delayed Policy updates¶
TD3 uses target networks similar to DDPG and DQNs for the two critics and the actors to stabilise learning. Apart from this, it also promotes updating the policy networks at a lower frequency as compared to the Qnetworks to avoid divergent behaviour for the policy. The paper recommends one policy update for every two Qfunction updates.
95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 
def update_params(self, update_interval: int) > None:
"""Update parameters of the model
Args:
update_interval (int): Interval between successive updates of the target model
"""
for timestep in range(update_interval):
batch = self.sample_from_buffer()
value_loss = self.get_q_loss(batch)
self.optimizer_value.zero_grad()
value_loss.backward()
self.optimizer_value.step()
# Delayed Update
if timestep % self.policy_frequency == 0:
policy_loss = self.get_p_loss(batch.states)
self.optimizer_policy.zero_grad()
policy_loss.backward()
self.optimizer_policy.step()
self.logs["policy_loss"].append(policy_loss.item())
self.logs["value_loss"].append(value_loss.item())

Training through the API¶
from genrl.agents import TD3
from genrl.environments import VectorEnv
from genrl.trainers import OffPolicyTrainer
env = VectorEnv("MountainCarContinuousv0")
agent = TD3("mlp", env)
trainer = OffPolicyTrainer(agent, env, max_timesteps=4000)
trainer.train()
trainer.evaluate()
Soft ActorCritic¶
Objective¶
Deep Reinforcement Learning Algorithms suffer from two main problems : one being high sample complexity (large amounts of data needed) and the other being thier brittleness with respect to learning rates, exporation constants and other hyperparameters. Algorithms such as DDPG and Twin Delayed DDPG are used to tackle the challenge of high sample complexity in actorcritic frameworks with continuous actionspaces. However, they still suffer from brittle stability with respect to their hyperparameters. SoftActor Critic introduces a actorcritic framework for arrangements with continuous action spaces wherein the standard objective of reinforcement learning, i.e., maximizing expected cumulative reward is augmented with an additional objective of entropy maximization which provides a substantial improvement in exploration and robustness. The objective can be mathematically represented as
where \(\alpha\) also known as the temperature parameter determines the relative importance of the entropy term against the reward, and thus controls the stochasticity of the optimal policy and \(\mathcal{H}\) represents the entropy function. The entropy of a random variable \(\mathcal{x}\) following a probability distribution \(P\) is defined as
Algorithm Details¶
Soft ActorCritic is mostly used in two variants depending on whether the temperature constant \(\alpha\) is kept constant throughout the learning process or if it is learned as a parameter over the course of learning. GenRL uses the latter one.
ActionValue Networks¶
SAC learns a ploicy \(\pi_\theta\) and two Q functions \(Q_{\phi_1}, Q_{\phi_2}\) and their target networks concurrently. The two Qfunctions are learned in a fashion similar to TD3 where a common target is considered for both the Q functions and Clipped Double Qlearning is used to train the network. However, unlike TD3, the nextstate actions used in the target are calculated using the current policy. Since, the optimisation objective also involves maximising the entropy, the new Qvalue can be expressed as
Thus, the actionvalue for one stateaction pair can be approximated as
where \(\tilde{a}'\) (action taken in next state) is sampled from the policy.
Experience Replay¶
SAC also uses Replay Buffer like other offpolicy algorithms. Whenever a transition \((s_t, a_t, r_t, s_{t+1})\) is encountered, it is stored into the replay buffer. Batches of these transitions are sampled while updating the network parameters. This helps in breaking the strong correlation between the updates that would have been present had the transitions been trained and discarded immediately after they are encountered and also helps to avoid the rapid forgetting of the possibly rare transitions that would be useful later on.
91 92 93 94 95 96 97 98 99 100 101 102 103 104  def log(self, timestep: int) > None:
"""Helper function to log
Sends useful parameters to the logger.
Args:
timestep (int): Current timestep of training
"""
self.logger.write(
{
"timestep": timestep,
"Episode": self.episodes,
**self.agent.get_logging_params(),
"Episode Reward": safe_mean(self.training_rewards),

QNetwork Optimisation¶
Just like TD3, SAC uses Clipped Double QLearning to calculate the target values for the Qvalue network
where \(\tilde{a}'\) is sampled from the policy. The loss function can then be defined as
Action Selection and Policy Optimisation¶
The main aim of policy optimisation will be maximise the value function which in this case can be defined as
In SAC, a reparameterisation trick is used to sample actions from the policy to ensure that sampling from the policy is a differentiable process. The policy is now parameterised as
The maximisation objective is now defined as
Training through the API¶
from genrl.agents import SAC
from genrl.environments import VectorEnv
from genrl.trainers import OffPolicyTrainer
env = VectorEnv("MountainCarContinuousv0")
agent = SAC("mlp", env)
trainer = OffPolicyTrainer(agent, env, max_timesteps=4000)
trainer.train()
trainer.evaluate()
Categorical Deep QNetworks¶
Objective¶
The main objective of Categorical Deep QNetworks is to learn the distribution of Qvalues as unlike to other variants of Deep QNetworks where the goal is is to approximate the expectations of the Qvalues as closely as possible. In complicated environments, the Qvalues can be stochastic and in that case, simply learning the expectation of Qvalues will not be able to capture all the information needed (for eg. variance of the distribution) to make an optimal decision.
Distributional Bellman¶
The bellman equation can be adapted to this form as
where \(Z(s, a)\) (the value distribution) and \(R(s, a)\) (the reward distribution) are now probability distributions. The equality or similarity of two distributions can be effectivelyevaluated using the KullbackLeibler(KL)  divergence or the crossentropy loss.
The transition operator \(P^\pi : \mathcal{Z} \rightarrow \mathcal{Z}\) and the bellman operator \(\mathcal{T} : \mathcal{Z} \rightarrow \mathcal{Z}\) can be defined as
Algorithm Details¶
Parametric Distribution¶
Categorical DQN uses a discrete distribution parameterized by a set of supports/atoms (discrete values) to model the value distribution. This set of atoms is determined as
where \(N \in \mathbb{N}\) and \(V_{MAX}, V_{MIN} \in \mathbb{R}\) are the distribution parameters. The probability of each atom is modeled as
Action Selection¶
GenRL uses greedy action selection for categorical DQN wherein the action with the highest Qvalues for all discrete regions is selected.
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86  def categorical_greedy_action(agent: DQN, state: torch.Tensor) > torch.Tensor:
"""Greedy action selection for Categorical DQN
Args:
agent (:obj:`DQN`): The agent
state (:obj:`torch.Tensor`): Current state of the environment
Returns:
action (:obj:`torch.Tensor`): Action taken by the agent
"""
q_value_dist = agent.model(state.unsqueeze(0)).detach() # .numpy()
# We need to scale and discretise the Qvalue distribution obtained above
q_value_dist = q_value_dist * torch.linspace(
agent.v_min, agent.v_max, agent.num_atoms
)
# Then we find the action with the highest Qvalues for all discrete regions
# Current shape of the q_value_dist is [1, n_envs, action_dim, num_atoms]
# So we take the sum of all the individual atom q_values and then take argmax
# along action dim to get the optimal action. Since batch_size is 1 for this
# function, we squeeze the first dimension out.
action = torch.argmax(q_value_dist.sum(1), axis=1).squeeze(0)
return action

Experience Replay¶
Categorical DQN like other DQNs uses Replay Buffer like other offpolicy algorithms. Whenever a transition \((s_t, a_t, r_t, s_{t+1})\) is encountered, it is stored into the replay buffer. Batches of these transitions are sampled while updating the network parameters. This helps in breaking the strong correlation between the updates that would have been present had the transitions been trained and discarded immediately after they are encountered and also helps to avoid the rapid forgetting of the possibly rare transitions that would be useful later on.
91 92 93 94 95 96 97 98 99 100 101 102 103 104  def log(self, timestep: int) > None:
"""Helper function to log
Sends useful parameters to the logger.
Args:
timestep (int): Current timestep of training
"""
self.logger.write(
{
"timestep": timestep,
"Episode": self.episodes,
**self.agent.get_logging_params(),
"Episode Reward": safe_mean(self.training_rewards),

Projected Bellman Update¶
The sample bellman update \(\hat{\mathcal{T}}Z_\theta\) is projected onto the support of \(Z_\theta\) for the update as shown in the figure below. The bellman update for each atom \(j\) can be calculated as
and then it’s probability \(\mathcal{p_j}(x', \pi{x'})\) is distributed to the neighbours of the update. Here, \((x, a, r, x')\) is a sample transition. The \(i^{th}\) component of the projected update is calculated as
The loss is calculated using KL divergence (cross entropy loss). This is also known as the Bernoulli algorithm
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185  def categorical_q_target(
agent: DQN,
next_states: torch.Tensor,
rewards: torch.Tensor,
dones: torch.Tensor,
):
"""Projected Distribution of Qvalues
Helper function for Categorical/Distributional DQN
Args:
agent (:obj:`DQN`): The agent
next_states (:obj:`torch.Tensor`): Next states being encountered by the agent
rewards (:obj:`torch.Tensor`): Rewards received by the agent
dones (:obj:`torch.Tensor`): Game over status of each environment
Returns:
target_q_values (object): Projected Qvalue Distribution or Target Q Values
"""
delta_z = float(agent.v_max  agent.v_min) / (agent.num_atoms  1)
support = torch.linspace(agent.v_min, agent.v_max, agent.num_atoms)
next_q_value_dist = agent.target_model(next_states) * support
next_actions = (
torch.argmax(next_q_value_dist.sum(1), axis=1).unsqueeze(1).unsqueeze(1)
)
next_actions = next_actions.expand(
agent.batch_size, agent.env.n_envs, 1, agent.num_atoms
)
next_q_values = next_q_value_dist.gather(2, next_actions).squeeze(2)
rewards = rewards.unsqueeze(1).expand_as(next_q_values)
dones = dones.unsqueeze(1).expand_as(next_q_values)
# Refer to the paper in section 4 for notation
Tz = rewards + (1  dones) * 0.99 * support
Tz = Tz.clamp(min=agent.v_min, max=agent.v_max)
bz = (Tz  agent.v_min) / delta_z
l = bz.floor().long()
u = bz.ceil().long()
offset = (
torch.linspace(
0,
(agent.batch_size * agent.env.n_envs  1) * agent.num_atoms,
agent.batch_size * agent.env.n_envs,
)
.long()
.view(agent.batch_size, agent.env.n_envs, 1)
.expand(agent.batch_size, agent.env.n_envs, agent.num_atoms)
)
target_q_values = torch.zeros(next_q_values.size())
target_q_values.view(1).index_add_(
0,
(l + offset).view(1),
(next_q_values * (u.float()  bz)).view(1),
)
target_q_values.view(1).index_add_(
0,
(u + offset).view(1),
(next_q_values * (bz  l.float())).view(1),
)
return target_q_values

Training through the API¶
from genrl.agents import CategoricalDQN
from genrl.environments import VectorEnv
from genrl.trainers import OffPolicyTrainer
env = VectorEnv("CartPolev0")
agent = CategoricalDQN("mlp", env)
trainer = OffPolicyTrainer(agent, env, max_timesteps=20000)
trainer.train()
trainer.evaluate()
Custom Policy Networks¶
GenRL provides default policies for images (CNNPolicy) and for other types of inputs(MlpPolicy). Sometimes, these default policies may be insuffiecient for your problem, or you may want more control over the policy definition, and hence require a custom policy.
The following code tutorial runs through the steps to use a custom policy depending on your problem.
Import the required libraries (eg. torch, torch.nn) and from GenRL, the algorithm (eg VPG), the trainer (eg. OnPolicyTrainer), the policy to be modified (eg. MlpPolicy)
# The necessary imports
import torch
import torch.nn as nn
from genrl.agents import VPG
from genrl.core.policies import MlpPolicy
from genrl.environments import VectorEnv
from genrl.trainers import OnPolicyTrainer
Then define a custom_policy
class that derives from the policy to be modified (in this case, the MlpPolicy
)
# Define a custom MLP Policy
class custom_policy(MlpPolicy):
def __init__(self, state_dim, action_dim, hidden, **kwargs):
super().__init__(state_dim, action_dim, hidden)
self.action_dim = action_dim
self.state_dim = state_dim
The above class modifies the MlpPolicy to have the desired number of hidden layers in the MLP Neural network that parametrizes the policy.
This is done by passing the variable hidden explicitly (defaulthidden = (32, 32)
). The state_dim
and action_dim
variables stand for the dimensions of the state_space and the action_space, and are required to construct the neural network with the proper input and output shapes for your policy, given the environment.
In some cases, you may also want to redefine the policy used completely and not just customize and existing policy. This can be done by creating a new custom policy class that inhierits the BasePolicy class.
The BasePolicy class is a basic implementation of a general policy, with a forward
and a get_action
method. The forward method maps the input state to the action probabilities,
and the get_action
method selects an action from the given action probabilities (for both continuous and discrete action_spaces)
Say you want to parametrize your policy by a Neural Network containing LSTM layers followed my MLP layers. This can be done as follows:
# Define a custom LSTM policy from the BasePolicy class
class custom_policy(BasePolicy):
def __init__(self, state_dim, action_dim, hidden,
discrete=True, layer_size=512, layers=1, **kwargs):
super(custom_policy, self).__init__(state_dim,
action_dim,
hidden,
discrete,
**kwargs)
self.state_dim = state_dim
self.action_dim = action_dim
self.layer_size = layer_size
self.lstm = nn.LSTM(self.state_dim, layer_size, layers)
self.fc = mlp([layer_size] + list(hidden) + [action_dim],
sac=self.sac) # the mlp layers
def forward(self, state):
state, h = self.lstm(state.unsqueeze(0))
state = state.view(1, self.layer_size)
action = self.fc(state)
return action
Finally, it’s time to train the custom policy. Define the environment to be trained on (CartPolev0
in this case), and the state_dim
and action_dim
variables.
# Initialize an environment
env = VectorEnv("CartPolev0", 1)
# Initialize the custom Policy
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
policy = custom_policy(state_dim=state_dim, action_dim=action_dim,
hidden = (64, 64))
Then the algorithm is initialised with the custom policy defined, and the OnPolicyTrainer trains in with logging for better inference.
algo = VPG(policy, env)
# Initialize the trainer and start training
trainer = OnPolicyTrainer(algo, env, log_mode=["csv"],
logdir="./logs", epochs=100)
trainer.train()
Using A2C¶
Using A2C on “CartPolev0”¶
import gym
from genrl.agents import A2C
from genrl.trainers import OnPolicyTrainer
from genrl.environments import VectorEnv
env = VectorEnv("CartPolev0")
agent = A2C('mlp', env, gamma=0.9, lr_policy=0.01, lr_value=0.1, policy_layers=(32,32), value_layers=(32, 32),rollout_size=2048)
trainer = OnPolicyTrainer(agent, env, log_mode=['stdout', 'tensorboard'], log_key="Episode")
trainer.train()
Using A2C on atari env  “Pongv0”¶
env = VectorEnv("Pongv0", env_type = "atari")
agent = A2C('cnn', env, gamma=0.99, lr_policy=0.01, lr_value=0.1, policy_layers=(32,32), value_layers=(32, 32), rollout_size=2048)
trainer = OnPolicyTrainer(agent, env, log_mode=['stdout', 'tensorboard'], log_key="timestep")
trainer.train()
More details can be found in the docs for A2C and OnPolicyTrainer.
Vanilla Policy Gradient (VPG)¶
If you wanted to explore Policy Gradient algorithms in RL, there is a high chance you would’ve heard of PPO, DDPG, etc. but understanding them can be tricky if you’re just starting.
VPG is arguably one of the easiest to understand policy gradient algorithms while still performing to a good enough level.
Let’s understand policy gradient at a high level, unlike the classical algorithms like QLearning, Monte Carlo where you try to optimise the outputs of the actionvalue function of the agent which are then used to determine the optimal policy. In policy gradient, as one would like to say we go directly for the kill shot, basically we optimise the thing we want to use at the end, i.e. the Policy.
So that explains the “Policy” part of Policy Gradient, so what about “Gradient”, so gradient comes from the fact that we try to optimise the policy by gradient ascent (unlike the popular gradient descent, here we want to increase the values, hence ascent). So that explains the name, but how does it even work.
For that, have a look at the following Psuedo Code (source: OpenAI)
For a more fundamental understanding this spinningup article is a good resource
Now that we have an understanding of how VPG works at a high level let’s jump into the code to see it in action
This is a very minimal way to run a VPG agent on GenRL
VPG agent on a Cartpole Environment¶
import gym # OpenAI Gym
from genrl.agents import VPG
from genrl.trainers import OnPolicyTrainer
from genrl.environments import VectorEnv
env = VectorEnv("CartPolev1")
agent = VPG('mlp', env)
trainer = OnPolicyTrainer(agent, env, epochs=200)
trainer.train()
This will run a VPG agent agent
which will interact with the CartPolev1
gym environment
Let’s understand the output on running this (your individual values may differ),
timestep Episode loss mean_reward
0 0 8.022 19.8835
20480 10 25.969 75.2941
40960 20 29.2478 144.2254
61440 30 25.5711 129.6203
81920 40 19.8718 96.6038
102400 50 19.2585 106.9452
122880 60 17.7781 99.9024
143360 70 23.6839 121.543
163840 80 24.4362 129.2114
184320 90 28.1183 156.3359
204800 100 26.6074 155.1515
225280 110 27.2012 178.8646
245760 120 26.4612 164.498
266240 130 22.8618 148.4058
286720 140 23.465 153.4082
307200 150 21.9764 151.1439
327680 160 22.445 151.1439
348160 170 22.9925 155.7414
368640 180 22.6605 165.1613
389120 190 23.4676 177.316
timestep
: It is basically the units of time the agent has interacted with the environment since the start of trainingEpisode
: It is one complete rollout of the agent, to put it simply it is one complete run until the agent ends up winning or losingloss
: The loss encountered in that episodemean_reward
: The mean reward accumulated in that episode
Now if you look closely the agent will not converge to the max reward even if you increase the epochs to say 5000, it is because that during training the agent is behaving according to a stochastic policy (Meaning when you try to pick from an action given a state from the policy it doesn’t simply take the one with the maximum return, rather it samples an action from a probability distribution, so in other words, the policy isn’t just like a lookup table, it’s function which outputs a probability distribution over the actions which we sample from when using it to pick our optimal action).
So even if the agent has figured out the optimal policy it is not taking the most optimal action at every step there is an inherent stochasticity to it.
If we want the agent to make full use of the learnt policy we can add the following line of code at after the training
trainer.evaluate(render=True)
This will not only make the agent follow a deterministic policy and thus help you achieve the maximun reward possible reward attainable from the learnt policy but also allow you to see your agent perform by passing render=True
For more information on the VPG implementation and the various hyperparameters available have a look at the official GenRL docs here
Some more implementations
VPG agent on an Atari Environment¶
env = VectorEnv("Pongv0", env_type = "atari")
agent = VPG('cnn', env)
trainer = OnPolicyTrainer(agent, env, epochs=200)
trainer.train()
Saving and Loading Weights and Hyperparameters with GenRL¶
We often want to checkpoint our training model in the RL setting, GenRL offers to save your hyperparameters and weights using TOML and pytorch state_dict respectively.
Following is a sample code to save checkpoints 
import gym
import shutil
from genrl.agents import VPG
from genrl.environments.suite import VectorEnv
from genrl.core import NormalActionNoise
from genrl.trainers import OnPolicyTrainer
env = VectorEnv("CartPolev0", 2)
algo = VPG("mlp", env, batch_size=5, replay_size=100)
trainer = OnPolicyTrainer(
algo,
env,
log_mode=["stdout"],
logdir="./logs",
save_interval=100,
epochs=100,
evaluate_episodes=2,
)
trainer.train()
trainer.evaluate()
shutil.rmtree("./logs")
Let’s say you have a saved weights and hyperparameters file to load onto the model you can change your trainer as below to load it 
trainer = OnPolicyTrainer(
algo,
env,
log_mode=["stdout"],
logdir="./logs",
save_interval=100,
epochs=100,
evaluate_episodes=2,
load_weights="./checkpoints/VPG_CartPolev0/1log0.pt",
load_hyperparams="./checkpoints/VPG_CartPolev0/1log0.toml",
)
Agents¶
A2C¶
genrl.agents.deep.a2c.a2c module¶

class
genrl.agents.deep.a2c.a2c.
A2C
(*args, noise: Any = None, noise_std: float = 0.1, value_coeff: float = 0.5, entropy_coeff: float = 0.01, **kwargs)[source]¶ Bases:
genrl.agents.deep.base.onpolicy.OnPolicyAgent
Advantage Actor Critic algorithm (A2C)
The synchronous version of A3C Paper: https://arxiv.org/abs/1602.01783

network
¶ The network type of the Qvalue function. Supported types: [“cnn”, “mlp”]
Type: str

env
¶ The environment that the agent is supposed to act on
Type: Environment

create_model
¶ Whether the model of the algo should be created when initialised
Type: bool

batch_size
¶ Mini batch size for loading experiences
Type: int

gamma
¶ The discount factor for rewards
Type: float

layers
¶ Layers in the Neural Network of the Qvalue function
Type: tuple
ofint
Sizes of shared layers in Actor Critic if using
Type: tuple
ofint

lr_policy
¶ Learning rate for the policy/actor
Type: float

lr_value
¶ Learning rate for the critic
Type: float

rollout_size
¶ Capacity of the Replay Buffer
Type: int

buffer_type
¶ Choose the type of Buffer: [“rollout”]
Type: str

noise
¶ Action Noise function added to aid in exploration
Type: ActionNoise

noise_std
¶ Standard deviation of the action noise distribution
Type: float

value_coeff
¶ Ratio of magnitude of value updates to policy updates
Type: float

entropy_coeff
¶ Ratio of magnitude of entropy updates to policy updates
Type: float

seed
¶ Seed for randomness
Type: int

render
¶ Should the env be rendered during training?
Type: bool

device
¶ Hardware being used for training. Options: [“cuda” > GPU, “cpu” > CPU]
Type: str

evaluate_actions
(states: torch.Tensor, actions: torch.Tensor)[source]¶ Evaluates actions taken by actor
Actions taken by actor and their respective states are analysed to get log probabilities and values from critics
Parameters:  states (
torch.Tensor
) – States encountered in rollout  actions (
torch.Tensor
) – Actions taken in response to respective states
Returns: Values of states encountered during the rollout log_probs (
torch.Tensor
): Log of action probabilities given a stateReturn type: values (
torch.Tensor
) states (

get_hyperparams
() → Dict[str, Any][source]¶ Get relevant hyperparameters to save
Returns: Hyperparameters to be saved weights ( torch.Tensor
): Neural network weightsReturn type: hyperparams ( dict
)

get_logging_params
() → Dict[str, Any][source]¶ Gets relevant parameters for logging
Returns: Logging parameters for monitoring training Return type: logs ( dict
)

get_traj_loss
(values: torch.Tensor, dones: torch.Tensor) → None[source]¶ Get loss from trajectory traversed by agent during rollouts
Computes the returns and advantages needed for calculating loss
Parameters:  values (
torch.Tensor
) – Values of states encountered during the rollout  dones (
list
of bool) – Game over statuses of each environment
 values (

select_action
(state: torch.Tensor, deterministic: bool = False) → torch.Tensor[source]¶ Select action given state
Action Selection for On Policy Agents with Actor Critic
Parameters:  state (
torch.Tensor
) – Current state of the environment  deterministic (bool) – Should the policy be deterministic or stochastic
Returns: Action taken by the agent value (
torch.Tensor
): Value of given state log_prob (torch.Tensor
): Log probability of selected actionReturn type: action (
torch.Tensor
) state (

DDPG¶
genrl.agents.deep.ddpg.ddpg module¶

class
genrl.agents.deep.ddpg.ddpg.
DDPG
(*args, noise: genrl.core.noise.ActionNoise = None, noise_std: float = 0.2, **kwargs)[source]¶ Bases:
genrl.agents.deep.base.offpolicy.OffPolicyAgentAC
Deep Deterministic Policy Gradient Algorithm
Paper: https://arxiv.org/abs/1509.02971

network
¶ The network type of the Qvalue function. Supported types: [“cnn”, “mlp”]
Type: str

env
¶ The environment that the agent is supposed to act on
Type: Environment

create_model
¶ Whether the model of the algo should be created when initialised
Type: bool

batch_size
¶ Mini batch size for loading experiences
Type: int

gamma
¶ The discount factor for rewards
Type: float

layers
¶ Layers in the Neural Network of the Qvalue function
Type: tuple
ofint
Sizes of shared layers in Actor Critic if using
Type: tuple
ofint

lr_policy
¶ Learning rate for the policy/actor
Type: float

lr_value
¶ Learning rate for the critic
Type: float

replay_size
¶ Capacity of the Replay Buffer
Type: int

buffer_type
¶ Choose the type of Buffer: [“push”, “prioritized”]
Type: str

polyak
¶ Target model update parameter (1 for hard update)
Type: float

noise
¶ Action Noise function added to aid in exploration
Type: ActionNoise

noise_std
¶ Standard deviation of the action noise distribution
Type: float

seed
¶ Seed for randomness
Type: int

render
¶ Should the env be rendered during training?
Type: bool

device
¶ Hardware being used for training. Options: [“cuda” > GPU, “cpu” > CPU]
Type: str

get_hyperparams
() → Dict[str, Any][source]¶ Get relevant hyperparameters to save
Returns: Hyperparameters to be saved weights ( torch.Tensor
): Neural Network weightsReturn type: hyperparams ( dict
)

DQN¶
genrl.agents.deep.dqn.base module¶

class
genrl.agents.deep.dqn.base.
DQN
(*args, max_epsilon: float = 1.0, min_epsilon: float = 0.01, epsilon_decay: int = 500, **kwargs)[source]¶ Bases:
genrl.agents.deep.base.offpolicy.OffPolicyAgent
Base DQN Class
Paper: https://arxiv.org/abs/1312.5602

network
¶ The network type of the Qvalue function. Supported types: [“cnn”, “mlp”]
Type: str

env
¶ The environment that the agent is supposed to act on
Type: Environment

create_model
¶ Whether the model of the algo should be created when initialised
Type: bool

batch_size
¶ Mini batch size for loading experiences
Type: int

gamma
¶ The discount factor for rewards
Type: float

value_layers
¶ Layers in the Neural Network of the Qvalue function
Type: tuple
ofint

lr_value
¶ Learning rate for the Qvalue function
Type: float

replay_size
¶ Capacity of the Replay Buffer
Type: int

buffer_type
¶ Choose the type of Buffer: [“push”, “prioritized”]
Type: str

max_epsilon
¶ Maximum epsilon for exploration
Type: str

min_epsilon
¶ Minimum epsilon for exploration
Type: str

epsilon_decay
¶ Rate of decay of epsilon (in order to decrease exploration with time)
Type: str

seed
¶ Seed for randomness
Type: int

render
¶ Should the env be rendered during training?
Type: bool

device
¶ Hardware being used for training. Options: [“cuda” > GPU, “cpu” > CPU]
Type: str

calculate_epsilon_by_frame
() → float[source]¶ Helper function to calculate epsilon after every timestep
Exponentially decays exploration rate from max epsilon to min epsilon The greater the value of epsilon_decay, the slower the decrease in epsilon

get_greedy_action
(state: torch.Tensor) → torch.Tensor[source]¶ Greedy action selection
Parameters: state ( torch.Tensor
) – Current state of the environmentReturns: Action taken by the agent Return type: action ( torch.Tensor
)

get_hyperparams
() → Dict[str, Any][source]¶ Get relevant hyperparameters to save
Returns: Hyperparameters to be saved weights ( torch.Tensor
): Neural network weightsReturn type: hyperparams ( dict
)

get_logging_params
() → Dict[str, Any][source]¶ Gets relevant parameters for logging
Returns: Logging parameters for monitoring training Return type: logs ( dict
)

get_q_values
(states: torch.Tensor, actions: torch.Tensor) → torch.Tensor[source]¶ Get Q values corresponding to specific states and actions
Parameters:  states (
torch.Tensor
) – States for which Qvalues need to be found  actions (
torch.Tensor
) – Actions taken at respective states
Returns: Q values for the given states and actions
Return type: q_values (
torch.Tensor
) states (

get_target_q_values
(next_states: torch.Tensor, rewards: List[float], dones: List[bool]) → torch.Tensor[source]¶ Get target Q values for the DQN
Parameters:  next_states (
torch.Tensor
) – Next states for which target Qvalues need to be found  rewards (
list
) – Rewards at each timestep for each environment  dones (
list
) – Game over status for each environment
Returns: Target Q values for the DQN
Return type: target_q_values (
torch.Tensor
) next_states (

load_weights
(weights) → None[source]¶ Load weights for the agent from pretrained model
Parameters: weights ( torch.Tensor
) – neural net weights

select_action
(state: torch.Tensor, deterministic: bool = False) → torch.Tensor[source]¶ Select action given state
Epsilongreedy actionselection
Parameters:  state (
torch.Tensor
) – Current state of the environment  deterministic (bool) – Should the policy be deterministic or stochastic
Returns: Action taken by the agent
Return type: action (
torch.Tensor
) state (

update_params
(update_interval: int) → None[source]¶ Update parameters of the model
Parameters: update_interval (int) – Interval between successive updates of the target model

genrl.agents.deep.dqn.categorical module¶

class
genrl.agents.deep.dqn.categorical.
CategoricalDQN
(*args, noisy_layers: Tuple = (32, 128), num_atoms: int = 51, v_min: int = 10, v_max: int = 10, **kwargs)[source]¶ Bases:
genrl.agents.deep.dqn.base.DQN
Categorical DQN Algorithm
Paper: https://arxiv.org/pdf/1707.06887.pdf

network
¶ The network type of the Qvalue function. Supported types: [“cnn”, “mlp”]
Type: str

env
¶ The environment that the agent is supposed to act on
Type: Environment

create_model
¶ Whether the model of the algo should be created when initialised
Type: bool

batch_size
¶ Mini batch size for loading experiences
Type: int

gamma
¶ The discount factor for rewards
Type: float

layers
¶ Layers in the Neural Network of the Qvalue function
Type: tuple
ofint

lr_value
¶ Learning rate for the Qvalue function
Type: float

replay_size
¶ Capacity of the Replay Buffer
Type: int

buffer_type
¶ Choose the type of Buffer: [“push”, “prioritized”]
Type: str

max_epsilon
¶ Maximum epsilon for exploration
Type: str

min_epsilon
¶ Minimum epsilon for exploration
Type: str

epsilon_decay
¶ Rate of decay of epsilon (in order to decrease exploration with time)
Type: str

noisy_layers
¶ Noisy layers in the Neural Network of the Qvalue function
Type: tuple
ofint

num_atoms
¶ Number of atoms used in the discrete distribution
Type: int

v_min
¶ Lower bound of value distribution
Type: int

v_max
¶ Upper bound of value distribution
Type: int

seed
¶ Seed for randomness
Type: int

render
¶ Should the env be rendered during training?
Type: bool

device
¶ Hardware being used for training. Options: [“cuda” > GPU, “cpu” > CPU]
Type: str

get_greedy_action
(state: torch.Tensor) → torch.Tensor[source]¶ Greedy action selection
Parameters: state ( torch.Tensor
) – Current state of the environmentReturns: Action taken by the agent Return type: action ( torch.Tensor
)

get_q_loss
(batch: collections.namedtuple)[source]¶ Categorical DQN loss function to calculate the loss of the Qfunction
Parameters: batch ( collections.namedtuple
oftorch.Tensor
) – Batch of experiencesReturns: Calculateed loss of the Qfunction Return type: loss ( torch.Tensor
)

get_q_values
(states: torch.Tensor, actions: torch.Tensor)[source]¶ Get Q values corresponding to specific states and actions
Parameters:  states (
torch.Tensor
) – States for which Qvalues need to be found  actions (
torch.Tensor
) – Actions taken at respective states
Returns: Q values for the given states and actions
Return type: q_values (
torch.Tensor
) states (

get_target_q_values
(next_states: torch.Tensor, rewards: torch.Tensor, dones: torch.Tensor)[source]¶ Projected Distribution of Qvalues
Helper function for Categorical/Distributional DQN
Parameters:  next_states (
torch.Tensor
) – Next states being encountered by the agent  rewards (
torch.Tensor
) – Rewards received by the agent  dones (
torch.Tensor
) – Game over status of each environment
Returns: Projected Qvalue Distribution or Target Q Values
Return type: target_q_values (object)
 next_states (

genrl.agents.deep.dqn.double module¶

class
genrl.agents.deep.dqn.double.
DoubleDQN
(*args, **kwargs)[source]¶ Bases:
genrl.agents.deep.dqn.base.DQN
Double DQN Class
Paper: https://arxiv.org/abs/1509.06461

network
¶ The network type of the Qvalue function. Supported types: [“cnn”, “mlp”]
Type: str

env
¶ The environment that the agent is supposed to act on
Type: Environment

batch_size
¶ Mini batch size for loading experiences
Type: int

gamma
¶ The discount factor for rewards
Type: float

layers
¶ Layers in the Neural Network of the Qvalue function
Type: tuple
ofint

lr_value
¶ Learning rate for the Qvalue function
Type: float

replay_size
¶ Capacity of the Replay Buffer
Type: int

buffer_type
¶ Choose the type of Buffer: [“push”, “prioritized”]
Type: str

max_epsilon
¶ Maximum epsilon for exploration
Type: str

min_epsilon
¶ Minimum epsilon for exploration
Type: str

epsilon_decay
¶ Rate of decay of epsilon (in order to decrease exploration with time)
Type: str

seed
¶ Seed for randomness
Type: int

render
¶ Should the env be rendered during training?
Type: bool

device
¶ Hardware being used for training. Options: [“cuda” > GPU, “cpu” > CPU]
Type: str

get_target_q_values
(next_states: torch.Tensor, rewards: torch.Tensor, dones: torch.Tensor) → torch.Tensor[source]¶ Get target Q values for the DQN
Parameters:  next_states (
torch.Tensor
) – Next states for which target Qvalues need to be found  rewards (
list
) – Rewards at each timestep for each environment  dones (
list
) – Game over status for each environment
Returns: Target Q values for the DQN
Return type: target_q_values (
torch.Tensor
) next_states (

genrl.agents.deep.dqn.dueling module¶

class
genrl.agents.deep.dqn.dueling.
DuelingDQN
(*args, **kwargs)[source]¶ Bases:
genrl.agents.deep.dqn.base.DQN
Dueling DQN class
Paper: https://arxiv.org/abs/1511.06581

network
¶ The network type of the Qvalue function. Supported types: [“cnn”, “mlp”]
Type: str

env
¶ The environment that the agent is supposed to act on
Type: Environment

batch_size
¶ Mini batch size for loading experiences
Type: int

gamma
¶ The discount factor for rewards
Type: float

layers
¶ Layers in the Neural Network of the Qvalue function
Type: tuple
ofint

lr_value
¶ Learning rate for the Qvalue function
Type: float

replay_size
¶ Capacity of the Replay Buffer
Type: int

buffer_type
¶ Choose the type of Buffer: [“push”, “prioritized”]
Type: str

max_epsilon
¶ Maximum epsilon for exploration
Type: str

min_epsilon
¶ Minimum epsilon for exploration
Type: str

epsilon_decay
¶ Rate of decay of epsilon (in order to decrease exploration with time)
Type: str

seed
¶ Seed for randomness
Type: int

render
¶ Should the env be rendered during training?
Type: bool

device
¶ Hardware being used for training. Options: [“cuda” > GPU, “cpu” > CPU]
Type: str

genrl.agents.deep.dqn.noisy module¶

class
genrl.agents.deep.dqn.noisy.
NoisyDQN
(*args, noisy_layers: Tuple = (128, 128), **kwargs)[source]¶ Bases:
genrl.agents.deep.dqn.base.DQN
Noisy DQN Algorithm
Paper: https://arxiv.org/abs/1706.10295

network
¶ The network type of the Qvalue function. Supported types: [“cnn”, “mlp”]
Type: str

env
¶ The environment that the agent is supposed to act on
Type: Environment

batch_size
¶ Mini batch size for loading experiences
Type: int

gamma
¶ The discount factor for rewards
Type: float

layers
¶ Layers in the Neural Network of the Qvalue function
Type: tuple
ofint

lr_value
¶ Learning rate for the Qvalue function
Type: float

replay_size
¶ Capacity of the Replay Buffer
Type: int

buffer_type
¶ Choose the type of Buffer: [“push”, “prioritized”]
Type: str

max_epsilon
¶ Maximum epsilon for exploration
Type: str

min_epsilon
¶ Minimum epsilon for exploration
Type: str

epsilon_decay
¶ Rate of decay of epsilon (in order to decrease exploration with time)
Type: str

noisy_layers
¶ Noisy layers in the Neural Network of the Qvalue function
Type: tuple
ofint

seed
¶ Seed for randomness
Type: int

render
¶ Should the env be rendered during training?
Type: bool

device
¶ Hardware being used for training. Options: [“cuda” > GPU, “cpu” > CPU]
Type: str

genrl.agents.deep.dqn.prioritized module¶

class
genrl.agents.deep.dqn.prioritized.
PrioritizedReplayDQN
(*args, alpha: float = 0.6, beta: float = 0.4, **kwargs)[source]¶ Bases:
genrl.agents.deep.dqn.base.DQN
Prioritized Replay DQN Class
Paper: https://arxiv.org/abs/1511.05952

network
¶ The network type of the Qvalue function. Supported types: [“cnn”, “mlp”]
Type: str

env
¶ The environment that the agent is supposed to act on
Type: Environment

batch_size
¶ Mini batch size for loading experiences
Type: int

gamma
¶ The discount factor for rewards
Type: float

layers
¶ Layers in the Neural Network of the Qvalue function
Type: tuple
ofint

lr_value
¶ Learning rate for the Qvalue function
Type: float

replay_size
¶ Capacity of the Replay Buffer
Type: int

buffer_type
¶ Choose the type of Buffer: [“push”, “prioritized”]
Type: str

max_epsilon
¶ Maximum epsilon for exploration
Type: str

min_epsilon
¶ Minimum epsilon for exploration
Type: str

epsilon_decay
¶ Rate of decay of epsilon (in order to decrease exploration with time)
Type: str

alpha
¶ Prioritization constant
Type: float

beta
¶ Importance Sampling bias
Type: float

seed
¶ Seed for randomness
Type: int

render
¶ Should the env be rendered during training?
Type: bool

device
¶ Hardware being used for training. Options: [“cuda” > GPU, “cpu” > CPU]
Type: str

genrl.agents.deep.dqn.utils module¶

genrl.agents.deep.dqn.utils.
categorical_greedy_action
(agent: genrl.agents.deep.dqn.base.DQN, state: torch.Tensor) → torch.Tensor[source]¶ Greedy action selection for Categorical DQN
Parameters:  agent (
DQN
) – The agent  state (
torch.Tensor
) – Current state of the environment
Returns: Action taken by the agent
Return type: action (
torch.Tensor
) agent (

genrl.agents.deep.dqn.utils.
categorical_q_loss
(agent: genrl.agents.deep.dqn.base.DQN, batch: collections.namedtuple)[source]¶ Categorical DQN loss function to calculate the loss of the Qfunction
Parameters:  agent (
DQN
) – The agent  batch (
collections.namedtuple
oftorch.Tensor
) – Batch of experiences
Returns: Calculateed loss of the Qfunction
Return type: loss (
torch.Tensor
) agent (

genrl.agents.deep.dqn.utils.
categorical_q_target
(agent: genrl.agents.deep.dqn.base.DQN, next_states: torch.Tensor, rewards: torch.Tensor, dones: torch.Tensor)[source]¶ Projected Distribution of Qvalues
Helper function for Categorical/Distributional DQN
Parameters:  agent (
DQN
) – The agent  next_states (
torch.Tensor
) – Next states being encountered by the agent  rewards (
torch.Tensor
) – Rewards received by the agent  dones (
torch.Tensor
) – Game over status of each environment
Returns: Projected Qvalue Distribution or Target Q Values
Return type: target_q_values (object)
 agent (

genrl.agents.deep.dqn.utils.
categorical_q_values
(agent: genrl.agents.deep.dqn.base.DQN, states: torch.Tensor, actions: torch.Tensor)[source]¶ Get Q values given state for a Categorical DQN
Parameters:  agent (
DQN
) – The agent  states (
torch.Tensor
) – States being replayed  actions (
torch.Tensor
) – Actions being replayed
Returns: Q values for the given states and actions
Return type: q_values (
torch.Tensor
) agent (

genrl.agents.deep.dqn.utils.
ddqn_q_target
(agent: genrl.agents.deep.dqn.base.DQN, next_states: torch.Tensor, rewards: torch.Tensor, dones: torch.Tensor) → torch.Tensor[source]¶ Double Qlearning target
Can be used to replace the get_target_values method of the Base DQN class in any DQN algorithm
Parameters:  agent (
DQN
) – The agent  next_states (
torch.Tensor
) – Next states being encountered by the agent  rewards (
torch.Tensor
) – Rewards received by the agent  dones (
torch.Tensor
) – Game over status of each environment
Returns: Target Q values using Double Qlearning
Return type: target_q_values (
torch.Tensor
) agent (
PPO1¶
genrl.agents.deep.ppo1.ppo1 module¶

class
genrl.agents.deep.ppo1.ppo1.
PPO1
(*args, clip_param: float = 0.2, value_coeff: float = 0.5, entropy_coeff: float = 0.01, **kwargs)[source]¶ Bases:
genrl.agents.deep.base.onpolicy.OnPolicyAgent
Proximal Policy Optimization algorithm (Clipped policy).
Paper: https://arxiv.org/abs/1707.06347

network
¶ The network type of the Qvalue function. Supported types: [“cnn”, “mlp”]
Type: str

env
¶ The environment that the agent is supposed to act on
Type: Environment

create_model
¶ Whether the model of the algo should be created when initialised
Type: bool

batch_size
¶ Mini batch size for loading experiences
Type: int

gamma
¶ The discount factor for rewards
Type: float

layers
¶ Layers in the Neural Network of the Qvalue function
Type: tuple
ofint
Sizes of shared layers in Actor Critic if using
Type: tuple
ofint

lr_policy
¶ Learning rate for the policy/actor
Type: float

lr_value
¶ Learning rate for the Qvalue function
Type: float

rollout_size
¶ Capacity of the Rollout Buffer
Type: int

buffer_type
¶ Choose the type of Buffer: [“rollout”]
Type: str

clip_param
¶ Epsilon for clipping policy loss
Type: float

value_coeff
¶ Ratio of magnitude of value updates to policy updates
Type: float

entropy_coeff
¶ Ratio of magnitude of entropy updates to policy updates
Type: float

seed
¶ Seed for randomness
Type: int

render
¶ Should the env be rendered during training?
Type: bool

device
¶ Hardware being used for training. Options: [“cuda” > GPU, “cpu” > CPU]
Type: str

evaluate_actions
(states: torch.Tensor, actions: torch.Tensor)[source]¶ Evaluates actions taken by actor
Actions taken by actor and their respective states are analysed to get log probabilities and values from critics
Parameters:  states (
torch.Tensor
) – States encountered in rollout  actions (
torch.Tensor
) – Actions taken in response to respective states
Returns: Values of states encountered during the rollout log_probs (
torch.Tensor
): Log of action probabilities given a stateReturn type: values (
torch.Tensor
) states (

get_hyperparams
() → Dict[str, Any][source]¶ Get relevant hyperparameters to save
Returns: Hyperparameters to be saved weights ( torch.Tensor
): Neural network weightsReturn type: hyperparams ( dict
)

get_logging_params
() → Dict[str, Any][source]¶ Gets relevant parameters for logging
Returns: Logging parameters for monitoring training Return type: logs ( dict
)

get_traj_loss
(values, dones)[source]¶ Get loss from trajectory traversed by agent during rollouts
Computes the returns and advantages needed for calculating loss
Parameters:  values (
torch.Tensor
) – Values of states encountered during the rollout  dones (
list
of bool) – Game over statuses of each environment
 values (

select_action
(state: torch.Tensor, deterministic: bool = False) → torch.Tensor[source]¶ Select action given state
Action Selection for On Policy Agents with Actor Critic
Parameters:  state (
np.ndarray
) – Current state of the environment  deterministic (bool) – Should the policy be deterministic or stochastic
Returns: Action taken by the agent value (
torch.Tensor
): Value of given state log_prob (torch.Tensor
): Log probability of selected actionReturn type: action (
np.ndarray
) state (

VPG¶
genrl.agents.deep.vpg.vpg module¶

class
genrl.agents.deep.vpg.vpg.
VPG
(*args, **kwargs)[source]¶ Bases:
genrl.agents.deep.base.onpolicy.OnPolicyAgent
Vanilla Policy Gradient algorithm
 network (str): The network type of the Qvalue function.
 Supported types: [“cnn”, “mlp”]
env (Environment): The environment that the agent is supposed to act on create_model (bool): Whether the model of the algo should be created when initialised batch_size (int): Mini batch size for loading experiences gamma (float): The discount factor for rewards layers (
tuple
ofint
): Layers in the Neural Networkof the Qvalue functionlr_policy (float): Learning rate for the policy/actor lr_value (float): Learning rate for the Qvalue function rollout_size (int): Capacity of the Rollout Buffer buffer_type (str): Choose the type of Buffer: [“rollout”] seed (int): Seed for randomness render (bool): Should the env be rendered during training? device (str): Hardware being used for training. Options:
[“cuda” > GPU, “cpu” > CPU]
get_hyperparams
() → Dict[str, Any][source]¶ Get relevant hyperparameters to save
Returns: Hyperparameters to be saved weights ( torch.Tensor
): Neural network weightsReturn type: hyperparams ( dict
)

get_log_probs
(states: torch.Tensor, actions: torch.Tensor)[source]¶ Get log probabilities of action values
Actions taken by actor and their respective states are analysed to get log probabilities
Parameters:  states (
torch.Tensor
) – States encountered in rollout  actions (
torch.Tensor
) – Actions taken in response to respective states
Returns: Log of action probabilities given a state
Return type: log_probs (
torch.Tensor
) states (

get_logging_params
() → Dict[str, Any][source]¶ Gets relevant parameters for logging
Returns: Logging parameters for monitoring training Return type: logs ( dict
)

get_traj_loss
(values, dones)[source]¶ Get loss from trajectory traversed by agent during rollouts
Computes the returns and advantages needed for calculating loss
Parameters:  values (
torch.Tensor
) – Values of states encountered during the rollout  dones (
list
of bool) – Game over statuses of each environment
 values (

select_action
(state: torch.Tensor, deterministic: bool = False) → torch.Tensor[source]¶ Select action given state
Action Selection for Vanilla Policy Gradient
Parameters:  state (
np.ndarray
) – Current state of the environment  deterministic (bool) – Should the policy be deterministic or stochastic
Returns: Action taken by the agent value (
torch.Tensor
): Value of given state. In VPG, there is no criticto find the value so we set this to a default 0 for convenience
log_prob (
torch.Tensor
): Log probability of selected actionReturn type: action (
np.ndarray
) state (
TD3¶
genrl.agents.deep.td3.td3 module¶

class
genrl.agents.deep.td3.td3.
TD3
(*args, policy_frequency: int = 2, noise: genrl.core.noise.ActionNoise = None, noise_std: float = 0.2, **kwargs)[source]¶ Bases:
genrl.agents.deep.base.offpolicy.OffPolicyAgentAC
Twin Delayed DDPG Algorithm
Paper: https://arxiv.org/abs/1509.02971

network
¶ The network type of the Qvalue function. Supported types: [“cnn”, “mlp”]
Type: str

env
¶ The environment that the agent is supposed to act on
Type: Environment

create_model
¶ Whether the model of the algo should be created when initialised
Type: bool

batch_size
¶ Mini batch size for loading experiences
Type: int

gamma
¶ The discount factor for rewards
Type: float

policy_layers
¶ Neural network layer dimensions for the policy
Type: tuple
ofint

value_layers
¶ Neural network layer dimensions for the critics
Type: tuple
ofint
Sizes of shared layers in Actor Critic if using
Type: tuple
ofint

lr_policy
¶ Learning rate for the policy/actor
Type: float

lr_value
¶ Learning rate for the critic
Type: float

replay_size
¶ Capacity of the Replay Buffer
Type: int

buffer_type
¶ Choose the type of Buffer: [“push”, “prioritized”]
Type: str

polyak
¶ Target model update parameter (1 for hard update)
Type: float

policy_frequency
¶ Frequency of policy updates in comparison to critic updates
Type: int

noise
¶ Action Noise function added to aid in exploration
Type: ActionNoise

noise_std
¶ Standard deviation of the action noise distribution
Type: float

seed
¶ Seed for randomness
Type: int

render
¶ Should the env be rendered during training?
Type: bool

device
¶ Hardware being used for training. Options: [“cuda” > GPU, “cpu” > CPU]
Type: str

get_hyperparams
() → Dict[str, Any][source]¶ Get relevant hyperparameters to save
Returns: Hyperparameters to be saved weights ( torch.Tensor
): Neural network weightsReturn type: hyperparams ( dict
)

SAC¶
genrl.agents.deep.sac.sac module¶

class
genrl.agents.deep.sac.sac.
SAC
(*args, alpha: float = 0.01, polyak: float = 0.995, entropy_tuning: bool = True, **kwargs)[source]¶ Bases:
genrl.agents.deep.base.offpolicy.OffPolicyAgentAC
Soft Actor Critic algorithm (SAC)
Paper: https://arxiv.org/abs/1812.05905

network
¶ The network type of the Qvalue function. Supported types: [“cnn”, “mlp”]
Type: str

env
¶ The environment that the agent is supposed to act on
Type: Environment

create_model
¶ Whether the model of the algo should be created when initialised
Type: bool

batch_size
¶ Mini batch size for loading experiences
Type: int

gamma
¶ The discount factor for rewards
Type: float

policy_layers
¶ Neural network layer dimensions for the policy
Type: tuple
ofint

value_layers
¶ Neural network layer dimensions for the critics
Type: tuple
ofint
Sizes of shared layers in Actor Critic if using
Type: tuple
ofint

lr_policy
¶ Learning rate for the policy/actor
Type: float

lr_value
¶ Learning rate for the critic
Type: float

replay_size
¶ Capacity of the Replay Buffer
Type: int

buffer_type
¶ Choose the type of Buffer: [“push”, “prioritized”]
Type: str

alpha
¶ Entropy factor
Type: str

polyak
¶ Target model update parameter (1 for hard update)
Type: float

entropy_tuning
¶ True if entropy tuning should be done, False otherwise
Type: bool

seed
¶ Seed for randomness
Type: int

render
¶ Should the env be rendered during training?
Type: bool

device
¶ Hardware being used for training. Options: [“cuda” > GPU, “cpu” > CPU]
Type: str

get_hyperparams
() → Dict[str, Any][source]¶ Get relevant hyperparameters to save
Returns: Hyperparameters to be saved weights ( torch.Tensor
): Neural network weightsReturn type: hyperparams ( dict
)

get_logging_params
() → Dict[str, Any][source]¶ Gets relevant parameters for logging
Returns: Logging parameters for monitoring training Return type: logs ( dict
)

get_p_loss
(states: torch.Tensor) → torch.Tensor[source]¶ Function to get the Policy loss
Parameters: states ( torch.Tensor
) – States for which Qvalues need to be foundReturns: Calculated policy loss Return type: loss ( torch.Tensor
)

get_target_q_values
(next_states: torch.Tensor, rewards: List[float], dones: List[bool]) → torch.Tensor[source]¶ Get target Q values for the SAC
Parameters:  next_states (
torch.Tensor
) – Next states for which target Qvalues need to be found  rewards (
list
) – Rewards at each timestep for each environment  dones (
list
) – Game over status for each environment
Returns: Target Q values for the SAC
Return type: target_q_values (
torch.Tensor
) next_states (

select_action
(state: torch.Tensor, deterministic: bool = False) → torch.Tensor[source]¶ Select action given state
Action Selection
Parameters:  state (
np.ndarray
) – Current state of the environment  deterministic (bool) – Should the policy be deterministic or stochastic
Returns: Action taken by the agent
Return type: action (
np.ndarray
) state (

QLearning¶
genrl.agents.classical.qlearning.qlearning module¶

class
genrl.agents.classical.qlearning.qlearning.
QLearning
(env: gym.core.Env, epsilon: float = 0.9, gamma: float = 0.95, lr: float = 0.01)[source]¶ Bases:
object
QLearning Algorithm.
Paper https://link.springer.com/article/10.1007/BF00992698

env
¶ Environment with which agent interacts.
Type: gym.Env

epsilon
¶ exploration coefficient for epsilongreedy exploration.
Type: float, optional

gamma
¶ discount factor.
Type: float, optional

lr
¶ learning rate for optimizer.
Type: float, optional

get_action
(state: numpy.ndarray, explore: bool = True) → numpy.ndarray[source]¶ Epsilon greedy selection of epsilon in the explore phase.
Parameters:  state (np.ndarray) – Environment state.
 explore (bool, optional) – True if exploration is required. False if not.
Returns: action.
Return type: np.ndarray

SARSA¶
genrl.agents.classical.sarsa.sarsa module¶

class
genrl.agents.classical.sarsa.sarsa.
SARSA
(env: gym.core.Env, epsilon: float = 0.9, lmbda: float = 0.9, gamma: float = 0.95, lr: float = 0.01)[source]¶ Bases:
object
SARSA Algorithm.
Paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.17.2539&rep=rep1&type=pdf

env
¶ Environment with which agent interacts.
Type: gym.Env

epsilon
¶ exploration coefficient for epsilongreedy exploration.
Type: float, optional

gamma
¶ discount factor.
Type: float, optional

lr
¶ learning rate for optimizer.
Type: float, optional

get_action
(state: numpy.ndarray, explore: bool = True) → numpy.ndarray[source]¶ Epsilon greedy selection of epsilon in the explore phase.
Parameters:  state (np.ndarray) – Environment state.
 explore (bool, optional) – True if exploration is required. False if not.
Returns: action.
Return type: np.ndarray

Contextual Bandit¶
Base¶

class
genrl.agents.bandits.contextual.base.
DCBAgent
(bandit: genrl.core.bandit.Bandit, device: str = 'cpu', **kwargs)[source]¶ Bases:
genrl.core.bandit.BanditAgent
Base class for deep contextual bandit solving agents
Parameters:  bandit (gennav.deep.bandit.data_bandits.DataBasedBandit) – The bandit to solve
 device (str) – Device to use for tensor operations. “cpu” for cpu or “cuda” for cuda. Defaults to “cpu”.

bandit
¶ The bandit to solve
Type: gennav.deep.bandit.data_bandits.DataBasedBandit

device
¶ Device to use for tensor operations.
Type: torch.device

select_action
(context: torch.Tensor) → int[source]¶ Select an action based on given context
Parameters: context (torch.Tensor) – The context vector to select action for Note
This method needs to be implemented in the specific agent.
Returns: The action to take Return type: int

update_parameters
(action: Optional[int] = None, batch_size: Optional[int] = None, train_epochs: Optional[int] = None) → None[source]¶ Update parameters of the agent.
Parameters:  action (Optional[int], optional) – Action to update the parameters for. Defaults to None.
 batch_size (Optional[int], optional) – Size of batch to update parameters with. Defaults to None.
 train_epochs (Optional[int], optional) – Epochs to train neural network for. Defaults to None.
Note
This method needs to be implemented in the specific agent.
Bootstrap Neural¶

class
genrl.agents.bandits.contextual.bootstrap_neural.
BootstrapNeuralAgent
(bandit: genrl.utils.data_bandits.base.DataBasedBandit, **kwargs)[source]¶ Bases:
genrl.agents.bandits.contextual.base.DCBAgent
Bootstraped ensemble agentfor deep contextual bandits.
Parameters:  bandit (DataBasedBandit) – The bandit to solve
 init_pulls (int, optional) – Number of times to select each action initially. Defaults to 3.
 hidden_dims (List[int], optional) – Dimensions of hidden layers of network. Defaults to [50, 50].
 init_lr (float, optional) – Initial learning rate. Defaults to 0.1.
 lr_decay (float, optional) – Decay rate for learning rate. Defaults to 0.5.
 lr_reset (bool, optional) – Whether to reset learning rate ever train interval. Defaults to True.
 max_grad_norm (float, optional) – Maximum norm of gradients for gradient clipping. Defaults to 0.5.
 dropout_p (Optional[float], optional) – Probability for dropout. Defaults to None which implies dropout is not to be used.
 eval_with_dropout (bool, optional) – Whether or not to use dropout at inference. Defaults to False.
 n (int, optional) – Number of models in ensemble. Defaults to 10.
 add_prob (float, optional) – Probability of adding a transition to a database. Defaults to 0.95.
 device (str) – Device to use for tensor operations. “cpu” for cpu or “cuda” for cuda. Defaults to “cpu”.

select_action
(context: torch.Tensor) → int[source]¶ Select an action based on given context.
Selects an action by computing a forward pass through a randomly selected network from the ensemble.
Parameters: context (torch.Tensor) – The context vector to select action for. Returns: The action to take. Return type: int

update_db
(context: torch.Tensor, action: int, reward: int)[source]¶ Updates transition database with given transition
The transition is added to each database with a certain probability.
Parameters:  context (torch.Tensor) – Context recieved
 action (int) – Action taken
 reward (int) – Reward recieved

update_params
(action: Optional[int] = None, batch_size: int = 512, train_epochs: int = 20)[source]¶ Update parameters of the agent.
Trains each neural network in the ensemble.
Parameters:  action (Optional[int], optional) – Action to update the parameters for. Not applicable in this agent. Defaults to None.
 batch_size (int, optional) – Size of batch to update parameters with. Defaults to 512
 train_epochs (int, optional) – Epochs to train neural network for. Defaults to 20
Fixed¶

class
genrl.agents.bandits.contextual.fixed.
FixedAgent
(bandit: genrl.utils.data_bandits.base.DataBasedBandit, p: List[float] = None, device: str = 'cpu')[source]¶
Linear Posterior¶

class
genrl.agents.bandits.contextual.linpos.
LinearPosteriorAgent
(bandit: genrl.utils.data_bandits.base.DataBasedBandit, **kwargs)[source]¶ Bases:
genrl.agents.bandits.contextual.base.DCBAgent
Deep contextual bandit agent using bayesian regression for posterior inference.
Parameters:  bandit (DataBasedBandit) – The bandit to solve
 init_pulls (int, optional) – Number of times to select each action initially. Defaults to 3.
 lambda_prior (float, optional) – Guassian prior for linear model. Defaults to 0.25.
 a0 (float, optional) – Inverse gamma prior for noise. Defaults to 6.0.
 b0 (float, optional) – Inverse gamma prior for noise. Defaults to 6.0.
 device (str) – Device to use for tensor operations. “cpu” for cpu or “cuda” for cuda. Defaults to “cpu”.

select_action
(context: torch.Tensor) → int[source]¶ Select an action based on given context.
Selecting action with highest predicted reward computed through betas sampled from posterior.
Parameters: context (torch.Tensor) – The context vector to select action for. Returns: The action to take. Return type: int

update_db
(context: torch.Tensor, action: int, reward: int)[source]¶ Updates transition database with given transition
Parameters:  context (torch.Tensor) – Context recieved
 action (int) – Action taken
 reward (int) – Reward recieved

update_params
(action: int, batch_size: int = 512, train_epochs: Optional[int] = None)[source]¶ Update parameters of the agent.
Updated the posterior over beta though bayesian regression.
Parameters:  action (int) – Action to update the parameters for.
 batch_size (int, optional) – Size of batch to update parameters with. Defaults to 512
 train_epochs (Optional[int], optional) – Epochs to train neural network for. Not applicable in this agent. Defaults to None
Neural Greedy¶

class
genrl.agents.bandits.contextual.neural_greedy.
NeuralGreedyAgent
(bandit: genrl.utils.data_bandits.base.DataBasedBandit, **kwargs)[source]¶ Bases:
genrl.agents.bandits.contextual.base.DCBAgent
Deep contextual bandit agent using epsilon greedy with a neural network.
Parameters:  bandit (DataBasedBandit) – The bandit to solve
 init_pulls (int, optional) – Number of times to select each action initially. Defaults to 3.
 hidden_dims (List[int], optional) – Dimensions of hidden layers of network. Defaults to [50, 50].
 init_lr (float, optional) – Initial learning rate. Defaults to 0.1.
 lr_decay (float, optional) – Decay rate for learning rate. Defaults to 0.5.
 lr_reset (bool, optional) – Whether to reset learning rate ever train interval. Defaults to True.
 max_grad_norm (float, optional) – Maximum norm of gradients for gradient clipping. Defaults to 0.5.
 dropout_p (Optional[float], optional) – Probability for dropout. Defaults to None which implies dropout is not to be used.
 eval_with_dropout (bool, optional) – Whether or not to use dropout at inference. Defaults to False.
 epsilon (float, optional) – Probability of selecting a random action. Defaults to 0.0.
 device (str) – Device to use for tensor operations. “cpu” for cpu or “cuda” for cuda. Defaults to “cpu”.

select_action
(context: torch.Tensor) → int[source]¶ Select an action based on given context.
Selects an action by computing a forward pass through network with an epsillon probability of selecting a random action.
Parameters: context (torch.Tensor) – The context vector to select action for. Returns: The action to take. Return type: int

update_db
(context: torch.Tensor, action: int, reward: int)[source]¶ Updates transition database with given transition
Parameters:  context (torch.Tensor) – Context recieved
 action (int) – Action taken
 reward (int) – Reward recieved

update_params
(action: Optional[int] = None, batch_size: int = 512, train_epochs: int = 20)[source]¶ Update parameters of the agent.
Trains neural network.
Parameters:  action (Optional[int], optional) – Action to update the parameters for. Not applicable in this agent. Defaults to None.
 batch_size (int, optional) – Size of batch to update parameters with. Defaults tp 512
 train_epochs (int, optional) – Epochs to train neural network for. Defaults to 20
Neural Linear Posterior¶

class
genrl.agents.bandits.contextual.neural_linpos.
NeuralLinearPosteriorAgent
(bandit: genrl.utils.data_bandits.base.DataBasedBandit, **kwargs)[source]¶ Bases:
genrl.agents.bandits.contextual.base.DCBAgent
Deep contextual bandit agent using bayesian regression on for posterior inference
A neural network is used to transform context vector to a latent represntation on which bayesian regression is performed.
Parameters:  bandit (DataBasedBandit) – The bandit to solve
 init_pulls (int, optional) – Number of times to select each action initially. Defaults to 3.
 hidden_dims (List[int], optional) – Dimensions of hidden layers of network. Defaults to [50, 50].
 init_lr (float, optional) – Initial learning rate. Defaults to 0.1.
 lr_decay (float, optional) – Decay rate for learning rate. Defaults to 0.5.
 lr_reset (bool, optional) – Whether to reset learning rate ever train interval. Defaults to True.
 max_grad_norm (float, optional) – Maximum norm of gradients for gradient clipping. Defaults to 0.5.
 dropout_p (Optional[float], optional) – Probability for dropout. Defaults to None which implies dropout is not to be used.
 eval_with_dropout (bool, optional) – Whether or not to use dropout at inference. Defaults to False.
 nn_update_ratio (int, optional) – . Defaults to 2.
 lambda_prior (float, optional) – Guassian prior for linear model. Defaults to 0.25.
 a0 (float, optional) – Inverse gamma prior for noise. Defaults to 3.0.
 b0 (float, optional) – Inverse gamma prior for noise. Defaults to 3.0.
 device (str) – Device to use for tensor operations. “cpu” for cpu or “cuda” for cuda. Defaults to “cpu”.

select_action
(context: torch.Tensor) → int[source]¶ Select an action based on given context.
Selects an action by computing a forward pass through network to output a representation of the context on which bayesian linear regression is performed to select an action.
Parameters: context (torch.Tensor) – The context vector to select action for. Returns: The action to take. Return type: int

update_db
(context: torch.Tensor, action: int, reward: int)[source]¶ Updates transition database with given transition
Updates latent context and predicted rewards seperately.
Parameters:  context (torch.Tensor) – Context recieved
 action (int) – Action taken
 reward (int) – Reward recieved

update_params
(action: int, batch_size: int = 512, train_epochs: int = 20)[source]¶ Update parameters of the agent.
Trains neural network and updates bayesian regression parameters.
Parameters:  action (int) – Action to update the parameters for.
 batch_size (int, optional) – Size of batch to update parameters with. Defaults to 512
 train_epochs (int, optional) – Epochs to train neural network for. Defaults to 20
Neural Noise Sampling¶

class
genrl.agents.bandits.contextual.neural_noise_sampling.
NeuralNoiseSamplingAgent
(bandit: genrl.utils.data_bandits.base.DataBasedBandit, **kwargs)[source]¶ Bases:
genrl.agents.bandits.contextual.base.DCBAgent
Deep contextual bandit agent with noise sampling for neural network parameters.
Parameters:  bandit (DataBasedBandit) – The bandit to solve
 init_pulls (int, optional) – Number of times to select each action initially. Defaults to 3.
 hidden_dims (List[int], optional) – Dimensions of hidden layers of network. Defaults to [50, 50].
 init_lr (float, optional) – Initial learning rate. Defaults to 0.1.
 lr_decay (float, optional) – Decay rate for learning rate. Defaults to 0.5.
 lr_reset (bool, optional) – Whether to reset learning rate ever train interval. Defaults to True.
 max_grad_norm (float, optional) – Maximum norm of gradients for gradient clipping. Defaults to 0.5.
 dropout_p (Optional[float], optional) – Probability for dropout. Defaults to None which implies dropout is not to be used.
 eval_with_dropout (bool, optional) – Whether or not to use dropout at inference. Defaults to False.
 noise_std_dev (float, optional) – Standard deviation of sampled noise. Defaults to 0.05.
 eps (float, optional) – Small constant for bounding KL divergece of noise. Defaults to 0.1.
 noise_update_batch_size (int, optional) – Batch size for updating noise parameters. Defaults to 256.
 device (str) – Device to use for tensor operations. “cpu” for cpu or “cuda” for cuda. Defaults to “cpu”.

select_action
(context: torch.Tensor) → int[source]¶ Select an action based on given context.
Selects an action by adding noise to neural network paramters and the computing forward with the context vector as input.
Parameters: context (torch.Tensor) – The context vector to select action for. Returns: The action to take Return type: int

update_db
(context: torch.Tensor, action: int, reward: int)[source]¶ Updates transition database with given transition
Parameters:  context (torch.Tensor) – Context recieved
 action (int) – Action taken
 reward (int) – Reward recieved

update_params
(action: Optional[int] = None, batch_size: int = 512, train_epochs: int = 20)[source]¶ Update parameters of the agent.
Trains each neural network in the ensemble.
Parameters:  action (Optional[int], optional) – Action to update the parameters for. Not applicable in this agent. Defaults to None.
 batch_size (int, optional) – Size of batch to update parameters with. Defaults to 512
 train_epochs (int, optional) – Epochs to train neural network for. Defaults to 20
Variational¶

class
genrl.agents.bandits.contextual.variational.
VariationalAgent
(bandit: genrl.utils.data_bandits.base.DataBasedBandit, **kwargs)[source]¶ Bases:
genrl.agents.bandits.contextual.base.DCBAgent
Deep contextual bandit agent using variation inference.
Parameters:  bandit (DataBasedBandit) – The bandit to solve
 init_pulls (int, optional) – Number of times to select each action initially. Defaults to 3.
 hidden_dims (List[int], optional) – Dimensions of hidden layers of network. Defaults to [50, 50].
 init_lr (float, optional) – Initial learning rate. Defaults to 0.1.
 lr_decay (float, optional) – Decay rate for learning rate. Defaults to 0.5.
 lr_reset (bool, optional) – Whether to reset learning rate ever train interval. Defaults to True.
 max_grad_norm (float, optional) – Maximum norm of gradients for gradient clipping. Defaults to 0.5.
 dropout_p (Optional[float], optional) – Probability for dropout. Defaults to None which implies dropout is not to be used.
 eval_with_dropout (bool, optional) – Whether or not to use dropout at inference. Defaults to False.
 noise_std (float, optional) – Standard deviation of noise in bayesian neural network. Defaults to 0.1.
 device (str) – Device to use for tensor operations. “cpu” for cpu or “cuda” for cuda. Defaults to “cpu”.

select_action
(context: torch.Tensor) → int[source]¶ Select an action based on given context.
Selects an action by computing a forward pass through the bayesian neural network.
Parameters: context (torch.Tensor) – The context vector to select action for. Returns: The action to take. Return type: int

update_db
(context: torch.Tensor, action: int, reward: int)[source]¶ Updates transition database with given transition
Parameters:  context (torch.Tensor) – Context recieved
 action (int) – Action taken
 reward (int) – Reward recieved

update_params
(action: int, batch_size: int = 512, train_epochs: int = 20)[source]¶ Update parameters of the agent.
Trains each neural network in the ensemble.
Parameters:  action (Optional[int], optional) – Action to update the parameters for. Not applicable in this agent. Defaults to None.
 batch_size (int, optional) – Size of batch to update parameters with. Defaults to 512
 train_epochs (int, optional) – Epochs to train neural network for. Defaults to 20
MultiArmed Bandit¶
Base¶

class
genrl.agents.bandits.multiarmed.base.
MABAgent
(bandit: genrl.core.bandit.MultiArmedBandit)[source]¶ Bases:
genrl.core.bandit.BanditAgent
Base Class for Contextual Bandit solving Policy
Parameters:  bandit (MultiArmedlBandit type object) – The Bandit to solve
 requires_init_run – Indicated if initialisation of Q values is required

action_hist
¶ Get the history of actions taken for contexts
Returns: List of context, actions pairs Return type: list

counts
¶ Get the number of times each action has been taken
Returns: Numpy array with count for each action Return type: numpy.ndarray

regret
¶ Get the current regret
Returns: The current regret Return type: float

regret_hist
¶ Get the history of regrets incurred for each step
Returns: List of rewards Return type: list

reward_hist
¶ Get the history of rewards received for each step
Returns: List of rewards Return type: list

select_action
(context: int) → int[source]¶ Select an action
This method needs to be implemented in the specific policy.
Parameters: context (int) – the context to select action for Returns: Selected action Return type: int

update_params
(context: int, action: int, reward: Union[int, float]) → None[source]¶ Update parmeters for the policy
This method needs to be implemented in the specific policy.
Parameters:  context (int) – context for which action is taken
 action (int) – action taken for the step
 reward (int or float) – reward obtained for the step
Bayesian Bandit¶

class
genrl.agents.bandits.multiarmed.bayesian.
BayesianUCBMABAgent
(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 1.0, beta: float = 1.0, confidence: float = 3.0)[source]¶ Bases:
genrl.agents.bandits.multiarmed.base.MABAgent
MultiArmed Bandit Solver with Bayesian Upper Confidence Bound based Action Selection Strategy.
Refer to Section 2.7 of Reinforcement Learning: An Introduction.
Parameters:  bandit (MultiArmedlBandit type object) – The Bandit to solve
 alpha (float) – alpha value for beta distribution
 beta (float) – beta values for beta distibution
 c (float) – Confidence level which controls degree of exploration

a
¶ alpha parameter of beta distribution associated with the policy
Type: numpy.ndarray

b
¶ beta parameter of beta distribution associated with the policy
Type: numpy.ndarray

confidence
¶ Confidence level which weights the exploration term
Type: float

quality
¶ Q values for all the actions for alpha, beta and c
Type: numpy.ndarray

select_action
(context: int) → int[source]¶ Select an action according to bayesian upper confidence bound
Take action that maximises a weighted sum of the Q values and a beta distribution paramerterized by alpha and beta and weighted by c for each action
Parameters:  context (int) – the context to select action for
 t (int) – timestep to choose action for
Returns: Selected action
Return type: int

update_params
(context: int, action: int, reward: float) → None[source]¶ Update parmeters for the policy
Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step
Parameters:  context (int) – context for which action is taken
 action (int) – action taken for the step
 reward (float) – reward obtained for the step
Bernoulli Bandit¶

class
genrl.agents.bandits.multiarmed.bernoulli_mab.
BernoulliMAB
(bandits: int = 1, arms: int = 5, reward_probs: numpy.ndarray = None, context_type: str = 'tensor')[source]¶ Bases:
genrl.core.bandit.MultiArmedBandit
Contextual Bandit with categorial context and bernoulli reward distribution
Parameters:  bandits (int) – Number of bandits
 arms (int) – Number of arms in each bandit
 reward_probs (numpy.ndarray) – Probabilities of getting rewards
Espilon Greedy¶

class
genrl.agents.bandits.multiarmed.epsgreedy.
EpsGreedyMABAgent
(bandit: genrl.core.bandit.MultiArmedBandit, eps: float = 0.05)[source]¶ Bases:
genrl.agents.bandits.multiarmed.base.MABAgent
Contextual Bandit Policy with Epsilon Greedy Action Selection Strategy.
Refer to Section 2.3 of Reinforcement Learning: An Introduction.
Parameters:  bandit (MultiArmedlBandit type object) – The Bandit to solve
 eps (float) – Probability with which a random action is to be selected.

eps
¶ Exploration constant
Type: float

quality
¶ Q values assigned by the policy to all actions
Type: numpy.ndarray

select_action
(context: int) → int[source]¶ Select an action according to epsilon greedy startegy
A random action is selected with espilon probability over the optimal action according to the current Q values to encourage exploration of the policy.
Parameters: context (int) – the context to select action for Returns: Selected action Return type: int

update_params
(context: int, action: int, reward: float) → None[source]¶ Update parmeters for the policy
Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step.
Parameters:  context (int) – context for which action is taken
 action (int) – action taken for the step
 reward (float) – reward obtained for the step
Gaussian¶

class
genrl.agents.bandits.multiarmed.gaussian_mab.
GaussianMAB
(bandits: int = 10, arms: int = 5, reward_means: numpy.ndarray = None, context_type: str = 'tensor')[source]¶ Bases:
genrl.core.bandit.MultiArmedBandit
Contextual Bandit with categorial context and gaussian reward distribution
Parameters:  bandits (int) – Number of bandits
 arms (int) – Number of arms in each bandit
 reward_means (numpy.ndarray) – Mean of gaussian distribution for each reward
Gradient¶

class
genrl.agents.bandits.multiarmed.gradient.
GradientMABAgent
(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 0.1, temp: float = 0.01)[source]¶ Bases:
genrl.agents.bandits.multiarmed.base.MABAgent
MultiArmed Bandit Solver with Softmax Action Selection Strategy.
Refer to Section 2.8 of Reinforcement Learning: An Introduction.
Parameters:  bandit (MultiArmedlBandit type object) – The Bandit to solve
 alpha (float) – The step size parameter for gradient based update
 temp (float) – Temperature for softmax distribution over Q values of actions

alpha
¶ Step size parameter for gradient based update of policy
Type: float

probability_hist
¶ History of probabilty values assigned to each action for each timestep
Type: numpy.ndarray

quality
¶ Q values assigned by the policy to all actions
Type: numpy.ndarray

select_action
(context: int) → int[source]¶ Select an action according by softmax action selection strategy
Action is sampled from softmax distribution computed over the Q values for all actions
Parameters: context (int) – the context to select action for Returns: Selected action Return type: int

temp
¶ Temperature for softmax distribution over Q values of actions
Type: float

update_params
(context: int, action: int, reward: float) → None[source]¶ Update parmeters for the policy
Updates the regret as the difference between max Q value and that of the action. Updates the Q values through a gradient ascent step
Parameters:  context (int) – context for which action is taken
 action (int) – action taken for the step
 reward (float) – reward obtained for the step
Thmopson Sampling¶

class
genrl.agents.bandits.multiarmed.thompson.
ThompsonSamplingMABAgent
(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 1.0, beta: float = 1.0)[source]¶ Bases:
genrl.agents.bandits.multiarmed.base.MABAgent
MultiArmed Bandit Solver with Bayesian Upper Confidence Bound based Action Selection Strategy.
Parameters:  bandit (MultiArmedlBandit type object) – The Bandit to solve
 a (float) – alpha value for beta distribution
 b (float) – beta values for beta distibution

a
¶ alpha parameter of beta distribution associated with the policy
Type: numpy.ndarray

b
¶ beta parameter of beta distribution associated with the policy
Type: numpy.ndarray

quality
¶ Q values for all the actions for alpha, beta and c
Type: numpy.ndarray

select_action
(context: int) → int[source]¶ Select an action according to Thompson Sampling
Samples are taken from beta distribution parameterized by alpha and beta for each action. The action with the highest sample is selected.
Parameters: context (int) – the context to select action for Returns: Selected action Return type: int

update_params
(context: int, action: int, reward: float) → None[source]¶ Update parmeters for the policy
Updates the regret as the difference between max Q value and that of the action. Updates the alpha value of beta distribution by adding the reward while the beta value is updated by adding 1  reward. Update the counts the action taken.
Parameters:  context (int) – context for which action is taken
 action (int) – action taken for the step
 reward (float) – reward obtained for the step
Upper Confidence Bound¶

class
genrl.agents.bandits.multiarmed.ucb.
UCBMABAgent
(bandit: genrl.core.bandit.MultiArmedBandit, confidence: float = 1.0)[source]¶ Bases:
genrl.agents.bandits.multiarmed.base.MABAgent
MultiArmed Bandit Solver with Upper Confidence Bound based Action Selection Strategy.
Refer to Section 2.7 of Reinforcement Learning: An Introduction.
Parameters:  bandit (MultiArmedlBandit type object) – The Bandit to solve
 c (float) – Confidence level which controls degree of exploration

confidence
¶ Confidence level which weights the exploration term
Type: float

quality
¶ q values assigned by the policy to all actions
Type: numpy.ndarray

select_action
(context: int) → int[source]¶ Select an action according to upper confidence bound action selction
Take action that maximises a weighted sum of the Q values for the action and an exploration encouragement term controlled by c.
Parameters: context (int) – the context to select action for Returns: Selected action Return type: int

update_params
(context: int, action: int, reward: float) → None[source]¶ Update parmeters for the policy
Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step.
Parameters:  context (int) – context for which action is taken
 action (int) – action taken for the step
 reward (float) – reward obtained for the step
Environments¶
Environments¶
Subpackages¶
Vectorized Envrionments¶
Submodules¶
genrl.environments.vec_env.monitor module¶

class
genrl.environments.vec_env.monitor.
VecMonitor
(venv: genrl.environments.vec_env.vector_envs.VecEnv, history_length: int = 0, info_keys: Tuple = ())[source]¶ Bases:
genrl.environments.vec_env.wrappers.VecEnvWrapper
Monitor class for VecEnvs. Saves important variables into the info dictionary
Parameters:  venv (object) – Vectorized Environment
 history_length (int) – Length of history for episode rewards and episode lengths
 info_keys (tuple or list) – Important variables to save
genrl.environments.vec_env.normalize module¶

class
genrl.environments.vec_env.normalize.
VecNormalize
(venv: genrl.environments.vec_env.vector_envs.VecEnv, norm_obs: bool = True, norm_reward: bool = True, clip_reward: float = 20.0)[source]¶ Bases:
genrl.environments.vec_env.wrappers.VecEnvWrapper
Wrapper to implement Normalization of observations and rewards for VecEnvs
Parameters:  venv (Vectorized Environment) – The Vectorized environment
 n_envs (int) – Number of environments in VecEnv
 norm_obs (bool) – True if observations should be normalized, else False
 norm_reward (bool) – True if rewards should be normalized, else False
 clip_reward (float) – Maximum absolute value for rewards
genrl.environments.vec_env.utils module¶

class
genrl.environments.vec_env.utils.
RunningMeanStd
(epsilon: float = 0.0001, shape: Tuple = ())[source]¶ Bases:
object
Utility Function to compute a running mean and variance calculator
Parameters:  epsilon (float) – Small number to prevent division by zero for calculations
 shape (Tuple) – Shape of the RMS object
genrl.environments.vec_env.vector_envs module¶

class
genrl.environments.vec_env.vector_envs.
SerialVecEnv
(*args, **kwargs)[source]¶ Bases:
genrl.environments.vec_env.vector_envs.VecEnv
Constructs a wrapper for serial execution through envs.

class
genrl.environments.vec_env.vector_envs.
SubProcessVecEnv
(*args, **kwargs)[source]¶ Bases:
genrl.environments.vec_env.vector_envs.VecEnv
Constructs a wrapper for parallel execution through envs.

class
genrl.environments.vec_env.vector_envs.
VecEnv
(envs: List[T], n_envs: int = 2)[source]¶ Bases:
abc.ABC
Base class for multiple environments.
Parameters:  env (Gym Environment) – Gym environment to be vectorised
 n_envs (int) – Number of environments

action_shape
¶

action_spaces
¶

n_envs
¶

obs_shape
¶

observation_spaces
¶

genrl.environments.vec_env.vector_envs.
worker
(parent_conn: multiprocessing.context.BaseContext.Pipe, child_conn: multiprocessing.context.BaseContext.Pipe, env: gym.core.Env)[source]¶ Worker class to facilitate multiprocessing
Parameters:  parent_conn (Multiprocessing Pipe Connection) – Parent connection of Pipe
 child_conn (Multiprocessing Pipe Connection) – Child connection of Pipe
 env (Gym Environment) – Gym environment we need multiprocessing for
genrl.environments.vec_env.wrappers module¶
Module contents¶
Submodules¶
genrl.environments.action_wrappers module¶

class
genrl.environments.action_wrappers.
ClipAction
(env: Union[gym.core.Env, genrl.environments.vec_env.vector_envs.VecEnv])[source]¶ Bases:
gym.core.ActionWrapper
Action Wrapper to clip actions
Parameters: env (object) – The environment whose actions need to be clipped

class
genrl.environments.action_wrappers.
RescaleAction
(env: Union[gym.core.Env, genrl.environments.vec_env.vector_envs.VecEnv], low: int, high: int)[source]¶ Bases:
gym.core.ActionWrapper
Action Wrapper to rescale actions
Parameters:  env (object) – The environment whose actions need to be rescaled
 low (int) – Lower limit of action
 high (int) – Upper limit of action
genrl.environments.atari_preprocessing module¶

class
genrl.environments.atari_preprocessing.
AtariPreprocessing
(env: gym.core.Env, frameskip: Union[Tuple, int] = (2, 5), grayscale: bool = True, screen_size: int = 84)[source]¶ Bases:
gym.core.Wrapper
Implementation for Image preprocessing for Gym Atari environments. Implements: 1) Frameskip 2) Grayscale 3) Downsampling to square image
param env: Atari environment param frameskip: Number of steps between actions. E.g. frameskip=4 will mean 1 action will be taken for every 4 frames. It’ll be a tuple  if nondeterministic and a random number will be chosen from (2, 5)
param grayscale: Whether or not the output should be converted to grayscale param screen_size: Size of the output screen (square output) type env: Gym Environment type frameskip: tuple or int type grayscale: boolean type screen_size: int
genrl.environments.atari_wrappers module¶

class
genrl.environments.atari_wrappers.
FireReset
(env: gym.core.Env)[source]¶ Bases:
gym.core.Wrapper
Some Atari environments do not actually do anything until a specific action (the fire action) is taken, so we make it take the action before starting the training process
Parameters: env (Gym Environment) – Atari environment

class
genrl.environments.atari_wrappers.
NoopReset
(env: gym.core.Env, max_noops: int = 30)[source]¶ Bases:
gym.core.Wrapper
Some Atari environments always reset to the same state. So we take a random number of some empty (noop) action to introduce some stochasticity.
Parameters:  env (Gym Environment) – Atari environment
 max_noops (int) – Maximum number of Noops to be taken
genrl.environments.base_wrapper module¶

class
genrl.environments.base_wrapper.
BaseWrapper
(env: Any, batch_size: int = None)[source]¶ Bases:
abc.ABC
Base class for all wrappers

batch_size
¶ The number of batches trained per update

close
() → None[source]¶ Closes environment and performs any other cleanup
Must be overridden by subclasses

genrl.environments.frame_stack module¶

class
genrl.environments.frame_stack.
FrameStack
(env: gym.core.Env, framestack: int = 4, compress: bool = True)[source]¶ Bases:
gym.core.Wrapper
Wrapper to stack the last few(4 by default) observations of agent efficiently
Parameters:  env (Gym Environment) – Environment to be wrapped
 framestack (int) – Number of frames to be stacked
 compress (bool) – True if we want to use LZ4 compression to conserve memory usage

class
genrl.environments.frame_stack.
LazyFrames
(frames: List[T], compress: bool = False)[source]¶ Bases:
object
Efficient data structure to save each frame only once. Can use LZ4 compression to optimizer memory usage.
Parameters:  frames (collections.deque) – List of frames that needs to converted to a LazyFrames data structure
 compress (boolean) – True if we want to use LZ4 compression to conserve memory usage

shape
¶ Returns dimensions of other object
genrl.environments.gym_wrapper module¶

class
genrl.environments.gym_wrapper.
GymWrapper
(env: gym.core.Env)[source]¶ Bases:
gym.core.Wrapper
Wrapper class for all Gym Environments
Parameters:  env (string) – Gym environment name
 n_envs (None, int) – Number of environments. None if not vectorised
 parallel (boolean) – If vectorised, should environments be run through serially or parallelly

action_shape
¶

obs_shape
¶

render
(mode: str = 'human') → None[source]¶ Renders all envs in a tiles format similar to baselines.
Parameters: mode (string) – Can either be ‘human’ or ‘rgb_array’. Displays tiled images in ‘human’ and returns tiled images in ‘rgb_array’
genrl.environments.suite module¶

genrl.environments.suite.
AtariEnv
(env_id: str, wrapper_list: List[T] = [<class 'genrl.environments.atari_preprocessing.AtariPreprocessing'>, <class 'genrl.environments.atari_wrappers.NoopReset'>, <class 'genrl.environments.atari_wrappers.FireReset'>, <class 'genrl.environments.time_limit.AtariTimeLimit'>, <class 'genrl.environments.frame_stack.FrameStack'>]) → gym.core.Env[source]¶ Function to apply wrappers for all Atari envs by Trainer class
Parameters:  env (string) – Environment Name
 wrapper_list (list or tuple) – List of wrappers to use
Returns: Gym Atari Environment
Return type: object

genrl.environments.suite.
GymEnv
(env_id: str) → gym.core.Env[source]¶ Function to apply wrappers for all regular Gym envs by Trainer class
Parameters: env (string) – Environment Name Returns: Gym Environment Return type: object

genrl.environments.suite.
VectorEnv
(env_id: str, n_envs: int = 2, parallel: int = False, env_type: str = 'gym') → genrl.environments.vec_env.vector_envs.VecEnv[source]¶ Chooses the kind of Vector Environment that is required
param env_id: Gym environment to be vectorised param n_envs: Number of environments param parallel: True if we want environments to run parallely and (  subprocesses, False if we want environments to run serially one after the other)
param env_type: Type of environment. Currently, we support [“gym”, “atari”] type env_id: string type n_envs: int type parallel: False type env_type: string returns: Vector Environment rtype: object
genrl.environments.time_limit module¶

class
genrl.environments.time_limit.
AtariTimeLimit
(env, max_episode_len=None)[source]¶ Bases:
gym.core.Wrapper

reset
(**kwargs)[source]¶ Resets the state of the environment and returns an initial observation.
Returns: the initial observation. Return type: observation (object)

step
(action)[source]¶ Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.
Accepts an action and returns a tuple (observation, reward, done, info).
Parameters: action (object) – an action provided by the agent Returns: agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning) Return type: observation (object)


class
genrl.environments.time_limit.
TimeLimit
(env, max_episode_len=None)[source]¶ Bases:
gym.core.Wrapper

reset
(**kwargs)[source]¶ Resets the state of the environment and returns an initial observation.
Returns: the initial observation. Return type: observation (object)

step
(action)[source]¶ Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.
Accepts an action and returns a tuple (observation, reward, done, info).
Parameters: action (object) – an action provided by the agent Returns: agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning) Return type: observation (object)

Module contents¶
Core¶
ActorCritic¶

class
genrl.core.actor_critic.
CNNActorCritic
(framestack: int, action_dim: gym.spaces.space.Space, policy_layers: Tuple = (256, ), value_layers: Tuple = (256, ), val_type: str = 'V', discrete: bool = True, *args, **kwargs)[source]¶ Bases:
genrl.core.base.BaseActorCritic
CNN Actor Critic
param framestack: Number of previous frames to stack together param action_dim: Action dimensions of the environment param fc_layers: Sizes of hidden layers param val_type: Specifies type of value function: (  “V” for V(s), “Qs” for Q(s), “Qsa” for Q(s,a))
param discrete: True if action space is discrete, else False param framestack: Number of previous frames to stack together type action_dim: int type fc_layers: tuple or list type val_type: str type discrete: bool

get_action
(state: torch.Tensor, deterministic: bool = False) → torch.Tensor[source]¶ Get action from the Actor based on input
param state: The state being passed as input to the Actor param deterministic: (True if the action space is deterministic,  else False)
type state: Tensor type deterministic: boolean returns: action

class
genrl.core.actor_critic.
MlpActorCritic
(state_dim: gym.spaces.space.Space, action_dim: gym.spaces.space.Space, shared_layers: None, policy_layers: Tuple = (32, 32), value_layers: Tuple = (32, 32), val_type: str = 'V', discrete: bool = True, **kwargs)[source]¶ Bases:
genrl.core.base.BaseActorCritic
MLP Actor Critic

state_dim
¶ State dimensions of the environment
Type: int

action_dim
¶ Action space dimensions of the environment
Type: int

policy_layers
¶ Hidden layers in the policy MLP
Type: list
ortuple

value_layers
¶ Hidden layers in the value MLP
Type: list
ortuple

val_type
¶ Value type of the critic network
Type: str

discrete
¶ True if the action space is discrete, else False
Type: bool

sac
¶ True if a SAClike network is needed, else False
Type: bool

activation
¶ Activation function to be used. Can be either “tanh” or “relu”
Type: str

Bases:
genrl.core.base.BaseActorCritic
MLP Shared Actor Critic
State dimensions of the environment
Type: int
Action space dimensions of the environment
Type: int
Hidden layers in the shared MLP
Type: list
ortuple
Hidden layers in the policy MLP
Type: list
ortuple
Hidden layers in the value MLP
Type: list
ortuple
Value type of the critic network
Type: str
True if the action space is discrete, else False
Type: bool
True if a SAClike network is needed, else False
Type: bool
Activation function to be used. Can be either “tanh” or “relu”
Type: str
Get Actions from the actor
 Arg:
 state (
torch.Tensor
): The state(s) being passed to the critics deterministic (bool): True if the action space is deterministic, else False
Returns: List of actions as estimated by the critic distribution (): The distribution from which the action was sampled (None if deterministReturn type: action ( list
)
Extract features from the state, which is then an input to get_action and get_value
Parameters: state ( torch.Tensor
) – The state(s) being passedReturns: The feature(s) extracted from the state Return type: features ( torch.Tensor
)
Get Values from the Critic
 Arg:
 state (
torch.Tensor
): The state(s) being passed to the critics
Returns: List of values as estimated by the critic Return type: values ( list
)
Bases:
genrl.core.actor_critic.MlpSingleActorTwoCritic
MLP Actor Critic
State dimensions of the environment
Type: int
Action space dimensions of the environment
Type: int
Hidden layers in the shared MLP
Type: list
ortuple
Hidden layers in the policy MLP
Type: list
ortuple
Hidden layers in the value MLP
Type: list
ortuple
Value type of the critic network
Type: str
True if the action space is discrete, else False
Type: bool
Number of critics in the architecture
Type: int
True if a SAClike network is needed, else False
Type: bool
Activation function to be used. Can be either “tanh” or “relu”
Type: str
Get Actions from the actor
 Arg:
 state (
torch.Tensor
): The state(s) being passed to the critics deterministic (bool): True if the action space is deterministic, else False
Returns: List of actions as estimated by the critic distribution (): The distribution from which the action was sampled (None if deterministic)Return type: action ( list
)
Extract features from the state, which is then an input to get_action and get_value
Parameters: state ( torch.Tensor
) – The state(s) being passedReturns: The feature(s) extracted from the state Return type: features ( torch.Tensor
)
Get Values from both the Critic
 Arg:
state (
torch.Tensor
): The state(s) being passed to the critics mode (str): What values should be returned. Types:“both” –> Both values will be returned “min” –> The minimum of both values will be returned “first” –> The value from the first critic only will be returned
Returns: List of values as estimated by each individual critic Return type: values ( list
)

class
genrl.core.actor_critic.
MlpSingleActorTwoCritic
(state_dim: gym.spaces.space.Space, action_dim: gym.spaces.space.Space, policy_layers: Tuple = (32, 32), value_layers: Tuple = (32, 32), val_type: str = 'V', discrete: bool = True, num_critics: int = 2, **kwargs)[source]¶ Bases:
genrl.core.base.BaseActorCritic
MLP Actor Critic

state_dim
¶ State dimensions of the environment
Type: int

action_dim
¶ Action space dimensions of the environment
Type: int

policy_layers
¶ Hidden layers in the policy MLP
Type: list
ortuple

value_layers
¶ Hidden layers in the value MLP
Type: list
ortuple

val_type
¶ Value type of the critic network
Type: str

discrete
¶ True if the action space is discrete, else False
Type: bool

num_critics
¶ Number of critics in the architecture
Type: int

sac
¶ True if a SAClike network is needed, else False
Type: bool

activation
¶ Activation function to be used. Can be either “tanh” or “relu”
Type: str

forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_action
(state: torch.Tensor, deterministic: bool = False)[source]¶ Get Actions from the actor
 Arg:
 state (
torch.Tensor
): The state(s) being passed to the critics deterministic (bool): True if the action space is deterministic, else False
Returns: List of actions as estimated by the critic distribution (): The distribution from which the action was sampled (None if deterministReturn type: action ( list
)

get_value
(state: torch.Tensor, mode='first') → torch.Tensor[source]¶ Get Values from the Critic
 Arg:
state (
torch.Tensor
): The state(s) being passed to the critics mode (str): What values should be returned. Types:“both” –> Both values will be returned “min” –> The minimum of both values will be returned “first” –> The value from the first critic only will be returned
Returns: List of values as estimated by each individual critic Return type: values ( list
)

Base¶

class
genrl.core.base.
BaseActorCritic
[source]¶ Bases:
torch.nn.modules.module.Module
Basic implementation of a general Actor Critic

get_action
(state: torch.Tensor, deterministic: bool = False) → torch.Tensor[source]¶ Get action from the Actor based on input
param state: The state being passed as input to the Actor param deterministic: (True if the action space is deterministic,  else False)
type state: Tensor type deterministic: boolean returns: action


class
genrl.core.base.
BasePolicy
(state_dim: int, action_dim: int, hidden: Tuple, discrete: bool, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
Basic implementation of a general Policy
Parameters:  state_dim (int) – State dimensions of the environment
 action_dim (int) – Action dimensions of the environment
 hidden (tuple or list) – Sizes of hidden layers
 discrete (bool) – True if action space is discrete, else False

forward
(state: torch.Tensor) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶ Defines the computation performed at every call.
Parameters: state (Tensor) – The state being passed as input to the policy

get_action
(state: torch.Tensor, deterministic: bool = False) → torch.Tensor[source]¶ Get action from policy based on input
param state: The state being passed as input to the policy param deterministic: (True if the action space is deterministic,  else False)
type state: Tensor type deterministic: boolean returns: action

class
genrl.core.base.
BaseValue
(state_dim: int, action_dim: int)[source]¶ Bases:
torch.nn.modules.module.Module
Basic implementation of a general Value function
Buffers¶

class
genrl.core.buffers.
PrioritizedBuffer
(capacity: int, alpha: float = 0.6, beta: float = 0.4)[source]¶ Bases:
object
Implements the Prioritized Experience Replay Mechanism
Parameters:  capacity (int) – Size of the replay buffer
 alpha (int) – Level of prioritization

pos
¶

push
(inp: Tuple) → None[source]¶ Adds new experience to buffer
param inp: (Tuple containing state, action, reward,  next_state and done)
type inp: tuple returns: None

sample
(batch_size: int, beta: float = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶  (Returns randomly sampled memories from replay memory along with their
respective indices and weights)
param batch_size: Number of samples per batch param beta: (Bias exponent used to correct  Importance Sampling (IS) weights)
type batch_size: int type beta: float returns: (Tuple containing states, actions, next_states,
rewards, dones, indices and weights)

update_priorities
(batch_indices: Tuple, batch_priorities: Tuple) → None[source]¶ Updates list of priorities with new order of priorities
param batch_indices: List of indices of batch param batch_priorities: (List of priorities of the batch at the  specific indices)
type batch_indices: list or tuple type batch_priorities: list or tuple

class
genrl.core.buffers.
PrioritizedReplayBufferSamples
(states, actions, rewards, next_states, dones, indices, weights)[source]¶ Bases:
tuple

actions
¶ Alias for field number 1

dones
¶ Alias for field number 4

indices
¶ Alias for field number 5

next_states
¶ Alias for field number 3

rewards
¶ Alias for field number 2

states
¶ Alias for field number 0

weights
¶ Alias for field number 6


class
genrl.core.buffers.
ReplayBuffer
(capacity: int)[source]¶ Bases:
object
Implements the basic Experience Replay Mechanism
Parameters: capacity (int) – Size of the replay buffer 
push
(inp: Tuple) → None[source]¶ Adds new experience to buffer
Parameters: inp (tuple) – Tuple containing state, action, reward, next_state and done Returns: None

sample
(batch_size: int) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Returns randomly sampled experiences from replay memory
param batch_size: Number of samples per batch type batch_size: int returns: (Tuple composing of state, action, reward, next_state and done)

Noise¶

class
genrl.core.noise.
ActionNoise
(mean: float, std: float)[source]¶ Bases:
abc.ABC
Base class for Action Noise
Parameters:  mean (float) – Mean of noise distribution
 std (float) – Standard deviation of noise distribution

mean
¶ Returns mean of noise distribution

std
¶ Returns standard deviation of noise distribution

class
genrl.core.noise.
NoisyLinear
(in_features: int, out_features: int, std_init: float = 0.4)[source]¶ Bases:
torch.nn.modules.module.Module
Noisy Linear Layer Class
Class to represent a Noisy Linear class (noisy version of nn.Linear)

in_features
¶ Input dimensions
Type: int

out_features
¶ Output dimensions
Type: int

std_init
¶ Weight initialisation constant
Type: float

forward
(state: torch.Tensor) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
genrl.core.noise.
NormalActionNoise
(mean: float, std: float)[source]¶ Bases:
genrl.core.noise.ActionNoise
Normal implementation of Action Noise
Parameters:  mean (float) – Mean of noise distribution
 std (float) – Standard deviation of noise distribution

class
genrl.core.noise.
OrnsteinUhlenbeckActionNoise
(mean: float, std: float, theta: float = 0.15, dt: float = 0.01, initial_noise: torch.Tensor = None)[source]¶ Bases:
genrl.core.noise.ActionNoise
Ornstein Uhlenbeck implementation of Action Noise
Parameters:  mean (float) – Mean of noise distribution
 std (float) – Standard deviation of noise distribution
 theta (float) – Parameter used to solve the Ornstein Uhlenbeck process
 dt (float) – Small parameter used to solve the Ornstein Uhlenbeck process
 initial_noise (torch.Tensor) – Initial noise distribution
Policies¶

class
genrl.core.policies.
CNNPolicy
(framestack: int, action_dim: int, hidden: Tuple = (32, 32), discrete: bool = True, *args, **kwargs)[source]¶ Bases:
genrl.core.base.BasePolicy
CNN Policy
Parameters:  framestack (int) – Number of previous frames to stack together
 action_dim (int) – Action dimensions of the environment
 fc_layers (tuple or list) – Sizes of hidden layers
 discrete (bool) – True if action space is discrete, else False
 channels (list or tuple) – Channel sizes for cnn layers

class
genrl.core.policies.
MlpPolicy
(state_dim: int, action_dim: int, hidden: Tuple = (32, 32), discrete: bool = True, *args, **kwargs)[source]¶ Bases:
genrl.core.base.BasePolicy
MLP Policy
Parameters:  state_dim (int) – State dimensions of the environment
 action_dim (int) – Action dimensions of the environment
 hidden (tuple or list) – Sizes of hidden layers
 discrete (bool) – True if action space is discrete, else False
RolloutStorage¶

class
genrl.core.rollout_storage.
BaseBuffer
(buffer_size: int, env: Union[gym.core.Env, genrl.environments.vec_env.vector_envs.VecEnv], device: Union[torch.device, str] = 'cpu')[source]¶ Bases:
object
Base class that represent a buffer (rollout or replay) :param buffer_size: (int) Max number of element in the buffer :param env: (Environment) The environment being trained on :param device: (Union[torch.device, str]) PyTorch device
to which the values will be convertedParameters: n_envs – (int) Number of parallel environments 
sample
(batch_size: int)[source]¶ Parameters: batch_size – (int) Number of element to sample Returns: (Union[RolloutBufferSamples, ReplayBufferSamples])

static
swap_and_flatten
(arr: numpy.ndarray) → numpy.ndarray[source]¶ Swap and then flatten axes 0 (buffer_size) and 1 (n_envs) to convert shape from [n_steps, n_envs, …] (when … is the shape of the features) to [n_steps * n_envs, …] (which maintain the order) :param arr: (np.ndarray) :return: (np.ndarray)

to_torch
(array: numpy.ndarray, copy: bool = True) → torch.Tensor[source]¶ Convert a numpy array to a PyTorch tensor. Note: it copies the data by default :param array: (np.ndarray) :param copy: (bool) Whether to copy or not the data
(may be useful to avoid changing things be reference)Returns: (torch.Tensor)


class
genrl.core.rollout_storage.
ReplayBufferSamples
(observations, actions, next_observations, dones, rewards)[source]¶ Bases:
tuple

actions
¶ Alias for field number 1

dones
¶ Alias for field number 3

next_observations
¶ Alias for field number 2

observations
¶ Alias for field number 0

rewards
¶ Alias for field number 4


class
genrl.core.rollout_storage.
RolloutBuffer
(buffer_size: int, env: Union[gym.core.Env, genrl.environments.vec_env.vector_envs.VecEnv], device: Union[torch.device, str] = 'cpu', gae_lambda: float = 1, gamma: float = 0.99)[source]¶ Bases:
genrl.core.rollout_storage.BaseBuffer
Rollout buffer used in onpolicy algorithms like A2C/PPO. :param buffer_size: (int) Max number of element in the buffer :param env: (Environment) The environment being trained on :param device: (torch.device) :param gae_lambda: (float) Factor for tradeoff of bias vs variance for Generalized Advantage Estimator
Equivalent to classic advantage when set to 1.Parameters:  gamma – (float) Discount factor
 n_envs – (int) Number of parallel environments

add
(obs: None._VariableFunctions.zeros, action: None._VariableFunctions.zeros, reward: None._VariableFunctions.zeros, done: None._VariableFunctions.zeros, value: torch.Tensor, log_prob: torch.Tensor) → None[source]¶ Parameters:  obs – (torch.zeros) Observation
 action – (torch.zeros) Action
 reward – (torch.zeros)
 done – (torch.zeros) End of episode signal.
 value – (torch.Tensor) estimated value of the current state following the current policy.
 log_prob – (torch.Tensor) log probability of the action following the current policy.

class
genrl.core.rollout_storage.
RolloutBufferSamples
(observations, actions, old_values, old_log_prob, advantages, returns)[source]¶ Bases:
tuple

actions
¶ Alias for field number 1

advantages
¶ Alias for field number 4

observations
¶ Alias for field number 0

old_log_prob
¶ Alias for field number 3

old_values
¶ Alias for field number 2

returns
¶ Alias for field number 5


class
genrl.core.rollout_storage.
RolloutReturn
(episode_reward, episode_timesteps, n_episodes, continue_training)[source]¶ Bases:
tuple

continue_training
¶ Alias for field number 3

episode_reward
¶ Alias for field number 0

episode_timesteps
¶ Alias for field number 1

n_episodes
¶ Alias for field number 2

Values¶

class
genrl.core.values.
CnnCategoricalValue
(*args, **kwargs)[source]¶ Bases:
genrl.core.values.CnnNoisyValue
Class for Categorical DQN’s CNN QValue function

framestack
¶ No. of frames being passed into the Qvalue function
Type: int

action_dim
¶ Action space dimensions
Type: int

fc_layers
¶ Fully connected layer dimensions
Type: tuple

noisy_layers
¶ Noisy layer dimensions
Type: tuple

num_atoms
¶ Number of atoms used to discretise the Categorical DQN value distribution
Type: int


class
genrl.core.values.
CnnDuelingValue
(*args, **kwargs)[source]¶ Bases:
genrl.core.values.CnnValue
Class for Dueling DQN’s MLP QValue function

framestack
¶ No. of frames being passed into the Qvalue function
Type: int

action_dim
¶ Action space dimensions
Type: int

fc_layers
¶ Hidden layer dimensions
Type: tuple


class
genrl.core.values.
CnnNoisyValue
(*args, **kwargs)[source]¶ Bases:
genrl.core.values.CnnValue
,genrl.core.values.MlpNoisyValue
Class for Noisy DQN’s CNN QValue function

state_dim
¶ Number of previous frames to stack together
Type: int

action_dim
¶ Action space dimensions
Type: int

fc_layers
¶ Fully connected layer dimensions
Type: tuple

noisy_layers
¶ Noisy layer dimensions
Type: tuple

num_atoms
¶ Number of atoms used to discretise the Categorical DQN value distribution
Type: int


class
genrl.core.values.
CnnValue
(*args, **kwargs)[source]¶ Bases:
genrl.core.values.MlpValue
CNN Value Function class
param framestack: Number of previous frames to stack together param action_dim: Action dimension of environment param val_type: Specifies type of value function: (  “V” for V(s), “Qs” for Q(s), “Qsa” for Q(s,a))
param fc_layers: Sizes of hidden layers type framestack: int type action_dim: int type val_type: string type fc_layers: tuple or list

class
genrl.core.values.
MlpCategoricalValue
(*args, **kwargs)[source]¶ Bases:
genrl.core.values.MlpNoisyValue
Class for Categorical DQN’s MLP QValue function

state_dim
¶ Observation space dimensions
Type: int

action_dim
¶ Action space dimensions
Type: int

fc_layers
¶ Fully connected layer dimensions
Type: tuple

noisy_layers
¶ Noisy layer dimensions
Type: tuple

num_atoms
¶ Number of atoms used to discretise the Categorical DQN value distribution
Type: int


class
genrl.core.values.
MlpDuelingValue
(*args, **kwargs)[source]¶ Bases:
genrl.core.values.MlpValue
Class for Dueling DQN’s MLP QValue function

state_dim
¶ Observation space dimensions
Type: int

action_dim
¶ Action space dimensions
Type: int
Hidden layer dimensions
Type: tuple


class
genrl.core.values.
MlpNoisyValue
(*args, noisy_layers: Tuple = (128, 512), **kwargs)[source]¶ Bases:
genrl.core.values.MlpValue

class
genrl.core.values.
MlpValue
(state_dim: int, action_dim: int = None, val_type: str = 'V', fc_layers: Tuple = (32, 32), **kwargs)[source]¶ Bases:
genrl.core.base.BaseValue
MLP Value Function class
param state_dim: State dimensions of environment param action_dim: Action dimensions of environment param val_type: Specifies type of value function: (  “V” for V(s), “Qs” for Q(s), “Qsa” for Q(s,a))
param hidden: Sizes of hidden layers type state_dim: int type action_dim: int type val_type: string type hidden: tuple or list
Utilities¶
Logger¶

class
genrl.utils.logger.
CSVLogger
(logdir: str)[source]¶ Bases:
object
CSV Logging class
Parameters: logdir (string) – Directory to save log at

class
genrl.utils.logger.
HumanOutputFormat
(logdir: str)[source]¶ Bases:
object
Output from a log file in a human readable format
Parameters: logdir (string) – Directory at which log is present 
max_key_len
(kvs: Dict[str, Any]) → None[source]¶ Finds max key length
Parameters: kvs (dict) – Entries to be logged

round
(num: float) → float[source]¶ Returns a rounded float value depending on self.maxlen
Parameters: num (float) – Value to round


class
genrl.utils.logger.
Logger
(logdir: str = None, formats: List[str] = ['csv'])[source]¶ Bases:
object
Logger class to log important information
Parameters:  logdir (string) – Directory to save log at
 formats (list) – Formatting of each log [‘csv’, ‘stdout’, ‘tensorboard’]

formats
¶ Return save format(s)

logdir
¶ Return log directory
Utilities¶

genrl.utils.utils.
cnn
(channels: Tuple = (4, 16, 32), kernel_sizes: Tuple = (8, 4), strides: Tuple = (4, 2), **kwargs) → Tuple[source]¶  (Generates a CNN model given input dimensions, channels, kernel_sizes and
strides)
param channels: Input output channels before and after each convolution param kernel_sizes: Kernel sizes for each convolution param strides: Strides for each convolution param in_size: Input dimensions (assuming square input) type channels: tuple type kernel_sizes: tuple type strides: tuple type in_size: int returns: (Convolutional Neural Network with convolutional layers and activation layers)

genrl.utils.utils.
get_env_properties
(env: Union[gym.core.Env, genrl.environments.vec_env.vector_envs.VecEnv], network: Union[str, Any] = 'mlp') → Tuple[int][source]¶ Finds important properties of environment
param env: Environment that the agent is interacting with type env: Gym Environment param network: Type of network architecture, eg. “mlp”, “cnn” type network: str returns: (State space dimensions, Action space dimensions,  discreteness of action space and action limit (highest action value)
rtype: int, float, …; int, float, …; bool; int, float, …

genrl.utils.utils.
get_model
(type_: str, name_: str) → Union[source]¶  Eg. “mlp” or “cnn”)
type type_: string returns: Required class. Eg. MlpActorCritic

genrl.utils.utils.
mlp
(sizes: Tuple, activation: str = 'relu', sac: bool = False)[source]¶ Generates an MLP model given sizes of each layer
param sizes: Sizes of hidden layers param sac: True if Soft Actor Critic is being used, else False type sizes: tuple or list type sac: bool returns: (Neural Network with fullyconnected linear layers and activation layers)

genrl.utils.utils.
noisy_mlp
(fc_layers: List[int], noisy_layers: List[int], activation='relu')[source]¶ Noisy MLP generating helper function
Parameters:  fc_layers (
list
ofint
) – List of fully connected layers  noisy_layers (
list
ofint
) – :ist of noisy layers  activation (str) – Activation function to be used. [“tanh”, “relu”]
Returns: Noisy MLP model
 fc_layers (
Models¶

class
genrl.utils.models.
TabularModel
(s_dim: int, a_dim: int)[source]¶ Bases:
object
Samplebased tabular model class for deterministic, discrete environments
Parameters:  s_dim (int) – environment state dimension
 a_dim (int) – environment action dimension

add
(state: numpy.ndarray, action: numpy.ndarray, reward: float, next_state: numpy.ndarray) → None[source]¶ add transition to model :param state: state :param action: action :param reward: reward :param next_state: next state :type state: float array :type action: int :type reward: int :type next_state: float array

is_empty
() → bool[source]¶ Check if the model has been updated or not
Returns: True if model not updated yet Return type: bool
Trainers¶
OnPolicy Trainer¶
On Policy Trainer Class
Trainer class for all the On Policy Agents: A2C, PPO1 and VPG

genrl.trainers.OnPolicyTrainer.
agent
¶ Agent algorithm object
Type: object

genrl.trainers.OnPolicyTrainer.
env
¶ Environment
Type: object

genrl.trainers.OnPolicyTrainer.
log_mode
¶ List of different kinds of logging. Supported: [“csv”, “stdout”, “tensorboard”]
Type: list
of str

genrl.trainers.OnPolicyTrainer.
log_key
¶ Key plotted on x_axis. Supported: [“timestep”, “episode”]
Type: str

genrl.trainers.OnPolicyTrainer.
log_interval
¶ Timesteps between successive logging of parameters onto the console
Type: int

genrl.trainers.OnPolicyTrainer.
logdir
¶ Directory where log files should be saved.
Type: str

genrl.trainers.OnPolicyTrainer.
epochs
¶ Total number of epochs to train for
Type: int

genrl.trainers.OnPolicyTrainer.
max_timesteps
¶ Maximum limit of timesteps to train for
Type: int

genrl.trainers.OnPolicyTrainer.
off_policy
¶ True if the agent is an off policy agent, False if it is on policy
Type: bool

genrl.trainers.OnPolicyTrainer.
save_interval
¶ Timesteps between successive saves of the agent’s important hyperparameters
Type: int

genrl.trainers.OnPolicyTrainer.
save_model
¶ Directory where the checkpoints of agent parameters should be saved
Type: str

genrl.trainers.OnPolicyTrainer.
run_num
¶ A run number allotted to the save of parameters
Type: int

genrl.trainers.OnPolicyTrainer.
load_model
¶ File to load saved parameter checkpoint from
Type: str

genrl.trainers.OnPolicyTrainer.
render
¶ True if environment is to be rendered during training, else False
Type: bool

genrl.trainers.OnPolicyTrainer.
evaluate_episodes
¶ Number of episodes to evaluate for
Type: int

genrl.trainers.OnPolicyTrainer.
seed
¶ Set seed for reproducibility
Type: int

genrl.trainers.OnPolicyTrainer.
n_envs
¶ Number of environments
OffPolicy Trainer¶
Off Policy Trainer Class
Trainer class for all the Off Policy Agents: DQN (all variants), DDPG, TD3 and SAC

genrl.trainers.OffPolicyTrainer.
agent
¶ Agent algorithm object
Type: object

genrl.trainers.OffPolicyTrainer.
env
¶ Environment
Type: object

genrl.trainers.OffPolicyTrainer.
buffer
¶ Replay Buffer object
Type: object

genrl.trainers.OffPolicyTrainer.
max_ep_len
¶ Maximum Episode length for training
Type: int

genrl.trainers.OffPolicyTrainer.
max_timesteps
¶ Maximum limit of timesteps to train for
Type: int

genrl.trainers.OffPolicyTrainer.
warmup_steps
¶ Number of warmup steps. (random actions are taken to add randomness to training)
Type: int

genrl.trainers.OffPolicyTrainer.
start_update
¶ Timesteps after which the agent networks should start updating
Type: int

genrl.trainers.OffPolicyTrainer.
update_interval
¶ Timesteps between target network updates
Type: int

genrl.trainers.OffPolicyTrainer.
log_mode
¶ List of different kinds of logging. Supported: [“csv”, “stdout”, “tensorboard”]
Type: list
of str

genrl.trainers.OffPolicyTrainer.
log_key
¶ Key plotted on x_axis. Supported: [“timestep”, “episode”]
Type: str

genrl.trainers.OffPolicyTrainer.
log_interval
¶ Timesteps between successive logging of parameters onto the console
Type: int

genrl.trainers.OffPolicyTrainer.
logdir
¶ Directory where log files should be saved.
Type: str

genrl.trainers.OffPolicyTrainer.
epochs
¶ Total number of epochs to train for
Type: int

genrl.trainers.OffPolicyTrainer.
off_policy
¶ True if the agent is an off policy agent, False if it is on policy
Type: bool

genrl.trainers.OffPolicyTrainer.
save_interval
¶ Timesteps between successive saves of the agent’s important hyperparameters
Type: int

genrl.trainers.OffPolicyTrainer.
save_model
¶ Directory where the checkpoints of agent parameters should be saved
Type: str

genrl.trainers.OffPolicyTrainer.
run_num
¶ A run number allotted to the save of parameters
Type: int

genrl.trainers.OffPolicyTrainer.
load_model
¶ File to load saved parameter checkpoint from
Type: str

genrl.trainers.OffPolicyTrainer.
render
¶ True if environment is to be rendered during training, else False
Type: bool

genrl.trainers.OffPolicyTrainer.
evaluate_episodes
¶ Number of episodes to evaluate for
Type: int

genrl.trainers.OffPolicyTrainer.
seed
¶ Set seed for reproducibility
Type: int

genrl.trainers.OffPolicyTrainer.
n_envs
¶ Number of environments
Classical Trainer¶
Global trainer class for classical RL algorithms
param agent:  Algorithm object to train 

param env:  standard gym environment to train on 
param mode:  mode of value function update [‘learn’, ‘plan’, ‘dyna’] 
param model:  model to use for planning [‘tabular’] 
param n_episodes:  
number of training episodes  
param plan_n_steps:  
number of planning step per environment interaction  
param start_steps:  
number of initial exploration timesteps  
param seed:  seed for random number generator 
param render:  render gym environment 
type agent:  object 
type env:  Gym environment 
type mode:  str 
type model:  str 
type n_episodes:  
int  
type plan_n_steps:  
int  
type start_steps:  
int  
type seed:  int 
type render:  bool 
Deep Contextual Bandit Trainer¶
Bandit Trainer Class
param agent:  Agent to train. 

type agent:  genrl.deep.bandit.dcb_agents.DCBAgent 
param bandit:  Bandit to train agent on. 
type bandit:  genrl.deep.bandit.data_bandits.DataBasedBandit 
param logdir:  Path to directory to store logs in. 
type logdir:  str 
param log_mode:  List of modes for logging. 
type log_mode:  List[str] 
Multi Armed Bandit Trainer¶
Bandit Trainer Class
param agent:  Agent to train. 

type agent:  genrl.deep.bandit.dcb_agents.DCBAgent 
param bandit:  Bandit to train agent on. 
type bandit:  genrl.deep.bandit.data_bandits.DataBasedBandit 
param logdir:  Path to directory to store logs in. 
type logdir:  str 
param log_mode:  List of modes for logging. 
type log_mode:  List[str] 
Base Trainer¶
Base Trainer Class
To be inherited specific usecases

genrl.trainers.Trainer.
agent
¶ Agent algorithm object
Type: object

genrl.trainers.Trainer.
env
¶ Environment
Type: object

genrl.trainers.Trainer.
log_mode
¶ List of different kinds of logging. Supported: [“csv”, “stdout”, “tensorboard”]
Type: list
of str

genrl.trainers.Trainer.
log_key
¶ Key plotted on x_axis. Supported: [“timestep”, “episode”]
Type: str

genrl.trainers.Trainer.
log_interval
¶ Timesteps between successive logging of parameters onto the console
Type: int

genrl.trainers.Trainer.
logdir
¶ Directory where log files should be saved.
Type: str

genrl.trainers.Trainer.
epochs
¶ Total number of epochs to train for
Type: int

genrl.trainers.Trainer.
max_timesteps
¶ Maximum limit of timesteps to train for
Type: int

genrl.trainers.Trainer.
off_policy
¶ True if the agent is an off policy agent, False if it is on policy
Type: bool

genrl.trainers.Trainer.
save_interval
¶ Timesteps between successive saves of the agent’s important hyperparameters
Type: int

genrl.trainers.Trainer.
save_model
¶ Directory where the checkpoints of agent parameters should be saved
Type: str

genrl.trainers.Trainer.
run_num
¶ A run number allotted to the save of parameters
Type: int

genrl.trainers.Trainer.
load_weights
¶ Weights file
Type: str

genrl.trainers.Trainer.
load_hyperparams
¶ File to load hyperparameters
Type: str

genrl.trainers.Trainer.
render
¶ True if environment is to be rendered during training, else False
Type: bool

genrl.trainers.Trainer.
evaluate_episodes
¶ Number of episodes to evaluate for
Type: int

genrl.trainers.Trainer.
seed
¶ Set seed for reproducibility
Type: int

genrl.trainers.Trainer.
n_envs
¶ Number of environments
Common¶
Classical Common¶
genrl.classical.common.models¶
genrl.classical.common.trainer¶
genrl.classical.common.values¶
Bandit Common¶
genrl.bandit.core¶
genrl.bandit.trainer¶
genrl.bandit.agents.cb_agents.common.base_model¶

class
genrl.agents.bandits.contextual.common.base_model.
Model
(layer, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Bayesian Neural Network used in Deep Contextual Bandit Models.
Parameters:  context_dim (int) – Length of context vector.
 hidden_dims (List[int], optional) – Dimensions of hidden layers of network.
 n_actions (int) – Number of actions that can be selected. Taken as length of output vector for network to predict.
 init_lr (float, optional) – Initial learning rate.
 max_grad_norm (float, optional) – Maximum norm of gradients for gradient clipping.
 lr_decay (float, optional) – Decay rate for learning rate.
 lr_reset (bool, optional) – Whether to reset learning rate ever train interval. Defaults to False.
 dropout_p (Optional[float], optional) – Probability for dropout. Defaults to None which implies dropout is not to be used.
 noise_std (float) – Standard deviation of noise used in the network. Defaults to 0.1

use_dropout
¶ Indicated whether or not dropout should be used in forward pass.
Type: int

forward
(context: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]¶ Computes forward pass through the network.
Parameters: context (torch.Tensor) – The context vector to perform forward pass on. Returns: Dictionary of outputs Return type: Dict[str, torch.Tensor]

train_model
(db: genrl.agents.bandits.contextual.common.transition.TransitionDB, epochs: int, batch_size: int)[source]¶ Trains the network on a given database for given epochs and batch_size.
Parameters:  db (TransitionDB) – The database of transitions to train on.
 epochs (int) – Number of gradient steps to take.
 batch_size (int) – The size of each batch to perform gradient descent on.
genrl.bandit.agents.cb_agents.common.bayesian¶

class
genrl.agents.bandits.contextual.common.bayesian.
BayesianLinear
(in_features: int, out_features: int, bias: bool = True)[source]¶ Bases:
torch.nn.modules.module.Module
Linear Layer for Bayesian Neural Networks.
Parameters:  in_features (int) – size of each input sample
 out_features (int) – size of each output sample
 bias (bool, optional) – Whether to use an additive bias. Defaults to True.

forward
(x: torch.Tensor, kl: bool = True, frozen: bool = False) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶ Apply linear transormation to input.
The weight and bias is sampled for each forward pass from a normal distribution. The KL divergence of the sampled weigth and bias can also be computed if specified.
Parameters:  x (torch.Tensor) – Input to be transformed
 kl (bool, optional) – Whether to compute the KL divergence. Defaults to True.
 frozen (bool, optional) – Whether to freeze current parameters. Defaults to False.
Returns:  The transformed input and optionally
the computed KL divergence value.
Return type: Tuple[torch.Tensor, Optional[torch.Tensor]]

class
genrl.agents.bandits.contextual.common.bayesian.
BayesianNNBanditModel
(**kwargs)[source]¶ Bases:
genrl.agents.bandits.contextual.common.base_model.Model
Bayesian Neural Network used in Deep Contextual Bandit Models.
Parameters:  context_dim (int) – Length of context vector.
 hidden_dims (List[int], optional) – Dimensions of hidden layers of network.
 n_actions (int) – Number of actions that can be selected. Taken as length of output vector for network to predict.
 init_lr (float, optional) – Initial learning rate.
 max_grad_norm (float, optional) – Maximum norm of gradients for gradient clipping.
 lr_decay (float, optional) – Decay rate for learning rate.
 lr_reset (bool, optional) – Whether to reset learning rate ever train interval. Defaults to False.
 dropout_p (Optional[float], optional) – Probability for dropout. Defaults to None which implies dropout is not to be used.
 noise_std (float) – Standard deviation of noise used in the network. Defaults to 0.1

use_dropout
¶ Indicated whether or not dropout should be used in forward pass.
Type: int
genrl.bandit.agents.cb_agents.common.neural¶

class
genrl.agents.bandits.contextual.common.neural.
NeuralBanditModel
(**kwargs)[source]¶ Bases:
genrl.agents.bandits.contextual.common.base_model.Model
Neural Network used in Deep Contextual Bandit Models.
Parameters:  context_dim (int) – Length of context vector.
 hidden_dims (List[int], optional) – Dimensions of hidden layers of network.
 n_actions (int) – Number of actions that can be selected. Taken as length of output vector for network to predict.
 init_lr (float, optional) – Initial learning rate.
 max_grad_norm (float, optional) – Maximum norm of gradients for gradient clipping.
 lr_decay (float, optional) – Decay rate for learning rate.
 lr_reset (bool, optional) – Whether to reset learning rate ever train interval. Defaults to False.
 dropout_p (Optional[float], optional) – Probability for dropout. Defaults to None which implies dropout is not to be used.

use_dropout
¶ Indicated whether or not dropout should be used in forward pass.
Type: bool
genrl.bandit.agents.cb_agents.common.transition¶

class
genrl.agents.bandits.contextual.common.transition.
TransitionDB
(device: Union[str, torch.device] = 'cpu')[source]¶ Bases:
object
Database for storing (context, action, reward) transitions.
Parameters: device (str) – Device to use for tensor operations. “cpu” for cpu or “cuda” for cuda. Defaults to “cpu”. 
db
¶ Dictionary containing list of transitions.
Type: dict

db_size
¶ Number of transitions stored in database.
Type: int

device
¶ Device to use for tensor operations.
Type: torch.device

add
(context: torch.Tensor, action: int, reward: int)[source]¶ Add (context, action, reward) transition to database
Parameters:  context (torch.Tensor) – Context recieved
 action (int) – Action taken
 reward (int) – Reward recieved

get_data
(batch_size: Optional[int] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Get a batch of transition from database
Parameters: batch_size (Union[int, None], optional) – Size of batch required. Defaults to None which implies all transitions in the database are to be included in batch. Returns:  Tuple of stacked
 contexts, actions, rewards tensors.
Return type: Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

get_data_for_action
(action: int, batch_size: Optional[int] = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Get a batch of transition from database for a given action.
Parameters:  action (int) – The action to sample transitions for.
 batch_size (Union[int, None], optional) – Size of batch required. Defaults to None which implies all transitions in the database are to be included in batch.
Returns:  Tuple of stacked
contexts and rewards tensors.
Return type: Tuple[torch.Tensor, torch.Tensor]
