Welcome to GenRL’s documentation!

Features

  • Unified Trainer and Logging class: code reusability and high-level UI
  • Ready-made algorithm implementations: ready-made implementations of popular RL algorithms.
  • Extensive Benchmarking
  • Environment implementations
  • Heavy Encapsulation useful for new algorithms

Contents

Installation

PyPI Package

GenRL is compatible with Python 3.6 or later and also depends on pytorch and openai-gym. The easiest way to install GenRL is with pip, Python’s preferred package installer.

$ pip install genrl

Note that GenRL is an active project and routinely publishes new releases. In order to upgrade GenRL to the latest version, use pip as follows.

$ pip install -U genrl

From Source

If you intend to install the latest unreleased version of the library (i.e from source), you can simply do:

$ git clone https://github.com/SforAiDl/genrl.git
$ cd genrl
$ python setup.py install

About

Introduction

Reinforcement Learning has taken massive leaps forward in extending current AI research. David Silver’s paper on playing Atari with Deep Reinforcement Learning can be considered one of the seminal papers in establishing a completely new landscape of Reinforcement Learning Research. With applications in Robotics, Healthcare and numerous other domains, RL has become the prime mechanism of modelling Sequential Decision Making through AI.

Yet, current libraries and resources in Reinforcement Learning are either very limited, messy and/or are scattered. OpenAI’s Spinning Up is a great resource for getting started with Deep Reinforcement Learning but it fails to cover more basic concepts in Reinforcement Learning for e.g. Multi Armed Bandits. garage is a great resource for reproducing and evaluating RL algorithms but it fails to introduce a newbie to RL.

With GenRL, our goal is three-fold: - To educate the user about Reinforcement learning. - Easy to understand implementations of State of the Art Reinforcement Learning Algorithms. - Providing utilities for developing and evaluating new RL algorithms. Or in a sense be able to implement any new RL algorithm in less than 200 lines.

Policies and Values

Modern research on Reinforcement Learning is majorly based on Markov Decision Processes. Policy and Value Functions are one of the core parts of such a problem formulation. And so, polices and values form one of the core parts of our library.

Trainers and Loggers

Trainers

Most current algorithms follow a standard procedure of training. Considering a classification between On-Policy and Off-Policy Algorithms, we provide high level APIs through Trainers which can be coupled with Agents and Environments for training seamlessly.

Lets take the example of an On-Policy Algorithm, Proximal Policy Optimization. In our Agent, we make sure to define three methods: collect_rollouts, get_traj_loss and finally update_policy.

The OnPolicyTrainer simply calls these functions and enables high level usage by simple defining of three methods.

Loggers

At the moment, we support three different types of Loggers. HumanOutputFormat, TensorboardLogger and CSVLogger. Any of these loggers can be initialized really easily by the top level Logger class and specifying the individual formats in which logging should performed.

logger = Logger(logdir='logs/', formats=['stdout', 'tensorboard'])

After which logger can perform logging easily by providing it with dictionaries of data. For e.g.

logger.write({"logger":0})

Note: The Tensorboard logger requires an extra x-axis parameter, as it plots data rather than just show it in a tabular format.

Agent Encapsulation

WIP

Environments

Wrappers

Tutorials

Bandit Tutorials

Multi Armed Bandit Overview

Training an EpsilonGreedy agent on a Bernoulli Multi Armed Bandit

Multi armed bandits is one of the most basic problems in RL. Think of it like this, you have ‘n’ levers in front of you and each of these levers will give you a different reward. For the purposes of formalising the problem the reward is written down in terms of a reward function i.e., the probability of getting a reward when a lever is pulled.

Suppose you try out one of the levers and get a positive reward. What do you do next? Should you just keep pulling that lever every time or think what if there might be a better reward to pulling one of the other levers? This is the exploration - exploitation dilemma.

Exploitation - Utilise the information you have gathered till now, to make the best decision. In this case, after 1 try you know a lever is giving you a positive reward and you just exploit it further. Since you do not care about other arms if you keep exploiting, it is known as the greedy action.

Exploration - You explore the untried levers in an attempt to maybe discover another one which has a higher payout than the one you currently have some knowledge about. This is exploring all your options without worrying about the short-term rewards, in hope of finding a lever with a bigger reward, in the long run.

You have to use an algorithm which correctly trades off exploration and exploitation as we do not want a ‘greedy’ algorithm which only exploits and does not explore at all, because there are very high chances that it will converge to a sub-optimal policy. We do not want an algorithm that keeps exploring either as this would lead to sub-optimal rewards inspite of knowing the best action to be taken. In this case, the optimal policy will be to always pull the lever with the highest reward, but at the beginning we do not know the probability distribution of the rewards.

So, we want a policy which explores actively at the beginning, building up an estimate for the reward values(defined as quality) of all the actions, and then exploiting that from that time onwards.

A Bernoulli Multi-Armed Bandit has multiple arms with each having a different bernoulli distribution over its reward. Basically each arm has a probabilty associated with it which is the probability of getting a reward if that arm is pulled. Our aim is to find the arm which has the highest probabilty, thus giving us the maximum return.

Notation:

\(Q_t(a)\): Estimated quality of action ‘a’ at timestep ‘t’.

\(q(a)\): True value of action ‘a’.

We want our estimate \(Q_t(a)\) to be as close to the true value \(q(a)\) as possible, so we can make the correct decision.

Let the action with the maximum quality be \(a^*\):s

\[q^* = q(a^*)\]

Our goal is to find this \(q^*\).

The ‘regret function’ is defined as the sum of ‘regret’ accumulated over all timesteps. This regret is the cost of not choosing the optimal arm and instead of exploring. Mathematically it can be written as:

\[L = E[\sum_{t=0}^T q^* - Q_t(a)]\]

Some policies which are effective at exploring are: 1. Epsilon Greedy 2. Gradient Algorithm 3. UCB(Upper Confidence Bound) 4. Bayesian 5. Thompson Sampling

Epsilon Greedy is the most basic exploratory policy which follows a simple principle to balance exploration and exploitation. It ‘exploits’ the current knowledge of the bandit most of the times, i.e. takes the action with the largest q value. But with a small probability epsilon, it also explores a random action. The value of epsilon signifies how much you want the agent explore. Higher the value, the more it explores. But remember you do not want an agent to explore too much even after it has a pretty confident estimate of the reward function, so the value of epislon should neither be too high nor too low!

For the bandit, you can set the number of bandits, number of arms, and also reward probabilities of each of these arms seperately.

Code to train an Epsilon Greedy agent on a Bernoulli Multi-Armed Bandit:

import gym
import numpy as np

from genrl.bandit import BernoulliMAB, EpsGreedyMABAgent, MABTrainer

reward_probs = np.random.random(size=(bandits, arms))
bandit = BernoulliMAB(arms=5, reward_probs=reward_probs, context_type="int")
agent = EpsGreedyMABAgent(bandit, eps=0.05)

trainer = MABTrainer(agent, bandit)
trainer.train(timesteps=10000)

More details can be found in the docs for BernoulliMAB, EpsGreedyMABAgent, MABTrainer.

You can also refer to the book “Reinforcement Learning: An Introduction”, Chapter 2 for further information on bandits.

Contextual Bandits Overview

Problem Setting

To get some background on the basic multi armed bandit problem, we recommend that you go through the Multi Armed Bandit Overview first. The contextual bandit (CB) problem varies from the basic case in that at each timestep, a context vector \(x \in \mathbb{R}^d\) is presented to the agent. The agent must then decide on an action \(a \in \mathcal{A}\) to take based on that context. After the action is taken, the reward \(r \in \mathbb{R}\) for only that action is revealed to the agent (a feature of all reinforcement learning problems). The aim of the agent remains the same - minimising regret and thus finding an optimal policy.

Here you still have the problem of exploration vs exploitation, but the agent also needs to find some relation between the context and reward.

A Simple Example

Lets consider the simplest case of the CB problem. Instead of having only one \(k\)-armed bandit that needs to be solved, say we have \(m\) different \(k\)-armed Bernoulli bandits. At each timestep, the context presented is the number of the bandit for which an action needs to be selected: \(i \in \mathbb{I}\) where \(0 < i \le m\)

Although real life CB problems usually have much higher dimensional contexts, such a toy problem can be usefull for testing and debugging agents.

To instantiate a Bernoulli bandit with \(m =10\) and \(k = 5\) (10 different 5-armed bandits) -

from genrl.bandit import BernoulliMAB

bandit = BernoulliMAB(bandits=10, arms=5, context_type="int")

Note that this is using the same BernoulliMAB as in the simple bandit case except that instead of the bandits argument defaulting to 1, we are explicitly saying we want multiple bandits (a contexutal case)

Suppose you want to solve this bandit with a UCB based policy.

from genrl.bandit import UCBMABAgent

agent = UCBMABAgent(bandit)
context = bandit.reset()

action = agent.select_action(context)
new_context, reward = bandit.step(action)

To train the agent, you an set up a loop which calls the update_params method on the agent whenever you want to agent to learn from actions it has taken. For convinience it is highly recommended to use the MABTrainer in such cases.

Data based Conextual Bandits

Lets consider a more realistic class of CB problem. I real life, you the CB setting is usually used to model recommendation or classification problems. Here, instead of getting an integer as the context, you will get a \(d\)-dimensional feature vector \(\mathbf{x} \in \mathbb{R}^d\). This is also different from regular classification since you only get the reward \(r \in \mathbb{R}\) for the action you have taken.

While tabular solutions can work well for integer contexts (see the implentation of any genrl.bandit.MABAgent for details), when you have a high dimensional vector, the agent should be able to infer the complex relation between the contexts and rewards. This can be done by modelling a conditional distribution over rewards for each action given the context.

\[P(r | a, \mathbf{x})\]

There are many ways to do this. For a detailed explanation and comparison of contextual bandit methods you can refer to this paper.

The following are the agents implemented in genrl

You can find the tutorials for most of these in Bandit Tutorials.

All the methods which use neural networks, provide an option to train and evaluate with dropout, have a decaying learning rate and a limit for gradient clipping. The sizes of hidden layers for the networks can also be specified. Refer to docs of the specific agents to see how to use these options.

Individual agents will have other method specific paramters to control behavior. Although default values have been provided, it may be neccessary to tune these for individual use cases.

The following bandits based on datasets are implemented in genrl

For each bandit, while instatiating an object you can either specify a path to the data file or pass download=True as an argument to download the data directly.

Data based Bandit Example

For this example, we’ll model the Statlog dataset as a bandit problem. You can read more about the bandit in the Statlog docs. In brief we have the number of arms as \(k = 7\) and dimension of context vector as \(d = 9\). The agent will get a reward \(r =1\) if it selects the correct arm else \(r = 0\).

from genrl.bandit import StatlogDataBandit

bandit = StatlogDataBandit(download=True)
context = bandit.reset()

Suppose you want to solve this bandit with a Greedy neural network based policy.

from genrl.bandit import NeuralLinearPosteriorAgent

agent = NeuralLinearPosteriorAgent(bandit)
context = bandit.reset()

action = agent.select_action(context)
new_context, reward = bandit.step(action)

To train the agent, we highly reccomend using the DCBTrainer. You can refer to the implementation of the train function to get an idea of how to implemente your own training loop.

from genrl.bandit import DCBTrainer

trainer = DCBTrainer(agent, bandit)
trainer.train(timesteps=5000, batch_size=32)
Further material about bandits
  1. Deep Contextual Multi-armed Bandits, Collier and Llorens, 2018
  2. Deep Bayesian Bandits Showdown, Riquelme∗ et al, 2018
  3. A Contextual Bandit Bake-off, Bietti et al, 2020

UCB

Training a UCB algorithm on a Bernoulli Multi-Armed Bandit

For an introduction to Multi Armed Bandits, refer to Multi Armed Bandit Overview

The UCB algorithm follows a basic principle - ‘Optimism in the face of uncertainty’. What this means is that we should always select the action whose reward we are most uncertain of. We quantify the uncertainty of taking an action by calculating an upper bound of the quality(reward) for that action. We then select the greedy action with respect to this upper bound.

Hoeffding’s inequality:

\[P[q(a) > Q_t(a) + U_t(a)] \le e ^ {-2 N_t(a) U_t(a)^2}\]

,

q(a) is the quality of that action,

\(Q_t(a)\) is the estimate of the quality of action ‘a’ at time ‘t’,

\(U_t(a)\) is the upper bound for uncertainty for that action at time ‘t’,

\(N_t(a\) is the number of times action ‘a’ has been selected

\[e ^ {-2 N_t(a) U_t(a)^2} = t^{-4}\]
\[U_t(a) = \sqrt{\frac{2 log t}{N_t(a)}}\]

Action taken: a = argmax\((Q_t(a) + U_t(a))\)

As we can see, the less an action has been tried, more the uncertainty is (due to \(N_t(a)\) being in the denominator), which leads to that action having a higher chance of being explored. Also, theoretically, as \({N_t(a)}\) goes to infinity, the uncertainty decreases to 0 giving us the true value of the quality of that action: q(a). This allows us to ‘exploit’ the greedy action \(a^*\) from then.

Code to train a UCB agent on a Bernoulli Multi-Armed Bandit:

import gym
import numpy as np

from genrl.bandit import BernoulliMAB, MABTrainer, UCBMABAgent

bandits = 10
arms = 5

reward_probs = np.random.random(size=(bandits, arms))
bandit = BernoulliMAB(bandits, arms, reward_probs, context_type="int")
agent = UCBMABAgent(bandit, confidence=1.0)

trainer = MABTrainer(agent, bandit)
trainer.train(timesteps=10000)

More details can be found in the docs for BernoulliMAB, UCB and MABTrainer.

Thompson Sampling

Using Thompson Sampling on a Bernoulli Multi-Armed Bandit

For an introduction to Multi Armed Bandits, refer to Multi Armed Bandit Overview

Thompson Sampling is one of the best methods for solving the Bernoulli multi-armed bandits problem. It is a ‘sample-based probability matching’ method.

We initially assume an initial distribution(prior) over the quality of each of the arms. We can model this prior using a Beta distribution, parametrised by alpha(\(\alpha\)) and beta(\(\beta\)).

\[PDF = \frac{x^{\alpha - 1} (1-x)^{\beta -1}}{B(\alpha, \beta)}\]

Let’s just think of the denominator as some normalising constant, and focus on the numerator for now. We initialise \(\alpha\) = \(\beta\) = 1. This will result in a uniform distribution over the values (0, 1), making all the values of quality from 0 to 1 equally probable, so this is a fair initial assumption. Now think of \(\alpha\) as the number of times we get the reward ‘1’ and \(\beta\) as the number of times we get ‘0’, for a particular arm. As our agent interacts with the environment and gets a reward for pulling any arm, we will update our prior for that arm using Bayes Theorem. What this does is that it gives a posterior distribution over the quality, according to the rewards we have seen so far.

At each timestep, we sample the quality: \(Q_t(a)\) for each arm from the posterior and select the sample with the highest value. The more an action is tried out, the narrower is the distribution over its quality, meaning we have a confident estimate of its quality (q(a)). If an action has not been tried out that often, it will have a more wider distribution (high variance), meaning we are uncertain about our estimate of its quality (q(a)). This wider variance of an arm with an uncertain estimate creates opportunities for it to be picked during sampling.

As we experience more successes for a particular arm, the value of \(\alpha\) for that arm increases and similiarly, the more failures we experience, the value of \(\beta\) increases. Higher the value of one of the parameters as compared to the other, the more skewed is the distribution in one of the directions. For eg. if \(\alpha\) = 100 and \(\beta\) = 50, we have seen considerably more successes than failures for this arm and so our estimate for its quality should be >0.5. This will be reflected in the posterior of this arm, i.e. the mean of the distribution, characterised by \(\frac{\alpha}{\alpha + \beta}\) will be 0.66, which is >0.5 as we expected.

Code to use Thompson Sampling on a Bernoulli Multi-Armed Bandit:

import gym
import numpy as np

from genrl.bandit import BernoulliMAB, MABTrainer, ThompsonSamplingMABAgent

bandits = 10
arms = 5
alpha = 1.0
beta = 1.0

reward_probs = np.random.random(size=(bandits, arms))
bandit = BernoulliMAB(bandits, arms, reward_probs, context_type="int")
agent = ThompsonSamplingMABAgent(bandit, alpha, beta)

trainer = MABTrainer(agent, bandit)
trainer.train(timesteps=10000)

More details can be found in the docs for BernoulliMAB, UCB and MABTrainer.

Bayesian

Using Bayesian Method on a Bernoulli Multi-Armed Bandit

For an introduction to Multi Armed Bandits, refer to Multi Armed Bandit Overview

This method is also based on the prinicple - ‘Optimism in the face of uncertainty’, like UCB. We initially assume an initial distribution(prior) over the quality of each of the arms. We can model this prior using a Beta distribution, parametrised by alpha(\(\alpha\)) and beta(\(\beta\)).

\[PDF = \frac{x^{\alpha - 1} (1-x)^{\beta -1}}{B(\alpha, \beta)}\]

Let’s just think of the denominator as some normalising constant, and focus on the numerator for now. We initialise \(\alpha\) = \(\beta\) = 1. This will result in a uniform distribution over the values (0, 1), making all the values of quality from 0 to 1 equally probable, so this is a fair initial assumption. Now think of \(\alpha\) as the number of times we get the reward ‘1’ and \(\beta\) as the number of times we get ‘0’, for a particular arm. As our agent interacts with the environment and gets a reward for pulling any arm, we will update our prior for that arm using Bayes Theorem. What this does is that it gives a posterior distribution over the quality, according to the rewards we have seen so far.

This is quite similar to Thompson Sampling. But what is different here is that we explicity try to calculate the uncertainty of a particular action by calculating the standard deviation(\(\sigma\)) of its posterior. We add this std. dev to the mean of the posterior, giving us an upper bound of the quality of that arm. At each timestep we select a greedy action based on this upper bound we calculated.

\[a_t = argmax(q_t(a) + \sigma_{q_t})\]

As we try out an action more and more, the standard deviation of the posterior decreases, corresponding to a decrease in the uncertainty of that action, which is exactly what we want. If an action has not been tried that often, it will have a wider posterior, meaning higher chances of it getting selected based on its upper bound.

Code to use Bayesian method on a Bernoulli Multi-Armed Bandit:

import gym
import numpy as np

from genrl.bandit import BayesianUCBMABAgent, BernoulliMAB, MABTrainer

bandits = 10
arms = 5
alpha = 1.0
beta = 1.0

reward_probs = np.random.random(size=(bandits, arms))
bandit = BernoulliMAB(bandits, arms, reward_probs, context_type="int")
agent = BayesianUCBMABAgent(bandit, alpha, beta)

trainer = MABTrainer(agent, bandit)
trainer.train(timesteps=10000)

More details can be found in the docs for BernoulliMAB, BayesianUCBMABAgent and MABTrainer.

Gradients

Using Gradient Method on a Bernoulli Multi-Armed Bandit

For an introduction to Multi Armed Bandits, refer to Multi Armed Bandit Overview

This method is different compared to others. In other methods, we explicity attempt to estimate the ‘value’ of taking an action (its quality) whereas in this method we approach the problem in a different way. Here, instead of estimating how good an action is through its quality, we only care about its preference of being selected compared to other actions. We denote this preference by \(H_t(a)\). The larger the preference of an action ‘a’, more are the chances of it being selected, but this preference has no interpretation in terms of the reward for that action. Only the relative preference is important.

The action probabilites are related to these action preferences \(H_t(a)\) by a softmax function. The probability of taking action \(a_j\) is given by:

\[P(a_j) = \frac{e^{H_t(a_j)}}{\sum_{i=1}^A e^{H_t(a_i)}} = \pi_t(a_j)\]

where, A is the total number of actions and \(\pi_t(a)\) is the probability of taking action ‘a’ at timestep ‘t’.

We initialise the preferences for all the actions to be 0, meaning \(\pi_t(a) = \frac{1}{A}\) for all actions.

After computing \(\pi_t(a)\) for all actions at each timestep, the action is sampled using this probability. Then that action is performed and based on the reward we get, we update our preferences.

The update rule bacially performs stochastic gradient ascent:

\(H_{t+1}(a_t) = H_t(a_t) + \alpha (R_t - \bar{R_t})(1-\pi_t(a_t))\), for \(a_t\): action taken at time ‘t’

\(H_{t+1}(a) = H_t(a) - \alpha (R_t - \bar{R_t})(\pi_t(a))\) for rest of the actions

where, \(\alpha\) is the step size, \(R_t\) is the reward obtained at time ‘t’ and \(\bar{R_t}\) is the mean reward obtained upto time t. If current reward is larger than the mean reward, we increase our preference for that action taken at time ‘t’. If it is lower than the mean reward, we decrease our preference for that action. The preferences for the rest of the actions are updated in the opposite direction.

For a more detailed mathematical analysis and derivation of the update rule, refer to chapter 2 of Sutton & Barto.

Code to use the Gradient method on a Bernoulli Multi-Armed Bandit:

import gym
import numpy as np

from genrl.bandit import BernoulliMAB, GradientMABAgent, MABTrainer

bandits = 10
arms = 5

reward_probs = np.random.random(size=(bandits, arms))
bandit = BernoulliMAB(bandits, arms, reward_probs, context_type="int")
agent = GradientMABAgent(bandit, alpha=0.1, temp=0.01)

trainer = MABTrainer(agent, bandit)
trainer.train(timesteps=10000)

More details can be found in the docs for BernoulliMAB, BayesianUCBMABAgent and MABTrainer.

Linear Posterior Inference

For an introduction to the Contextual Bandit problem, refer to Contextual Bandits Overview.

In this agent we assume a linear relationship between context and reward distribution of the form

\[Y = X^T \beta + \epsilon \ \ \text{where} \ \epsilon \sim \mathcal{N}(0, \sigma^2)\]

We can utilise bayesian linear regression to find the parameters \(\beta\) and \(\sigma\). Since our agent is continually learning, the parameters of the model will being updated according the (\(\mathbf{x}\), \(a\), \(r\)) transitions it observes.

For more complex non linear relations, we can make use of neural networks to transform the context into a learned embedding space. The above method can then be used on this latent embedding to model the reward.

An example of using a neural network based linear posterior agent in genrl -

from genrl.bandit import NeuralLinearPosteriorAgent, DCBTrainer

agent = NeuralLinearPosteriorAgent(bandit, lambda_prior=0.5, a0=2, b0=2, device="cuda")

trainer = DCBTrainer(agent, bandit)
trainer.train()

Note that the priors here are used to parameterise the initial distribution over \(\beta\) and \(\sigma\). More specificaly lambda_prior is used to parameterise a guassian distribution for \(\beta\) while a0 and b0 are paramters of an inverse gamma distribution over \(\sigma^2\). These are updated over the course of exploring a bandit. More details can be found in Section 3 of this paper.

All hyperparameters can be tuned for individual use cases to improve training efficiency and achieve convergence faster.

Refer to the LinearPosteriorAgent, NeuralLinearPosteriorAgent and DCBTrainer docs for more details.

Variational Inference

For an introduction to the Contextual Bandit problem, refer to Contextual Bandits Overview.

In this method, we try find a distribution \(P_{\theta}(r | \mathbf{x}, a)\) by minimising the KL divergence with the true distribution. For the model we take a neueral network where each weight is modelled by independant gaussians, also known as Bayesian Neural Nets.

An example of using a variational inference based agent in genrl with bayesian net of hidden layer of 128 neurons and standard deviation of 0.1 for al the weights -

from genrl.bandit import VariationalAgent, DCBTrainer

agent = VariationalAgent(bandit, hidden_dims=[128], noise_std=0.1, device="cuda")

trainer = DCBTrainer(agent, bandit)
trainer.train()

Refer to the VariationalAgent, and DCBTrainer docs for more details.

Bootstrap

For an introduction to the Contextual Bandit problem, refer to Contextual Bandits Overview.

In the bootstrap agent multiple different neural network based models are trained simultaneously. Different transition databases are maintained for each model and every time we observe a transition it is added to each dataset with some probability. At each timestep, the model used to select an action is chosen randomly from the set of models.

By having multiple different models initialised with different random weights, we promote the exploration of the loss landscape which may have multiple different local optima.

An example of using a bootstrap based agent in genrl with 10 models with a hidden layer of 128 neurons which also uses dropout for training -

from genrl.bandit import BootstrapNeuralAgent, DCBTrainer

agent = BootstrapNeuralAgent(bandit, hidden_dims=[128], n=10, dropout_p=0.5, device="cuda")

trainer = DCBTrainer(agent, bandit)
trainer.train()

Refer to the BootstrapNeuralAgent and DCBTrainer docs for more details.

Parameter Noise Sampling

For an introduction to the Contextual Bandit problem, refer to Contextual Bandits Overview.

One of the ways to improve exploration of our algorithms is to introduce noise into the weights of the neural network while selecting actions. This does not affect the gradients but will have a similar effect to epsilon greedy exploration.

The noise distribution is regularly updated during training to keep the KL divergence of the prediction and noise predictions within certain limits.

An example of using a noise sampling based agent in genrl with noise standard deviation as 0.1, KL divergence limit as 0.1 and batch size for updating the noise distribution as 128 -

from genrl.bandit import BootstrapNeuralAgent, DCBTrainer

agent = NeuralNoiseSamplingAgent(bandit, hidden_dims=[128], noise_std_dev=0.1, eps=0.1, noise_update_batch_size=128, device="cuda")

trainer = DCBTrainer(agent, bandit)
trainer.train()

Refer to the NeuralNoiseSamplingAgent, and DCBTrainer docs for more details.

Adding a new Data Bandit

The bandit submodule like all of genrl has been designed to be easily extensible for custom additions. This tutorial will show how to create a dataset based bandit which will work with the rest of genrl.bandit

For this tutorial, we will use the Wine dataset which is a simple datset often used for testing classifiers. It has 178 examples each with 14 features, the first of which gives the cultivar of the wine (the feature we need to classify each wine sample into) (this can be one of three) and the rest give the properties of the wine itself. Formulated as a bandit problem we have a bandit with 3 arms and a 13-dimensional context. The agent will get a reward of 1 if it correctly selects the arm else 0.

To start off with lets import necessary modules, specify the data URL and make a class which inherits from genrl.utils.data_bandits.base.DataBasedBandit

from typing import Tuple

import pandas as pd
import torch

from genrl.utils.data_bandits.base import DataBasedBandit
from genrl.utils.data_bandits.utils import download_data


URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"

class WineDataBandit(DataBasedBandit):
    def __init__(self, **kwargs):

    def reset(self) -> torch.Tensor:

    def _compute_reward(self, action: int) -> Tuple[int, int]:

    def _get_context(self) -> torch.Tensor:

We will need to implement __init__, reset, _compute_reward and _get_context to make the class functional.

For dataset based bandits, we can generally load the data into memory during initialisation. This can be in some tabular form (numpy.array, torch.Tensor or pandas.DataFrame) and maintaining an index. When reset, the bandit would set its index to 0 and reshuffle the rows of the table. For stepping, the bandit can compute rewards from the current row of the table as given by the index and then increment the index to move to the next row.

Lets start with __init__. Here we need to download the data if specified and load it into memory. Many utility functions are available in genrl.utils.data_bandits.utils including download_data to download data from a URL as well as functions to fetch data from memory.

For most cases, you can load the data into a pandas.DataFrame. You also need to specify the n_actions, context_dim and len here.

def __init__(self, **kwargs):
    super(WineDataBandit, self).__init__(**kwargs)

    path = kwargs.get("path", "./data/Wine/")
    download = kwargs.get("download", None)
    force_download = kwargs.get("force_download", None)
    url = kwargs.get("url", URL)

    if download:
        path = download_data(path, url, force_download)

    self._df = pd.read_csv(path, header=None)
    self.n_actions = len(self._df[0].unique())
    self.context_dim = self._df.shape[1] - 1
    self.len = len(self._df)

The reset method will shuffle the indices of the data and return the counting index to 0. You must have a call to _reset here to reset any metrics, counters etc… (which is implemented in the base class)

def reset(self) -> torch.Tensor:
    self._reset()
    self.df = self._df.sample(frac=1).reset_index(drop=True)
    return self._get_context()

The new bandit does not explicitly need to implement the step method since this is already implmented in the base class. We do however need to implement _compute_reward and _get_context which step uses.

In _compute_reward, we need to figure out whether the given action corresponds to the correct label for this index or not and return the reward appropriately. This method also return the maxium possible reward in the current context which is used to compute regret.

def _compute_reward(self, action: int) -> Tuple[int, int]:
    label = self._df.iloc[self.idx, 0]
    r = int(label == (action + 1))
    return r, 1

The _get_context method should return a 13-dimensional torch.Tensor (in this case) corresponding to the context for the current index.

def _get_context(self) -> torch.Tensor:
    return torch.tensor(
        self._df.iloc[self.idx, 0].values,
        device=self.device,
        dtype=torch.float,
    )

Once you are done with the above, you can use the WineDataBandit class like you would any other bandit from from genrl.utils.data_bandits. You can use it with any of the cb_agents as well as training on it with genrl.bandit.DCBTrainer.

Adding a new Deep Contextual Bandit Agent

The bandit submodule like all of genrl has been designed to be easily extensible for custom additions. This tutorial will show how to create a deep contextual bandit agent which will work with the rest of genrl.bandit

For the purpose of this tutorial we will consider a simple neural network based agent. Although this is a simplictic agent, implementation of any level of agent will need to have the following steps.

To start off with lets import necessary modules and make a class which inherits from genrl.agents.bandits.contextual.base.DCBAgent

from typing import Optional

import torch

from genrl.agents.bandits.contextual.base import DCBAgent
from genrl.agents.bandits.contextual.common import NeuralBanditModel, TransitionDB
from genrl.utils.data_bandits.base import DataBasedBandit

class NeuralAgent(DCBAgent):
    """Deep contextual bandit agent based on a neural network."""

    def __init__(self, bandit: DataBasedBandit, **kwargs):

    def select_action(self, context: torch.Tensor) -> int:

    def update_db(self, context: torch.Tensor, action: int, reward: int):

    def update_params(
        self,
        action: Optional[int] = None,
        batch_size: int = 512,
        train_epochs: int = 20,
    ):

We will need to implement __init__, select_action, update_db and update_param to make the class functional.

Lets start off with __init__. Here we will need to initialise some required parameters (init_pulls, eval_with_dropout, t and update_count) along with our transition database and the neural network. For the neural network, you can use the NeuralBanditModel class. It packages together many of the functionalities a neural network might require. Refer to the docs for more details.

def __init__(self, bandit: DataBasedBandit, **kwargs):
    super(NeuralAgent, self).__init__(bandit, **kwargs)
    self.model = (
        NeuralBanditModel(
            context_dim=self.context_dim,
            n_actions=self.n_actions,
            **kwargs
        )
        .to(torch.float)
        .to(self.device)
    )
    self.eval_with_dropout = kwargs.get("eval_with_dropout", False)
    self.db = TransitionDB(self.device)
    self.t = 0
    self.update_count = 0

For the select action function, the agent will pass the context vector through the neural network to produce logits for each action. It will then select the action with highest logit value. Note that it must also increment the timestep, and if take every action atleast init_pulls number of times initially.

def select_action(self, context: torch.Tensor) -> int:
    """Selects action for a given context"""
    self.model.use_dropout = self.eval_with_dropout
    self.t += 1
    if self.t < self.n_actions * self.init_pulls:
        return torch.tensor(
            self.t % self.n_actions, device=self.device, dtype=torch.int
        )

    results = self.model(context)
    action = torch.argmax(results["pred_rewards"]).to(torch.int)
    return action

For updating the databse we can use the add method of TransitionDB class.

def update_db(self, context: torch.Tensor, action: int, reward: int):
    """Updates transition database."""
    self.db.add(context, action, reward)

In update_params we need to train the model on the observations seen so far. Since the NeuralBanditModel class already hass a train function, we just need to call that. However if you are writing your own model, this is where the updates to the parameters would happen.

def update_params(
    self,
    action: Optional[int] = None,
    batch_size: int = 512,
    train_epochs: int = 20,
):
    """Update parameters of the agent."""
    self.update_count += 1
    self.model.train_model(self.db, train_epochs, batch_size)

Note that some of these functions have unused arguments. The signatures have been decided so as such to ensure generality over all classes of algorithms.

Once you are done with the above, you can use the NeuralAgent class like you would any other agent from genrl.bandit. You can use it with any of the bandits as well as training it with genrl.bandit.DCBTrainer.

Classical

Q-Learning using GenRL

What is Q-Learning?

Q-Learning is one of the stepping stones for many reinforcement learning algorithms like DQN. AlphaGO is also one of the famous examples that use Q-Learning at the heart.

Essentially, a RL agent take an action on the environment and then collect rewards and update its policy, and over time gets better at collecting higher rewards.

In Q-Learning, we generally maintain a “Q-table” of Q-values by mapping them to a (state, action) pair.

A natural question is, What are these Q-values ? It is nothing but the “Quality” of an action taken from a particular state. The more the Q-value the more chances of getting a better reward.

Q-Table is often initialized with random values/with zeros and as the agent collects rewards via performing actions on the environment we update this Q-Table at the \(i\) th step using the following formulation -

\[Q_{i}(s, a) = (1- \alpha)Q_{i-1}(s, a) + \alpha * (reward + \gamma * max_{a'} Q_{i-1}(s', a'))\]

Here \(\alpha\) is the learning rate in ML terms, \(\gamma\) is the discount factor for the rewards and \(s'\) is the state reached after taking action \(a\) from state \(s\).

FrozenLake-v0 environment

So to demonstrate how easy it is to train a Q-Learning approach in GenRL, we are taking a very simple gym environment.

Description of the environment (from the documentation) -

“The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you’ll fall into the freezing water. At this time, there’s an international frisbee shortage, so it’s absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won’t always move in the direction you intend.

The surface is described using a grid like the following:

SFFF       (S: starting point, safe)
FHFH       (F: frozen surface, safe)
FFFH       (H: hole, fall to your doom)
HFFG       (G: goal, where the frisbee is located)

The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.”

Code

Let’s import all the usefull stuff first.

import gym
from genrl import QLearning                             # for the agent
from genrl.classical.common import Trainer              # for training the agent

Now that we have imported all the necessary stuff let’s go ahead and define the environment, the agent and an object for the Trainer class.

env = gym.make("FrozenLake-v0")
agent = QLearning(env, gamma=0.6, lr=0.1, epsilon=0.1)
trainer = Trainer(
    agent,
    env,
    model="tabular",
    n_episodes=3000,
    start_steps=100,
    evaluate_frequency=100,
)

Great so far so good! Now moving towards the training process it is just calling the train method in the trainer class.

trainer.train()
trainer.evaluate()

That’s it! You have successfully trained a Q-Learning agent. You can now go ahead and play with your own environments using GenRL!

SARSA using GenRL

What is SARSA?

SARSA is an acronym for State-Action-Reward-State-Action. It is an on-policy TD control method. Our aim is basically to estimate the Q-value or the utility value for state-action pair using the TD update rule given below.

\[Q(S_{t}, A_{t}) = Q(S_{t}, A_{t}) + \alpha * [R_{t+1} + \gamma * Q(S_{t+1}, A_{t+2}) - Q(S_{t}, A_{t})]\]
FrozenLake-v0 environment

So to demonstrate how easy it is to train a SARSA approach in GenRL, we are taking a very simple gym environment.

Description of the environment (from the documentation) -

“The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you’ll fall into the freezing water. At this time, there’s an international frisbee shortage, so it’s absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won’t always move in the direction you intend.

The surface is described using a grid like the following:

SFFF       (S: starting point, safe)
FHFH       (F: frozen surface, safe)
FFFH       (H: hole, fall to your doom)
HFFG       (G: goal, where the frisbee is located)

The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.”

Code

Let’s import all the usefull stuff first.

import gym
from genrl import SARSA                                 # for the agent
from genrl.classical.common import Trainer              # for training the agent

Now that we have imported all the necessary stuff let’s go ahead and define the environment, the agent and an object for the Trainer class.

env = gym.make("FrozenLake-v0")
agent = SARSA(env, gamma=0.6, lr=0.1, epsilon=0.1)
trainer = Trainer(
    agent,
    env,
    model="tabular",
    n_episodes=3000,
    start_steps=100,
    evaluate_frequency=100,
)

Great so far so good! Now moving towards the training process it is just calling the train method in the trainer class.

trainer.train()
trainer.evaluate()

That’s it! You have successfully trained a SARSA agent. You can now go ahead and play with your own environments using GenRL!

Deep RL Tutorials

Deep Reinforcement Learning Background

Background

The goal of Reinforcement Learning Algorithms is to maximize reward. This is usually achieved by having a policy \(\pi_{\theta}\) perform optimal behavior. Let’s denote this optimal policy by \(\pi_{\theta}^{*}\). For ease, we define the Reinforcement Learning problem as a Markov Decision Process.

Markov Decision Process

An Markov Decision Process (MDP) is defined by \((S, A, r, P_{a})\) where,

  • \(S\) is a set of States.
  • \(A\) is a set of Actions.
  • \(r : S \rightarrow \mathbb{R}\) is a reward function.
  • \(P_{a}(s, s')\) is the transition probability that action \(a\) in state \(s\) leads to state \(s'\).

Often we define two functions, a policy function \(\pi_{\theta}(s,a)\) and \(V_{\pi_{\theta}}(s)\).

Policy Function

The policy is the agent’s strategy, we our goal is to make it optimal. The optimal policy is usually denoted by \(\pi_{\theta}^{*}\). There are usually 2 types of policies:

Stochastic Policy

The Policy Function is a stochastic variable defining a probability distribution over actions given states i.e. likelihood of every action when an agent is in a particular state. Formally,

\[\pi : S \times A \rightarrow [0,1]\]
\[a \sim \pi(a|s)\]
Deterministic Policy

The Policy Function maps from States directly to Actions.

\[\pi : S \rightarrow A\]
\[a = \pi(s)\]
Value Function

The Value Function is defined as the expected return obtained when we follow a policy \(\pi\) starting from state S. Usually there are two types of value functions defined State Value Function and a State Action Value Function.

State Value Function

The State Value Function is defined as the expected return starting from only State s.

\[V^{\pi}(s) = E\left[ R_{t} \right]\]
State Action Value Function

The Action Value Function is defined as the expected return starting from a state s and a taking an action a.

\[Q^{\pi}(s,a) = E\left[ R_{t} \right]\]

The Action Value Function is also known as the Quality Function as it would denote how good a particular action is for a state s.

Approximators

Neural Networks are often used as approximators for Policy and Value Functions. In such a case, we say these are parameterised by \(\theta\). For e.g. \(\pi_{\theta}\).

Objective

The objective is to choose/learn a policy that will maximize a cumulative function of rewards received at each step, typically the discounted reward over a potential infinite horizon. We formulate this cumulative function as

\[E\left[{\sum_{t=0}^{\infty}{\gamma^{t} r_{t}}}\right]\]

where we choose an action according to our policy, \(a_{t} = \pi_{\theta}(s_{t})\).

Vanilla Policy Gradient

For background on Deep RL, its core definitions and problem formulations refer to Deep RL Background

Objective

The objective is to choose/learn a policy that will maximize a cumulative function of rewards received at each step, typically the discounted reward over a potential infinite horizon. We formulate this cumulative function as

\[E\left[{\sum_{t=0}^{\infty}{\gamma^{t} r_{t}}}\right]\]

where we choose the action \(a_{t} = \pi_{\theta}(s_{t})\).

Algorithm Details
Collect Experience

To make our agent learn, we first need to collect some experience in an online fashion. For this we make use of the collect_rollouts method. This method is defined in the OnPolicyAgent Base Class.

For updation, we would need to compute advantages from this experience. So, we store our experience in a Rollout Buffer. Action Selection —————-

Note: We sample a stochastic action from the distribution on the action space by providing False as an argument to select_action.

For practical purposes we would assume that we are working with a finite horizon MDP.

Update Equations

Let \(\pi_{\theta}\) denote a policy with parameters \(\theta\), and \(J(\pi_{\theta})\) denote the expected finite-horizon undiscounted return of the policy.

At each update timestep, we get value and log probabilities:

Now, that we have the log probabilities we calculate the gradient of \(J(\pi_{\theta})\) as:

\[\nabla_{\theta} J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}}\left[{ \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) }\right],\]

where \(\tau\) is the trajectory.

We then update the policy parameters via stochastic gradient ascent:

\[\theta_{k+1} = \theta_k + \alpha \nabla_{\theta} J(\pi_{\theta_k})\]

The key idea underlying vanilla policy gradients is to push up the probabilities of actions that lead to higher return, and push down the probabilities of actions that lead to lower return, until you arrive at the optimal policy.

Training through the API
import gym

from genrl import VPG
from genrl.deep.common import OnPolicyTrainer
from genrl.environments import VectorEnv

env = VectorEnv("CartPole-v0")
agent = VPG('mlp', env)
trainer = OnPolicyTrainer(agent, env, log_mode=['stdout'])
trainer.train()
timestep         Episode          loss             mean_reward
0                0                9.1853           22.3825
20480            10               24.5517          80.3137
40960            20               24.4992          117.7011
61440            30               22.578           121.543
81920            40               20.423           114.7339
102400           50               21.7225          128.4013
122880           60               21.0566          116.034
143360           70               21.628           115.0562
163840           80               23.1384          133.4202
184320           90               23.2824          133.4202
204800           100              26.3477          147.87
225280           110              26.7198          139.7952
245760           120              30.0402          184.5045
266240           130              30.293           178.8646
286720           140              29.4063          162.5397
307200           150              30.9759          183.6771
327680           160              30.6517          186.1818
348160           170              31.7742          184.5045
368640           180              30.4608          186.1818
389120           190              30.2635          186.1818

Advantage Actor Critic

For background on Deep RL, its core definitions and problem formulations refer to Deep RL Background

Objective

The objective is to maximize the discounted cumulative reward function:

\[E\left[{\sum_{t=0}^{\infty}{\gamma^{t} r_{t}}}\right]\]

This comprises of two parts in the Adantage Actor Critic Algorithm:

  1. To choose/learn a policy that will increase the probability of landing an action that has higher expected return than the value of just the state and decrease the probability of landing an action that has lower expected return than the value of the state. The Advantage is computed as:
\[A(s,a) = Q(s,a) - V(s)\]
  1. To learn a State Action Value Function (in the name of Critic) that estimates the future cumulative rewards given the current state and action. This function helps the policy in evaluation potential state, action pairs.

where we choose the action \(a_{t} = \pi_{\theta}(s_{t})\).

Algorithm Details
Action Selection and Values

ac here is an object of the ActorCritic class, which defined two methods: get_value and get_action and ofcourse they return the value approximation from the Critic and action from the Actor.

Note: We sample a stochastic action from the distribution on the action space by providing False as an argument to select_action.

For practical purposes we would assume that we are working with a finite horizon MDP.

Collect Experience

To make our agent learn, we first need to collect some experience in an online fashion. For this we make use of the collect_rollouts method. This method is defined in the OnPolicyAgent Base Class.

For updation, we would need to compute advantages from this experience. So, we store our experience in a Rollout Buffer.

Compute discounted Returns and Advantages

Next we can compute the advantages and the actual discounted returns for each state. This can be done very easily by simply calling compute_returns_and_advantage. Note this implementation of the rollout buffer is borrowed from Stable Baselines.

Update Equations

Let \(\pi_{\theta}\) denote a policy with parameters \(\theta\), and \(J(\pi_{\theta})\) denote the expected finite-horizon undiscounted return of the policy.

At each update timestep, we get value and log probabilities:

Now, that we have the log probabilities we calculate the gradient of \(J(\pi_{\theta})\) as:

\[\nabla_{\theta} J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}}\left[{ \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) A^{\pi_{\theta}}(s_t,a_t) }\right],\]

where \(\tau\) is the trajectory.

We then update the policy parameters via stochastic gradient ascent:

\[\theta_{k+1} = \theta_k + \alpha \nabla_{\theta} J(\pi_{\theta_k})\]

The key idea underlying Advantage Actor Critic Algorithm is to push up the probabilities of actions that lead to higher return than the expected return of that state, and push down the probabilities of actions that lead to lower return than the expected return of that state, until you arrive at the optimal policy.

Training through the API
import gym

from genrl import A2C
from genrl.deep.common import OnPolicyTrainer
from genrl.environments import VectorEnv

env = VectorEnv("CartPole-v0")
agent = A2C('mlp', env)
trainer = OnPolicyTrainer(agent, env, log_mode=['stdout'])
trainer.train()

Proximal Policy Optimization

For background on Deep RL, its core definitions and problem formulations refer to Deep RL Background

Objective

The objective is to maximize the discounted cumulative reward function:

\[E\left[{\sum_{t=0}^{\infty}{\gamma^{t} r_{t}}}\right]\]

The Proximal Policy Optimization Algorithm is very similar to the Advantage Actor Critic Algorithm except we add multiply the advantages with a ratio between the log probability of actions at experience collection time and at updation time. What this does is - helps in establishing a trust region for not moving too away from the old policy and at the same time taking gradient ascent steps in the directions of actions which result in positive advantages.

where we choose the action \(a_{t} = \pi_{\theta}(s_{t})\).

Algorithm Details
Action Selection and Values

ac here is an object of the ActorCritic class, which defined two methods: get_value and get_action and ofcourse they return the value approximation from the Critic and action from the Actor.

Note: We sample a stochastic action from the distribution on the action space by providing False as an argument to select_action.

For practical purposes we would assume that we are working with a finite horizon MDP.

Collect Experience

To make our agent learn, we first need to collect some experience in an online fashion. For this we make use of the collect_rollouts method. This method is defined in the OnPolicyAgent Base Class.

For updation, we would need to compute advantages from this experience. So, we store our experience in a Rollout Buffer.

Compute discounted Returns and Advantages

Next we can compute the advantages and the actual discounted returns for each state. This can be done very easily by simply calling compute_returns_and_advantage. Note this implementation of the rollout buffer is borrowed from Stable Baselines.

Update Equations

Let \(\pi_{\theta}\) denote a policy with parameters \(\theta\), and \(J(\pi_{\theta})\) denote the expected finite-horizon undiscounted return of the policy.

At each update timestep, we get value and log probabilities:

In the case of PPO our loss function is:

\[L(s,a,\theta_k,\theta) = \min\left( \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)} A^{\pi_{\theta_k}}(s,a), \;\; \text{clip}\left(\frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}, 1 - \epsilon, 1+\epsilon \right) A^{\pi_{\theta_k}}(s,a) \right),\]

where \(\tau\) is the trajectory.

We then update the policy parameters via stochastic gradient ascent:

\[\theta_{k+1} = \theta_k + \alpha \nabla_{\theta} J(\pi_{\theta_k})\]
Training through the API
import gym

from genrl import PPO1
from genrl.deep.common import OnPolicyTrainer
from genrl.environments import VectorEnv

env = VectorEnv("CartPole-v0")
agent = PPO1('mlp', env)
trainer = OnPolicyTrainer(agent, env, log_mode=['stdout'])
trainer.train()

Custom Policy Networks

GenRL provides default policies for images (CNNPolicy) and for other types of inputs(MlpPolicy). Sometimes, these default policies may be insuffiecient for your problem, or you may want more control over the policy definition, and hence require a custom policy.

The following code tutorial runs through the steps to use a custom policy depending on your problem.

Import the required libraries (eg. torch, torch.nn) and from GenRL, the algorithm (eg VPG), the trainer (eg. OnPolicyTrainer), the policy to be modified (eg. MlpPolicy)

# The necessary imports
import torch
import torch.nn as nn

from genrl import VPG
from genrl.core.policies import MlpPolicy
from genrl.environments import VectorEnv
from genrl.trainers import OnPolicyTrainer

Then define a custom_policy class that derives from the policy to be modified (in this case, the MlpPolicy)

# Define a custom MLP Policy
class custom_policy(MlpPolicy):
    def __init__(self, state_dim, action_dim, hidden, **kwargs):
        super().__init__(state_dim, action_dim, hidden)
        self.action_dim = action_dim
        self.state_dim = state_dim

The above class modifies the MlpPolicy to have the desired number of hidden layers in the MLP Neural network that parametrizes the policy. This is done by passing the variable hidden explicitly (defaulthidden = (32, 32)). The state_dim and action_dim variables stand for the dimensions of the state_space and the action_space, and are required to construct the neural network with the proper input and output shapes for your policy, given the environment.

In some cases, you may also want to redefine the policy used completely and not just customize and existing policy. This can be done by creating a new custom policy class that inhierits the BasePolicy class. The BasePolicy class is a basic implementation of a general policy, with a forward and a get_action method. The forward method maps the input state to the action probabilities, and the get_action method selects an action from the given action probabilities (for both continuous and discrete action_spaces)

Say you want to parametrize your policy by a Neural Network containing LSTM layers followed my MLP layers. This can be done as follows:

# Define a custom LSTM policy from the BasePolicy class
class custom_policy(BasePolicy):
    def __init__(self, state_dim, action_dim, hidden,
                 discrete=True, layer_size=512, layers=1, **kwargs):
        super(custom_policy, self).__init__(state_dim,
                                            action_dim,
                                            hidden,
                                            discrete,
                                            **kwargs)
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.layer_size = layer_size
        self.lstm = nn.LSTM(self.state_dim, layer_size, layers)
        self.fc = mlp([layer_size] + list(hidden) + [action_dim],
                      sac=self.sac)  # the mlp layers

    def forward(self, state):
        state, h = self.lstm(state.unsqueeze(0))
        state = state.view(-1, self.layer_size)
        action = self.fc(state)
        return action

Finally, it’s time to train the custom policy. Define the environment to be trained on (CartPole-v0 in this case), and the state_dim and action_dim variables.

# Initialize an environment
env = VectorEnv("CartPole-v0", 1)

# Initialize the custom Policy
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
policy = custom_policy(state_dim=state_dim, action_dim=action_dim,
                        hidden = (64, 64))

Then the algorithm is initialised with the custom policy defined, and the OnPolicyTrainer trains in with logging for better inference.

algo = VPG(policy, env)

# Initialize the trainer and start training 
trainer = OnPolicyTrainer(algo, env, log_mode=["csv"],
                          logdir="./logs", epochs=100)
trainer.train()

Using A2C

Using A2C on “CartPole-v0”

import gym

from genrl import A2C
from genrl.deep.common import OnPolicyTrainer
from genrl.environments import VectorEnv

env = VectorEnv("CartPole-v0")
agent = A2C('mlp', env, gamma=0.9, lr_policy=0.01, lr_value=0.1, policy_layers=(32,32), value_layers=(32, 32),rollout_size=2048)
trainer = OnPolicyTrainer(agent, env, log_mode=['stdout', 'tensorboard'], log_key="Episode")
trainer.train()

Using A2C on atari env - “Pong-v0”



env = VectorEnv("Pong-v0", env_type = "atari")
agent = A2C('cnn', env, gamma=0.99, lr_policy=0.01, lr_value=0.1,  policy_layers=(32,32), value_layers=(32, 32), rollout_size=2048)
trainer = OnPolicyTrainer(agent, env, log_mode=['stdout', 'tensorboard'], log_key="timestep")
trainer.train()

More details can be found in the docs for A2C and OnPolicyTrainer.

Vanilla Policy Gradient (VPG)

If you wanted to explore Policy Gradient algorithms in RL, there is a high chance you would’ve heard of PPO, DDPG, etc. but understanding them can be tricky if you’re just starting.

VPG is arguably one of the easiest to understand policy gradient algorithms while still performing to a good enough level.

Let’s understand policy gradient at a high level, unlike the classical algorithms like Q-Learning, Monte Carlo where you try to optimise the outputs of the action-value function of the agent which are then used to determine the optimal policy. In policy gradient, as one would like to say we go directly for the kill shot, basically we optimise the thing we want to use at the end, i.e. the Policy.

So that explains the “Policy” part of Policy Gradient, so what about “Gradient”, so gradient comes from the fact that we try to optimise the policy by gradient ascent (unlike the popular gradient descent, here we want to increase the values, hence ascent). So that explains the name, but how does it even work.

For that, have a look at the following Psuedo Code (source: OpenAI)

Psuedo Code

For a more fundamental understanding this spinningup article is a good resource

Now that we have an understanding of how VPG works at a high level let’s jump into the code to see it in action
This is a very minimal way to run a VPG agent on GenRL

VPG agent on a Cartpole Environment

import gym  # OpenAI Gym

from genrl import VPG
from genrl.deep.common import OnPolicyTrainer
from genrl.environments import VectorEnv

env = VectorEnv("CartPole-v1")
agent = VPG('mlp', env)
trainer = OnPolicyTrainer(agent, env, epochs=200)
trainer.train()

This will run a VPG agent agent which will interact with the CartPole-v1 gym environment
Let’s understand the output on running this (your individual values may differ),

timestep         Episode          loss             mean_reward
0                0                8.022            19.8835
20480            10               25.969           75.2941
40960            20               29.2478          144.2254
61440            30               25.5711          129.6203
81920            40               19.8718          96.6038
102400           50               19.2585          106.9452
122880           60               17.7781          99.9024
143360           70               23.6839          121.543
163840           80               24.4362          129.2114
184320           90               28.1183          156.3359
204800           100              26.6074          155.1515
225280           110              27.2012          178.8646
245760           120              26.4612          164.498
266240           130              22.8618          148.4058
286720           140              23.465           153.4082
307200           150              21.9764          151.1439
327680           160              22.445           151.1439
348160           170              22.9925          155.7414
368640           180              22.6605          165.1613
389120           190              23.4676          177.316

timestep: It is basically the units of time the agent has interacted with the environment since the start of training
Episode: It is one complete rollout of the agent, to put it simply it is one complete run until the agent ends up winning or losing
loss: The loss encountered in that episode
mean_reward: The mean reward accumulated in that episode

Now if you look closely the agent will not converge to the max reward even if you increase the epochs to say 5000, it is because that during training the agent is behaving according to a stochastic policy (Meaning when you try to pick from an action given a state from the policy it doesn’t simply take the one with the maximum return, rather it samples an action from a probability distribution, so in other words, the policy isn’t just like a lookup table, it’s function which outputs a probability distribution over the actions which we sample from when using it to pick our optimal action).
So even if the agent has figured out the optimal policy it is not taking the most optimal action at every step there is an inherent stochasticity to it.
If we want the agent to make full use of the learnt policy we can add the following line of code at after the training

trainer.evaluate(render=True)

This will not only make the agent follow a deterministic policy and thus help you achieve the maximun reward possible reward attainable from the learnt policy but also allow you to see your agent perform by passing render=True

For more information on the VPG implementation and the various hyperparameters available have a look at the official GenRL docs here

Some more implementations

VPG agent on an Atari Environment

env = VectorEnv("Pong-v0", env_type = "atari")
agent = VPG('cnn', env)
trainer = OnPolicyTrainer(agent, env, epochs=200)
trainer.train()

Agents

A2C

genrl.agents.deep.a2c.a2c module

class genrl.agents.deep.a2c.a2c.A2C(*args, noise: Any = None, noise_std: float = 0.1, value_coeff: float = 0.5, entropy_coeff: float = 0.01, **kwargs)[source]

Bases: genrl.agents.deep.base.onpolicy.OnPolicyAgent

Advantage Actor Critic algorithm (A2C)

The synchronous version of A3C Paper: https://arxiv.org/abs/1602.01783

network

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:str
env

The environment that the agent is supposed to act on

Type:Environment
create_model

Whether the model of the algo should be created when initialised

Type:bool
batch_size

Mini batch size for loading experiences

Type:int
gamma

The discount factor for rewards

Type:float
layers

Layers in the Neural Network of the Q-value function

Type:tuple of int
lr_policy

Learning rate for the policy/actor

Type:float
lr_value

Learning rate for the critic

Type:float
rollout_size

Capacity of the Replay Buffer

Type:int
buffer_type

Choose the type of Buffer: [“rollout”]

Type:str
noise

Action Noise function added to aid in exploration

Type:ActionNoise
noise_std

Standard deviation of the action noise distribution

Type:float
value_coeff

Ratio of magnitude of value updates to policy updates

Type:float
entropy_coeff

Ratio of magnitude of entropy updates to policy updates

Type:float
seed

Seed for randomness

Type:int
render

Should the env be rendered during training?

Type:bool
device

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:str
empty_logs()[source]

Empties logs

evaluate_actions(states: torch.Tensor, actions: torch.Tensor)[source]

Evaluates actions taken by actor

Actions taken by actor and their respective states are analysed to get log probabilities and values from critics

Parameters:
  • states (torch.Tensor) – States encountered in rollout
  • actions (torch.Tensor) – Actions taken in response to respective states
Returns:

Values of states encountered during the rollout log_probs (torch.Tensor): Log of action probabilities given a state

Return type:

values (torch.Tensor)

get_hyperparams() → Dict[str, Any][source]

Get relevant hyperparameters to save

Returns:Hyperparameters to be saved
Return type:hyperparams (dict)
get_logging_params() → Dict[str, Any][source]

Gets relevant parameters for logging

Returns:Logging parameters for monitoring training
Return type:logs (dict)
get_traj_loss(values: torch.Tensor, dones: torch.Tensor) → None[source]

Get loss from trajectory traversed by agent during rollouts

Computes the returns and advantages needed for calculating loss

Parameters:
  • values (torch.Tensor) – Values of states encountered during the rollout
  • dones (list of bool) – Game over statuses of each environment
load_weights(weights) → None[source]

Load weights for the agent from pretrained model

Parameters:weights (dict) – Dictionary of different neural net weights
select_action(state: numpy.ndarray, deterministic: bool = False) → numpy.ndarray[source]

Select action given state

Action Selection for On Policy Agents with Actor Critic

Parameters:
  • state (np.ndarray) – Current state of the environment
  • deterministic (bool) – Should the policy be deterministic or stochastic
Returns:

Action taken by the agent value (torch.Tensor): Value of given state log_prob (torch.Tensor): Log probability of selected action

Return type:

action (np.ndarray)

update_params() → None[source]

Updates the the A2C network

Function to update the A2C actor-critic architecture

DDPG

genrl.agents.deep.ddpg.ddpg module

class genrl.agents.deep.ddpg.ddpg.DDPG(*args, noise: genrl.core.noise.ActionNoise = None, noise_std: float = 0.2, **kwargs)[source]

Bases: genrl.agents.deep.base.offpolicy.OffPolicyAgentAC

Deep Deterministic Policy Gradient Algorithm

Paper: https://arxiv.org/abs/1509.02971

network

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:str
env

The environment that the agent is supposed to act on

Type:Environment
create_model

Whether the model of the algo should be created when initialised

Type:bool
batch_size

Mini batch size for loading experiences

Type:int
gamma

The discount factor for rewards

Type:float
layers

Layers in the Neural Network of the Q-value function

Type:tuple of int
lr_policy

Learning rate for the policy/actor

Type:float
lr_value

Learning rate for the critic

Type:float
replay_size

Capacity of the Replay Buffer

Type:int
buffer_type

Choose the type of Buffer: [“push”, “prioritized”]

Type:str
polyak

Target model update parameter (1 for hard update)

Type:float
noise

Action Noise function added to aid in exploration

Type:ActionNoise
noise_std

Standard deviation of the action noise distribution

Type:float
seed

Seed for randomness

Type:int
render

Should the env be rendered during training?

Type:bool
device

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:str
empty_logs()[source]

Empties logs

get_hyperparams() → Dict[str, Any][source]

Get relevant hyperparameters to save

Returns:Hyperparameters to be saved
Return type:hyperparams (dict)
get_logging_params() → Dict[str, Any][source]

Gets relevant parameters for logging

Returns:Logging parameters for monitoring training
Return type:logs (dict)
update_params(update_interval: int) → None[source]

Update parameters of the model

Parameters:update_interval (int) – Interval between successive updates of the target model

DQN

genrl.agents.deep.dqn.base module

class genrl.agents.deep.dqn.base.DQN(*args, max_epsilon: float = 1.0, min_epsilon: float = 0.01, epsilon_decay: int = 1000, **kwargs)[source]

Bases: genrl.agents.deep.base.offpolicy.OffPolicyAgent

Base DQN Class

Paper: https://arxiv.org/abs/1312.5602

network

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:str
env

The environment that the agent is supposed to act on

Type:Environment
create_model

Whether the model of the algo should be created when initialised

Type:bool
batch_size

Mini batch size for loading experiences

Type:int
gamma

The discount factor for rewards

Type:float
value_layers

Layers in the Neural Network of the Q-value function

Type:tuple of int
lr_value

Learning rate for the Q-value function

Type:float
replay_size

Capacity of the Replay Buffer

Type:int
buffer_type

Choose the type of Buffer: [“push”, “prioritized”]

Type:str
max_epsilon

Maximum epsilon for exploration

Type:str
min_epsilon

Minimum epsilon for exploration

Type:str
epsilon_decay

Rate of decay of epsilon (in order to decrease exploration with time)

Type:str
seed

Seed for randomness

Type:int
render

Should the env be rendered during training?

Type:bool
device

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:str
calculate_epsilon_by_frame() → float[source]

Helper function to calculate epsilon after every timestep

Exponentially decays exploration rate from max epsilon to min epsilon The greater the value of epsilon_decay, the slower the decrease in epsilon

empty_logs() → None[source]

Empties logs

get_greedy_action(state: torch.Tensor) → numpy.ndarray[source]

Greedy action selection

Parameters:state (np.ndarray) – Current state of the environment
Returns:Action taken by the agent
Return type:action (np.ndarray)
get_hyperparams() → Dict[str, Any][source]

Get relevant hyperparameters to save

Returns:Hyperparameters to be saved
Return type:hyperparams (dict)
get_logging_params() → Dict[str, Any][source]

Gets relevant parameters for logging

Returns:Logging parameters for monitoring training
Return type:logs (dict)
get_q_values(states: torch.Tensor, actions: torch.Tensor) → torch.Tensor[source]

Get Q values corresponding to specific states and actions

Parameters:
  • states (torch.Tensor) – States for which Q-values need to be found
  • actions (torch.Tensor) – Actions taken at respective states
Returns:

Q values for the given states and actions

Return type:

q_values (torch.Tensor)

get_target_q_values(next_states: torch.Tensor, rewards: List[float], dones: List[bool]) → torch.Tensor[source]

Get target Q values for the DQN

Parameters:
  • next_states (torch.Tensor) – Next states for which target Q-values need to be found
  • rewards (list) – Rewards at each timestep for each environment
  • dones (list) – Game over status for each environment
Returns:

Target Q values for the DQN

Return type:

target_q_values (torch.Tensor)

load_weights(weights) → None[source]

Load weights for the agent from pretrained model

Parameters:weights (Dict) – Dictionary of different neural net weights
select_action(state: numpy.ndarray, deterministic: bool = False) → numpy.ndarray[source]

Select action given state

Epsilon-greedy action-selection

Parameters:
  • state (np.ndarray) – Current state of the environment
  • deterministic (bool) – Should the policy be deterministic or stochastic
Returns:

Action taken by the agent

Return type:

action (np.ndarray)

update_params(update_interval: int) → None[source]

Update parameters of the model

Parameters:update_interval (int) – Interval between successive updates of the target model
update_params_before_select_action(timestep: int) → None[source]

Update necessary parameters before selecting an action

This updates the epsilon (exploration rate) of the agent every timestep

Parameters:timestep (int) – Timestep of training
update_target_model() → None[source]

Function to update the target Q model

Updates the target model with the training model’s weights when called

genrl.agents.deep.dqn.categorical module

class genrl.agents.deep.dqn.categorical.CategoricalDQN(*args, noisy_layers: Tuple = (32, 128), num_atoms: int = 51, v_min: int = -10, v_max: int = 10, **kwargs)[source]

Bases: genrl.agents.deep.dqn.base.DQN

Categorical DQN Algorithm

Paper: https://arxiv.org/pdf/1707.06887.pdf

network

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:str
env

The environment that the agent is supposed to act on

Type:Environment
create_model

Whether the model of the algo should be created when initialised

Type:bool
batch_size

Mini batch size for loading experiences

Type:int
gamma

The discount factor for rewards

Type:float
layers

Layers in the Neural Network of the Q-value function

Type:tuple of int
lr_value

Learning rate for the Q-value function

Type:float
replay_size

Capacity of the Replay Buffer

Type:int
buffer_type

Choose the type of Buffer: [“push”, “prioritized”]

Type:str
max_epsilon

Maximum epsilon for exploration

Type:str
min_epsilon

Minimum epsilon for exploration

Type:str
epsilon_decay

Rate of decay of epsilon (in order to decrease exploration with time)

Type:str
noisy_layers

Noisy layers in the Neural Network of the Q-value function

Type:tuple of int
num_atoms

Number of atoms used in the discrete distribution

Type:int
v_min

Lower bound of value distribution

Type:int
v_max

Upper bound of value distribution

Type:int
seed

Seed for randomness

Type:int
render

Should the env be rendered during training?

Type:bool
device

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:str
get_greedy_action(state: torch.Tensor) → numpy.ndarray[source]

Greedy action selection

Parameters:state (np.ndarray) – Current state of the environment
Returns:Action taken by the agent
Return type:action (np.ndarray)
get_q_loss(batch: collections.namedtuple)[source]

Categorical DQN loss function to calculate the loss of the Q-function

Parameters:batch (collections.namedtuple of torch.Tensor) – Batch of experiences
Returns:Calculateed loss of the Q-function
Return type:loss (torch.Tensor)
get_q_values(states: torch.Tensor, actions: torch.Tensor)[source]

Get Q values corresponding to specific states and actions

Parameters:
  • states (torch.Tensor) – States for which Q-values need to be found
  • actions (torch.Tensor) – Actions taken at respective states
Returns:

Q values for the given states and actions

Return type:

q_values (torch.Tensor)

get_target_q_values(next_states: numpy.ndarray, rewards: List[float], dones: List[bool])[source]

Projected Distribution of Q-values

Helper function for Categorical/Distributional DQN

Parameters:
  • next_states (torch.Tensor) – Next states being encountered by the agent
  • rewards (torch.Tensor) – Rewards received by the agent
  • dones (torch.Tensor) – Game over status of each environment
Returns:

Projected Q-value Distribution or Target Q Values

Return type:

target_q_values (object)

genrl.agents.deep.dqn.double module

class genrl.agents.deep.dqn.double.DoubleDQN(*args, **kwargs)[source]

Bases: genrl.agents.deep.dqn.base.DQN

Double DQN Class

Paper: https://arxiv.org/abs/1509.06461

network

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:str
env

The environment that the agent is supposed to act on

Type:Environment
batch_size

Mini batch size for loading experiences

Type:int
gamma

The discount factor for rewards

Type:float
layers

Layers in the Neural Network of the Q-value function

Type:tuple of int
lr_value

Learning rate for the Q-value function

Type:float
replay_size

Capacity of the Replay Buffer

Type:int
buffer_type

Choose the type of Buffer: [“push”, “prioritized”]

Type:str
max_epsilon

Maximum epsilon for exploration

Type:str
min_epsilon

Minimum epsilon for exploration

Type:str
epsilon_decay

Rate of decay of epsilon (in order to decrease exploration with time)

Type:str
seed

Seed for randomness

Type:int
render

Should the env be rendered during training?

Type:bool
device

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:str
get_target_q_values(next_states: torch.Tensor, rewards: torch.Tensor, dones: torch.Tensor) → torch.Tensor[source]

Get target Q values for the DQN

Parameters:
  • next_states (torch.Tensor) – Next states for which target Q-values need to be found
  • rewards (list) – Rewards at each timestep for each environment
  • dones (list) – Game over status for each environment
Returns:

Target Q values for the DQN

Return type:

target_q_values (torch.Tensor)

genrl.agents.deep.dqn.dueling module

class genrl.agents.deep.dqn.dueling.DuelingDQN(*args, **kwargs)[source]

Bases: genrl.agents.deep.dqn.base.DQN

Dueling DQN class

Paper: https://arxiv.org/abs/1511.06581

network

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:str
env

The environment that the agent is supposed to act on

Type:Environment
batch_size

Mini batch size for loading experiences

Type:int
gamma

The discount factor for rewards

Type:float
layers

Layers in the Neural Network of the Q-value function

Type:tuple of int
lr_value

Learning rate for the Q-value function

Type:float
replay_size

Capacity of the Replay Buffer

Type:int
buffer_type

Choose the type of Buffer: [“push”, “prioritized”]

Type:str
max_epsilon

Maximum epsilon for exploration

Type:str
min_epsilon

Minimum epsilon for exploration

Type:str
epsilon_decay

Rate of decay of epsilon (in order to decrease exploration with time)

Type:str
seed

Seed for randomness

Type:int
render

Should the env be rendered during training?

Type:bool
device

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:str

genrl.agents.deep.dqn.noisy module

class genrl.agents.deep.dqn.noisy.NoisyDQN(*args, noisy_layers: Tuple = (128, 128), **kwargs)[source]

Bases: genrl.agents.deep.dqn.base.DQN

Noisy DQN Algorithm

Paper: https://arxiv.org/abs/1706.10295

network

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:str
env

The environment that the agent is supposed to act on

Type:Environment
batch_size

Mini batch size for loading experiences

Type:int
gamma

The discount factor for rewards

Type:float
layers

Layers in the Neural Network of the Q-value function

Type:tuple of int
lr_value

Learning rate for the Q-value function

Type:float
replay_size

Capacity of the Replay Buffer

Type:int
buffer_type

Choose the type of Buffer: [“push”, “prioritized”]

Type:str
max_epsilon

Maximum epsilon for exploration

Type:str
min_epsilon

Minimum epsilon for exploration

Type:str
epsilon_decay

Rate of decay of epsilon (in order to decrease exploration with time)

Type:str
noisy_layers

Noisy layers in the Neural Network of the Q-value function

Type:tuple of int
seed

Seed for randomness

Type:int
render

Should the env be rendered during training?

Type:bool
device

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:str

genrl.agents.deep.dqn.prioritized module

class genrl.agents.deep.dqn.prioritized.PrioritizedReplayDQN(*args, alpha: float = 0.6, beta: float = 0.4, **kwargs)[source]

Bases: genrl.agents.deep.dqn.base.DQN

Prioritized Replay DQN Class

Paper: https://arxiv.org/abs/1511.05952

network

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:str
env

The environment that the agent is supposed to act on

Type:Environment
batch_size

Mini batch size for loading experiences

Type:int
gamma

The discount factor for rewards

Type:float
layers

Layers in the Neural Network of the Q-value function

Type:tuple of int
lr_value

Learning rate for the Q-value function

Type:float
replay_size

Capacity of the Replay Buffer

Type:int
buffer_type

Choose the type of Buffer: [“push”, “prioritized”]

Type:str
max_epsilon

Maximum epsilon for exploration

Type:str
min_epsilon

Minimum epsilon for exploration

Type:str
epsilon_decay

Rate of decay of epsilon (in order to decrease exploration with time)

Type:str
alpha

Prioritization constant

Type:float
beta

Importance Sampling bias

Type:float
seed

Seed for randomness

Type:int
render

Should the env be rendered during training?

Type:bool
device

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:str
get_q_loss(batch: collections.namedtuple) → torch.Tensor[source]

Normal Function to calculate the loss of the Q-function

Parameters:batch (collections.namedtuple of torch.Tensor) – Batch of experiences
Returns:Calculateed loss of the Q-function
Return type:loss (torch.Tensor)

genrl.agents.deep.dqn.utils module

genrl.agents.deep.dqn.utils.categorical_greedy_action(agent: genrl.agents.deep.dqn.base.DQN, state: torch.Tensor) → numpy.ndarray[source]

Greedy action selection for Categorical DQN

Parameters:
  • agent (DQN) – The agent
  • state (np.ndarray) – Current state of the environment
Returns:

Action taken by the agent

Return type:

action (np.ndarray)

genrl.agents.deep.dqn.utils.categorical_q_loss(agent: genrl.agents.deep.dqn.base.DQN, batch: collections.namedtuple)[source]

Categorical DQN loss function to calculate the loss of the Q-function

Parameters:
  • agent (DQN) – The agent
  • batch (collections.namedtuple of torch.Tensor) – Batch of experiences
Returns:

Calculateed loss of the Q-function

Return type:

loss (torch.Tensor)

genrl.agents.deep.dqn.utils.categorical_q_target(agent: genrl.agents.deep.dqn.base.DQN, next_states: numpy.ndarray, rewards: List[float], dones: List[bool])[source]

Projected Distribution of Q-values

Helper function for Categorical/Distributional DQN

Parameters:
  • agent (DQN) – The agent
  • next_states (torch.Tensor) – Next states being encountered by the agent
  • rewards (torch.Tensor) – Rewards received by the agent
  • dones (torch.Tensor) – Game over status of each environment
Returns:

Projected Q-value Distribution or Target Q Values

Return type:

target_q_values (object)

genrl.agents.deep.dqn.utils.categorical_q_values(agent: genrl.agents.deep.dqn.base.DQN, states: torch.Tensor, actions: torch.Tensor)[source]

Get Q values given state for a Categorical DQN

Parameters:
  • agent (DQN) – The agent
  • states (torch.Tensor) – States being replayed
  • actions (torch.Tensor) – Actions being replayed
Returns:

Q values for the given states and actions

Return type:

q_values (torch.Tensor)

genrl.agents.deep.dqn.utils.ddqn_q_target(agent: genrl.agents.deep.dqn.base.DQN, next_states: torch.Tensor, rewards: torch.Tensor, dones: torch.Tensor) → torch.Tensor[source]

Double Q-learning target

Can be used to replace the get_target_values method of the Base DQN class in any DQN algorithm

Parameters:
  • agent (DQN) – The agent
  • next_states (torch.Tensor) – Next states being encountered by the agent
  • rewards (torch.Tensor) – Rewards received by the agent
  • dones (torch.Tensor) – Game over status of each environment
Returns:

Target Q values using Double Q-learning

Return type:

target_q_values (torch.Tensor)

genrl.agents.deep.dqn.utils.prioritized_q_loss(agent: genrl.agents.deep.dqn.base.DQN, batch: collections.namedtuple)[source]

Function to calculate the loss of the Q-function

Returns:The agent loss (torch.Tensor): Calculateed loss of the Q-function
Return type:agent (DQN)

PPO1

genrl.agents.deep.ppo1.ppo1 module

class genrl.agents.deep.ppo1.ppo1.PPO1(*args, clip_param: float = 0.2, value_coeff: float = 0.5, entropy_coeff: float = 0.01, **kwargs)[source]

Bases: genrl.agents.deep.base.onpolicy.OnPolicyAgent

Proximal Policy Optimization algorithm (Clipped policy).

Paper: https://arxiv.org/abs/1707.06347

network

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:str
env

The environment that the agent is supposed to act on

Type:Environment
create_model

Whether the model of the algo should be created when initialised

Type:bool
batch_size

Mini batch size for loading experiences

Type:int
gamma

The discount factor for rewards

Type:float
layers

Layers in the Neural Network of the Q-value function

Type:tuple of int
lr_policy

Learning rate for the policy/actor

Type:float
lr_value

Learning rate for the Q-value function

Type:float
rollout_size

Capacity of the Rollout Buffer

Type:int
buffer_type

Choose the type of Buffer: [“rollout”]

Type:str
clip_param

Epsilon for clipping policy loss

Type:float
value_coeff

Ratio of magnitude of value updates to policy updates

Type:float
entropy_coeff

Ratio of magnitude of entropy updates to policy updates

Type:float
seed

Seed for randomness

Type:int
render

Should the env be rendered during training?

Type:bool
device

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:str
empty_logs()[source]

Empties logs

evaluate_actions(states: torch.Tensor, actions: torch.Tensor)[source]

Evaluates actions taken by actor

Actions taken by actor and their respective states are analysed to get log probabilities and values from critics

Parameters:
  • states (torch.Tensor) – States encountered in rollout
  • actions (torch.Tensor) – Actions taken in response to respective states
Returns:

Values of states encountered during the rollout log_probs (torch.Tensor): Log of action probabilities given a state

Return type:

values (torch.Tensor)

get_hyperparams() → Dict[str, Any][source]

Get relevant hyperparameters to save

Returns:Hyperparameters to be saved
Return type:hyperparams (dict)
get_logging_params() → Dict[str, Any][source]

Gets relevant parameters for logging

Returns:Logging parameters for monitoring training
Return type:logs (dict)
get_traj_loss(values, dones)[source]

Get loss from trajectory traversed by agent during rollouts

Computes the returns and advantages needed for calculating loss

Parameters:
  • values (torch.Tensor) – Values of states encountered during the rollout
  • dones (list of bool) – Game over statuses of each environment
load_weights(weights) → None[source]

Load weights for the agent from pretrained model

Parameters:weights (dict) – Dictionary of different neural net weights
select_action(state: numpy.ndarray, deterministic: bool = False) → numpy.ndarray[source]

Select action given state

Action Selection for On Policy Agents with Actor Critic

Parameters:
  • state (np.ndarray) – Current state of the environment
  • deterministic (bool) – Should the policy be deterministic or stochastic
Returns:

Action taken by the agent value (torch.Tensor): Value of given state log_prob (torch.Tensor): Log probability of selected action

Return type:

action (np.ndarray)

update_params()[source]

Updates the the A2C network

Function to update the A2C actor-critic architecture

VPG

genrl.agents.deep.vpg.vpg module

class genrl.agents.deep.vpg.vpg.VPG(*args, **kwargs)[source]

Bases: genrl.agents.deep.base.onpolicy.OnPolicyAgent

Vanilla Policy Gradient algorithm

Paper https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf

network (str): The network type of the Q-value function.
Supported types: [“cnn”, “mlp”]

env (Environment): The environment that the agent is supposed to act on create_model (bool): Whether the model of the algo should be created when initialised batch_size (int): Mini batch size for loading experiences gamma (float): The discount factor for rewards layers (tuple of int): Layers in the Neural Network

of the Q-value function

lr_policy (float): Learning rate for the policy/actor lr_value (float): Learning rate for the Q-value function rollout_size (int): Capacity of the Rollout Buffer buffer_type (str): Choose the type of Buffer: [“rollout”] seed (int): Seed for randomness render (bool): Should the env be rendered during training? device (str): Hardware being used for training. Options:

[“cuda” -> GPU, “cpu” -> CPU]
empty_logs()[source]

Empties logs

get_hyperparams() → Dict[str, Any][source]

Get relevant hyperparameters to save

Returns:Hyperparameters to be saved
Return type:hyperparams (dict)
get_log_probs(states: torch.Tensor, actions: torch.Tensor)[source]

Get log probabilities of action values

Actions taken by actor and their respective states are analysed to get log probabilities

Parameters:
  • states (torch.Tensor) – States encountered in rollout
  • actions (torch.Tensor) – Actions taken in response to respective states
Returns:

Log of action probabilities given a state

Return type:

log_probs (torch.Tensor)

get_logging_params() → Dict[str, Any][source]

Gets relevant parameters for logging

Returns:Logging parameters for monitoring training
Return type:logs (dict)
get_traj_loss(values, dones)[source]

Get loss from trajectory traversed by agent during rollouts

Computes the returns and advantages needed for calculating loss

Parameters:
  • values (torch.Tensor) – Values of states encountered during the rollout
  • dones (list of bool) – Game over statuses of each environment
load_weights(weights) → None[source]

Load weights for the agent from pretrained model

Parameters:weights (dict) – Dictionary of different neural net weights
select_action(state: numpy.ndarray, deterministic: bool = False) → numpy.ndarray[source]

Select action given state

Action Selection for Vanilla Policy Gradient

Parameters:
  • state (np.ndarray) – Current state of the environment
  • deterministic (bool) – Should the policy be deterministic or stochastic
Returns:

Action taken by the agent value (torch.Tensor): Value of given state. In VPG, there is no critic

to find the value so we set this to a default 0 for convenience

log_prob (torch.Tensor): Log probability of selected action

Return type:

action (np.ndarray)

update_params() → None[source]

Updates the the A2C network

Function to update the A2C actor-critic architecture

TD3

genrl.agents.deep.td3.td3 module

class genrl.agents.deep.td3.td3.TD3(*args, policy_frequency: int = 2, noise: genrl.core.noise.ActionNoise = None, noise_std: float = 0.2, **kwargs)[source]

Bases: genrl.agents.deep.base.offpolicy.OffPolicyAgentAC

Twin Delayed DDPG Algorithm

Paper: https://arxiv.org/abs/1509.02971

network

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:str
env

The environment that the agent is supposed to act on

Type:Environment
create_model

Whether the model of the algo should be created when initialised

Type:bool
batch_size

Mini batch size for loading experiences

Type:int
gamma

The discount factor for rewards

Type:float
policy_layers

Neural network layer dimensions for the policy

Type:tuple of int
value_layers

Neural network layer dimensions for the critics

Type:tuple of int
lr_policy

Learning rate for the policy/actor

Type:float
lr_value

Learning rate for the critic

Type:float
replay_size

Capacity of the Replay Buffer

Type:int
buffer_type

Choose the type of Buffer: [“push”, “prioritized”]

Type:str
polyak

Target model update parameter (1 for hard update)

Type:float
policy_frequency

Frequency of policy updates in comparison to critic updates

Type:int
noise

Action Noise function added to aid in exploration

Type:ActionNoise
noise_std

Standard deviation of the action noise distribution

Type:float
seed

Seed for randomness

Type:int
render

Should the env be rendered during training?

Type:bool
device

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:str
empty_logs()[source]

Empties logs

get_hyperparams() → Dict[str, Any][source]

Get relevant hyperparameters to save

Returns:Hyperparameters to be saved
Return type:hyperparams (dict)
get_logging_params() → Dict[str, Any][source]

Gets relevant parameters for logging

Returns:Logging parameters for monitoring training
Return type:logs (dict)
update_params(update_interval: int) → None[source]

Update parameters of the model

Parameters:update_interval (int) – Interval between successive updates of the target model

SAC

genrl.agents.deep.sac.sac module

class genrl.agents.deep.sac.sac.SAC(*args, alpha: float = 0.01, polyak: float = 0.995, entropy_tuning: bool = True, **kwargs)[source]

Bases: genrl.agents.deep.base.offpolicy.OffPolicyAgentAC

Soft Actor Critic algorithm (SAC)

Paper: https://arxiv.org/abs/1812.05905

network

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:str
env

The environment that the agent is supposed to act on

Type:Environment
create_model

Whether the model of the algo should be created when initialised

Type:bool
batch_size

Mini batch size for loading experiences

Type:int
gamma

The discount factor for rewards

Type:float
policy_layers

Neural network layer dimensions for the policy

Type:tuple of int
value_layers

Neural network layer dimensions for the critics

Type:tuple of int
lr_policy

Learning rate for the policy/actor

Type:float
lr_value

Learning rate for the critic

Type:float
replay_size

Capacity of the Replay Buffer

Type:int
buffer_type

Choose the type of Buffer: [“push”, “prioritized”]

Type:str
alpha

Entropy factor

Type:str
polyak

Target model update parameter (1 for hard update)

Type:float
entropy_tuning

True if entropy tuning should be done, False otherwise

Type:bool
seed

Seed for randomness

Type:int
render

Should the env be rendered during training?

Type:bool
device

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:str
empty_logs()[source]

Empties logs

get_alpha_loss(log_probs)[source]

Calculate Entropy Loss

Parameters:log_probs (float) – Log probs
get_hyperparams() → Dict[str, Any][source]

Get relevant hyperparameters to save

Returns:Hyperparameters to be saved
Return type:hyperparams (dict)
get_logging_params() → Dict[str, Any][source]

Gets relevant parameters for logging

Returns:Logging parameters for monitoring training
Return type:logs (dict)
get_p_loss(states: torch.Tensor) → torch.Tensor[source]

Function to get the Policy loss

Parameters:states (torch.Tensor) – States for which Q-values need to be found
Returns:Calculated policy loss
Return type:loss (torch.Tensor)
get_target_q_values(next_states: torch.Tensor, rewards: List[float], dones: List[bool]) → torch.Tensor[source]

Get target Q values for the SAC

Parameters:
  • next_states (torch.Tensor) – Next states for which target Q-values need to be found
  • rewards (list) – Rewards at each timestep for each environment
  • dones (list) – Game over status for each environment
Returns:

Target Q values for the SAC

Return type:

target_q_values (torch.Tensor)

select_action(state: numpy.ndarray, deterministic: bool = False) → numpy.ndarray[source]

Select action given state

Action Selection

Parameters:
  • state (np.ndarray) – Current state of the environment
  • deterministic (bool) – Should the policy be deterministic or stochastic
Returns:

Action taken by the agent

Return type:

action (np.ndarray)

update_params(update_interval: int) → None[source]

Update parameters of the model

Parameters:update_interval (int) – Interval between successive updates of the target model
update_target_model() → None[source]

Function to update the target Q model

Updates the target model with the training model’s weights when called

Q-Learning

genrl.agents.classical.qlearning.qlearning module

class genrl.agents.classical.qlearning.qlearning.QLearning(env: gym.core.Env, epsilon: float = 0.9, gamma: float = 0.95, lr: float = 0.01)[source]

Bases: object

Q-Learning Algorithm.

Paper- https://link.springer.com/article/10.1007/BF00992698

env

Environment with which agent interacts.

Type:gym.Env
epsilon

exploration coefficient for epsilon-greedy exploration.

Type:float, optional
gamma

discount factor.

Type:float, optional
lr

learning rate for optimizer.

Type:float, optional
get_action(state: numpy.ndarray, explore: bool = True) → numpy.ndarray[source]

Epsilon greedy selection of epsilon in the explore phase.

Parameters:
  • state (np.ndarray) – Environment state.
  • explore (bool, optional) – True if exploration is required. False if not.
Returns:

action.

Return type:

np.ndarray

get_hyperparams() → Dict[str, Any][source]
update(transition: Tuple) → None[source]

Update the Q table.

Parameters:transition (Tuple) – transition 4-tuple used to update Q-table. In the form (state, action, reward, next_state)

SARSA

genrl.agents.classical.sarsa.sarsa module

class genrl.agents.classical.sarsa.sarsa.SARSA(env: gym.core.Env, epsilon: float = 0.9, lmbda: float = 0.9, gamma: float = 0.95, lr: float = 0.01)[source]

Bases: object

SARSA Algorithm.

Paper- http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.17.2539&rep=rep1&type=pdf

env

Environment with which agent interacts.

Type:gym.Env
epsilon

exploration coefficient for epsilon-greedy exploration.

Type:float, optional
gamma

discount factor.

Type:float, optional
lr

learning rate for optimizer.

Type:float, optional
get_action(state: numpy.ndarray, explore: bool = True) → numpy.ndarray[source]

Epsilon greedy selection of epsilon in the explore phase.

Parameters:
  • state (np.ndarray) – Environment state.
  • explore (bool, optional) – True if exploration is required. False if not.
Returns:

action.

Return type:

np.ndarray

update(transition: Tuple) → None[source]

Update the Q table and e values

Parameters:transition (Tuple) – transition 4-tuple used to update Q-table. In the form (state, action, reward, next_state)

Contextual Bandit

Base

class genrl.agents.bandits.contextual.base.DCBAgent(bandit: genrl.core.bandit.Bandit, device: str = 'cpu', **kwargs)[source]

Bases: genrl.core.bandit.BanditAgent

Base class for deep contextual bandit solving agents

Parameters:
  • bandit (gennav.deep.bandit.data_bandits.DataBasedBandit) – The bandit to solve
  • device (str) – Device to use for tensor operations. “cpu” for cpu or “cuda” for cuda. Defaults to “cpu”.
bandit

The bandit to solve

Type:gennav.deep.bandit.data_bandits.DataBasedBandit
device

Device to use for tensor operations.

Type:torch.device
select_action(context: torch.Tensor) → int[source]

Select an action based on given context

Parameters:context (torch.Tensor) – The context vector to select action for

Note

This method needs to be implemented in the specific agent.

Returns:The action to take
Return type:int
update_parameters(action: Optional[int] = None, batch_size: Optional[int] = None, train_epochs: Optional[int] = None) → None[source]

Update parameters of the agent.

Parameters:
  • action (Optional[int], optional) – Action to update the parameters for. Defaults to None.
  • batch_size (Optional[int], optional) – Size of batch to update parameters with. Defaults to None.
  • train_epochs (Optional[int], optional) – Epochs to train neural network for. Defaults to None.

Note

This method needs to be implemented in the specific agent.

Bootstrap Neural

class genrl.agents.bandits.contextual.bootstrap_neural.BootstrapNeuralAgent(bandit: genrl.utils.data_bandits.base.DataBasedBandit, **kwargs)[source]

Bases: genrl.agents.bandits.contextual.base.DCBAgent

Bootstraped ensemble agentfor deep contextual bandits.

Parameters:
  • bandit (DataBasedBandit) – The bandit to solve
  • init_pulls (int, optional) – Number of times to select each action initially. Defaults to 3.
  • hidden_dims (List[int], optional) – Dimensions of hidden layers of network. Defaults to [50, 50].
  • init_lr (float, optional) – Initial learning rate. Defaults to 0.1.
  • lr_decay (float, optional) – Decay rate for learning rate. Defaults to 0.5.
  • lr_reset (bool, optional) – Whether to reset learning rate ever train interval. Defaults to True.
  • max_grad_norm (float, optional) – Maximum norm of gradients for gradient clipping. Defaults to 0.5.
  • dropout_p (Optional[float], optional) – Probability for dropout. Defaults to None which implies dropout is not to be used.
  • eval_with_dropout (bool, optional) – Whether or not to use dropout at inference. Defaults to False.
  • n (int, optional) – Number of models in ensemble. Defaults to 10.
  • add_prob (float, optional) – Probability of adding a transition to a database. Defaults to 0.95.
  • device (str) – Device to use for tensor operations. “cpu” for cpu or “cuda” for cuda. Defaults to “cpu”.
select_action(context: torch.Tensor) → int[source]

Select an action based on given context.

Selects an action by computing a forward pass through a randomly selected network from the ensemble.

Parameters:context (torch.Tensor) – The context vector to select action for.
Returns:The action to take.
Return type:int
update_db(context: torch.Tensor, action: int, reward: int)[source]

Updates transition database with given transition

The transition is added to each database with a certain probability.

Parameters:
  • context (torch.Tensor) – Context recieved
  • action (int) – Action taken
  • reward (int) – Reward recieved
update_params(action: Optional[int] = None, batch_size: int = 512, train_epochs: int = 20)[source]

Update parameters of the agent.

Trains each neural network in the ensemble.

Parameters:
  • action (Optional[int], optional) – Action to update the parameters for. Not applicable in this agent. Defaults to None.
  • batch_size (int, optional) – Size of batch to update parameters with. Defaults to 512
  • train_epochs (int, optional) – Epochs to train neural network for. Defaults to 20

Fixed

class genrl.agents.bandits.contextual.fixed.FixedAgent(bandit: genrl.utils.data_bandits.base.DataBasedBandit, p: List[float] = None, device: str = 'cpu')[source]

Bases: genrl.agents.bandits.contextual.base.DCBAgent

select_action(context: torch.Tensor) → int[source]

Select an action based on fixed probabilities.

Parameters:context (torch.Tensor) – The context vector to select action for. In this agent, context vector is not considered.
Returns:The action to take.
Return type:int
update_db(*args, **kwargs)[source]
update_params(*args, **kwargs)[source]

Linear Posterior

class genrl.agents.bandits.contextual.linpos.LinearPosteriorAgent(bandit: genrl.utils.data_bandits.base.DataBasedBandit, **kwargs)[source]

Bases: genrl.agents.bandits.contextual.base.DCBAgent

Deep contextual bandit agent using bayesian regression for posterior inference.

Parameters:
  • bandit (DataBasedBandit) – The bandit to solve
  • init_pulls (int, optional) – Number of times to select each action initially. Defaults to 3.
  • lambda_prior (float, optional) – Guassian prior for linear model. Defaults to 0.25.
  • a0 (float, optional) – Inverse gamma prior for noise. Defaults to 6.0.
  • b0 (float, optional) – Inverse gamma prior for noise. Defaults to 6.0.
  • device (str) – Device to use for tensor operations. “cpu” for cpu or “cuda” for cuda. Defaults to “cpu”.
select_action(context: torch.Tensor) → int[source]

Select an action based on given context.

Selecting action with highest predicted reward computed through betas sampled from posterior.

Parameters:context (torch.Tensor) – The context vector to select action for.
Returns:The action to take.
Return type:int
update_db(context: torch.Tensor, action: int, reward: int)[source]

Updates transition database with given transition

Parameters:
  • context (torch.Tensor) – Context recieved
  • action (int) – Action taken
  • reward (int) – Reward recieved
update_params(action: int, batch_size: int = 512, train_epochs: Optional[int] = None)[source]

Update parameters of the agent.

Updated the posterior over beta though bayesian regression.

Parameters:
  • action (int) – Action to update the parameters for.
  • batch_size (int, optional) – Size of batch to update parameters with. Defaults to 512
  • train_epochs (Optional[int], optional) – Epochs to train neural network for. Not applicable in this agent. Defaults to None

Neural Greedy

class genrl.agents.bandits.contextual.neural_greedy.NeuralGreedyAgent(bandit: genrl.utils.data_bandits.base.DataBasedBandit, **kwargs)[source]

Bases: genrl.agents.bandits.contextual.base.DCBAgent

Deep contextual bandit agent using epsilon greedy with a neural network.

Parameters:
  • bandit (DataBasedBandit) – The bandit to solve
  • init_pulls (int, optional) – Number of times to select each action initially. Defaults to 3.
  • hidden_dims (List[int], optional) – Dimensions of hidden layers of network. Defaults to [50, 50].
  • init_lr (float, optional) – Initial learning rate. Defaults to 0.1.
  • lr_decay (float, optional) – Decay rate for learning rate. Defaults to 0.5.
  • lr_reset (bool, optional) – Whether to reset learning rate ever train interval. Defaults to True.
  • max_grad_norm (float, optional) – Maximum norm of gradients for gradient clipping. Defaults to 0.5.
  • dropout_p (Optional[float], optional) – Probability for dropout. Defaults to None which implies dropout is not to be used.
  • eval_with_dropout (bool, optional) – Whether or not to use dropout at inference. Defaults to False.
  • epsilon (float, optional) – Probability of selecting a random action. Defaults to 0.0.
  • device (str) – Device to use for tensor operations. “cpu” for cpu or “cuda” for cuda. Defaults to “cpu”.
select_action(context: torch.Tensor) → int[source]

Select an action based on given context.

Selects an action by computing a forward pass through network with an epsillon probability of selecting a random action.

Parameters:context (torch.Tensor) – The context vector to select action for.
Returns:The action to take.
Return type:int
update_db(context: torch.Tensor, action: int, reward: int)[source]

Updates transition database with given transition

Parameters:
  • context (torch.Tensor) – Context recieved
  • action (int) – Action taken
  • reward (int) – Reward recieved
update_params(action: Optional[int] = None, batch_size: int = 512, train_epochs: int = 20)[source]

Update parameters of the agent.

Trains neural network.

Parameters:
  • action (Optional[int], optional) – Action to update the parameters for. Not applicable in this agent. Defaults to None.
  • batch_size (int, optional) – Size of batch to update parameters with. Defaults tp 512
  • train_epochs (int, optional) – Epochs to train neural network for. Defaults to 20

Neural Linear Posterior

class genrl.agents.bandits.contextual.neural_linpos.NeuralLinearPosteriorAgent(bandit: genrl.utils.data_bandits.base.DataBasedBandit, **kwargs)[source]

Bases: genrl.agents.bandits.contextual.base.DCBAgent

Deep contextual bandit agent using bayesian regression on for posterior inference

A neural network is used to transform context vector to a latent represntation on which bayesian regression is performed.

Parameters:
  • bandit (DataBasedBandit) – The bandit to solve
  • init_pulls (int, optional) – Number of times to select each action initially. Defaults to 3.
  • hidden_dims (List[int], optional) – Dimensions of hidden layers of network. Defaults to [50, 50].
  • init_lr (float, optional) – Initial learning rate. Defaults to 0.1.
  • lr_decay (float, optional) – Decay rate for learning rate. Defaults to 0.5.
  • lr_reset (bool, optional) – Whether to reset learning rate ever train interval. Defaults to True.
  • max_grad_norm (float, optional) – Maximum norm of gradients for gradient clipping. Defaults to 0.5.
  • dropout_p (Optional[float], optional) – Probability for dropout. Defaults to None which implies dropout is not to be used.
  • eval_with_dropout (bool, optional) – Whether or not to use dropout at inference. Defaults to False.
  • nn_update_ratio (int, optional) – . Defaults to 2.
  • lambda_prior (float, optional) – Guassian prior for linear model. Defaults to 0.25.
  • a0 (float, optional) – Inverse gamma prior for noise. Defaults to 3.0.
  • b0 (float, optional) – Inverse gamma prior for noise. Defaults to 3.0.
  • device (str) – Device to use for tensor operations. “cpu” for cpu or “cuda” for cuda. Defaults to “cpu”.
select_action(context: torch.Tensor) → int[source]

Select an action based on given context.

Selects an action by computing a forward pass through network to output a representation of the context on which bayesian linear regression is performed to select an action.

Parameters:context (torch.Tensor) – The context vector to select action for.
Returns:The action to take.
Return type:int
update_db(context: torch.Tensor, action: int, reward: int)[source]

Updates transition database with given transition

Updates latent context and predicted rewards seperately.

Parameters:
  • context (torch.Tensor) – Context recieved
  • action (int) – Action taken
  • reward (int) – Reward recieved
update_params(action: int, batch_size: int = 512, train_epochs: int = 20)[source]

Update parameters of the agent.

Trains neural network and updates bayesian regression parameters.

Parameters:
  • action (int) – Action to update the parameters for.
  • batch_size (int, optional) – Size of batch to update parameters with. Defaults to 512
  • train_epochs (int, optional) – Epochs to train neural network for. Defaults to 20

Neural Noise Sampling

class genrl.agents.bandits.contextual.neural_noise_sampling.NeuralNoiseSamplingAgent(bandit: genrl.utils.data_bandits.base.DataBasedBandit, **kwargs)[source]

Bases: genrl.agents.bandits.contextual.base.DCBAgent

Deep contextual bandit agent with noise sampling for neural network parameters.

Parameters:
  • bandit (DataBasedBandit) – The bandit to solve
  • init_pulls (int, optional) – Number of times to select each action initially. Defaults to 3.
  • hidden_dims (List[int], optional) – Dimensions of hidden layers of network. Defaults to [50, 50].
  • init_lr (float, optional) – Initial learning rate. Defaults to 0.1.
  • lr_decay (float, optional) – Decay rate for learning rate. Defaults to 0.5.
  • lr_reset (bool, optional) – Whether to reset learning rate ever train interval. Defaults to True.
  • max_grad_norm (float, optional) – Maximum norm of gradients for gradient clipping. Defaults to 0.5.
  • dropout_p (Optional[float], optional) – Probability for dropout. Defaults to None which implies dropout is not to be used.
  • eval_with_dropout (bool, optional) – Whether or not to use dropout at inference. Defaults to False.
  • noise_std_dev (float, optional) – Standard deviation of sampled noise. Defaults to 0.05.
  • eps (float, optional) – Small constant for bounding KL divergece of noise. Defaults to 0.1.
  • noise_update_batch_size (int, optional) – Batch size for updating noise parameters. Defaults to 256.
  • device (str) – Device to use for tensor operations. “cpu” for cpu or “cuda” for cuda. Defaults to “cpu”.
select_action(context: torch.Tensor) → int[source]

Select an action based on given context.

Selects an action by adding noise to neural network paramters and the computing forward with the context vector as input.

Parameters:context (torch.Tensor) – The context vector to select action for.
Returns:The action to take
Return type:int
update_db(context: torch.Tensor, action: int, reward: int)[source]

Updates transition database with given transition

Parameters:
  • context (torch.Tensor) – Context recieved
  • action (int) – Action taken
  • reward (int) – Reward recieved
update_params(action: Optional[int] = None, batch_size: int = 512, train_epochs: int = 20)[source]

Update parameters of the agent.

Trains each neural network in the ensemble.

Parameters:
  • action (Optional[int], optional) – Action to update the parameters for. Not applicable in this agent. Defaults to None.
  • batch_size (int, optional) – Size of batch to update parameters with. Defaults to 512
  • train_epochs (int, optional) – Epochs to train neural network for. Defaults to 20

Variational

class genrl.agents.bandits.contextual.variational.VariationalAgent(bandit: genrl.utils.data_bandits.base.DataBasedBandit, **kwargs)[source]

Bases: genrl.agents.bandits.contextual.base.DCBAgent

Deep contextual bandit agent using variation inference.

Parameters:
  • bandit (DataBasedBandit) – The bandit to solve
  • init_pulls (int, optional) – Number of times to select each action initially. Defaults to 3.
  • hidden_dims (List[int], optional) – Dimensions of hidden layers of network. Defaults to [50, 50].
  • init_lr (float, optional) – Initial learning rate. Defaults to 0.1.
  • lr_decay (float, optional) – Decay rate for learning rate. Defaults to 0.5.
  • lr_reset (bool, optional) – Whether to reset learning rate ever train interval. Defaults to True.
  • max_grad_norm (float, optional) – Maximum norm of gradients for gradient clipping. Defaults to 0.5.
  • dropout_p (Optional[float], optional) – Probability for dropout. Defaults to None which implies dropout is not to be used.
  • eval_with_dropout (bool, optional) – Whether or not to use dropout at inference. Defaults to False.
  • noise_std (float, optional) – Standard deviation of noise in bayesian neural network. Defaults to 0.1.
  • device (str) – Device to use for tensor operations. “cpu” for cpu or “cuda” for cuda. Defaults to “cpu”.
select_action(context: torch.Tensor) → int[source]

Select an action based on given context.

Selects an action by computing a forward pass through the bayesian neural network.

Parameters:context (torch.Tensor) – The context vector to select action for.
Returns:The action to take.
Return type:int
update_db(context: torch.Tensor, action: int, reward: int)[source]

Updates transition database with given transition

Parameters:
  • context (torch.Tensor) – Context recieved
  • action (int) – Action taken
  • reward (int) – Reward recieved
update_params(action: int, batch_size: int = 512, train_epochs: int = 20)[source]

Update parameters of the agent.

Trains each neural network in the ensemble.

Parameters:
  • action (Optional[int], optional) – Action to update the parameters for. Not applicable in this agent. Defaults to None.
  • batch_size (int, optional) – Size of batch to update parameters with. Defaults to 512
  • train_epochs (int, optional) – Epochs to train neural network for. Defaults to 20

Multi-Armed Bandit

Base

class genrl.agents.bandits.multiarmed.base.MABAgent(bandit: genrl.core.bandit.MultiArmedBandit)[source]

Bases: genrl.core.bandit.BanditAgent

Base Class for Contextual Bandit solving Policy

Parameters:
  • bandit (MultiArmedlBandit type object) – The Bandit to solve
  • requires_init_run – Indicated if initialisation of Q values is required
action_hist

Get the history of actions taken for contexts

Returns:List of context, actions pairs
Return type:list
counts

Get the number of times each action has been taken

Returns:Numpy array with count for each action
Return type:numpy.ndarray
regret

Get the current regret

Returns:The current regret
Return type:float
regret_hist

Get the history of regrets incurred for each step

Returns:List of rewards
Return type:list
reward_hist

Get the history of rewards received for each step

Returns:List of rewards
Return type:list
select_action(context: int) → int[source]

Select an action

This method needs to be implemented in the specific policy.

Parameters:context (int) – the context to select action for
Returns:Selected action
Return type:int
update_params(context: int, action: int, reward: Union[int, float]) → None[source]

Update parmeters for the policy

This method needs to be implemented in the specific policy.

Parameters:
  • context (int) – context for which action is taken
  • action (int) – action taken for the step
  • reward (int or float) – reward obtained for the step

Bayesian Bandit

class genrl.agents.bandits.multiarmed.bayesian.BayesianUCBMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 1.0, beta: float = 1.0, confidence: float = 3.0)[source]

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Multi-Armed Bandit Solver with Bayesian Upper Confidence Bound based Action Selection Strategy.

Refer to Section 2.7 of Reinforcement Learning: An Introduction.

Parameters:
  • bandit (MultiArmedlBandit type object) – The Bandit to solve
  • alpha (float) – alpha value for beta distribution
  • beta (float) – beta values for beta distibution
  • c (float) – Confidence level which controls degree of exploration
a

alpha parameter of beta distribution associated with the policy

Type:numpy.ndarray
b

beta parameter of beta distribution associated with the policy

Type:numpy.ndarray
confidence

Confidence level which weights the exploration term

Type:float
quality

Q values for all the actions for alpha, beta and c

Type:numpy.ndarray
select_action(context: int) → int[source]

Select an action according to bayesian upper confidence bound

Take action that maximises a weighted sum of the Q values and a beta distribution paramerterized by alpha and beta and weighted by c for each action

Parameters:
  • context (int) – the context to select action for
  • t (int) – timestep to choose action for
Returns:

Selected action

Return type:

int

update_params(context: int, action: int, reward: float) → None[source]

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step

Parameters:
  • context (int) – context for which action is taken
  • action (int) – action taken for the step
  • reward (float) – reward obtained for the step

Bernoulli Bandit

class genrl.agents.bandits.multiarmed.bernoulli_mab.BernoulliMAB(bandits: int = 1, arms: int = 5, reward_probs: numpy.ndarray = None, context_type: str = 'tensor')[source]

Bases: genrl.core.bandit.MultiArmedBandit

Contextual Bandit with categorial context and bernoulli reward distribution

Parameters:
  • bandits (int) – Number of bandits
  • arms (int) – Number of arms in each bandit
  • reward_probs (numpy.ndarray) – Probabilities of getting rewards

Espilon Greedy

class genrl.agents.bandits.multiarmed.epsgreedy.EpsGreedyMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, eps: float = 0.05)[source]

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Contextual Bandit Policy with Epsilon Greedy Action Selection Strategy.

Refer to Section 2.3 of Reinforcement Learning: An Introduction.

Parameters:
  • bandit (MultiArmedlBandit type object) – The Bandit to solve
  • eps (float) – Probability with which a random action is to be selected.
eps

Exploration constant

Type:float
quality

Q values assigned by the policy to all actions

Type:numpy.ndarray
select_action(context: int) → int[source]

Select an action according to epsilon greedy startegy

A random action is selected with espilon probability over the optimal action according to the current Q values to encourage exploration of the policy.

Parameters:context (int) – the context to select action for
Returns:Selected action
Return type:int
update_params(context: int, action: int, reward: float) → None[source]

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step.

Parameters:
  • context (int) – context for which action is taken
  • action (int) – action taken for the step
  • reward (float) – reward obtained for the step

Gaussian

class genrl.agents.bandits.multiarmed.gaussian_mab.GaussianMAB(bandits: int = 10, arms: int = 5, reward_means: numpy.ndarray = None, context_type: str = 'tensor')[source]

Bases: genrl.core.bandit.MultiArmedBandit

Contextual Bandit with categorial context and gaussian reward distribution

Parameters:
  • bandits (int) – Number of bandits
  • arms (int) – Number of arms in each bandit
  • reward_means (numpy.ndarray) – Mean of gaussian distribution for each reward

Gradient

class genrl.agents.bandits.multiarmed.gradient.GradientMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 0.1, temp: float = 0.01)[source]

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Multi-Armed Bandit Solver with Softmax Action Selection Strategy.

Refer to Section 2.8 of Reinforcement Learning: An Introduction.

Parameters:
  • bandit (MultiArmedlBandit type object) – The Bandit to solve
  • alpha (float) – The step size parameter for gradient based update
  • temp (float) – Temperature for softmax distribution over Q values of actions
alpha

Step size parameter for gradient based update of policy

Type:float
probability_hist

History of probabilty values assigned to each action for each timestep

Type:numpy.ndarray
quality

Q values assigned by the policy to all actions

Type:numpy.ndarray
select_action(context: int) → int[source]

Select an action according by softmax action selection strategy

Action is sampled from softmax distribution computed over the Q values for all actions

Parameters:context (int) – the context to select action for
Returns:Selected action
Return type:int
temp

Temperature for softmax distribution over Q values of actions

Type:float
update_params(context: int, action: int, reward: float) → None[source]

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the Q values through a gradient ascent step

Parameters:
  • context (int) – context for which action is taken
  • action (int) – action taken for the step
  • reward (float) – reward obtained for the step

Thmopson Sampling

class genrl.agents.bandits.multiarmed.thompson.ThompsonSamplingMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, alpha: float = 1.0, beta: float = 1.0)[source]

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Multi-Armed Bandit Solver with Bayesian Upper Confidence Bound based Action Selection Strategy.

Parameters:
  • bandit (MultiArmedlBandit type object) – The Bandit to solve
  • a (float) – alpha value for beta distribution
  • b (float) – beta values for beta distibution
a

alpha parameter of beta distribution associated with the policy

Type:numpy.ndarray
b

beta parameter of beta distribution associated with the policy

Type:numpy.ndarray
quality

Q values for all the actions for alpha, beta and c

Type:numpy.ndarray
select_action(context: int) → int[source]

Select an action according to Thompson Sampling

Samples are taken from beta distribution parameterized by alpha and beta for each action. The action with the highest sample is selected.

Parameters:context (int) – the context to select action for
Returns:Selected action
Return type:int
update_params(context: int, action: int, reward: float) → None[source]

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the alpha value of beta distribution by adding the reward while the beta value is updated by adding 1 - reward. Update the counts the action taken.

Parameters:
  • context (int) – context for which action is taken
  • action (int) – action taken for the step
  • reward (float) – reward obtained for the step

Upper Confidence Bound

class genrl.agents.bandits.multiarmed.ucb.UCBMABAgent(bandit: genrl.core.bandit.MultiArmedBandit, confidence: float = 1.0)[source]

Bases: genrl.agents.bandits.multiarmed.base.MABAgent

Multi-Armed Bandit Solver with Upper Confidence Bound based Action Selection Strategy.

Refer to Section 2.7 of Reinforcement Learning: An Introduction.

Parameters:
  • bandit (MultiArmedlBandit type object) – The Bandit to solve
  • c (float) – Confidence level which controls degree of exploration
confidence

Confidence level which weights the exploration term

Type:float
quality

q values assigned by the policy to all actions

Type:numpy.ndarray
select_action(context: int) → int[source]

Select an action according to upper confidence bound action selction

Take action that maximises a weighted sum of the Q values for the action and an exploration encouragement term controlled by c.

Parameters:context (int) – the context to select action for
Returns:Selected action
Return type:int
update_params(context: int, action: int, reward: float) → None[source]

Update parmeters for the policy

Updates the regret as the difference between max Q value and that of the action. Updates the Q values according to the reward recieved in this step.

Parameters:
  • context (int) – context for which action is taken
  • action (int) – action taken for the step
  • reward (float) – reward obtained for the step

Environments

Environments

Subpackages

Vectorized Envrionments
Submodules
genrl.environments.vec_env.monitor module
class genrl.environments.vec_env.monitor.VecMonitor(venv: genrl.environments.vec_env.vector_envs.VecEnv, history_length: int = 0, info_keys: Tuple = ())[source]

Bases: genrl.environments.vec_env.wrappers.VecEnvWrapper

Monitor class for VecEnvs. Saves important variables into the info dictionary

Parameters:
  • venv (object) – Vectorized Environment
  • history_length (int) – Length of history for episode rewards and episode lengths
  • info_keys (tuple or list) – Important variables to save
reset() → numpy.ndarray[source]

Resets Vectorized Environment

Returns:Initial observations
Return type:Numpy Array
step(actions: numpy.ndarray) → Tuple[source]

Steps through all the environments and records important information

Parameters:actions (Numpy Array) – Actions to be taken for the Vectorized Environment
Returns:States, rewards, dones, infos
genrl.environments.vec_env.normalize module
class genrl.environments.vec_env.normalize.VecNormalize(venv: genrl.environments.vec_env.vector_envs.VecEnv, norm_obs: bool = True, norm_reward: bool = True, clip_reward: float = 20.0)[source]

Bases: genrl.environments.vec_env.wrappers.VecEnvWrapper

Wrapper to implement Normalization of observations and rewards for VecEnvs

Parameters:
  • venv (Vectorized Environment) – The Vectorized environment
  • n_envs (int) – Number of environments in VecEnv
  • norm_obs (bool) – True if observations should be normalized, else False
  • norm_reward (bool) – True if rewards should be normalized, else False
  • clip_reward (float) – Maximum absolute value for rewards
close()[source]

Close all individual environments in the Vectorized Environment

reset() → numpy.ndarray[source]

Resets Vectorized Environment

Returns:Initial observations
Return type:Numpy Array
step(actions: numpy.ndarray) → Tuple[source]

Steps through all the environments and normalizes the observations and rewards (if enabled)

Parameters:actions (Numpy Array) – Actions to be taken for the Vectorized Environment
Returns:States, rewards, dones, infos
genrl.environments.vec_env.utils module
class genrl.environments.vec_env.utils.RunningMeanStd(epsilon: float = 0.0001, shape: Tuple = ())[source]

Bases: object

Utility Function to compute a running mean and variance calculator

Parameters:
  • epsilon (float) – Small number to prevent division by zero for calculations
  • shape (Tuple) – Shape of the RMS object
update(batch: numpy.ndarray)[source]
genrl.environments.vec_env.vector_envs module
class genrl.environments.vec_env.vector_envs.SerialVecEnv(*args, **kwargs)[source]

Bases: genrl.environments.vec_env.vector_envs.VecEnv

Constructs a wrapper for serial execution through envs.

close()[source]

Closes all envs

get_spaces()[source]
images() → List[T][source]

Returns an array of images from each env render

render(mode='human')[source]

Renders all envs in a tiles format similar to baselines

param mode:(Can either be ‘human’ or ‘rgb_array’. Displays tiled
images in ‘human’ and returns tiled images in ‘rgb_array’)
type mode:string
reset() → numpy.ndarray[source]

Resets all envs

step(actions: numpy.ndarray) → Tuple[source]

Steps through all envs serially

Parameters:actions (Iterable of ints/floats) – Actions from the model
class genrl.environments.vec_env.vector_envs.SubProcessVecEnv(*args, **kwargs)[source]

Bases: genrl.environments.vec_env.vector_envs.VecEnv

Constructs a wrapper for parallel execution through envs.

close()[source]

Closes all environments and processes

get_spaces() → Tuple[source]

Returns state and action spaces of environments

reset() → numpy.ndarray[source]

Resets environments

Returns:States after environment reset
seed(seed: int = None)[source]

Sets seed for reproducability

step(actions: numpy.ndarray) → Tuple[source]

Steps through environments serially

Parameters:actions (Iterable of ints/floats) – Actions from the model
class genrl.environments.vec_env.vector_envs.VecEnv(envs: List[T], n_envs: int = 2)[source]

Bases: abc.ABC

Base class for multiple environments.

Parameters:
  • env (Gym Environment) – Gym environment to be vectorised
  • n_envs (int) – Number of environments
action_shape
action_spaces
close()[source]
n_envs
obs_shape
observation_spaces
reset()[source]
sample() → List[T][source]

Return samples of actions from each environment

seed(seed: int)[source]

Set seed for reproducibility in all environments

step(actions)[source]
genrl.environments.vec_env.vector_envs.worker(parent_conn: multiprocessing.context.BaseContext.Pipe, child_conn: multiprocessing.context.BaseContext.Pipe, env: gym.core.Env)[source]

Worker class to facilitate multiprocessing

Parameters:
  • parent_conn (Multiprocessing Pipe Connection) – Parent connection of Pipe
  • child_conn (Multiprocessing Pipe Connection) – Child connection of Pipe
  • env (Gym Environment) – Gym environment we need multiprocessing for
genrl.environments.vec_env.wrappers module
class genrl.environments.vec_env.wrappers.VecEnvWrapper(venv)[source]

Bases: genrl.environments.vec_env.vector_envs.VecEnv

close()[source]
render(mode='human')[source]
reset()[source]
step(actions)[source]
Module contents

Submodules

genrl.environments.action_wrappers module

class genrl.environments.action_wrappers.ClipAction(env: Union[gym.core.Env, genrl.environments.vec_env.vector_envs.VecEnv])[source]

Bases: gym.core.ActionWrapper

Action Wrapper to clip actions

Parameters:env (object) – The environment whose actions need to be clipped
action(action: numpy.ndarray) → numpy.ndarray[source]
class genrl.environments.action_wrappers.RescaleAction(env: Union[gym.core.Env, genrl.environments.vec_env.vector_envs.VecEnv], low: int, high: int)[source]

Bases: gym.core.ActionWrapper

Action Wrapper to rescale actions

Parameters:
  • env (object) – The environment whose actions need to be rescaled
  • low (int) – Lower limit of action
  • high (int) – Upper limit of action
action(action: numpy.ndarray) → numpy.ndarray[source]

genrl.environments.atari_preprocessing module

class genrl.environments.atari_preprocessing.AtariPreprocessing(env: gym.core.Env, frameskip: Union[Tuple, int] = (2, 5), grayscale: bool = True, screen_size: int = 84)[source]

Bases: gym.core.Wrapper

Implementation for Image preprocessing for Gym Atari environments. Implements: 1) Frameskip 2) Grayscale 3) Downsampling to square image

param env:Atari environment
param frameskip:
 Number of steps between actions. E.g. frameskip=4 will mean 1 action will be taken for every 4 frames. It’ll be a tuple
if non-deterministic and a random number will be chosen from (2, 5)
param grayscale:
 Whether or not the output should be converted to grayscale
param screen_size:
 Size of the output screen (square output)
type env:Gym Environment
type frameskip:tuple or int
type grayscale:boolean
type screen_size:
 int
reset() → numpy.ndarray[source]

Resets state of environment

Returns:Initial state
Return type:NumPy array
step(action: numpy.ndarray) → numpy.ndarray[source]

Step through Atari environment for given action

Parameters:action (NumPy array) – Action taken by agent
Returns:Current state, reward(for frameskip number of actions), done, info

genrl.environments.atari_wrappers module

class genrl.environments.atari_wrappers.FireReset(env: gym.core.Env)[source]

Bases: gym.core.Wrapper

Some Atari environments do not actually do anything until a specific action (the fire action) is taken, so we make it take the action before starting the training process

Parameters:env (Gym Environment) – Atari environment
reset() → numpy.ndarray[source]

Resets state of environment. Performs the noop action a random number of times to introduce stochasticity

Returns:Initial state
Return type:NumPy array
class genrl.environments.atari_wrappers.NoopReset(env: gym.core.Env, max_noops: int = 30)[source]

Bases: gym.core.Wrapper

Some Atari environments always reset to the same state. So we take a random number of some empty (noop) action to introduce some stochasticity.

Parameters:
  • env (Gym Environment) – Atari environment
  • max_noops (int) – Maximum number of Noops to be taken
reset() → numpy.ndarray[source]

Resets state of environment. Performs the noop action a random number of times to introduce stochasticity

Returns:Initial state
Return type:NumPy array
step(action: numpy.ndarray) → numpy.ndarray[source]

Step through underlying Atari environment for given action

Parameters:action (NumPy array) – Action taken by agent
Returns:Current state, reward(for frameskip number of actions), done, info

genrl.environments.base_wrapper module

class genrl.environments.base_wrapper.BaseWrapper(env: Any, batch_size: int = None)[source]

Bases: abc.ABC

Base class for all wrappers

batch_size

The number of batches trained per update

close() → None[source]

Closes environment and performs any other cleanup

Must be overridden by subclasses

render() → None[source]

Render the environment

reset() → None[source]

Resets state of environment

Must be overriden by subclasses

Returns:Initial state
seed(seed: int = None) → None[source]

Set seed for environment

step(action: numpy.ndarray) → None[source]

Step through the environment

Must be overriden by subclasses

genrl.environments.frame_stack module

class genrl.environments.frame_stack.FrameStack(env: gym.core.Env, framestack: int = 4, compress: bool = True)[source]

Bases: gym.core.Wrapper

Wrapper to stack the last few(4 by default) observations of agent efficiently

Parameters:
  • env (Gym Environment) – Environment to be wrapped
  • framestack (int) – Number of frames to be stacked
  • compress (bool) – True if we want to use LZ4 compression to conserve memory usage
reset() → numpy.ndarray[source]

Resets environment

Returns:Initial state of environment
Return type:NumPy Array
step(action: numpy.ndarray) → numpy.ndarray[source]

Steps through environment

Parameters:action (NumPy Array) – Action taken by agent
Returns:Next state, reward, done, info
Return type:NumPy Array, float, boolean, dict
class genrl.environments.frame_stack.LazyFrames(frames: List[T], compress: bool = False)[source]

Bases: object

Efficient data structure to save each frame only once. Can use LZ4 compression to optimizer memory usage.

Parameters:
  • frames (collections.deque) – List of frames that needs to converted to a LazyFrames data structure
  • compress (boolean) – True if we want to use LZ4 compression to conserve memory usage
shape

Returns dimensions of other object

genrl.environments.gym_wrapper module

class genrl.environments.gym_wrapper.GymWrapper(env: gym.core.Env)[source]

Bases: gym.core.Wrapper

Wrapper class for all Gym Environments

Parameters:
  • env (string) – Gym environment name
  • n_envs (None, int) – Number of environments. None if not vectorised
  • parallel (boolean) – If vectorised, should environments be run through serially or parallelly
action_shape
close() → None[source]

Closes environment

obs_shape
render(mode: str = 'human') → None[source]

Renders all envs in a tiles format similar to baselines.

Parameters:mode (string) – Can either be ‘human’ or ‘rgb_array’. Displays tiled images in ‘human’ and returns tiled images in ‘rgb_array’
reset() → numpy.ndarray[source]

Resets environment

Returns:Initial state
sample() → numpy.ndarray[source]

Shortcut method to directly sample from environment’s action space

Returns:Random action from action space
Return type:NumPy Array
seed(seed: int = None) → None[source]

Set environment seed

Parameters:seed (int) – Value of seed
step(action: numpy.ndarray) → numpy.ndarray[source]

Steps the env through given action

Parameters:action (NumPy array) – Action taken by agent
Returns:Next observation, reward, game status and debugging info

genrl.environments.suite module

genrl.environments.suite.AtariEnv(env_id: str, wrapper_list: List[T] = [<class 'genrl.environments.atari_preprocessing.AtariPreprocessing'>, <class 'genrl.environments.atari_wrappers.NoopReset'>, <class 'genrl.environments.atari_wrappers.FireReset'>, <class 'genrl.environments.time_limit.AtariTimeLimit'>, <class 'genrl.environments.frame_stack.FrameStack'>]) → gym.core.Env[source]

Function to apply wrappers for all Atari envs by Trainer class

Parameters:
  • env (string) – Environment Name
  • wrapper_list (list or tuple) – List of wrappers to use
Returns:

Gym Atari Environment

Return type:

object

genrl.environments.suite.GymEnv(env_id: str) → gym.core.Env[source]

Function to apply wrappers for all regular Gym envs by Trainer class

Parameters:env (string) – Environment Name
Returns:Gym Environment
Return type:object
genrl.environments.suite.VectorEnv(env_id: str, n_envs: int = 2, parallel: int = False, env_type: str = 'gym') → genrl.environments.vec_env.vector_envs.VecEnv[source]

Chooses the kind of Vector Environment that is required

param env_id:Gym environment to be vectorised
param n_envs:Number of environments
param parallel:True if we want environments to run parallely and (
subprocesses, False if we want environments to run serially one after the other)
param env_type:Type of environment. Currently, we support [“gym”, “atari”]
type env_id:string
type n_envs:int
type parallel:False
type env_type:string
returns:Vector Environment
rtype:object

genrl.environments.time_limit module

class genrl.environments.time_limit.AtariTimeLimit(env, max_episode_len=None)[source]

Bases: gym.core.Wrapper

reset(**kwargs)[source]

Resets the state of the environment and returns an initial observation.

Returns:the initial observation.
Return type:observation (object)
step(action)[source]

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters:action (object) – an action provided by the agent
Returns:agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
Return type:observation (object)
class genrl.environments.time_limit.TimeLimit(env, max_episode_len=None)[source]

Bases: gym.core.Wrapper

reset(**kwargs)[source]

Resets the state of the environment and returns an initial observation.

Returns:the initial observation.
Return type:observation (object)
step(action)[source]

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters:action (object) – an action provided by the agent
Returns:agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
Return type:observation (object)

Module contents

Core

ActorCritic

class genrl.core.actor_critic.CNNActorCritic(framestack: int, action_dim: gym.spaces.space.Space, policy_layers: Tuple = (256, ), value_layers: Tuple = (256, ), val_type: str = 'V', discrete: bool = True, *args, **kwargs)[source]

Bases: genrl.core.base.BaseActorCritic

CNN Actor Critic

param framestack:
 Number of previous frames to stack together
param action_dim:
 Action dimensions of the environment
param fc_layers:
 Sizes of hidden layers
param val_type:Specifies type of value function: (
“V” for V(s), “Qs” for Q(s), “Qsa” for Q(s,a))
param discrete:True if action space is discrete, else False
param framestack:
 Number of previous frames to stack together
type action_dim:
 int
type fc_layers:tuple or list
type val_type:str
type discrete:bool
get_action(state: torch.Tensor, deterministic: bool = False) → torch.Tensor[source]

Get action from the Actor based on input

param state:The state being passed as input to the Actor
param deterministic:
 (True if the action space is deterministic,
else False)
type state:Tensor
type deterministic:
 boolean
returns:action
get_value(inp: torch.Tensor) → torch.Tensor[source]

Get value from the Critic based on input

Parameters:inp (Tensor) – Input to the Critic
Returns:value
class genrl.core.actor_critic.MlpActorCritic(state_dim: gym.spaces.space.Space, action_dim: gym.spaces.space.Space, policy_layers: Tuple = (32, 32), value_layers: Tuple = (32, 32), val_type: str = 'V', discrete: bool = True, **kwargs)[source]

Bases: genrl.core.base.BaseActorCritic

MLP Actor Critic

state_dim

State dimensions of the environment

Type:int
action_dim

Action space dimensions of the environment

Type:int
hidden

Hidden layers in the MLP

Type:list or tuple
val_type

Value type of the critic network

Type:str
discrete

True if the action space is discrete, else False

Type:bool
sac

True if a SAC-like network is needed, else False

Type:bool
activation

Activation function to be used. Can be either “tanh” or “relu”

Type:str
class genrl.core.actor_critic.MlpSingleActorMultiCritic(state_dim: gym.spaces.space.Space, action_dim: gym.spaces.space.Space, policy_layers: Tuple = (32, 32), value_layers: Tuple = (32, 32), val_type: str = 'V', discrete: bool = True, num_critics: int = 2, **kwargs)[source]

Bases: genrl.core.base.BaseActorCritic

MLP Actor Critic

state_dim

State dimensions of the environment

Type:int
action_dim

Action space dimensions of the environment

Type:int
hidden

Hidden layers in the MLP

Type:list or tuple
val_type

Value type of the critic network

Type:str
discrete

True if the action space is discrete, else False

Type:bool
num_critics

Number of critics in the architecture

Type:int
sac

True if a SAC-like network is needed, else False

Type:bool
activation

Activation function to be used. Can be either “tanh” or “relu”

Type:str
forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_action(state: torch.Tensor, deterministic: bool = False)[source]

Get action from the Actor based on input

param state:The state being passed as input to the Actor
param deterministic:
 (True if the action space is deterministic,
else False)
type state:Tensor
type deterministic:
 boolean
returns:action
get_value(state: torch.Tensor, mode='first') → torch.Tensor[source]

Get Values from the Critic

Arg:

state (torch.Tensor): The state(s) being passed to the critics mode (str): What values should be returned. Types:

“both” –> Both values will be returned “min” –> The minimum of both values will be returned “first” –> The value from the first critic only will be returned
Returns:List of values as estimated by each individual critic
Return type:values (list)
genrl.core.actor_critic.get_actor_critic_from_name(name_: str)[source]

Returns Actor Critic given the type of the Actor Critic

Parameters:ac_name (str) – Name of the policy needed
Returns:Actor Critic class to be used

Base

class genrl.core.base.BaseActorCritic[source]

Bases: torch.nn.modules.module.Module

Basic implementation of a general Actor Critic

get_action(state: torch.Tensor, deterministic: bool = False) → torch.Tensor[source]

Get action from the Actor based on input

param state:The state being passed as input to the Actor
param deterministic:
 (True if the action space is deterministic,
else False)
type state:Tensor
type deterministic:
 boolean
returns:action
get_value(state: torch.Tensor) → torch.Tensor[source]

Get value from the Critic based on input

Parameters:state (Tensor) – Input to the Critic
Returns:value
class genrl.core.base.BasePolicy(state_dim: int, action_dim: int, hidden: Tuple, discrete: bool, **kwargs)[source]

Bases: torch.nn.modules.module.Module

Basic implementation of a general Policy

Parameters:
  • state_dim (int) – State dimensions of the environment
  • action_dim (int) – Action dimensions of the environment
  • hidden (tuple or list) – Sizes of hidden layers
  • discrete (bool) – True if action space is discrete, else False
forward(state: torch.Tensor) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

Defines the computation performed at every call.

Parameters:state (Tensor) – The state being passed as input to the policy
get_action(state: torch.Tensor, deterministic: bool = False) → torch.Tensor[source]

Get action from policy based on input

param state:The state being passed as input to the policy
param deterministic:
 (True if the action space is deterministic,
else False)
type state:Tensor
type deterministic:
 boolean
returns:action
class genrl.core.base.BaseValue(state_dim: int, action_dim: int)[source]

Bases: torch.nn.modules.module.Module

Basic implementation of a general Value function

forward(state: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Parameters:state (Tensor) – Input to value function
get_value(state: torch.Tensor) → torch.Tensor[source]

Get value from value function based on input

Parameters:state (Tensor) – Input to value function
Returns:Value

Buffers

class genrl.core.buffers.PrioritizedBuffer(capacity: int, alpha: float = 0.6, beta: float = 0.4)[source]

Bases: object

Implements the Prioritized Experience Replay Mechanism

Parameters:
  • capacity (int) – Size of the replay buffer
  • alpha (int) – Level of prioritization
pos
push(inp: Tuple) → None[source]

Adds new experience to buffer

param inp:(Tuple containing state, action, reward,
next_state and done)
type inp:tuple
returns:None
sample(batch_size: int, beta: float = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]
(Returns randomly sampled memories from replay memory along with their

respective indices and weights)

param batch_size:
 Number of samples per batch
param beta:(Bias exponent used to correct
Importance Sampling (IS) weights)
type batch_size:
 int
type beta:float
returns:(Tuple containing states, actions, next_states,

rewards, dones, indices and weights)

update_priorities(batch_indices: Tuple, batch_priorities: Tuple) → None[source]

Updates list of priorities with new order of priorities

param batch_indices:
 List of indices of batch
param batch_priorities:
 (List of priorities of the batch at the
specific indices)
type batch_indices:
 list or tuple
type batch_priorities:
 list or tuple
class genrl.core.buffers.PrioritizedReplayBufferSamples(states, actions, rewards, next_states, dones, indices, weights)[source]

Bases: tuple

actions

Alias for field number 1

dones

Alias for field number 4

indices

Alias for field number 5

next_states

Alias for field number 3

rewards

Alias for field number 2

states

Alias for field number 0

weights

Alias for field number 6

class genrl.core.buffers.PushReplayBuffer(capacity: int)[source]

Bases: object

Implements the basic Experience Replay Mechanism

Parameters:capacity (int) – Size of the replay buffer
push(inp: Tuple) → None[source]

Adds new experience to buffer

Parameters:inp (tuple) – Tuple containing state, action, reward, next_state and done
Returns:None
sample(batch_size: int) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

Returns randomly sampled experiences from replay memory

param batch_size:
 Number of samples per batch
type batch_size:
 int
returns:(Tuple composing of state, action, reward,

next_state and done)

class genrl.core.buffers.ReplayBuffer(size, env)[source]

Bases: object

extend(inp)[source]
push(inp)[source]
sample(batch_size)[source]
class genrl.core.buffers.ReplayBufferSamples(states, actions, rewards, next_states, dones)[source]

Bases: tuple

actions

Alias for field number 1

dones

Alias for field number 4

next_states

Alias for field number 3

rewards

Alias for field number 2

states

Alias for field number 0

Noise

class genrl.core.noise.ActionNoise(mean: float, std: float)[source]

Bases: abc.ABC

Base class for Action Noise

Parameters:
  • mean (float) – Mean of noise distribution
  • std (float) – Standard deviation of noise distribution
mean

Returns mean of noise distribution

std

Returns standard deviation of noise distribution

class genrl.core.noise.NoisyLinear(in_features: int, out_features: int, std_init: float = 0.4)[source]

Bases: torch.nn.modules.module.Module

Noisy Linear Layer Class

Class to represent a Noisy Linear class (noisy version of nn.Linear)

in_features

Input dimensions

Type:int
out_features

Output dimensions

Type:int
std_init

Weight initialisation constant

Type:float
forward(state: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

reset_noise() → None[source]

Reset noise components of layer

reset_parameters() → None[source]

Reset parameters of layer

class genrl.core.noise.NormalActionNoise(mean: float, std: float)[source]

Bases: genrl.core.noise.ActionNoise

Normal implementation of Action Noise

Parameters:
  • mean (float) – Mean of noise distribution
  • std (float) – Standard deviation of noise distribution
reset() → None[source]
class genrl.core.noise.OrnsteinUhlenbeckActionNoise(mean: float, std: float, theta: float = 0.15, dt: float = 0.01, initial_noise: numpy.ndarray = None)[source]

Bases: genrl.core.noise.ActionNoise

Ornstein Uhlenbeck implementation of Action Noise

Parameters:
  • mean (float) – Mean of noise distribution
  • std (float) – Standard deviation of noise distribution
  • theta (float) – Parameter used to solve the Ornstein Uhlenbeck process
  • dt (float) – Small parameter used to solve the Ornstein Uhlenbeck process
  • initial_noise (Numpy array) – Initial noise distribution
reset() → None[source]

Reset the initial noise value for the noise distribution sampling

Policies

class genrl.core.policies.CNNPolicy(framestack: int, action_dim: int, hidden: Tuple = (32, 32), discrete: bool = True, *args, **kwargs)[source]

Bases: genrl.core.base.BasePolicy

CNN Policy

Parameters:
  • framestack (int) – Number of previous frames to stack together
  • action_dim (int) – Action dimensions of the environment
  • fc_layers (tuple or list) – Sizes of hidden layers
  • discrete (bool) – True if action space is discrete, else False
  • channels (list or tuple) – Channel sizes for cnn layers
forward(state: numpy.ndarray) → numpy.ndarray[source]

Defines the computation performed at every call.

Parameters:state (Tensor) – The state being passed as input to the policy
class genrl.core.policies.MlpPolicy(state_dim: int, action_dim: int, hidden: Tuple = (32, 32), discrete: bool = True, *args, **kwargs)[source]

Bases: genrl.core.base.BasePolicy

MLP Policy

Parameters:
  • state_dim (int) – State dimensions of the environment
  • action_dim (int) – Action dimensions of the environment
  • hidden (tuple or list) – Sizes of hidden layers
  • discrete (bool) – True if action space is discrete, else False
genrl.core.policies.get_policy_from_name(name_: str)[source]

Returns policy given the name of the policy

Parameters:name (str) – Name of the policy needed
Returns:Policy Function to be used

RolloutStorage

class genrl.core.rollout_storage.BaseBuffer(buffer_size: int, env: Union[gym.core.Env, genrl.environments.vec_env.vector_envs.VecEnv], device: Union[torch.device, str] = 'cpu')[source]

Bases: object

Base class that represent a buffer (rollout or replay) :param buffer_size: (int) Max number of element in the buffer :param env: (Environment) The environment being trained on :param device: (Union[torch.device, str]) PyTorch device

to which the values will be converted
Parameters:n_envs – (int) Number of parallel environments
add(*args, **kwargs) → None[source]

Add elements to the buffer.

extend(*args, **kwargs) → None[source]

Add a new batch of transitions to the buffer

reset() → None[source]

Reset the buffer.

sample(batch_size: int)[source]
Parameters:batch_size – (int) Number of element to sample
Returns:(Union[RolloutBufferSamples, ReplayBufferSamples])
size() → int[source]
Returns:(int) The current size of the buffer
static swap_and_flatten(arr: numpy.ndarray) → numpy.ndarray[source]

Swap and then flatten axes 0 (buffer_size) and 1 (n_envs) to convert shape from [n_steps, n_envs, …] (when … is the shape of the features) to [n_steps * n_envs, …] (which maintain the order) :param arr: (np.ndarray) :return: (np.ndarray)

to_torch(array: numpy.ndarray, copy: bool = True) → torch.Tensor[source]

Convert a numpy array to a PyTorch tensor. Note: it copies the data by default :param array: (np.ndarray) :param copy: (bool) Whether to copy or not the data

(may be useful to avoid changing things be reference)
Returns:(torch.Tensor)
class genrl.core.rollout_storage.ReplayBufferSamples(observations, actions, next_observations, dones, rewards)[source]

Bases: tuple

actions

Alias for field number 1

dones

Alias for field number 3

next_observations

Alias for field number 2

observations

Alias for field number 0

rewards

Alias for field number 4

class genrl.core.rollout_storage.RolloutBuffer(buffer_size: int, env: Union[gym.core.Env, genrl.environments.vec_env.vector_envs.VecEnv], device: Union[torch.device, str] = 'cpu', gae_lambda: float = 1, gamma: float = 0.99)[source]

Bases: genrl.core.rollout_storage.BaseBuffer

Rollout buffer used in on-policy algorithms like A2C/PPO. :param buffer_size: (int) Max number of element in the buffer :param env: (Environment) The environment being trained on :param device: (torch.device) :param gae_lambda: (float) Factor for trade-off of bias vs variance for Generalized Advantage Estimator

Equivalent to classic advantage when set to 1.
Parameters:
  • gamma – (float) Discount factor
  • n_envs – (int) Number of parallel environments
add(obs: numpy.ndarray, action: numpy.ndarray, reward: numpy.ndarray, done: numpy.ndarray, value: torch.Tensor, log_prob: torch.Tensor) → None[source]
Parameters:
  • obs – (np.ndarray) Observation
  • action – (np.ndarray) Action
  • reward – (np.ndarray)
  • done – (np.ndarray) End of episode signal.
  • value – (torch.Tensor) estimated value of the current state following the current policy.
  • log_prob – (torch.Tensor) log probability of the action following the current policy.
compute_returns_and_advantage(last_value: torch.Tensor, dones: numpy.ndarray, use_gae: bool = False) → None[source]

Post-processing step: compute the returns (sum of discounted rewards) and advantage (A(s) = R - V(S)). Adapted from Stable-Baselines PPO2. :param last_value: (torch.Tensor) :param dones: (np.ndarray) :param use_gae: (bool) Whether to use Generalized Advantage Estimation

or normal advantage for advantage computation.
get(batch_size: Optional[int] = None) → Generator[genrl.core.rollout_storage.RolloutBufferSamples, None, None][source]
reset() → None[source]

Reset the buffer.

class genrl.core.rollout_storage.RolloutBufferSamples(observations, actions, old_values, old_log_prob, advantages, returns)[source]

Bases: tuple

actions

Alias for field number 1

advantages

Alias for field number 4

observations

Alias for field number 0

old_log_prob

Alias for field number 3

old_values

Alias for field number 2

returns

Alias for field number 5

class genrl.core.rollout_storage.RolloutReturn(episode_reward, episode_timesteps, n_episodes, continue_training)[source]

Bases: tuple

continue_training

Alias for field number 3

episode_reward

Alias for field number 0

episode_timesteps

Alias for field number 1

n_episodes

Alias for field number 2

Values

class genrl.core.values.CnnCategoricalValue(*args, **kwargs)[source]

Bases: genrl.core.values.CnnNoisyValue

Class for Categorical DQN’s CNN Q-Value function

framestack

No. of frames being passed into the Q-value function

Type:int
action_dim

Action space dimensions

Type:int
fc_layers

Fully connected layer dimensions

Type:tuple
noisy_layers

Noisy layer dimensions

Type:tuple
num_atoms

Number of atoms used to discretise the Categorical DQN value distribution

Type:int
forward(state: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Parameters:state (Tensor) – Input to value function
class genrl.core.values.CnnDuelingValue(*args, **kwargs)[source]

Bases: genrl.core.values.CnnValue

Class for Dueling DQN’s MLP Q-Value function

framestack

No. of frames being passed into the Q-value function

Type:int
action_dim

Action space dimensions

Type:int
fc_layers

Hidden layer dimensions

Type:tuple
forward(inp: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Parameters:state (Tensor) – Input to value function
class genrl.core.values.CnnNoisyValue(*args, **kwargs)[source]

Bases: genrl.core.values.CnnValue, genrl.core.values.MlpNoisyValue

Class for Noisy DQN’s CNN Q-Value function

state_dim

Number of previous frames to stack together

Type:int
action_dim

Action space dimensions

Type:int
fc_layers

Fully connected layer dimensions

Type:tuple
noisy_layers

Noisy layer dimensions

Type:tuple
num_atoms

Number of atoms used to discretise the Categorical DQN value distribution

Type:int
forward(state: numpy.ndarray) → numpy.ndarray[source]

Defines the computation performed at every call.

Parameters:state (Tensor) – Input to value function
class genrl.core.values.CnnValue(*args, **kwargs)[source]

Bases: genrl.core.values.MlpValue

CNN Value Function class

param framestack:
 Number of previous frames to stack together
param action_dim:
 Action dimension of environment
param val_type:Specifies type of value function: (
“V” for V(s), “Qs” for Q(s), “Qsa” for Q(s,a))
param fc_layers:
 Sizes of hidden layers
type framestack:
 int
type action_dim:
 int
type val_type:string
type fc_layers:tuple or list
forward(state: numpy.ndarray) → numpy.ndarray[source]

Defines the computation performed at every call.

Parameters:state (Tensor) – Input to value function
class genrl.core.values.MlpCategoricalValue(*args, **kwargs)[source]

Bases: genrl.core.values.MlpNoisyValue

Class for Categorical DQN’s MLP Q-Value function

state_dim

Observation space dimensions

Type:int
action_dim

Action space dimensions

Type:int
fc_layers

Fully connected layer dimensions

Type:tuple
noisy_layers

Noisy layer dimensions

Type:tuple
num_atoms

Number of atoms used to discretise the Categorical DQN value distribution

Type:int
forward(state: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Parameters:state (Tensor) – Input to value function
class genrl.core.values.MlpDuelingValue(*args, **kwargs)[source]

Bases: genrl.core.values.MlpValue

Class for Dueling DQN’s MLP Q-Value function

state_dim

Observation space dimensions

Type:int
action_dim

Action space dimensions

Type:int
hidden

Hidden layer dimensions

Type:tuple
forward(state: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Parameters:state (Tensor) – Input to value function
class genrl.core.values.MlpNoisyValue(*args, noisy_layers: Tuple = (128, 512), **kwargs)[source]

Bases: genrl.core.values.MlpValue

reset_noise() → None[source]

Resets noise for any Noisy layers in Value function

class genrl.core.values.MlpValue(state_dim: int, action_dim: int = None, val_type: str = 'V', fc_layers: Tuple = (32, 32), **kwargs)[source]

Bases: genrl.core.base.BaseValue

MLP Value Function class

param state_dim:
 State dimensions of environment
param action_dim:
 Action dimensions of environment
param val_type:Specifies type of value function: (
“V” for V(s), “Qs” for Q(s), “Qsa” for Q(s,a))
param hidden:Sizes of hidden layers
type state_dim:int
type action_dim:
 int
type val_type:string
type hidden:tuple or list
genrl.core.values.get_value_from_name(name_: str) → Union[Type[genrl.core.values.MlpValue], Type[genrl.core.values.CnnValue]][source]

Gets the value function given the name of the value function

Parameters:name (string) – Name of the value function needed
Returns:Value function

Utilities

Logger

class genrl.utils.logger.CSVLogger(logdir: str)[source]

Bases: object

CSV Logging class

Parameters:logdir (string) – Directory to save log at
close() → None[source]

Close the logger

write(kvs: Dict[str, Any], log_key) → None[source]

Add entry to logger

Parameters:kvs (dict) – Entries to be logged
class genrl.utils.logger.HumanOutputFormat(logdir: str)[source]

Bases: object

Output from a log file in a human readable format

Parameters:logdir (string) – Directory at which log is present
close() → None[source]
max_key_len(kvs: Dict[str, Any]) → None[source]

Finds max key length

Parameters:kvs (dict) – Entries to be logged
round(num: float) → float[source]

Returns a rounded float value depending on self.maxlen

Parameters:num (float) – Value to round
write(kvs: Dict[str, Any], log_key) → None[source]

Log the entry out in human readable format

Parameters:kvs (dict) – Entries to be logged
write_to_file(kvs: Dict[str, Any], file=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>) → None[source]

Log the entry out in human readable format

Parameters:
  • kvs (dict) – Entries to be logged
  • file (io.TextIOWrapper) – Name of file to write logs to
class genrl.utils.logger.Logger(logdir: str = None, formats: List[str] = ['csv'])[source]

Bases: object

Logger class to log important information

Parameters:
  • logdir (string) – Directory to save log at
  • formats (list) – Formatting of each log [‘csv’, ‘stdout’, ‘tensorboard’]
close() → None[source]

Close the logger

formats

Return save format(s)

logdir

Return log directory

write(kvs: Dict[str, Any], log_key: str = 'timestep') → None[source]

Add entry to logger

Parameters:
  • kvs (dict) – Entry to be logged
  • log_key (str) – Key plotted on log_key
class genrl.utils.logger.TensorboardLogger(logdir: str)[source]

Bases: object

Tensorboard Logging class

Parameters:logdir (string) – Directory to save log at
close() → None[source]

Close the logger

write(kvs: Dict[str, Any], log_key: str = 'timestep') → None[source]

Add entry to logger

Parameters:
  • kvs (dict) – Entries to be logged
  • log_key (str) – Key plotted on x_axis
genrl.utils.logger.get_logger_by_name(name: str)[source]

Gets the logger given the type of logger

Parameters:name (string) – Name of the value function needed
Returns:Logger

Utilities

genrl.utils.utils.cnn(channels: Tuple = (4, 16, 32), kernel_sizes: Tuple = (8, 4), strides: Tuple = (4, 2), **kwargs) → Tuple[source]
(Generates a CNN model given input dimensions, channels, kernel_sizes and

strides)

param channels:Input output channels before and after each convolution
param kernel_sizes:
 Kernel sizes for each convolution
param strides:Strides for each convolution
param in_size:Input dimensions (assuming square input)
type channels:tuple
type kernel_sizes:
 tuple
type strides:tuple
type in_size:int
returns:(Convolutional Neural Network with convolutional layers and

activation layers)

genrl.utils.utils.get_env_properties(env: Union[gym.core.Env, genrl.environments.vec_env.vector_envs.VecEnv], network: Union[str, Any] = 'mlp') → Tuple[int][source]

Finds important properties of environment

param env:Environment that the agent is interacting with
type env:Gym Environment
param network:Type of network architecture, eg. “mlp”, “cnn”
type network:str
returns:(State space dimensions, Action space dimensions,
discreteness of action space and action limit (highest action value)
rtype:int, float, …; int, float, …; bool; int, float, …
genrl.utils.utils.get_model(type_: str, name_: str) → Union[source]

Utility to get the class of required function

param type_:“ac” for Actor Critic, “v” for Value, “p” for Policy
param name_:Name of the specific structure of model. (
Eg. “mlp” or “cnn”)
type type_:string
returns:Required class. Eg. MlpActorCritic
genrl.utils.utils.mlp(sizes: Tuple, activation: str = 'relu', sac: bool = False)[source]

Generates an MLP model given sizes of each layer

param sizes:Sizes of hidden layers
param sac:True if Soft Actor Critic is being used, else False
type sizes:tuple or list
type sac:bool
returns:(Neural Network with fully-connected linear layers and

activation layers)

genrl.utils.utils.noisy_mlp(fc_layers: List[int], noisy_layers: List[int], activation='relu')[source]

Noisy MLP generating helper function

Parameters:
  • fc_layers (list of int) – List of fully connected layers
  • noisy_layers (list of int) – :ist of noisy layers
  • activation (str) – Activation function to be used. [“tanh”, “relu”]
Returns:

Noisy MLP model

genrl.utils.utils.safe_mean(log: List[int])[source]

Returns 0 if there are no elements in logs

genrl.utils.utils.set_seeds(seed: int, env: Union[gym.core.Env, genrl.environments.vec_env.vector_envs.VecEnv] = None) → None[source]

Sets seeds for reproducibility

Parameters:
  • seed (int) – Seed Value
  • env (Gym Environment) – Optionally pass gym environment to set its seed

Models

class genrl.utils.models.TabularModel(s_dim: int, a_dim: int)[source]

Bases: object

Sample-based tabular model class for deterministic, discrete environments

Parameters:
  • s_dim (int) – environment state dimension
  • a_dim (int) – environment action dimension
add(state: numpy.ndarray, action: numpy.ndarray, reward: float, next_state: numpy.ndarray) → None[source]

add transition to model :param state: state :param action: action :param reward: reward :param next_state: next state :type state: float array :type action: int :type reward: int :type next_state: float array

is_empty() → bool[source]

Check if the model has been updated or not

Returns:True if model not updated yet
Return type:bool
sample() → Tuple[source]

sample state action pair from model

Returns:state and action
Return type:int, float, .. ; int, float, ..
step(state: numpy.ndarray, action: numpy.ndarray) → Tuple[source]

return consequence of action at state

Returns:reward and next state
Return type:int; int, float, ..
genrl.utils.models.get_model_from_name(name_: str)[source]

get model object from name

Parameters:name (str) – name of the model [‘tabular’]
Returns:the model

Trainers

On-Policy Trainer

On Policy Trainer Class

Trainer class for all the On Policy Agents: A2C, PPO1 and VPG

genrl.trainers.OnPolicyTrainer.agent

Agent algorithm object

Type:object
genrl.trainers.OnPolicyTrainer.env

Environment

Type:object
genrl.trainers.OnPolicyTrainer.log_mode

List of different kinds of logging. Supported: [“csv”, “stdout”, “tensorboard”]

Type:list of str
genrl.trainers.OnPolicyTrainer.log_key

Key plotted on x_axis. Supported: [“timestep”, “episode”]

Type:str
genrl.trainers.OnPolicyTrainer.log_interval

Timesteps between successive logging of parameters onto the console

Type:int
genrl.trainers.OnPolicyTrainer.logdir

Directory where log files should be saved.

Type:str
genrl.trainers.OnPolicyTrainer.epochs

Total number of epochs to train for

Type:int
genrl.trainers.OnPolicyTrainer.max_timesteps

Maximum limit of timesteps to train for

Type:int
genrl.trainers.OnPolicyTrainer.off_policy

True if the agent is an off policy agent, False if it is on policy

Type:bool
genrl.trainers.OnPolicyTrainer.save_interval

Timesteps between successive saves of the agent’s important hyperparameters

Type:int
genrl.trainers.OnPolicyTrainer.save_model

Directory where the checkpoints of agent parameters should be saved

Type:str
genrl.trainers.OnPolicyTrainer.run_num

A run number allotted to the save of parameters

Type:int
genrl.trainers.OnPolicyTrainer.load_model

File to load saved parameter checkpoint from

Type:str
genrl.trainers.OnPolicyTrainer.render

True if environment is to be rendered during training, else False

Type:bool
genrl.trainers.OnPolicyTrainer.evaluate_episodes

Number of episodes to evaluate for

Type:int
genrl.trainers.OnPolicyTrainer.seed

Set seed for reproducibility

Type:int
genrl.trainers.OnPolicyTrainer.n_envs

Number of environments

Off-Policy Trainer

Off Policy Trainer Class

Trainer class for all the Off Policy Agents: DQN (all variants), DDPG, TD3 and SAC

genrl.trainers.OffPolicyTrainer.agent

Agent algorithm object

Type:object
genrl.trainers.OffPolicyTrainer.env

Environment

Type:object
genrl.trainers.OffPolicyTrainer.buffer

Replay Buffer object

Type:object
genrl.trainers.OffPolicyTrainer.max_ep_len

Maximum Episode length for training

Type:int
genrl.trainers.OffPolicyTrainer.warmup_steps

Number of warmup steps. (random actions are taken to add randomness to training)

Type:int
genrl.trainers.OffPolicyTrainer.start_update

Timesteps after which the agent networks should start updating

Type:int
genrl.trainers.OffPolicyTrainer.update_interval

Timesteps between target network updates

Type:int
genrl.trainers.OffPolicyTrainer.log_mode

List of different kinds of logging. Supported: [“csv”, “stdout”, “tensorboard”]

Type:list of str
genrl.trainers.OffPolicyTrainer.log_key

Key plotted on x_axis. Supported: [“timestep”, “episode”]

Type:str
genrl.trainers.OffPolicyTrainer.log_interval

Timesteps between successive logging of parameters onto the console

Type:int
genrl.trainers.OffPolicyTrainer.logdir

Directory where log files should be saved.

Type:str
genrl.trainers.OffPolicyTrainer.epochs

Total number of epochs to train for

Type:int
genrl.trainers.OffPolicyTrainer.max_timesteps

Maximum limit of timesteps to train for

Type:int
genrl.trainers.OffPolicyTrainer.off_policy

True if the agent is an off policy agent, False if it is on policy

Type:bool
genrl.trainers.OffPolicyTrainer.save_interval

Timesteps between successive saves of the agent’s important hyperparameters

Type:int
genrl.trainers.OffPolicyTrainer.save_model

Directory where the checkpoints of agent parameters should be saved

Type:str
genrl.trainers.OffPolicyTrainer.run_num

A run number allotted to the save of parameters

Type:int
genrl.trainers.OffPolicyTrainer.load_model

File to load saved parameter checkpoint from

Type:str
genrl.trainers.OffPolicyTrainer.render

True if environment is to be rendered during training, else False

Type:bool
genrl.trainers.OffPolicyTrainer.evaluate_episodes

Number of episodes to evaluate for

Type:int
genrl.trainers.OffPolicyTrainer.seed

Set seed for reproducibility

Type:int
genrl.trainers.OffPolicyTrainer.n_envs

Number of environments

Classical Trainer

Global trainer class for classical RL algorithms

param agent:Algorithm object to train
param env:standard gym environment to train on
param mode:mode of value function update [‘learn’, ‘plan’, ‘dyna’]
param model:model to use for planning [‘tabular’]
param n_episodes:
 number of training episodes
param plan_n_steps:
 number of planning step per environment interaction
param start_steps:
 number of initial exploration timesteps
param seed:seed for random number generator
param render:render gym environment
type agent:object
type env:Gym environment
type mode:str
type model:str
type n_episodes:
 int
type plan_n_steps:
 int
type start_steps:
 int
type seed:int
type render:bool

Deep Contextual Bandit Trainer

Bandit Trainer Class

param agent:Agent to train.
type agent:genrl.deep.bandit.dcb_agents.DCBAgent
param bandit:Bandit to train agent on.
type bandit:genrl.deep.bandit.data_bandits.DataBasedBandit
param logdir:Path to directory to store logs in.
type logdir:str
param log_mode:List of modes for logging.
type log_mode:List[str]

Multi Armed Bandit Trainer

Bandit Trainer Class

param agent:Agent to train.
type agent:genrl.deep.bandit.dcb_agents.DCBAgent
param bandit:Bandit to train agent on.
type bandit:genrl.deep.bandit.data_bandits.DataBasedBandit
param logdir:Path to directory to store logs in.
type logdir:str
param log_mode:List of modes for logging.
type log_mode:List[str]

Base Trainer

Base Trainer Class

To be inherited specific use-cases

genrl.trainers.Trainer.agent

Agent algorithm object

Type:object
genrl.trainers.Trainer.env

Environment

Type:object
genrl.trainers.Trainer.log_mode

List of different kinds of logging. Supported: [“csv”, “stdout”, “tensorboard”]

Type:list of str
genrl.trainers.Trainer.log_key

Key plotted on x_axis. Supported: [“timestep”, “episode”]

Type:str
genrl.trainers.Trainer.log_interval

Timesteps between successive logging of parameters onto the console

Type:int
genrl.trainers.Trainer.logdir

Directory where log files should be saved.

Type:str
genrl.trainers.Trainer.epochs

Total number of epochs to train for

Type:int
genrl.trainers.Trainer.max_timesteps

Maximum limit of timesteps to train for

Type:int
genrl.trainers.Trainer.off_policy

True if the agent is an off policy agent, False if it is on policy

Type:bool
genrl.trainers.Trainer.save_interval

Timesteps between successive saves of the agent’s important hyperparameters

Type:int
genrl.trainers.Trainer.save_model

Directory where the checkpoints of agent parameters should be saved

Type:str
genrl.trainers.Trainer.run_num

A run number allotted to the save of parameters

Type:int
genrl.trainers.Trainer.load_model

File to load saved parameter checkpoint from

Type:str
genrl.trainers.Trainer.render

True if environment is to be rendered during training, else False

Type:bool
genrl.trainers.Trainer.evaluate_episodes

Number of episodes to evaluate for

Type:int
genrl.trainers.Trainer.seed

Set seed for reproducibility

Type:int
genrl.trainers.Trainer.n_envs

Number of environments

Common

Classical Common

genrl.classical.common.models

genrl.classical.common.trainer

genrl.classical.common.values

Bandit Common

genrl.bandit.core

genrl.bandit.trainer

genrl.bandit.agents.cb_agents.common.base_model

class genrl.agents.bandits.contextual.common.base_model.Model(layer, **kwargs)[source]

Bases: torch.nn.modules.module.Module, abc.ABC

Bayesian Neural Network used in Deep Contextual Bandit Models.

Parameters:
  • context_dim (int) – Length of context vector.
  • hidden_dims (List[int], optional) – Dimensions of hidden layers of network.
  • n_actions (int) – Number of actions that can be selected. Taken as length of output vector for network to predict.
  • init_lr (float, optional) – Initial learning rate.
  • max_grad_norm (float, optional) – Maximum norm of gradients for gradient clipping.
  • lr_decay (float, optional) – Decay rate for learning rate.
  • lr_reset (bool, optional) – Whether to reset learning rate ever train interval. Defaults to False.
  • dropout_p (Optional[float], optional) – Probability for dropout. Defaults to None which implies dropout is not to be used.
  • noise_std (float) – Standard deviation of noise used in the network. Defaults to 0.1
use_dropout

Indicated whether or not dropout should be used in forward pass.

Type:int
forward(context: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]

Computes forward pass through the network.

Parameters:context (torch.Tensor) – The context vector to perform forward pass on.
Returns:Dictionary of outputs
Return type:Dict[str, torch.Tensor]
train_model(db: genrl.agents.bandits.contextual.common.transition.TransitionDB, epochs: int, batch_size: int)[source]

Trains the network on a given database for given epochs and batch_size.

Parameters:
  • db (TransitionDB) – The database of transitions to train on.
  • epochs (int) – Number of gradient steps to take.
  • batch_size (int) – The size of each batch to perform gradient descent on.

genrl.bandit.agents.cb_agents.common.bayesian

class genrl.agents.bandits.contextual.common.bayesian.BayesianLinear(in_features: int, out_features: int, bias: bool = True)[source]

Bases: torch.nn.modules.module.Module

Linear Layer for Bayesian Neural Networks.

Parameters:
  • in_features (int) – size of each input sample
  • out_features (int) – size of each output sample
  • bias (bool, optional) – Whether to use an additive bias. Defaults to True.
forward(x: torch.Tensor, kl: bool = True, frozen: bool = False) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

Apply linear transormation to input.

The weight and bias is sampled for each forward pass from a normal distribution. The KL divergence of the sampled weigth and bias can also be computed if specified.

Parameters:
  • x (torch.Tensor) – Input to be transformed
  • kl (bool, optional) – Whether to compute the KL divergence. Defaults to True.
  • frozen (bool, optional) – Whether to freeze current parameters. Defaults to False.
Returns:

The transformed input and optionally

the computed KL divergence value.

Return type:

Tuple[torch.Tensor, Optional[torch.Tensor]]

reset_parameters() → None[source]

Resets weight and bias parameters of the layer.

class genrl.agents.bandits.contextual.common.bayesian.BayesianNNBanditModel(**kwargs)[source]

Bases: genrl.agents.bandits.contextual.common.base_model.Model

Bayesian Neural Network used in Deep Contextual Bandit Models.

Parameters:
  • context_dim (int) – Length of context vector.
  • hidden_dims (List[int], optional) – Dimensions of hidden layers of network.
  • n_actions (int) – Number of actions that can be selected. Taken as length of output vector for network to predict.
  • init_lr (float, optional) – Initial learning rate.
  • max_grad_norm (float, optional) – Maximum norm of gradients for gradient clipping.
  • lr_decay (float, optional) – Decay rate for learning rate.
  • lr_reset (bool, optional) – Whether to reset learning rate ever train interval. Defaults to False.
  • dropout_p (Optional[float], optional) – Probability for dropout. Defaults to None which implies dropout is not to be used.
  • noise_std (float) – Standard deviation of noise used in the network. Defaults to 0.1
use_dropout

Indicated whether or not dropout should be used in forward pass.

Type:int
forward(context: torch.Tensor, kl: bool = True) → Dict[str, torch.Tensor][source]

Computes forward pass through the network.

Parameters:context (torch.Tensor) – The context vector to perform forward pass on.
Returns:Dictionary of outputs
Return type:Dict[str, torch.Tensor]

genrl.bandit.agents.cb_agents.common.neural

class genrl.agents.bandits.contextual.common.neural.NeuralBanditModel(**kwargs)[source]

Bases: genrl.agents.bandits.contextual.common.base_model.Model

Neural Network used in Deep Contextual Bandit Models.

Parameters:
  • context_dim (int) – Length of context vector.
  • hidden_dims (List[int], optional) – Dimensions of hidden layers of network.
  • n_actions (int) – Number of actions that can be selected. Taken as length of output vector for network to predict.
  • init_lr (float, optional) – Initial learning rate.
  • max_grad_norm (float, optional) – Maximum norm of gradients for gradient clipping.
  • lr_decay (float, optional) – Decay rate for learning rate.
  • lr_reset (bool, optional) – Whether to reset learning rate ever train interval. Defaults to False.
  • dropout_p (Optional[float], optional) – Probability for dropout. Defaults to None which implies dropout is not to be used.
use_dropout

Indicated whether or not dropout should be used in forward pass.

Type:bool
forward(context: torch.Tensor) → Dict[str, torch.Tensor][source]

Computes forward pass through the network.

Parameters:context (torch.Tensor) – The context vector to perform forward pass on.
Returns:Dictionary of outputs
Return type:Dict[str, torch.Tensor]

genrl.bandit.agents.cb_agents.common.transition

class genrl.agents.bandits.contextual.common.transition.TransitionDB(device: Union[str, torch.device] = 'cpu')[source]

Bases: object

Database for storing (context, action, reward) transitions.

Parameters:device (str) – Device to use for tensor operations. “cpu” for cpu or “cuda” for cuda. Defaults to “cpu”.
db

Dictionary containing list of transitions.

Type:dict
db_size

Number of transitions stored in database.

Type:int
device

Device to use for tensor operations.

Type:torch.device
add(context: torch.Tensor, action: int, reward: int)[source]

Add (context, action, reward) transition to database

Parameters:
  • context (torch.Tensor) – Context recieved
  • action (int) – Action taken
  • reward (int) – Reward recieved
get_data(batch_size: Optional[int] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Get a batch of transition from database

Parameters:batch_size (Union[int, None], optional) – Size of batch required. Defaults to None which implies all transitions in the database are to be included in batch.
Returns:
Tuple of stacked
contexts, actions, rewards tensors.
Return type:Tuple[torch.Tensor, torch.Tensor, torch.Tensor]
get_data_for_action(action: int, batch_size: Optional[int] = None) → Tuple[torch.Tensor, torch.Tensor][source]

Get a batch of transition from database for a given action.

Parameters:
  • action (int) – The action to sample transitions for.
  • batch_size (Union[int, None], optional) – Size of batch required. Defaults to None which implies all transitions in the database are to be included in batch.
Returns:

Tuple of stacked

contexts and rewards tensors.

Return type:

Tuple[torch.Tensor, torch.Tensor]