Vanilla Policy Gradient (VPG)

If you wanted to explore Policy Gradient algorithms in RL, there is a high chance you would’ve heard of PPO, DDPG, etc. but understanding them can be tricky if you’re just starting.

VPG is arguably one of the easiest to understand policy gradient algorithms while still performing to a good enough level.

Let’s understand policy gradient at a high level, unlike the classical algorithms like Q-Learning, Monte Carlo where you try to optimise the outputs of the action-value function of the agent which are then used to determine the optimal policy. In policy gradient, as one would like to say we go directly for the kill shot, basically we optimise the thing we want to use at the end, i.e. the Policy.

So that explains the “Policy” part of Policy Gradient, so what about “Gradient”, so gradient comes from the fact that we try to optimise the policy by gradient ascent (unlike the popular gradient descent, here we want to increase the values, hence ascent). So that explains the name, but how does it even work.

For that, have a look at the following Psuedo Code (source: OpenAI)

Psuedo Code

For a more fundamental understanding this spinningup article is a good resource

Now that we have an understanding of how VPG works at a high level let’s jump into the code to see it in action
This is a very minimal way to run a VPG agent on GenRL

VPG agent on a Cartpole Environment

import gym  # OpenAI Gym

from genrl.agents import VPG
from genrl.trainers import OnPolicyTrainer
from genrl.environments import VectorEnv

env = VectorEnv("CartPole-v1")
agent = VPG('mlp', env)
trainer = OnPolicyTrainer(agent, env, epochs=200)

This will run a VPG agent agent which will interact with the CartPole-v1 gym environment
Let’s understand the output on running this (your individual values may differ),

timestep         Episode          loss             mean_reward
0                0                8.022            19.8835
20480            10               25.969           75.2941
40960            20               29.2478          144.2254
61440            30               25.5711          129.6203
81920            40               19.8718          96.6038
102400           50               19.2585          106.9452
122880           60               17.7781          99.9024
143360           70               23.6839          121.543
163840           80               24.4362          129.2114
184320           90               28.1183          156.3359
204800           100              26.6074          155.1515
225280           110              27.2012          178.8646
245760           120              26.4612          164.498
266240           130              22.8618          148.4058
286720           140              23.465           153.4082
307200           150              21.9764          151.1439
327680           160              22.445           151.1439
348160           170              22.9925          155.7414
368640           180              22.6605          165.1613
389120           190              23.4676          177.316

timestep: It is basically the units of time the agent has interacted with the environment since the start of training
Episode: It is one complete rollout of the agent, to put it simply it is one complete run until the agent ends up winning or losing
loss: The loss encountered in that episode
mean_reward: The mean reward accumulated in that episode

Now if you look closely the agent will not converge to the max reward even if you increase the epochs to say 5000, it is because that during training the agent is behaving according to a stochastic policy (Meaning when you try to pick from an action given a state from the policy it doesn’t simply take the one with the maximum return, rather it samples an action from a probability distribution, so in other words, the policy isn’t just like a lookup table, it’s function which outputs a probability distribution over the actions which we sample from when using it to pick our optimal action).
So even if the agent has figured out the optimal policy it is not taking the most optimal action at every step there is an inherent stochasticity to it.
If we want the agent to make full use of the learnt policy we can add the following line of code at after the training


This will not only make the agent follow a deterministic policy and thus help you achieve the maximun reward possible reward attainable from the learnt policy but also allow you to see your agent perform by passing render=True

For more information on the VPG implementation and the various hyperparameters available have a look at the official GenRL docs here

Some more implementations

VPG agent on an Atari Environment

env = VectorEnv("Pong-v0", env_type = "atari")
agent = VPG('cnn', env)
trainer = OnPolicyTrainer(agent, env, epochs=200)