Proximal Policy Optimization

For background on Deep RL, its core definitions and problem formulations refer to Deep RL Background

Objective

The objective is to maximize the discounted cumulative reward function:

\[E\left[{\sum_{t=0}^{\infty}{\gamma^{t} r_{t}}}\right]\]

The Proximal Policy Optimization Algorithm is very similar to the Advantage Actor Critic Algorithm except we add multiply the advantages with a ratio between the log probability of actions at experience collection time and at updation time. What this does is - helps in establishing a trust region for not moving too away from the old policy and at the same time taking gradient ascent steps in the directions of actions which result in positive advantages.

where we choose the action \(a_{t} = \pi_{\theta}(s_{t})\).

Algorithm Details

Action Selection and Values

ac here is an object of the ActorCritic class, which defined two methods: get_value and get_action and ofcourse they return the value approximation from the Critic and action from the Actor.

Note: We sample a stochastic action from the distribution on the action space by providing False as an argument to select_action.

For practical purposes we would assume that we are working with a finite horizon MDP.

Collect Experience

To make our agent learn, we first need to collect some experience in an online fashion. For this we make use of the collect_rollouts method. This method is defined in the OnPolicyAgent Base Class.

For updation, we would need to compute advantages from this experience. So, we store our experience in a Rollout Buffer.

Compute discounted Returns and Advantages

Next we can compute the advantages and the actual discounted returns for each state. This can be done very easily by simply calling compute_returns_and_advantage. Note this implementation of the rollout buffer is borrowed from Stable Baselines.

Update Equations

Let \(\pi_{\theta}\) denote a policy with parameters \(\theta\), and \(J(\pi_{\theta})\) denote the expected finite-horizon undiscounted return of the policy.

At each update timestep, we get value and log probabilities:

In the case of PPO our loss function is:

\[L(s,a,\theta_k,\theta) = \min\left( \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)} A^{\pi_{\theta_k}}(s,a), \;\; \text{clip}\left(\frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}, 1 - \epsilon, 1+\epsilon \right) A^{\pi_{\theta_k}}(s,a) \right),\]

where \(\tau\) is the trajectory.

We then update the policy parameters via stochastic gradient ascent:

\[\theta_{k+1} = \theta_k + \alpha \nabla_{\theta} J(\pi_{\theta_k})\]

Training through the API

import gym

from genrl.agents import PPO1
from genrl.trainers import OnPolicyTrainer
from genrl.environments import VectorEnv

env = VectorEnv("CartPole-v0")
agent = PPO1('mlp', env)
trainer = OnPolicyTrainer(agent, env, log_mode=['stdout'])
trainer.train()