For background on Deep RL, its core definitions and problem formulations refer to Deep RL Background

## Objective¶

The objective is to maximize the discounted cumulative reward function:

$E\left[{\sum_{t=0}^{\infty}{\gamma^{t} r_{t}}}\right]$

This comprises of two parts in the Adantage Actor Critic Algorithm:

1. To choose/learn a policy that will increase the probability of landing an action that has higher expected return than the value of just the state and decrease the probability of landing an action that has lower expected return than the value of the state. The Advantage is computed as:
$A(s,a) = Q(s,a) - V(s)$
1. To learn a State Action Value Function (in the name of Critic) that estimates the future cumulative rewards given the current state and action. This function helps the policy in evaluation potential state, action pairs.

where we choose the action $$a_{t} = \pi_{\theta}(s_{t})$$.

## Algorithm Details¶

### Action Selection and Values¶

ac here is an object of the ActorCritic class, which defined two methods: get_value and get_action and ofcourse they return the value approximation from the Critic and action from the Actor.

Note: We sample a stochastic action from the distribution on the action space by providing False as an argument to select_action.

For practical purposes we would assume that we are working with a finite horizon MDP.

### Collect Experience¶

To make our agent learn, we first need to collect some experience in an online fashion. For this we make use of the collect_rollouts method. This method is defined in the OnPolicyAgent Base Class.

For updation, we would need to compute advantages from this experience. So, we store our experience in a Rollout Buffer.

### Compute discounted Returns and Advantages¶

Next we can compute the advantages and the actual discounted returns for each state. This can be done very easily by simply calling compute_returns_and_advantage. Note this implementation of the rollout buffer is borrowed from Stable Baselines.

### Update Equations¶

Let $$\pi_{\theta}$$ denote a policy with parameters $$\theta$$, and $$J(\pi_{\theta})$$ denote the expected finite-horizon undiscounted return of the policy.

At each update timestep, we get value and log probabilities:

Now, that we have the log probabilities we calculate the gradient of $$J(\pi_{\theta})$$ as:

$\nabla_{\theta} J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}}\left[{ \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) A^{\pi_{\theta}}(s_t,a_t) }\right],$

where $$\tau$$ is the trajectory.

We then update the policy parameters via stochastic gradient ascent:

$\theta_{k+1} = \theta_k + \alpha \nabla_{\theta} J(\pi_{\theta_k})$

The key idea underlying Advantage Actor Critic Algorithm is to push up the probabilities of actions that lead to higher return than the expected return of that state, and push down the probabilities of actions that lead to lower return than the expected return of that state, until you arrive at the optimal policy.

## Training through the API¶

import gym

from genrl.agents import A2C
from genrl.trainers import OnPolicyTrainer
from genrl.environments import VectorEnv

env = VectorEnv("CartPole-v0")
agent = A2C('mlp', env)
trainer = OnPolicyTrainer(agent, env, log_mode=['stdout'])
trainer.train()