Deep Reinforcement Learning Background


The goal of Reinforcement Learning Algorithms is to maximize reward. This is usually achieved by having a policy \(\pi_{\theta}\) perform optimal behavior. Let’s denote this optimal policy by \(\pi_{\theta}^{*}\). For ease, we define the Reinforcement Learning problem as a Markov Decision Process.

Markov Decision Process

An Markov Decision Process (MDP) is defined by \((S, A, r, P_{a})\) where,

  • \(S\) is a set of States.
  • \(A\) is a set of Actions.
  • \(r : S \rightarrow \mathbb{R}\) is a reward function.
  • \(P_{a}(s, s')\) is the transition probability that action \(a\) in state \(s\) leads to state \(s'\).

Often we define two functions, a policy function \(\pi_{\theta}(s,a)\) and \(V_{\pi_{\theta}}(s)\).

Policy Function

The policy is the agent’s strategy, we our goal is to make it optimal. The optimal policy is usually denoted by \(\pi_{\theta}^{*}\). There are usually 2 types of policies:

Stochastic Policy

The Policy Function is a stochastic variable defining a probability distribution over actions given states i.e. likelihood of every action when an agent is in a particular state. Formally,

\[\pi : S \times A \rightarrow [0,1]\]
\[a \sim \pi(a|s)\]

Deterministic Policy

The Policy Function maps from States directly to Actions.

\[\pi : S \rightarrow A\]
\[a = \pi(s)\]

Value Function

The Value Function is defined as the expected return obtained when we follow a policy \(\pi\) starting from state S. Usually there are two types of value functions defined State Value Function and a State Action Value Function.

State Value Function

The State Value Function is defined as the expected return starting from only State s.

\[V^{\pi}(s) = E\left[ R_{t} \right]\]

State Action Value Function

The Action Value Function is defined as the expected return starting from a state s and a taking an action a.

\[Q^{\pi}(s,a) = E\left[ R_{t} \right]\]

The Action Value Function is also known as the Quality Function as it would denote how good a particular action is for a state s.


Neural Networks are often used as approximators for Policy and Value Functions. In such a case, we say these are parameterised by \(\theta\). For e.g. \(\pi_{\theta}\).


The objective is to choose/learn a policy that will maximize a cumulative function of rewards received at each step, typically the discounted reward over a potential infinite horizon. We formulate this cumulative function as

\[E\left[{\sum_{t=0}^{\infty}{\gamma^{t} r_{t}}}\right]\]

where we choose an action according to our policy, \(a_{t} = \pi_{\theta}(s_{t})\).