# Deep Reinforcement Learning Background¶

## Background¶

The goal of Reinforcement Learning Algorithms is to maximize reward. This is usually achieved by having a policy $$\pi_{\theta}$$ perform optimal behavior. Let’s denote this optimal policy by $$\pi_{\theta}^{*}$$. For ease, we define the Reinforcement Learning problem as a Markov Decision Process.

## Markov Decision Process¶

An Markov Decision Process (MDP) is defined by $$(S, A, r, P_{a})$$ where,

• $$S$$ is a set of States.
• $$A$$ is a set of Actions.
• $$r : S \rightarrow \mathbb{R}$$ is a reward function.
• $$P_{a}(s, s')$$ is the transition probability that action $$a$$ in state $$s$$ leads to state $$s'$$.

Often we define two functions, a policy function $$\pi_{\theta}(s,a)$$ and $$V_{\pi_{\theta}}(s)$$.

## Policy Function¶

The policy is the agent’s strategy, we our goal is to make it optimal. The optimal policy is usually denoted by $$\pi_{\theta}^{*}$$. There are usually 2 types of policies:

### Stochastic Policy¶

The Policy Function is a stochastic variable defining a probability distribution over actions given states i.e. likelihood of every action when an agent is in a particular state. Formally,

$\pi : S \times A \rightarrow [0,1]$
$a \sim \pi(a|s)$

### Deterministic Policy¶

The Policy Function maps from States directly to Actions.

$\pi : S \rightarrow A$
$a = \pi(s)$

## Value Function¶

The Value Function is defined as the expected return obtained when we follow a policy $$\pi$$ starting from state S. Usually there are two types of value functions defined State Value Function and a State Action Value Function.

### State Value Function¶

The State Value Function is defined as the expected return starting from only State s.

$V^{\pi}(s) = E\left[ R_{t} \right]$

### State Action Value Function¶

The Action Value Function is defined as the expected return starting from a state s and a taking an action a.

$Q^{\pi}(s,a) = E\left[ R_{t} \right]$

The Action Value Function is also known as the Quality Function as it would denote how good a particular action is for a state s.

## Approximators¶

Neural Networks are often used as approximators for Policy and Value Functions. In such a case, we say these are parameterised by $$\theta$$. For e.g. $$\pi_{\theta}$$.

## Objective¶

The objective is to choose/learn a policy that will maximize a cumulative function of rewards received at each step, typically the discounted reward over a potential infinite horizon. We formulate this cumulative function as

$E\left[{\sum_{t=0}^{\infty}{\gamma^{t} r_{t}}}\right]$

where we choose an action according to our policy, $$a_{t} = \pi_{\theta}(s_{t})$$.