Deep Reinforcement Learning Background¶
Background¶
The goal of Reinforcement Learning Algorithms is to maximize reward. This is usually achieved by having a policy \(\pi_{\theta}\) perform optimal behavior. Let’s denote this optimal policy by \(\pi_{\theta}^{*}\). For ease, we define the Reinforcement Learning problem as a Markov Decision Process.
Markov Decision Process¶
An Markov Decision Process (MDP) is defined by \((S, A, r, P_{a})\) where,
- \(S\) is a set of States.
- \(A\) is a set of Actions.
- \(r : S \rightarrow \mathbb{R}\) is a reward function.
- \(P_{a}(s, s')\) is the transition probability that action \(a\) in state \(s\) leads to state \(s'\).
Often we define two functions, a policy function \(\pi_{\theta}(s,a)\) and \(V_{\pi_{\theta}}(s)\).
Policy Function¶
The policy is the agent’s strategy, we our goal is to make it optimal. The optimal policy is usually denoted by \(\pi_{\theta}^{*}\). There are usually 2 types of policies:
Stochastic Policy¶
The Policy Function is a stochastic variable defining a probability distribution over actions given states i.e. likelihood of every action when an agent is in a particular state. Formally,
Deterministic Policy¶
The Policy Function maps from States directly to Actions.
Value Function¶
The Value Function is defined as the expected return obtained when we follow a policy \(\pi\) starting from state S. Usually there are two types of value functions defined State Value Function and a State Action Value Function.
State Value Function¶
The State Value Function is defined as the expected return starting from only State s.
State Action Value Function¶
The Action Value Function is defined as the expected return starting from a state s and a taking an action a.
The Action Value Function is also known as the Quality Function as it would denote how good a particular action is for a state s.
Approximators¶
Neural Networks are often used as approximators for Policy and Value Functions. In such a case, we say these are parameterised by \(\theta\). For e.g. \(\pi_{\theta}\).
Objective¶
The objective is to choose/learn a policy that will maximize a cumulative function of rewards received at each step, typically the discounted reward over a potential infinite horizon. We formulate this cumulative function as
where we choose an action according to our policy, \(a_{t} = \pi_{\theta}(s_{t})\).