## Objective¶

Deep Deterministic Policy Gradients (DDPG) is a model-free actor-critic algorithm which deals with continuous action spaces. One simple approach of dealing with continuous action spaces can be discretizing the action space. However, this gives rise to several problems, the most significant being that the size of the action-space increases exponentially with the number of degrees of freedom. DDPG builds up on Deterministic Policy Gradients to learn deterministic policies in high-dimensional continuous action-spaces.

## Algorithms Details¶

In cases with continuous action-spaces, using Q-learning like approach (greedy policy improvement) to learn deterministic policies is not feasible since it involves selecting the action with the maximum action value function at every step and it is not possible to check the action value for every possible action in case of continuous action spaces.

$\mu^{k+1}(s) = argmax_a Q^{\mu^{k}}(s, a)$

This problem can be solved by considering the fact that a policy can be improved by moving it in the direction of increasing action-value function:

$\nabla_{\theta^{\mu}}J = \mathbb{E}_{s_t \sim \rho^{\beta}}[\nabla_{\theta^{\mu}}Q(s, a \vert \theta^{Q}) \vert_{s=s_t, a=\mu(s_t, \theta^{\mu})}]$

### Action Selection¶

To ensure sufficient exploration, noise is added to the action selected using the current policy. The noise is sampled from a noise process $$\mathcal{N}$$ :

$\mu'(s_t) = \mu(s_t \vert \theta_t^{\mu}) + \mathcal{N}$

$$\mathcal{N}$$ can be chosen to suit the environment (for eg. Ornstein-Uhlenbeck process, Gaussian noise, etc.)

 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179  def select_action( self, state: torch.Tensor, deterministic: bool = True ) -> torch.Tensor: """Select action given state Deterministic Action Selection with Noise Args: state (:obj:torch.Tensor): Current state of the environment deterministic (bool): Should the policy be deterministic or stochastic Returns: action (:obj:torch.Tensor): Action taken by the agent """ action, _ = self.ac.get_action(state, deterministic) action = action.detach() # add noise to output from policy network if self.noise is not None: action += self.noise() return torch.clamp( action, self.env.action_space.low[0], self.env.action_space.high[0] ) 

### Experience Replay¶

Similar to DQNs, DDPG being an off-policy algorithm, makes use of Replay Buffers. Whenever a transition $$(s_t, a_t, r_t, s_{t+1})$$ is encountered, it is stored into the replay buffer. Batches of these transitions are sampled while updating the network parameters. This helps in breaking the strong correlation between the updates that would have been present had the transitions been trained and discarded immediately after they are encountered and also helps to avoid the rapid forgetting of the possibly rare transitions that would be useful later on.

  91 92 93 94 95 96 97 98 99 100 101 102 103 104  def log(self, timestep: int) -> None: """Helper function to log Sends useful parameters to the logger. Args: timestep (int): Current timestep of training """ self.logger.write( { "timestep": timestep, "Episode": self.episodes, **self.agent.get_logging_params(), "Episode Reward": safe_mean(self.training_rewards), 

### Update the Value and Policy Networks¶

DDPG makes use of target networks for the actor(policy) and the critic(value) networks to stabilise the training. The Q-network is update using TD-learning updates. The target and the loss function for the same are defined as:

$L(\theta^{Q}) = \mathbb{E}_{(s_t \sim \rho^{\beta}, a_t \sim \beta, t_t \sim R)}[(Q(s_t, a_t \vert \theta^{Q}) - y_t)^{2}]$
$y_t = r(s_t, a_t) + \gamma Q_{targ}(s_{t+1}, \mu_{targ}(s_{t+1}) \vert \theta^{Q})$

Buliding up on Deterministic Policy Gradients, the gradient of the policy can be determined using the action-value function as

$\nabla_{\theta^{\mu}} J = \mathbb{E}_{s_t \sim \rho^{\beta}}[\nabla_{\theta^{\mu}}Q(s, a \vert \theta^{Q})\vert_{s=s_t, a=\mu(s_t \vert \theta^{\mu})}]$
$\nabla_{\theta^{\mu}} J = \mathbb{E}_{s_t \sim \rho^{\beta}}[\nabla_a Q(s, a \vert \theta^{Q}) \vert_{s=s_t, a=\mu(s_t)}\nabla_{\theta_\mu}\mu(s \vert \theta^{\mu}) \vert_{s=s_t}]$

The target networks are updated at regular intervals

 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183  for timestep in range(0, self.max_timesteps, self.env.n_envs): self.agent.update_params_before_select_action(timestep) action = self.get_action(state, timestep) next_state, reward, done, info = self.env.step(action) if self.render: self.env.render() # true_dones contains the "true" value of the dones (game over statuses). It is set # to False when the environment is not actually done but instead reaches the max # episode length. true_dones = [info[i]["done"] for i in range(self.env.n_envs)] self.buffer.push((state, action, reward, next_state, true_dones)) state = next_state.detach().clone() if self.check_game_over_status(done): self.noise_reset() if self.episodes % self.log_interval == 0: self.log(timestep) if self.episodes == self.epochs: break if timestep >= self.start_update and timestep % self.update_interval == 0: self.agent.update_params(self.update_interval) if ( timestep >= self.start_update and self.save_interval != 0 and timestep % self.save_interval == 0 ): self.save(timestep) self.env.close() self.logger.close() 

## Training through the API¶

from genrl.agents import DDPG
from genrl.environments import VectorEnv
from genrl.trainers import OffPolicyTrainer

env = VectorEnv("MountainCarContinuous-v0")
agent = DDPG("mlp", env)
trainer = OffPolicyTrainer(agent, env, max_timesteps=20000)
trainer.train()
trainer.evaluate()