Deep Deterministic Policy Gradients

Objective

Deep Deterministic Policy Gradients (DDPG) is a model-free actor-critic algorithm which deals with continuous action spaces. One simple approach of dealing with continuous action spaces can be discretizing the action space. However, this gives rise to several problems, the most significant being that the size of the action-space increases exponentially with the number of degrees of freedom. DDPG builds up on Deterministic Policy Gradients to learn deterministic policies in high-dimensional continuous action-spaces.

Algorithms Details

Deterministic Policy Gradient

In cases with continuous action-spaces, using Q-learning like approach (greedy policy improvement) to learn deterministic policies is not feasible since it involves selecting the action with the maximum action value function at every step and it is not possible to check the action value for every possible action in case of continuous action spaces.

\[\mu^{k+1}(s) = argmax_a Q^{\mu^{k}}(s, a)\]

This problem can be solved by considering the fact that a policy can be improved by moving it in the direction of increasing action-value function:

\[\nabla_{\theta^{\mu}}J = \mathbb{E}_{s_t \sim \rho^{\beta}}[\nabla_{\theta^{\mu}}Q(s, a \vert \theta^{Q}) \vert_{s=s_t, a=\mu(s_t, \theta^{\mu})}]\]

Action Selection

To ensure sufficient exploration, noise is added to the action selected using the current policy. The noise is sampled from a noise process \(\mathcal{N}\) :

\[\mu'(s_t) = \mu(s_t \vert \theta_t^{\mu}) + \mathcal{N}\]

\(\mathcal{N}\) can be chosen to suit the environment (for eg. Ornstein-Uhlenbeck process, Gaussian noise, etc.)

156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
    def select_action(
        self, state: torch.Tensor, deterministic: bool = True
    ) -> torch.Tensor:
        """Select action given state

        Deterministic Action Selection with Noise

        Args:
            state (:obj:`torch.Tensor`): Current state of the environment
            deterministic (bool): Should the policy be deterministic or stochastic

        Returns:
            action (:obj:`torch.Tensor`): Action taken by the agent
        """
        action, _ = self.ac.get_action(state, deterministic)
        action = action.detach()

        # add noise to output from policy network
        if self.noise is not None:
            action += self.noise()

        return torch.clamp(
            action, self.env.action_space.low[0], self.env.action_space.high[0]
        )

Experience Replay

Similar to DQNs, DDPG being an off-policy algorithm, makes use of Replay Buffers. Whenever a transition \((s_t, a_t, r_t, s_{t+1})\) is encountered, it is stored into the replay buffer. Batches of these transitions are sampled while updating the network parameters. This helps in breaking the strong correlation between the updates that would have been present had the transitions been trained and discarded immediately after they are encountered and also helps to avoid the rapid forgetting of the possibly rare transitions that would be useful later on.

 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
    def log(self, timestep: int) -> None:
        """Helper function to log

        Sends useful parameters to the logger.

        Args:
            timestep (int): Current timestep of training
        """
        self.logger.write(
            {
                "timestep": timestep,
                "Episode": self.episodes,
                **self.agent.get_logging_params(),
                "Episode Reward": safe_mean(self.training_rewards),

Update the Value and Policy Networks

DDPG makes use of target networks for the actor(policy) and the critic(value) networks to stabilise the training. The Q-network is update using TD-learning updates. The target and the loss function for the same are defined as:

\[L(\theta^{Q}) = \mathbb{E}_{(s_t \sim \rho^{\beta}, a_t \sim \beta, t_t \sim R)}[(Q(s_t, a_t \vert \theta^{Q}) - y_t)^{2}]\]
\[y_t = r(s_t, a_t) + \gamma Q_{targ}(s_{t+1}, \mu_{targ}(s_{t+1}) \vert \theta^{Q})\]

Buliding up on Deterministic Policy Gradients, the gradient of the policy can be determined using the action-value function as

\[\nabla_{\theta^{\mu}} J = \mathbb{E}_{s_t \sim \rho^{\beta}}[\nabla_{\theta^{\mu}}Q(s, a \vert \theta^{Q})\vert_{s=s_t, a=\mu(s_t \vert \theta^{\mu})}]\]
\[\nabla_{\theta^{\mu}} J = \mathbb{E}_{s_t \sim \rho^{\beta}}[\nabla_a Q(s, a \vert \theta^{Q}) \vert_{s=s_t, a=\mu(s_t)}\nabla_{\theta_\mu}\mu(s \vert \theta^{\mu}) \vert_{s=s_t}]\]

The target networks are updated at regular intervals

145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183

        for timestep in range(0, self.max_timesteps, self.env.n_envs):
            self.agent.update_params_before_select_action(timestep)

            action = self.get_action(state, timestep)
            next_state, reward, done, info = self.env.step(action)

            if self.render:
                self.env.render()

            # true_dones contains the "true" value of the dones (game over statuses). It is set
            # to False when the environment is not actually done but instead reaches the max
            # episode length.
            true_dones = [info[i]["done"] for i in range(self.env.n_envs)]
            self.buffer.push((state, action, reward, next_state, true_dones))

            state = next_state.detach().clone()

            if self.check_game_over_status(done):
                self.noise_reset()

                if self.episodes % self.log_interval == 0:
                    self.log(timestep)

                if self.episodes == self.epochs:
                    break

            if timestep >= self.start_update and timestep % self.update_interval == 0:
                self.agent.update_params(self.update_interval)

            if (
                timestep >= self.start_update
                and self.save_interval != 0
                and timestep % self.save_interval == 0
            ):
                self.save(timestep)

        self.env.close()
        self.logger.close()

Training through the API

from genrl.agents import DDPG
from genrl.environments import VectorEnv
from genrl.trainers import OffPolicyTrainer

env = VectorEnv("MountainCarContinuous-v0")
agent = DDPG("mlp", env)
trainer = OffPolicyTrainer(agent, env, max_timesteps=20000)
trainer.train()
trainer.evaluate()