Deep Q Networks with Noisy Nets¶

Objective¶

NoisyNet DQN is a variant of DQN which uses fully connected layers with noisy parameters to drive exploration. Thus, the parametrised action-value function can now be seen as a random variable. The new loss function which needs to minimised is defined as:

\[E[E_{(x, a, r, y) \sim D}[r + \gamma max_{b \in A} Q(y, b, \epsilon'; \zeta^{-}) - Q(x, a, \epsilon; \zeta)]^{2}]\]

where \(\zeta\) is a set of learnable parameters for the noise.

Algorithm Details¶

Action Selection¶

The action selection is no longer epsilon-greedy since the exploration is driven by the noise in the neural network layers. The action selection is done greedily.

Noisy Parameters¶

A noisy parameter \(\theta\) is defined as:

\[\theta := \mu + \Sigma \odot \epsilon\]

where \(\Sigma\) and \(\mu\) are vectors of trainable parameters and \(\epsilon\) is a vector of zero mean noise. Hence, the loss function is now defined with respect to \(\Sigma\) and \(\mu\) and the optimization now takes place with respect to \(\Sigma\) and \(\mu\). \(\epsilon\) is sampled from factorised gaussian noise.

Experience Replay¶

Every transition occuring during the training is stored in a separate Replay Buffer

    def log(self, timestep: int) -> None:
        """Helper function to log

        Sends useful parameters to the logger.

        Args:
            timestep (int): Current timestep of training
        """
        self.logger.write(
            {
                "timestep": timestep,
                "Episode": self.episodes,
                **self.agent.get_logging_params(),
                "Episode Reward": safe_mean(self.training_rewards),

The transitions are later sampled in batches from the replay buffer for updating the network

Update the Q-Network¶

Once enough number of transitions ae stored in the replay buffer, we start updating the Q-values according to the given objective. The loss function is defined in a fashion similar to a DQN. This allows any new improvisations in training techniques of DQN such as Double DQN or NoisyNet DQN to be readily adapted in the dueling architechture.

        for timestep in range(0, self.max_timesteps, self.env.n_envs):
            self.agent.update_params_before_select_action(timestep)

            action = self.get_action(state, timestep)
            next_state, reward, done, info = self.env.step(action)

            if self.render:
                self.env.render()

            # true_dones contains the "true" value of the dones (game over statuses). It is set
            # to False when the environment is not actually done but instead reaches the max
            # episode length.
            true_dones = [info[i]["done"] for i in range(self.env.n_envs)]
            self.buffer.push((state, action, reward, next_state, true_dones))

            state = next_state.detach().clone()

            if self.check_game_over_status(done):
                self.noise_reset()

                if self.episodes % self.log_interval == 0:
                    self.log(timestep)

                if self.episodes == self.epochs:
                    break

            if timestep >= self.start_update and timestep % self.update_interval == 0:
                self.agent.update_params(self.update_interval)

            if (
                timestep >= self.start_update
                and self.save_interval != 0
                and timestep % self.save_interval == 0
            ):
                self.save(timestep)

        self.env.close()
        self.logger.close()

Training through the API¶

from genrl.agents import NoisyDQN
from genrl.environments import VectorEnv
from genrl.trainers import OffPolicyTrainer

env = VectorEnv("CartPole-v0")
agent = NoisyDQN("mlp", env)
trainer = OffPolicyTrainer(agent, env, max_timesteps=20000)
trainer.train()
trainer.evaluate()