PPO1¶

genrl.agents.deep.ppo1.ppo1 module¶

class genrl.agents.deep.ppo1.ppo1.PPO1(*args, clip_param: float = 0.2, value_coeff: float = 0.5, entropy_coeff: float = 0.01, **kwargs)[source]¶

Bases: genrl.agents.deep.base.onpolicy.OnPolicyAgent

Proximal Policy Optimization algorithm (Clipped policy).

Paper: https://arxiv.org/abs/1707.06347

network¶

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:	str

env¶

The environment that the agent is supposed to act on

Type:	Environment

create_model¶

Whether the model of the algo should be created when initialised

Type:	bool

batch_size¶

Mini batch size for loading experiences

Type:	int

gamma¶

The discount factor for rewards

Type:	float

layers¶

Layers in the Neural Network of the Q-value function

Type:	`tuple` of `int`

shared_layers¶

Sizes of shared layers in Actor Critic if using

Type:	`tuple` of `int`

lr_policy¶

Learning rate for the policy/actor

Type:	float

lr_value¶

Learning rate for the Q-value function

Type:	float

rollout_size¶

Capacity of the Rollout Buffer

Type:	int

buffer_type¶

Choose the type of Buffer: [“rollout”]

Type:	str

clip_param¶

Epsilon for clipping policy loss

Type:	float

value_coeff¶

Ratio of magnitude of value updates to policy updates

Type:	float

entropy_coeff¶

Ratio of magnitude of entropy updates to policy updates

Type:	float

seed¶

Seed for randomness

Type:	int

render¶

Should the env be rendered during training?

Type:	bool

device¶

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:	str

empty_logs()[source]¶: Empties logs

evaluate_actions(states: torch.Tensor, actions: torch.Tensor)[source]¶

Evaluates actions taken by actor

Actions taken by actor and their respective states are analysed to get log probabilities and values from critics

Parameters:	states (`torch.Tensor`) – States encountered in rollout actions (`torch.Tensor`) – Actions taken in response to respective states
Returns:	Values of states encountered during the rollout log_probs (`torch.Tensor`): Log of action probabilities given a state
Return type:	values (`torch.Tensor`)

get_hyperparams() → Dict[str, Any][source]¶

Get relevant hyperparameters to save

Returns:	Hyperparameters to be saved weights (`torch.Tensor`): Neural network weights
Return type:	hyperparams (`dict`)

get_logging_params() → Dict[str, Any][source]¶

Gets relevant parameters for logging

Returns:	Logging parameters for monitoring training
Return type:	logs (`dict`)

get_traj_loss(values, dones)[source]¶

Get loss from trajectory traversed by agent during rollouts

Computes the returns and advantages needed for calculating loss

Parameters:	values (`torch.Tensor`) – Values of states encountered during the rollout dones (`list` of bool) – Game over statuses of each environment

select_action(state: torch.Tensor, deterministic: bool = False) → torch.Tensor[source]¶

Select action given state

Action Selection for On Policy Agents with Actor Critic

Parameters:	state (`np.ndarray`) – Current state of the environment deterministic (bool) – Should the policy be deterministic or stochastic
Returns:	Action taken by the agent value (`torch.Tensor`): Value of given state log_prob (`torch.Tensor`): Log probability of selected action
Return type:	action (`np.ndarray`)

update_params()[source]¶

Updates the the A2C network

Function to update the A2C actor-critic architecture