genrl.agents.deep.base package¶

Submodules¶

genrl.agents.deep.base.base module¶

class genrl.agents.deep.base.base.BaseAgent(network: Any, env: Any, create_model: bool = True, batch_size: int = 64, gamma: float = 0.99, shared_layers=None, policy_layers: Tuple = (64, 64), value_layers: Tuple = (64, 64), lr_policy: float = 0.0001, lr_value: float = 0.001, **kwargs)[source]¶

Bases: abc.ABC

Base Agent Class

network¶

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:	str

env¶

The environment that the agent is supposed to act on

Type:	Environment

create_model¶

Whether the model of the algo should be created when initialised

Type:	bool

batch_size¶

Mini batch size for loading experiences

Type:	int

gamma¶

The discount factor for rewards

Type:	float

layers¶

Layers in the Neural Network of the Q-value function

Type:	`tuple` of `int`

lr_policy¶

Learning rate for the policy/actor

Type:	float

lr_value¶

Learning rate for the Q-value function

Type:	float

seed¶

Seed for randomness

Type:	int

render¶

Should the env be rendered during training?

Type:	bool

device¶

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:	str

empty_logs()[source]¶: Empties logs

get_hyperparams() → Dict[str, Any][source]¶

Get relevant hyperparameters to save

Returns:	Hyperparameters to be saved
Return type:	hyperparams (`dict`)

get_logging_params() → Dict[str, Any][source]¶

Gets relevant parameters for logging

Returns:	Logging parameters for monitoring training
Return type:	logs (`dict`)

select_action(state: numpy.ndarray, deterministic: bool = False) → numpy.ndarray[source]¶

Select action given state

Action selection method

Parameters:	state (`np.ndarray`) – Current state of the environment deterministic (bool) – Should the policy be deterministic or stochastic
Returns:	Action taken by the agent
Return type:	action (`np.ndarray`)

genrl.agents.deep.base.offpolicy module¶

class genrl.agents.deep.base.offpolicy.OffPolicyAgent(*args, replay_size: int = 5000, buffer_type: str = 'push', **kwargs)[source]¶

Bases: genrl.agents.deep.base.base.BaseAgent

Off Policy Agent Base Class

network¶

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:	str

env¶

The environment that the agent is supposed to act on

Type:	Environment

create_model¶

Whether the model of the algo should be created when initialised

Type:	bool

batch_size¶

Mini batch size for loading experiences

Type:	int

gamma¶

The discount factor for rewards

Type:	float

layers¶

Layers in the Neural Network of the Q-value function

Type:	`tuple` of `int`

lr_policy¶

Learning rate for the policy/actor

Type:	float

lr_value¶

Learning rate for the Q-value function

Type:	float

replay_size¶

Capacity of the Replay Buffer

Type:	int

buffer_type¶

Choose the type of Buffer: [“push”, “prioritized”]

Type:	str

seed¶

Seed for randomness

Type:	int

render¶

Should the env be rendered during training?

Type:	bool

device¶

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:	str

get_q_loss(batch: collections.namedtuple) → torch.Tensor[source]¶

Normal Function to calculate the loss of the Q-function or critic

Parameters:	batch (`collections.namedtuple` of `torch.Tensor`) – Batch of experiences
Returns:	Calculated loss of the Q-function
Return type:	loss (`torch.Tensor`)

sample_from_buffer(beta: float = None)[source]¶

Samples experiences from the buffer and converts them into usable formats

Parameters:	beta (float) – Importance-Sampling beta for prioritized replay
Returns:	Replay experiences sampled from the buffer
Return type:	batch (`list`)

update_params(update_interval: int) → None[source]¶: Update parameters of the model

update_params_before_select_action(timestep: int) → None[source]¶

Update any parameters before selecting action like epsilon for decaying epsilon greedy

Parameters:	timestep (int) – Timestep in the training process

update_target_model() → None[source]¶

Function to update the target Q model

Updates the target model with the training model’s weights when called

class genrl.agents.deep.base.offpolicy.OffPolicyAgentAC(*args, polyak=0.995, **kwargs)[source]¶

Bases: genrl.agents.deep.base.offpolicy.OffPolicyAgent

Off Policy Agent Base Class

network¶

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:	str

env¶

The environment that the agent is supposed to act on

Type:	Environment

create_model¶

Whether the model of the algo should be created when initialised

Type:	bool

batch_size¶

Mini batch size for loading experiences

Type:	int

gamma¶

The discount factor for rewards

Type:	float

layers¶

Layers in the Neural Network of the Q-value function

Type:	`tuple` of `int`

lr_policy¶

Learning rate for the policy/actor

Type:	float

lr_value¶

Learning rate for the Q-value function

Type:	float

replay_size¶

Capacity of the Replay Buffer

Type:	int

buffer_type¶

Choose the type of Buffer: [“push”, “prioritized”]

Type:	str

seed¶

Seed for randomness

Type:	int

render¶

Should the env be rendered during training?

Type:	bool

device¶

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:	str

get_p_loss(states: torch.Tensor) → torch.Tensor[source]¶

Function to get the Policy loss

Parameters:	states (`torch.Tensor`) – States for which Q-values need to be found
Returns:	Calculated policy loss
Return type:	loss (`torch.Tensor`)

get_q_loss(batch: collections.namedtuple) → torch.Tensor[source]¶

Actor Critic Function to calculate the loss of the Q-function or critic

Parameters:	batch (`collections.namedtuple` of `torch.Tensor`) – Batch of experiences
Returns:	Calculated loss of the Q-function
Return type:	loss (`torch.Tensor`)

get_q_values(states: torch.Tensor, actions: torch.Tensor) → torch.Tensor[source]¶

Get Q values corresponding to specific states and actions

Parameters:	states (`torch.Tensor`) – States for which Q-values need to be found actions (`torch.Tensor`) – Actions taken at respective states
Returns:	Q values for the given states and actions
Return type:	q_values (`torch.Tensor`)

get_target_q_values(next_states: torch.Tensor, rewards: List[float], dones: List[bool]) → torch.Tensor[source]¶

Get target Q values for the TD3

Parameters:	next_states (`torch.Tensor`) – Next states for which target Q-values need to be found rewards (`list`) – Rewards at each timestep for each environment dones (`list`) – Game over status for each environment
Returns:	Target Q values for the TD3
Return type:	target_q_values (`torch.Tensor`)

select_action(state: torch.Tensor, deterministic: bool = True) → torch.Tensor[source]¶

Select action given state

Deterministic Action Selection with Noise

Parameters:	state (`torch.Tensor`) – Current state of the environment deterministic (bool) – Should the policy be deterministic or stochastic
Returns:	Action taken by the agent
Return type:	action (`torch.Tensor`)

update_target_model() → None[source]¶

Function to update the target Q model

Updates the target model with the training model’s weights when called

genrl.agents.deep.base.onpolicy module¶

class genrl.agents.deep.base.onpolicy.OnPolicyAgent(*args, rollout_size: int = 1024, buffer_type: str = 'rollout', **kwargs)[source]¶

Bases: genrl.agents.deep.base.base.BaseAgent

Base On Policy Agent Class

network¶

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:	str

env¶

The environment that the agent is supposed to act on

Type:	Environment

create_model¶

Whether the model of the algo should be created when initialised

Type:	bool

batch_size¶

Mini batch size for loading experiences

Type:	int

gamma¶

The discount factor for rewards

Type:	float

layers¶

Layers in the Neural Network of the Q-value function

Type:	`tuple` of `int`

lr_policy¶

Learning rate for the policy/actor

Type:	float

lr_value¶

Learning rate for the Q-value function

Type:	float

rollout_size¶

Capacity of the Rollout Buffer

Type:	int

buffer_type¶

Choose the type of Buffer: [“rollout”]

Type:	str

seed¶

Seed for randomness

Type:	int

render¶

Should the env be rendered during training?

Type:	bool

device¶

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:	str

collect_rewards(dones: torch.Tensor, timestep: int)[source]¶

Helper function to collect rewards

Runs through all the envs and collects rewards accumulated during rollouts

Parameters:	dones (`torch.Tensor`) – Game over statuses of each environment timestep (int) – Timestep during rollout

collect_rollouts(state: torch.Tensor)[source]¶

Function to collect rollouts

Collects rollouts by playing the env like a human agent and inputs information into the rollout buffer.

Parameters:	state (`torch.Tensor`) – The starting state of the environment
Returns:	Values of states encountered during the rollout dones (`torch.Tensor`): Game over statuses of each environment
Return type:	values (`torch.Tensor`)

update_params() → None[source]¶: Update parameters of the model

genrl.agents.deep.base package¶

Submodules¶

genrl.agents.deep.base.base module¶

genrl.agents.deep.base.offpolicy module¶

genrl.agents.deep.base.onpolicy module¶

Module contents¶