genrl.agents.deep.base package

Submodules

genrl.agents.deep.base.base module

class genrl.agents.deep.base.base.BaseAgent(network: Any, env: Any, create_model: bool = True, batch_size: int = 64, gamma: float = 0.99, shared_layers=None, policy_layers: Tuple = (64, 64), value_layers: Tuple = (64, 64), lr_policy: float = 0.0001, lr_value: float = 0.001, **kwargs)[source]

Bases: abc.ABC

Base Agent Class

network

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:str
env

The environment that the agent is supposed to act on

Type:Environment
create_model

Whether the model of the algo should be created when initialised

Type:bool
batch_size

Mini batch size for loading experiences

Type:int
gamma

The discount factor for rewards

Type:float
layers

Layers in the Neural Network of the Q-value function

Type:tuple of int
lr_policy

Learning rate for the policy/actor

Type:float
lr_value

Learning rate for the Q-value function

Type:float
seed

Seed for randomness

Type:int
render

Should the env be rendered during training?

Type:bool
device

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:str
empty_logs()[source]

Empties logs

get_hyperparams() → Dict[str, Any][source]

Get relevant hyperparameters to save

Returns:Hyperparameters to be saved
Return type:hyperparams (dict)
get_logging_params() → Dict[str, Any][source]

Gets relevant parameters for logging

Returns:Logging parameters for monitoring training
Return type:logs (dict)
select_action(state: numpy.ndarray, deterministic: bool = False) → numpy.ndarray[source]

Select action given state

Action selection method

Parameters:
  • state (np.ndarray) – Current state of the environment
  • deterministic (bool) – Should the policy be deterministic or stochastic
Returns:

Action taken by the agent

Return type:

action (np.ndarray)

genrl.agents.deep.base.offpolicy module

class genrl.agents.deep.base.offpolicy.OffPolicyAgent(*args, replay_size: int = 5000, buffer_type: str = 'push', **kwargs)[source]

Bases: genrl.agents.deep.base.base.BaseAgent

Off Policy Agent Base Class

network

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:str
env

The environment that the agent is supposed to act on

Type:Environment
create_model

Whether the model of the algo should be created when initialised

Type:bool
batch_size

Mini batch size for loading experiences

Type:int
gamma

The discount factor for rewards

Type:float
layers

Layers in the Neural Network of the Q-value function

Type:tuple of int
lr_policy

Learning rate for the policy/actor

Type:float
lr_value

Learning rate for the Q-value function

Type:float
replay_size

Capacity of the Replay Buffer

Type:int
buffer_type

Choose the type of Buffer: [“push”, “prioritized”]

Type:str
seed

Seed for randomness

Type:int
render

Should the env be rendered during training?

Type:bool
device

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:str
get_q_loss(batch: collections.namedtuple) → torch.Tensor[source]

Normal Function to calculate the loss of the Q-function or critic

Parameters:batch (collections.namedtuple of torch.Tensor) – Batch of experiences
Returns:Calculated loss of the Q-function
Return type:loss (torch.Tensor)
sample_from_buffer(beta: float = None)[source]

Samples experiences from the buffer and converts them into usable formats

Parameters:beta (float) – Importance-Sampling beta for prioritized replay
Returns:Replay experiences sampled from the buffer
Return type:batch (list)
update_params(update_interval: int) → None[source]

Update parameters of the model

update_params_before_select_action(timestep: int) → None[source]

Update any parameters before selecting action like epsilon for decaying epsilon greedy

Parameters:timestep (int) – Timestep in the training process
update_target_model() → None[source]

Function to update the target Q model

Updates the target model with the training model’s weights when called

class genrl.agents.deep.base.offpolicy.OffPolicyAgentAC(*args, polyak=0.995, **kwargs)[source]

Bases: genrl.agents.deep.base.offpolicy.OffPolicyAgent

Off Policy Agent Base Class

network

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:str
env

The environment that the agent is supposed to act on

Type:Environment
create_model

Whether the model of the algo should be created when initialised

Type:bool
batch_size

Mini batch size for loading experiences

Type:int
gamma

The discount factor for rewards

Type:float
layers

Layers in the Neural Network of the Q-value function

Type:tuple of int
lr_policy

Learning rate for the policy/actor

Type:float
lr_value

Learning rate for the Q-value function

Type:float
replay_size

Capacity of the Replay Buffer

Type:int
buffer_type

Choose the type of Buffer: [“push”, “prioritized”]

Type:str
seed

Seed for randomness

Type:int
render

Should the env be rendered during training?

Type:bool
device

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:str
get_p_loss(states: torch.Tensor) → torch.Tensor[source]

Function to get the Policy loss

Parameters:states (torch.Tensor) – States for which Q-values need to be found
Returns:Calculated policy loss
Return type:loss (torch.Tensor)
get_q_loss(batch: collections.namedtuple) → torch.Tensor[source]

Actor Critic Function to calculate the loss of the Q-function or critic

Parameters:batch (collections.namedtuple of torch.Tensor) – Batch of experiences
Returns:Calculated loss of the Q-function
Return type:loss (torch.Tensor)
get_q_values(states: torch.Tensor, actions: torch.Tensor) → torch.Tensor[source]

Get Q values corresponding to specific states and actions

Parameters:
  • states (torch.Tensor) – States for which Q-values need to be found
  • actions (torch.Tensor) – Actions taken at respective states
Returns:

Q values for the given states and actions

Return type:

q_values (torch.Tensor)

get_target_q_values(next_states: torch.Tensor, rewards: List[float], dones: List[bool]) → torch.Tensor[source]

Get target Q values for the TD3

Parameters:
  • next_states (torch.Tensor) – Next states for which target Q-values need to be found
  • rewards (list) – Rewards at each timestep for each environment
  • dones (list) – Game over status for each environment
Returns:

Target Q values for the TD3

Return type:

target_q_values (torch.Tensor)

select_action(state: torch.Tensor, deterministic: bool = True) → torch.Tensor[source]

Select action given state

Deterministic Action Selection with Noise

Parameters:
  • state (torch.Tensor) – Current state of the environment
  • deterministic (bool) – Should the policy be deterministic or stochastic
Returns:

Action taken by the agent

Return type:

action (torch.Tensor)

update_target_model() → None[source]

Function to update the target Q model

Updates the target model with the training model’s weights when called

genrl.agents.deep.base.onpolicy module

class genrl.agents.deep.base.onpolicy.OnPolicyAgent(*args, rollout_size: int = 1024, buffer_type: str = 'rollout', **kwargs)[source]

Bases: genrl.agents.deep.base.base.BaseAgent

Base On Policy Agent Class

network

The network type of the Q-value function. Supported types: [“cnn”, “mlp”]

Type:str
env

The environment that the agent is supposed to act on

Type:Environment
create_model

Whether the model of the algo should be created when initialised

Type:bool
batch_size

Mini batch size for loading experiences

Type:int
gamma

The discount factor for rewards

Type:float
layers

Layers in the Neural Network of the Q-value function

Type:tuple of int
lr_policy

Learning rate for the policy/actor

Type:float
lr_value

Learning rate for the Q-value function

Type:float
rollout_size

Capacity of the Rollout Buffer

Type:int
buffer_type

Choose the type of Buffer: [“rollout”]

Type:str
seed

Seed for randomness

Type:int
render

Should the env be rendered during training?

Type:bool
device

Hardware being used for training. Options: [“cuda” -> GPU, “cpu” -> CPU]

Type:str
collect_rewards(dones: torch.Tensor, timestep: int)[source]

Helper function to collect rewards

Runs through all the envs and collects rewards accumulated during rollouts

Parameters:
  • dones (torch.Tensor) – Game over statuses of each environment
  • timestep (int) – Timestep during rollout
collect_rollouts(state: torch.Tensor)[source]

Function to collect rollouts

Collects rollouts by playing the env like a human agent and inputs information into the rollout buffer.

Parameters:state (torch.Tensor) – The starting state of the environment
Returns:Values of states encountered during the rollout dones (torch.Tensor): Game over statuses of each environment
Return type:values (torch.Tensor)
update_params() → None[source]

Update parameters of the model

Module contents