Skip to content



Defines the reinforcement learning OnPolicyRLEngine.


class OnPolicyRLEngine(object)


The reinforcement learning primary controller.

This OnPolicyRLEngine class handles all training, validation, and testing as well as logging and checkpointing. You are not expected to instantiate this class yourself, instead you should define an experiment which will then be used to instantiate an OnPolicyRLEngine and perform any desired tasks.


 | __init__(experiment_name: str, config: ExperimentConfig, results_queue: mp.Queue, checkpoints_queue: Optional[
 |             mp.Queue
 |         ], checkpoints_dir: str, mode: str = "train", seed: Optional[int] = None, deterministic_cudnn: bool = False, mp_ctx: Optional[BaseContext] = None, worker_id: int = 0, num_workers: int = 1, device: Union[str, torch.device, int] = "cpu", distributed_ip: str = "", distributed_port: int = 0, deterministic_agents: bool = False, max_sampler_processes_per_worker: Optional[int] = None, initial_model_state_dict: Optional[Union[Dict[str, Any], int]] = None, **kwargs, ,)




  • config : The ExperimentConfig defining the experiment to run.
  • output_dir : Root directory at which checkpoints and logs should be saved.
  • seed : Seed used to encourage deterministic behavior (it is difficult to ensure completely deterministic behavior due to CUDA issues and nondeterminism in environments).
  • mode : "train", "valid", or "test".
  • deterministic_cudnn : Whether or not to use deterministic cudnn. If True this may lower training performance this is necessary (but not sufficient) if you desire deterministic behavior.
  • extra_tag : An additional label to add to the experiment when saving tensorboard logs.


 | @staticmethod
 | worker_seeds(nprocesses: int, initial_seed: Optional[int]) -> List[int]


Create a collection of seeds for workers without modifying the RNG state.


 | probe(dones: List[bool], npaused, period=100000)


Debugging util. When called from self.collect_rollout_step(...), calls render for the 0-th task sampler of the 0-th distributed worker for the first beginning episode spaced at least period steps from the beginning of the previous one.

For valid, train, it currently renders all episodes for the 0-th task sampler of the 0-th distributed worker. If this is not wanted, it must be hard-coded for now below.


  • dones: dones list from self.collect_rollout_step(...)
  • npaused: number of newly paused tasks returned by self.removed_paused(...)
  • period: minimal spacing in sampled steps between the beginning of episodes to be shown.


class OnPolicyTrainer(OnPolicyRLEngine)



 | distributed_weighted_sum(to_share: Union[torch.Tensor, float, int], weight: Union[torch.Tensor, float, int])


Weighted sum of scalar across distributed workers.