Defines the reinforcement learning
The reinforcement learning primary controller.
OnPolicyRLEngine class handles all training, validation, and
testing as well as logging and checkpointing. You are not expected
to instantiate this class yourself, instead you should define an
experiment which will then be used to instantiate an
OnPolicyRLEngine and perform any desired tasks.
| __init__(experiment_name: str, config: ExperimentConfig, results_queue: mp.Queue, checkpoints_queue: Optional[ | mp.Queue | ], checkpoints_dir: str, mode: str = "train", seed: Optional[int] = None, deterministic_cudnn: bool = False, mp_ctx: Optional[BaseContext] = None, worker_id: int = 0, num_workers: int = 1, device: Union[str, torch.device, int] = "cpu", distributed_port: int = 0, deterministic_agents: bool = False, max_sampler_processes_per_worker: Optional[int] = None, initial_model_state_dict: Optional[Dict[str, Any]] = None, **kwargs, ,)
- config : The ExperimentConfig defining the experiment to run.
- output_dir : Root directory at which checkpoints and logs should be saved.
- seed : Seed used to encourage deterministic behavior (it is difficult to ensure completely deterministic behavior due to CUDA issues and nondeterminism in environments).
- mode : "train", "valid", or "test".
- deterministic_cudnn : Whether or not to use deterministic cudnn. If
Truethis may lower training performance this is necessary (but not sufficient) if you desire deterministic behavior.
- extra_tag : An additional label to add to the experiment when saving tensorboard logs.
| @staticmethod | worker_seeds(nprocesses: int, initial_seed: Optional[int]) -> List[int]
Create a collection of seeds for workers without modifying the RNG state.
| probe(dones: List[bool], npaused, period=100000)
Debugging util. When called from self.collect_rollout_step(...), calls render for the 0-th task sampler of the 0-th distributed worker for the first beginning episode spaced at least period steps from the beginning of the previous one.
For valid, train, it currently renders all episodes for the 0-th task sampler of the 0-th distributed worker. If this is not wanted, it must be hard-coded for now below.
dones: dones list from self.collect_rollout_step(...)
npaused: number of newly paused tasks returned by self.removed_paused(...)
period: minimal spacing in sampled steps between the beginning of episodes to be shown.