Defining the PPO loss for actor critic type models.
Implementation of the Advisor loss' stage 1 when main and auxiliary actors are equally weighted.
| loss(step_count: int, batch: Dict[str, Union[torch.Tensor, Dict[str, torch.Tensor]]], actor_critic_output: ActorCriticOutput[CategoricalDistr], *args, **kwargs)
Compute imitation loss on main and auxiliary policies.
- batch : A batch of data corresponding to the information collected when rolling out (possibly many) agents
over a fixed number of steps. In particular this batch should have the same format as that returned by
ExpertPolicySensor) for an example of a sensor producing such observations.
- actor_critic_output : The output of calling an ActorCriticModel on the observations in
- args : Extra args. Ignored.
- kwargs : Extra kwargs. Ignored.
A (0-dimensional) torch.FloatTensor corresponding to the computed loss.
.backward() will be called on this
tensor in order to compute a gradient update to the ActorCriticModel's parameters.
Implementation of the Advisor loss' second stage (simplest variant).
rl_loss: The RL loss to use, should be a loss object of type
Builderthat when called returns such a loss object).
alpha: Exponent to use when reweighting the expert cross entropy loss. Larger alpha means an (exponentially) smaller weight assigned to the cross entropy loss. E.g. if a the weight with alpha=1 is 0.6 then with alpha=2 it is 0.6^2=0.36.
bound: If the distance from the auxilary policy to expert policy is greater than this bound then the distance is set to 0.
alpha_scheduler: An object of type
AlphaSchedulerwhich is before computing the loss in order to get a new value for
smooth_expert_weight_decay: If not None, will redistribute (smooth) the weight assigned to the cross entropy loss at a particular step over the following
smooth_expert_stepssteps. Values of
smooth_expert_weight_decaynear 1 will increase how evenly weight is assigned to future steps. Values near 0 will decrease how evenly this weight is distributed with larger weight being given steps less far into the
smooth_expert_stepsis automatically defined from
smooth_expert_weight_decayas detailed below.
smooth_expert_steps: The number of "future" steps over which to distribute the current steps weight. This value is computed as
math.ceil(-math.log(1 + ((1 - r) / r) / 0.05) / math.log(r)) - 1where
r=smooth_expert_weight_decay. This ensures that the weight is always distributed over at least one additional step and that it is never distributed more than 20 steps into the future.
| __init__(rl_loss: Optional[Union[Union[PPO, A2C], Builder[Union[PPO, A2C]]]], fixed_alpha: Optional[float] = 1, fixed_bound: Optional[float] = 0.1, alpha_scheduler: AlphaScheduler = None, smooth_expert_weight_decay: Optional[float] = None, *args, **kwargs)
See the class documentation for parameter definitions not included below.
- fixed_alpha: This fixed value of
alphato use. This value is IGNORED if alpha_scheduler is not None.
- fixed_bound: This fixed value of the