Skip to content



Defining the PPO loss for actor critic type models.


class AdvisorImitationStage(AbstractActorCriticLoss)


Implementation of the Advisor loss' stage 1 when main and auxiliary actors are equally weighted.


 | loss(step_count: int, batch: Dict[str, Union[torch.Tensor, Dict[str, torch.Tensor]]], actor_critic_output: ActorCriticOutput[CategoricalDistr], *args, **kwargs)


Compute imitation loss on main and auxiliary policies.


  • batch : A batch of data corresponding to the information collected when rolling out (possibly many) agents over a fixed number of steps. In particular this batch should have the same format as that returned by RolloutStorage.recurrent_generator. Here batch["observations"] must contain "expert_action" observations or "expert_policy" observations. See ExpertActionSensor (or ExpertPolicySensor) for an example of a sensor producing such observations.
  • actor_critic_output : The output of calling an ActorCriticModel on the observations in batch.
  • args : Extra args. Ignored.
  • kwargs : Extra kwargs. Ignored.


A (0-dimensional) torch.FloatTensor corresponding to the computed loss. .backward() will be called on this tensor in order to compute a gradient update to the ActorCriticModel's parameters.


class AdvisorWeightedStage(AbstractActorCriticLoss)


Implementation of the Advisor loss' second stage (simplest variant).


  • rl_loss: The RL loss to use, should be a loss object of type PPO or A2C (or a Builder that when called returns such a loss object).
  • alpha: Exponent to use when reweighting the expert cross entropy loss. Larger alpha means an (exponentially) smaller weight assigned to the cross entropy loss. E.g. if a the weight with alpha=1 is 0.6 then with alpha=2 it is 0.6^2=0.36.
  • bound: If the distance from the auxilary policy to expert policy is greater than this bound then the distance is set to 0.
  • alpha_scheduler: An object of type AlphaScheduler which is before computing the loss in order to get a new value for alpha.
  • smooth_expert_weight_decay: If not None, will redistribute (smooth) the weight assigned to the cross entropy loss at a particular step over the following smooth_expert_steps steps. Values of smooth_expert_weight_decay near 1 will increase how evenly weight is assigned to future steps. Values near 0 will decrease how evenly this weight is distributed with larger weight being given steps less far into the future. Here smooth_expert_steps is automatically defined from smooth_expert_weight_decay as detailed below.
  • smooth_expert_steps: The number of "future" steps over which to distribute the current steps weight. This value is computed as math.ceil(-math.log(1 + ((1 - r) / r) / 0.05) / math.log(r)) - 1 where r=smooth_expert_weight_decay. This ensures that the weight is always distributed over at least one additional step and that it is never distributed more than 20 steps into the future.


 | __init__(rl_loss: Optional[Union[Union[PPO, A2C], Builder[Union[PPO, A2C]]]], fixed_alpha: Optional[float] = 1, fixed_bound: Optional[float] = 0.1, alpha_scheduler: AlphaScheduler = None, smooth_expert_weight_decay: Optional[float] = None, *args, **kwargs)



See the class documentation for parameter definitions not included below.


  • fixed_alpha: This fixed value of alpha to use. This value is IGNORED if alpha_scheduler is not None.
  • fixed_bound: This fixed value of the bound to use.