Active Neural SLAM module.
This is an implementation of the Active Neural SLAM module from:
Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A. and Salakhutdinov, R., 2020. Learning To Explore Using Active Neural SLAM. In International Conference on Learning Representations (ICLR).
Note that this is purely the mapping component and does not include the planning components from the above paper.
This implementation is adapted from
we have extended this implementation to allow for an arbitrary number of output map
channels (enabling semantic mapping).
At a high level, this model takes as input RGB egocentric images and outputs metric map tensors of shape (# channels) x height x width where height/width correspond to the ground plane of the environment.
| __init__(frame_height: int, frame_width: int, n_map_channels: int, resolution_in_cm: int = 5, map_size_in_cm: int = 2400, vision_range_in_cm: int = 300, use_pose_estimation: bool = False, pretrained_resnet: bool = True, freeze_resnet_batchnorm: bool = True, use_resnet_layernorm: bool = False)
Initialize an Active Neural SLAM module.
- frame_height : The height of the RGB images given to this module on calls to
- frame_width : The width of the RGB images given to this module on calls to
- n_map_channels : The number of output channels in the output maps.
- resolution_in_cm : The resolution of the output map, see
- map_size_in_cm : The height & width of the map in centimeters. The size of the map
tensor returned on calls to forward will be
map_size_in_cm/resolution_in_cm. Note that
map_size_in_cmmust be an divisible by resolution_in_cm.
- vision_range_in_cm : Given an RGB image input, this module will transform this image into
an "egocentric map" with height and width equaling
vision_range_in_cm/resolution_in_cm. This egocentr map corresponds to the area of the world directly in front of the agent. This "egocentric map" will be rotated/translated into the allocentric reference frame and used to update the larger, allocentric, map whose height and width equal
map_size_in_cm/resolution_in_cm. Thus this parameter controls how much of the map will be updated on every step.
- use_pose_estimation : Whether or not we should estimate the agent's change in position/rotation.
False, you'll need to provide the ground truth changes in position/rotation.
- pretrained_resnet : Whether or not to use ImageNet pre-trained model weights for the ResNet18 backbone.
- freeze_resnet_batchnorm : Whether or not the batch normalization layers in the ResNet18 backbone
should be frozen and batchnorm updates disabled. You almost certainly want this to be
Trueas using batch normalization during RL training results in all sorts of issues unless you're very careful.
- use_resnet_layernorm : If you've enabled
freeze_resnet_batchnorm(recommended) you'll likely want to normalize the output from the ResNet18 model as we've found that these values can otherwise grow quite large harming learning.
| forward(images: Optional[torch.Tensor], last_map_probs_allocentric: Optional[torch.Tensor], last_xzrs_allocentric: Optional[torch.Tensor], dx_dz_drs_egocentric: Optional[torch.Tensor], last_map_logits_egocentric: Optional[torch.Tensor], return_allocentric_maps=True, resnet_image_features: Optional[torch.Tensor] = None) -> Dict[str, Any]
Create allocentric/egocentric maps predictions given RGB image inputs.
Here it is assumed that
last_xzrs_allocentric has been re-centered so that (x, z) == (0,0)
corresponds to the top left of the returned map (with increasing x/z moving to the bottom right of the map).
Note that all maps are oriented so that: * Increasing x values correspond to increasing columns in the map(s). * Increasing z values correspond to increasing rows in the map(s). Note that this may seem a bit weird as: * "north" is pointing downwards in the map, * if you picture yourself as the agent facing north (i.e. down) in the map, then moving to the right from the agent's perspective will correspond to increasing which column the agent is at:
agent facing downwards - - > (dir. to the right of the agent, i.e. moving right corresponds to +cols) | | v (dir. agent faces, i.e. moving ahead corresponds to +rows)
This may be the opposite of what you expect.
- images : A (# batches) x 3 x height x width tensor of RGB images. These should be
normalized for use with a resnet model. See here
for information (see also the
use_resnet_normalizationparameter of the
- last_map_probs_allocentric : A (# batches) x (map channels) x (map height) x (map width) tensor representing the colllection of allocentric maps to be updated.
- last_xzrs_allocentric : A (# batches) x 3 tensor where
last_xzrs_allocentric[:, 0]are the agent's (allocentric) x-coordinates on the previous step,
last_xzrs_allocentric[:, 1]are the agent's (allocentric) z-coordinates from the previous step, and
last_xzrs_allocentric[:, 2]are the agent's rotations (allocentric, in degrees) from the prevoius step.
- dx_dz_drs_egocentric : A (# batches) x 3 tensor representing the agent's change in x (in meters), z (in meters),
and rotation (in degrees) from the previous step. Note that these changes are "egocentric" so that if the
agent moved 1 meter ahead from it's perspective this should correspond to a dz of +1.0 regardless of
the agent's orientation (similarly moving right would result in a dx of +1.0). This
is ignored (and thus can be
None) if you are using pose estimation (i.e.
True) or if
- last_map_logits_egocentric : The "egocentric_update" output when calling this function
on the last agent's step. I.e. this should be the egocentric map view of the agent
from the last step. This is used to compute the change in the agent's position rotation.
This is ignored (and thus can be
None) if you do not wish to estimate the agent's pose (i.e.
- return_allocentric_maps : Whether or not to generate new allocentric maps given
last_map_probs_allocentricand the new map estimates. Creating these new allocentric maps is expensive so better avoided when not needed.
- resnet_image_features : Sometimes you may wish to compute the ResNet image features yourself for use
in another part of your model. Rather than having to recompute them multiple times, you can
instead compute them once and pass them into this forward call (in this case the input
imagesparameter is ignored). Note that if you're using the
self.resnet_l5module to compute these features, be sure to also normalize them with
self.resnet_normalizerif you have opted to
use_resnet_layernormwhen initializing this module).
A dictionary with keys/values:
* "egocentric_update" - The egocentric map view for the given RGB image. This is what should
be used for computing losses in general.
* "map_logits_probs_update_no_grad" - The egocentric map view after it has been
rotated, translated, and moved into a full-sized allocentric map. This map has been
detached from the computation graph and so should not be used for gradient computations.
This will be
* "map_logits_probs_no_grad" - The newly updated allocentric map, this corresponds to
performing a pointwise maximum between
last_map_probs_allocentric and the
This will be
* "dx_dz_dr_egocentric_preds" - The predicted change in x, z, and rotation of the agent (from the
egocentric perspective of the agent).
* "xzr_allocentric_preds" - The (predicted if
self.use_pose_estimation == True) allocentric
(x, z) position and rotation of the agent. This will equal
self.use_pose_estimation == False