In the field of autonomous driving and mobile robotics, there has been a significant shift in the methods used to create Bird's Eye View (BEV) representations. This shift is characterised by using transformers and learning to fuse measurements from disparate vision sensors, mainly lidar and cameras, into a 2D planar ground-based representation. However, these learning-based methods for creating such maps often rely heavily on extensive annotated data, presenting notable challenges, particularly in diverse or non-urban environments where large-scale datasets are scarce. In this work, we present BEVPose, a framework that integrates BEV representations from camera and lidar data, using sensor pose as a guiding supervisory signal. This method notably reduces the dependence on costly annotated data. By leveraging pose information, we align and fuse multi-modal sensory inputs, facilitating the learning of latent BEV embeddings that capture both geometric and semantic aspects of the environment. Our pretraining approach demonstrates promising performance in BEV map segmentation tasks, outperforming fully-supervised state-of-the-art methods, while necessitating only a minimal amount of annotated data. This development not only confronts the challenge of data efficiency in BEV representation learning but also broadens the potential for such techniques in a variety of domains, including off-road and indoor environments.
Qualitative Results. This figure presents the predicted BEV map segmentations across scenes with diverse levels of complexity. The leftmost column illustrates the lidar point-clouds. The middle columns display the six surrounding cameras, arranged from left to right as follows: front-left, front, front-right, back-left, back, and back-right. The rightmost column features the predicted BEV segmentations, with class labels including car park, walkway, lane divider, drivable area, stop line, and pedestrian crossing. The ego-vehicle is positioned at the centre of each map, facing upwards.
Robustness Evaluation for Sensor Modality Dropout. This figure assesses the robustness of our framework against sensor dropout, illustrating BEV segmentation predictions when various modalities are omitted. The first column displays the lidar point-cloud, and the second column shows the camera image (only the front camera is represented, with the other 5 camera views omitted for brevity). The leftmost predicted map results from using only the camera modality (excluding lidar), the middle map depicts predictions based on lidar-only, and the rightmost column illustrates predictions derived from fusing both modalities. The prediction maps confirm the framework's robustness to sensor dropout: the camera-only mode achieves broader coverage with less precise geometry, while the lidar-only mode provides enhanced geometric fidelity albeit over a more limited range. This robustness is crucial for the practical application of multi-modal fusion models in real-world autonomous driving scenarios.
BEV Embeddings. An illustration of the pose-supervised latent BEV 256-dimensional features projected into the RGB 3-space using PCA. The visualisation clearly demonstrates that these latent BEV representations capture high-level geometric and semantic information, such as roads, intersections, walkways, and other significant landmarks observable in camera frustums. The first row displays the lidar point-cloud, all 6 surrounding camera images, and their corresponding latent BEV embeddings. For brevity, only BEV features for various scenes are shown in the second row.
@inproceedings{hosseinzadeh2024bevpose,
title = {BEVPose: Unveiling Scene Semantics through Pose-Guided Multi-Modal BEV Alignment},
author = {Hosseinzadeh, Mehdi and Reid, Ian},
booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
year = {2024}
}