BEV-VAE: A Unified BEV Representation for Generalizable Driving Scene Synthesis

Abstract

TL; DR We introduce BEV-VAE, a variational autoencoder that unifies multi-view images into a BEV representation for generalizable autonomous driving scene synthesis.

Generative modeling has shown remarkable success in vision and language, inspiring research on synthesizing driving scenes. Existing multi-view synthesis approaches typically operate in image latent spaces with cross-attention to enforce spatial consistency, but they are tightly bound to camera configurations, which limits model generalization. We propose BEV-VAE, a variational autoencoder that learns a unified Bird’s-Eye-View (BEV) representation from multi-view images, enabling encoding from arbitrary camera layouts and decoding to any desired viewpoint. Through multi-view image reconstruction and novel view synthesis, we show that BEV-VAE effectively fuses multi-view information and accurately models spatial structure. This capability allows it to generalize across camera configurations and facilitates scalable training on diverse datasets. Within the latent space of BEV-VAE, a Diffusion Transformer (DiT) generates BEV representations conditioned on 3D object layouts, enabling multi-view image synthesis with enhanced spatial consistency on nuScenes and achieving the first complete seven-view synthesis on AV2. Compared with training generative models in image latent spaces, BEV-VAE achieves superior computational efficiency. Finally, synthesized imagery significantly improves the perception performance of BEVFormer, highlighting the utility of generalizable scene synthesis for autonomous driving.

Method

Overall architecture of BEV-VAE with DiT for multi-view image generation.

In Stage 1, BEV-VAE learns to encode multi-view images into a spatially compact latent space in BEV and reconstruct them, ensuring spatial consistency. In Stage 2, DiT is trained with Classifier-Free Guidance (CFG) in this latent space to generate BEV representations from random noise, which are then decoded into multi-view images.

Getting Started

Environment Setup

First, create and activate a conda environment:

conda create -n bevvae python=3.10
conda activate bevvae

Clone the repository:

git clone https://github.com/Czm369/bev-vae.git

Dependencies

The code is tested with Python 3.10.x and CUDA 12.x.

Install PyTorch according to your CUDA version:

CUDA 12.1

pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121

CUDA 12.8

pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128

Install BEV-VAE

cd ${ROOT}
pip install requirements.txt
pip install -e .

Data Preparation

nuScenes

Download the nuScenes dataset from the website and and place it under ./data/.
Download the BEV-VAE preprocessed data from the website. This repository provides additional data required by BEV-VAE, including BEV latent representations (nusc_bev-lidar_train.tar.gz and nusc_bev-lidar_val.tar.gz) encoded from nuScenes.These BEV latents can be decoded into multi-view images using nuScenes (or other compatible, such as AV2) camera configurations. The provided BEV latents can be directly used to train DiT, significantly reducing training cost by skipping the BEV-VAE encoding stage.
After preparation, you should have the following files:

bev-vae/data/nusc
├── maps
├── samples
├── sweeps
├── v1.0-trainval
└── nusc
    ├── scene2frame.json
    ├── scene2sensor2extrinsic.json
    ├── scene2sensor2intrinsic.json
    ├── scene2sensor2stamp2token.json
    ├── scene2stamp2annotation.json
    ├── sensor_cache.feather
    ├── synchronization_cache.feather
    ├── token2ego.json
    └── token2file.json

AV2

Download the AV2 dataset from the website and and place it under ./data/.
Download the BEV-VAE preprocessed data from the website. This repository provides additional data required by BEV-VAE, including BEV latent representations (av2_bev-lidar_train.tar.gz and av2_bev-lidar_val.tar.gz) encoded from AV2.These BEV latents can be decoded into multi-view images using AV2 (or other compatible, such as nuScenes) camera configurations. The provided BEV latents can be directly used to train DiT, significantly reducing training cost by skipping the BEV-VAE encoding stage.
After preparation, you should have the following files:

bev-vae/data/av2/sensor/
├── train
├── val
└── av2
    ├── log2sensor2extrinsic.json
    ├── log2sensor2intrinsic.json
    ├── log2stamp2annotation.json
    ├── log2stamp2ego.json
    ├── sensor_cache.feather
    └── synchronization_cache.feather

Model Preparation

Download the pre-trained BEV-VAE bev-vae_329089c03f0d.ckpt from the website and place it under ./ckpt/stage1.
Download the pre-trained Inception pt_inception-2015-12-05-6726825d.pth for FID evaluation from the website and place it under ./ckpt/fid.
Download the pre-trained LoFTR loftr_outdoor.ckpt for MVSC evaluation from the website and place it under ./ckpt/.

Model Inference

Before running inference, update the dataset and checkpoint root paths in eval_single.sh to match your local environment:

export NUSCENES_DATA_DIR="/root/bev-vae/data/nusc/"
export ARGOVERSE_DATA_DIR="/root/bev-vae/data/av2/"
export CKPT_DIR="/root/bev-vae/ckpt/"
export OUTPUT_DIR="/root/bev-vae/logs/"

On a single RTX 5090 GPU, BEV-VAE can perform multi-view image reconstruction with batch_size=4 at least.

nuScenes

bash eval_single.sh test-bev-vae_nusc-val_1x1x4x1_4e-5_1504

AV2

bash eval_single.sh test-bev-vae_av2-val_1x1x4x1_8e-5_5880

Experiments

Datasets

This study uses four multi-camera autonomous driving datasets that differ substantially in scale, camera configuration, annotated categories, and recording locations. Despite these differences, all datasets provide full 360° coverage of the surrounding scene.

Dataset	#Frames	#Cameras	#Classes	Recording Locations
WS101	17k	5	0	London, San Francisco Bay Area
nuScenes	155k	6	23	Boston, Pittsburgh, Las Vegas, Singapore
AV2	224k	7	30	Austin, Detroit, Miami, Pittsburgh, Palo Alto, Washington DC
nuPlan	3.11M	8	7	Boston, Pittsburgh, Las Vegas, Singapore

We introduce a new hybrid autonomous driving dataset configuration, PAS, which combines nuPlan, AV2, and nuScenes.

Multi-view Image Reconstruction

BEV-VAE learns unified BEV representations by reconstructing multi-view images, integrating semantics from all camera views while modeling 3D spatial structure. Reconstruction metrics provide an indirect evaluation of the quality of the learned BEV representations. For reference, we compare with SD-VAE, a foundational model trained on LAION-5B, which encodes a single $256\times256$ image into a $32 \times32\times4$ latent. In contrast, BEV-VAE encodes multiple $256\times256$ views into a $32\times32\times16$ BEV latent, facing the more challenging task of modeling underlying 3D structure.

Reconstruction metrics on nuScenes compared with SD-VAE.

Model	Training	Validation	PSNR $\uparrow$	SSIM $\uparrow$	MVSC $\uparrow$	rFID $\downarrow$
SD-VAE	LAION-5B	nuScenes	29.63	0.8283	0.9292	2.18
BEV-VAE	nuScenes	nuScenes	26.13	0.7231	0.9250	6.66
BEV-VAE	PAS	nuScenes	28.88	0.8028	0.9756	4.74

Reconstruction metrics on AV2 compared with SD-VAE.

Model	Training	Validation	PSNR $\uparrow$	SSIM $\uparrow$	MVSC $\uparrow$	rFID $\downarrow$
SD-VAE	LAION-5B	AV2	27.81	0.8229	0.8962	1.87
BEV-VAE	AV2	AV2	26.02	0.7651	0.9197	4.15
BEV-VAE	PAS	AV2	27.29	0.8028	0.9461	2.82

SD-VAE focuses on per-view image fidelity, whereas PAS-trained BEV-VAE achieves superior multi-view spatial consistency (MVSC).

Multi-view image reconstruction on nuScenes

Click the image below to watch the ego view rotate 360° horizontally.

Multi-view image reconstruction on AV2

Click the image below to watch the ego view rotate 360° horizontally.

Multi-view image reconstruction on nuPlan

Click the image below to watch the ego view rotate 360° horizontally.

Novel View Synthesis

Novel view synthesis via camera pose modifications on nuScenes. Row 1 shows real images from the nuScenes validation set, and Rows 2-3 show reconstructions with all cameras rotated 30° left and right, where the cement truck and tower crane truck remain consistent across views without deformation.

Novel view synthesis cross camera configurations. Row 1 presents real images from the nuPlan validation set. Row 2 and Row 3 show reconstructions using camera parameters from AV2 and nuScenes, respectively. The model captures dataset-specific vehicle priors: AV2 reconstructions include both the front and rear of the ego vehicle, while nuScenes reconstructions mainly show the rear (with the rightmost image corresponding to the rear-view camera for alignment).

Zero-shot BEV Representation Construction

Zero-shot BEV representation construction on WS101. Row 1 shows real images from the WS101 validation set. Rows 2 and 3 show zero-shot and fine-tuned reconstructions, respectively, with object shapes preserved in the zero-shot results and further sharpened after fine-tuning.

Model	Training	Validation	PSNR $\uparrow$	SSIM $\uparrow$	MVSC $\uparrow$	rFID $\downarrow$
SD-VAE	LAION-5B	WS101	23.38	0.7050	0.8580	4.59
BEV-VAE	PAS	WS101	16.6	0.3998	0.8309	56.7
BEV-VAE	PAS + WS101	WS101	23.46	0.6844	0.9505	13.78

Zero-shot and fine-tuned reconstruction metrics on WS101 compared with SD-VAE.

Autonomous Driving Scene Synthesis

Autonomous driving scene synthesis from AV2 to nuScenes.

BEV-VAE with DiT generates a BEV representation from 3D bounding boxes of AV2, which can then be decoded into multi-view images according to the camera configurations of nuScenes.

Multi-view image generation on AV2 with 3D object layout editing.

Click the image below to watch the ego view rotate 360° horizontally.

Multi-view image generation on nuScenes with 3D object layout editing.

Click the image below to watch the ego view rotate 360° horizontally.

Data Augmentation for Perception

BEV-VAE w/ DiT using the Historical Frame Replacement strategy (randomly replacing real frames with generated ones) improves BEVFormer’s perception by enabling the model to learn invariance of object locations relative to appearance.

Perception Model	Generative Model	Augmentation Strategy	mAP$\uparrow$	NDS$\uparrow$
BEVFormer Tiny	-	-	25.2	35.4
BEVFormer Tiny	BEVGen	Training Set + 6k Synthetic Data	27.3	37.2
BEVFormer Tiny	BEV-VAE w/ DiT	Historical Frame Replacement	27.1	37.4

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
assets		assets
bev_vae		bev_vae
configs		configs
logs		logs
notebooks		notebooks
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.project-root		.project-root
Makefile		Makefile
README.md		README.md
eval_single.sh		eval_single.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

BEV-VAE: A Unified BEV Representation for Generalizable Driving Scene Synthesis

Abstract

Method

Overall architecture of BEV-VAE with DiT for multi-view image generation.

Getting Started

Environment Setup

Dependencies

CUDA 12.1

CUDA 12.8

Install BEV-VAE

Data Preparation

nuScenes

AV2

Model Preparation

Model Inference

nuScenes

AV2

Experiments

Datasets

Multi-view Image Reconstruction

Reconstruction metrics on nuScenes compared with SD-VAE.

Reconstruction metrics on AV2 compared with SD-VAE.

Multi-view image reconstruction on nuScenes

Multi-view image reconstruction on AV2

Multi-view image reconstruction on nuPlan

Novel View Synthesis

Zero-shot BEV Representation Construction

Autonomous Driving Scene Synthesis

Autonomous driving scene synthesis from AV2 to nuScenes.

Multi-view image generation on AV2 with 3D object layout editing.

Multi-view image generation on nuScenes with 3D object layout editing.

Data Augmentation for Perception

TODO

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages