Welcome to the official Google DeepMind repository for 4D Representations.
- Scaling 4D Representations focuses on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model 22B parameters.
- Moving Off-the-Grid (MooG) introduces a self-supervised video representation that allows latent tokens to move freely across space and time, staying aligned with dynamic scene elements rather than fixed pixel grids. By combining cross-attention with positional embeddings, MooG disentangles representation structure from image structure, enabling tokens to bind to meaningful objects and regions. Trained with a simple next-frame prediction objective, MooG naturally learns object-centric tracking representations and achieves strong performance across downstream tasks with lightweight readouts.
- Recurrent Video Masked Autoencoders (RVM) proposes a recurrent, transformer-based approach to video representation learning that models temporal structure using an asymmetric masking objective and simple pixel reconstruction loss. RVM learns an efficient general-purpose encoder that matches or exceeds state-of-the-art video models on action recognition, tracking, and dense geometric tasks, while remaining competitive with strong image models. It is particularly effective in the small-model regime, achieving up to 30× greater parameter efficiency without distillation.
git clone https://github.com/google-deepmind/representations4d.git
cd representations4d
python3 -m venv representations4d_env
source representations4d_env/bin/activate
pip install .-
Segmentation tracking and keypoint tracking with RVM backbone
-
Segmentation tracking and keypoint tracking evaluation between RVM and popular video models
We release the following checkpoints
| Name | Model | # Params | File Size | Checkpoint |
|---|---|---|---|---|
| 4DS-B-dist-e | Backbone (ViT-B) | 88M | 334MB | link |
| 4DS-e | Backbone (ViT-e) | 3.8B | 14GB | link |
| 4DS-B-dist-e ScanNet depth | Backbone (ViT-B) + Readout | 105M | 420MB | link |
| MooG | Backbone (ConvNet + Transformer) | 35M | 140MB | link |
| MooG | Box Track Readout (Cross Attention) | 35M | 140MB | link |
| MooG | Point Track Readout (Cross Attention) | 35M | 140MB | link |
| RVM | Backbone (ViT-L) | 375M | 1.6GB | link |
@article{carreira2024scaling,
title={Scaling 4D Representations},
author={João Carreira and Dilara Gokay and Michael King and Chuhan Zhang and Ignacio Rocco and Aravindh Mahendran and Thomas Albert Keck and Joseph Heyward and Skanda Koppula and Etienne Pot and Goker Erdogan and Yana Hasson and Yi Yang and Klaus Greff and Guillaume Le Moing and Sjoerd van Steenkiste and Daniel Zoran and Drew A. Hudson and Pedro Vélez and Luisa Polanía and Luke Friedman and Chris Duvarney and Ross Goroshin and Kelsey Allen and Jacob Walker and Rishabh Kabra and Eric Aboussouan and Jennifer Sun and Thomas Kipf and Carl Doersch and Viorica Pătrăucean and Dima Damen and Pauline Luc and Mehdi S. M. Sajjadi and Andrew Zisserman},
journal={arXiv preprint arXiv:2412.15212},
year={2024}
}
@article{van2024moving,
title={Moving Off-the-Grid: Scene-Grounded Video Representations},
author={Sjoerd van Steenkiste and Daniel Zoran and Yi Yang and Yulia Rubanova and Rishabh Kabra and Carl Doersch and Dilara Gokay and Joseph Heyward and Etienne Pot and Klaus Greff and Drew Hudson and Thomas Albert Keck and João Carreira and Alexey Dosovitskiy and Mehdi S. M. Sajjadi and Thomas Kipf},
journal={Advances in Neural Information Processing Systems},
volume={37},
pages={124319--124346},
year={2024}
}
@article{zoran2025recurrent,
title={Recurrent Video Masked Autoencoders},
author={Daniel Zoran and Nikhil Parthasarathy and Yi Yang and Drew A Hudson and João Carreira and Andrew Zisserman},
journal={arXiv preprint arXiv:},
year={2025}
}
Copyright 2025 Google LLC
All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0
All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode
Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.
This is not an official Google product.


