MAESTRO: Masked Autoencoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data 
Official implementation for Masked Autoencoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data
Abstract | Results | Datasets | Getting Started
We introduce MAESTRO, a tailored adaptation of the Masked Autoencoder (MAE) framework that effectively orchestrates the use of multimodal, multitemporal, and multispectral Earth Observation (EO) data. Evaluated on four EO datasets, MAESTRO sets a new state-of-the-art on tasks that strongly rely on multitemporal dynamics, while remaining highly competitive on tasks dominated by a single monotemporal modality.
Our contributions are as follows:
- Extensive benchmarking of multimodal and multitemporal SSL: Impact evaluation of various fusion strategies for multimodal and multitemporal SSL.
- Patch-group-wise normalization: Novel normalization scheme that normalizes reconstruction targets patch-wise within groups of highly correlated spectral bands.
- MAESTRO: Novel adaptation of the MAE that combines optimized fusion strategies with our tailored patch-group-wise normalization.
Table 1 — Intra-dataset comparison of MAESTRO, supervised ViTs, and previous SOTA.
We report wF1 (%) on TreeSatAI-TS and mIoU (%) on PASTIS-HD, FLAIR#2, and FLAIR-HUB.
MAESTRO† models are pre-trained for twice the number of epochs.
| Model | TreeSatAI-TS | PASTIS-HD | FLAIR#2 | FLAIR-HUB |
|---|---|---|---|---|
| MAESTRO (ours) | 78.8 | 68.6 | 62.6 | 65.9 🟢↑1.6 |
| MAESTRO† (ours) | 79.4 🟢↑3.8 | 69.0 🟢↑2.5 | 63.3 🔴↓0.8 | 65.8 |
| ViT | 75.6 | 64.5 | 58.2 | 62.1 |
| Previous SOTA | 75.1 | 66.5 | 64.1 | 64.3 |
Table 2 — Cross-dataset comparison of MAESTRO and adapted baseline FMs. The baseline FMs are adapted to handle an arbitrary number of dates and/or to retain all modality bands.
We report wF1 (%) on TreeSatAI-TS and mIoU (%) on PASTIS-HD, FLAIR#2, and FLAIR-HUB.
| Model | Pre-training dataset | TreeSatAI-TS | PASTIS-HD | FLAIR#2 | FLAIR-HUB |
|---|---|---|---|---|---|
| MAESTRO (ours) | FLAIR-HUB | 79.6 | 68.0 | - | - |
| MAESTRO (ours) | S2-NAIP urban | 78.8 | 67.4 | 62.6 | 64.6 |
| DINO-v2 | LVD-142M | 76.7 | 64.4 | 64.2 | 66.0 |
| DINO-v2 sat. | Maxar Vivid2 | 76.3 | 64.0 | 63.5 | 66.0 |
| DOFA | DOFA MM | 76.0 | 62.9 | 62.3 | 65.1 |
| CROMA | SSL4EO | 70.5 | 65.0 | 39.0 | 44.3 |
| Prithvi-EO-2.0 | HLS | 75.6 | 66.2 | 41.8 | 44.9 |
| SatMAE | fMoW RGB+S | 76.9 | 66.6 | 42.5 | 45.0 |
Our implementation already supports 5 datasets.
TreeSatAI-TS
Tree species identification in Germany, with 15 multi-label classes.
- Extent: 50,381 tiles of 60 × 60 m, covering 181 km².
- Modalities: aerial imagery RGB + NIR (0.2 m), Sentinel-1 time series, Sentinel-2 time series.
PASTIS-HD
Agricultural crop segmentation in France, with 19 semantic classes.
- Extent: 433 tiles of 1280 × 1280 m, covering 709 km².
- Modalities: VHR satellite imagery SPOT 6-7 (1 m), Sentinel-1 time series, Sentinel-2 time series.
FLAIR#2
Land cover segmentation in France, with 12 semantic classes. Note that the FLAIR#2 version used here is not the original release, but a regenerated version obtained by refiltering FLAIR-HUB.
- Extent: 77,762 tiles of 102.4 × 102.4 m, covering 815 km².
- Modalities: Aerial and elevation imagery RGB + NIR + DEM + DSM (0.2 m), Sentinel-2 time series.
FLAIR-HUB
Land cover segmentation in France, with 15 semantic classes.
- Extent: 241,100 tiles of 102.4 × 102.4 m, covering 2,528 km².
- Modalities: Aerial and elevation imagery RGB + NIR + DEM + DSM (0.2 m), Sentinel-1 time series, Sentinel-2 time series.
S2-NAIP urban
Super-resolution in urban areas of the United States. To construct this urban subset, the original S2-NAIP footprints are intersected with the urban set defined in Zooming-in zooming-out, which is limited to locations within a 5 km radius of U.S. cities with populations of at least 50,000.
- Extent: 167,397 tiles of size 640 m × 640 m, covering 68,565 km².
- Modalities: Aerial NAIP imagery RGB + NIR (1.25 m), Sentinel-1 time series, Sentinel-2 time series.
First, set up the module with Poetry.
# 1. Change directory
cd MAESTRO
# 2. Install dependencies with Poetry
poetry installThen, you can start from the following minimal examples.
Intra-dataset MAESTRO on TreeSatAI-TS:
# pre-train, probe and finetune on TreeSatAI-TS
poetry run python main.py \
model.model=mae model.model_size=medium \
opt_pretrain.epochs=100 opt_probe.epochs=10 opt_finetune.epochs=50 \
datasets.name_dataset=treesatai_ts \
datasets.root_dir=/path/to/dataset/dir datasets.treesatai_ts.rel_dir=TreeSatAI-TS \
run.exp_dir=/path/to/experiments/dir run.exp_name=mae-m_treesatIntra-dataset MAESTRO on PASTIS-HD:
# pre-train, probe and finetune on PASTIS-HD
poetry run python main.py \
model.model=mae model.model_size=medium \
opt_pretrain.epochs=100 opt_probe.epochs=10 opt_finetune.epochs=50 \
datasets.name_dataset=pastis_hd \
datasets.root_dir=/path/to/dataset/dir datasets.pastis_hd.rel_dir=PASTIS-HD \
run.exp_dir=/path/to/experiments/dir run.exp_name=mae-m_pastisIntra-dataset MAESTRO on FLAIR-HUB:
# pre-train, probe and finetune on FLAIR-HUB
poetry run python main.py \
model.model=mae model.model_size=medium \
opt_pretrain.epochs=100 opt_probe.epochs=15 opt_finetune.epochs=100 \
datasets.name_dataset=flair \
datasets.root_dir=/path/to/dataset/dir datasets.flair.rel_dir=FLAIR-HUB \
run.exp_dir=/path/to/experiments/dir run.exp_name=mae-m_flairCross-dataset MAESTRO from S2-NAIP urban to TreeSatAI-TS:
# pre-train on S2-NAIP urban
poetry run python main.py \
model.model=mae model.model_size=medium \
opt_pretrain.epochs=15 opt_probe.epochs=0 opt_finetune.epochs=0 \
datasets.name_dataset=s2_naip \
datasets.root_dir=/path/to/dataset/dir datasets.s2_naip.rel_dir=s2-naip-urban \
run.exp_dir=/path/to/experiments/dir run.exp_name=mae-m_s2-naip && \
# probe and finetune on TreeSatAI-TS
poetry run python main.py \
model.model=mae model.model_size=medium \
opt_pretrain.epochs=0 opt_probe.epochs=10 opt_finetune.epochs=50 \
datasets.name_dataset=treesatai_ts \
datasets.treesatai_ts.aerial.image_size=240 datasets.treesatai_ts.aerial.patch_size.mae=16 \
datasets.treesatai_ts.s1_asc.name_embed=s1 datasets.treesatai_ts.s1_des.name_embed=s1 \
datasets.root_dir=/path/to/dataset/dir datasets.treesatai_ts.rel_dir=TreeSatAI-TS \
run.exp_dir=/path/to/experiments/dir run.load_name=mae-m_s2-naip run.exp_name=mae-m_s2-naip-x-treesat
Most hyperparameters can be adapted through the hydra-zen CLI.
On 2025-08-19 the repository history was cleaned to remove large files. If you previously cloned the repo, please reclone to avoid conflicts.
For questions or contributions, please open an issue or contact the authors.
If you use this code, please cite:
@article{labatie2025maestro,
title={MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data},
author={Labatie, Antoine and Vaccaro, Michael and Lardiere, Nina and Garioud, Anatol and Gonthier, Nicolas},
journal={arXiv preprint arXiv:2508.10894},
year={2025}
}The experiments in the paper were conducted using HPC/AI resources from GENCI-IDRIS (allocations A0181013803, A0161013803, and AD010114597R1).
