Allen Institute for AI's OlmoEarth Pretrain project
Earth system foundation model: data, training, and evaluation
launching training runs on beaker
Requirements: Python 3.11 or higher (Python 3.12 recommended)
- Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh(other ways to do it are documented here) - Navigate to root directory of this repo and run
uv sync --locked --all-groups --python 3.12 - Install the pre-commit tool
uv tool install pre-commit --with pre-commit-uv --force-reinstall - uv installs everything into a venv, so to keep using
pythoncommands you can activate uv's venv:source .venv/bin/activate. Otherwise, swap touv run python.
The dataset for training is stored in h5 datasets. A training dataset can be created from tiles via python3 -m olmoearth_pretrain.internal.run_h5_conversion script.
We have 2 versions of each dataset 1 with 256 x 256 tiles and 1 with 4x as many 128 by 128 tiles. The 128 by 128 tiles may be faster for data loading due to GB/s bottlenecks on weka.
OUT OF DATE!
-
Presto Dataset: ~120k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled via locations used in Galileo paper
- 256 Path:
/weka/dfive-default/helios/dataset/presto/rerun_1_h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/117473/ - 128 Path:
/weka/dfive-default/helios/dataset/presto/h5py_data_w_missing_timesteps_128_x_4_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/469892
- 256 Path:
-
OSM Sampling Dataset: ~285k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled across OpenStreetmap classes
- 256 Path:
/weka/dfive-default/helios/dataset/osm_sampling/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/285288/ - 128 Path:
/weka/dfive-default/helios/dataset/osm_sampling/h5py_data_w_missing_timesteps_128_x_4_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1141152
- 256 Path:
-
OSM Big Dataset: ~324k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled across a wider set of opens treetmap classes
- 256 Path:
/weka/dfive-default/helios/dataset/osmbig/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/324482/ - 128 Path:
/weka/dfive-default/helios/dataset/osmbig/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1297928
- 256 Path:
-
Presto Neighbor Dataset: ~877k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities presto + the neighboring tiles
- 256 Path:
/weka/dfive-default/helios/dataset/presto_neighbor/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/876937/ - 128 Path:
/weka/dfive-default/helios/dataset/presto_neighbor/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/3507748
- 256 Path:
-
WorldCover Sampling Dataset: ~1.6M samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities. WorldCover class based sampling and some additional random sampling over the rest of the world.
- 256 Path:
/weka/dfive-default/helios/dataset/worldcover_sampling/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1592645/ - 128 Path:
/weka/dfive-default/helios/dataset/worldcover_sampling/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/6370580/
- 256 Path:
olmoearth_pretrain/internal/full_eval_sweep.py runs comprehensive evaluation sweeps across multiple downstream tasks for any OlmoEarth Pretrain checkpoint. It automatically sweeps over learning rates, pooling types, and normalization strategies.
Basic command to run evaluation sweep for a checkpoint:
python3 olmoearth_pretrain/internal/full_eval_sweep.py \
--cluster=ai2/saturn-cirrascale \
--checkpoint_path=/path/to/your/checkpoint/step450000 \
--module_path=scripts/your_training_script.py \
For just default hyperparameters (faster, single run):
python3 olmoearth_pretrain/internal/full_eval_sweep.py \
--cluster=ai2/saturn-cirrascale \
--checkpoint_path=/path/to/your/checkpoint/step450000 \
--module_path=scripts/your_training_script.py \
--defaults_onlyPass additional training arguments after the main arguments:
python3 olmoearth_pretrain/internal/full_eval_sweep.py \
--cluster=ai2/saturn-cirrascale \
--checkpoint_path=/path/to/checkpoint \
--module_path=scripts/your_script.py \
--model.decoder_config.depth=1 \
--trainer.callbacks.downstream_evaluator.tasks_to_run=\[mados,pastis_sentinel2,breizhcrops,sen1floods11,pastis_sentinel1_sentinel2\] \Use the --panopticon flag for Panopticon model evaluation:
python3 olmoearth_pretrain/internal/full_eval_sweep.py \
--cluster=ai2/saturn-cirrascale \
--panopticon \
--model_name=panopticonFor DINO v3 evaluation:
python3 olmoearth_pretrain/internal/full_eval_sweep.py \
--cluster=ai2/saturn-cirrascale \
--dino_v3 \
--model_name=dino_v3_large_sat \
--model.model_name=DinoV3Models.LARGE_SATELLITE \Use the --galileo flag for Galileo model evaluation:
python3 olmoearth_pretrain/internal/full_eval_sweep.py \
--cluster=ai2/saturn-cirrascale \
--galileo \
--model_name=galileo_vit_base
--model.patch_size=4Key Notes:
- The script automatically determines appropriate normalization strategies for each model type (see
olmoearth_pretrain/evals/datasets/normalize.py)- OlmoEarth Pretrain: Use pretrained normalizer or NORM_METHOD.NORM_NO_CLIP with dataset stats
- Galileo: Use galileo pretrained normalizer or NORM_METHOD.NORM_NO_CLIP with dataset stats
- Panopticon: Uses NORM_METHOD.STANDARDIZE with the dataset statistics
- DinoV3: Uses NORM_METHOD.NORM_YES_CLIP_MIN_MAX_INT to get to 0-1 and then applies either the web or sat normalization values
- Supports both full hyperparameter sweeps and default-only runs
- Use
--dry_runto preview commands without execution - For local testing, use
--cluster=local
See olmoearth_pretrain/internal/full_eval_sweep.py for complete argument list and implementation details.