Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 43 additions & 101 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,122 +1,64 @@
# OlmoEarth Pretrain
<div align="center">
<img src="assets/OlmoEarth-logo.png" alt="OlmoEarth Logo" style="width: 600px; margin-left:'auto' margin-right:'auto' display:'block'"/>
<br>
<br>
</div>
<p align="center">
<a href="https://huggingface.co/collections/allenai/olmoearth">
<img alt="Model Checkpoints" src="https://img.shields.io/badge/%F0%9F%A4%97%20HF-Models-yellow">
</a>
</p>
Comment thread
gabrieltseng marked this conversation as resolved.

Allen Institute for AI's OlmoEarth Pretrain project
The OlmoEarth models are a flexible, multi-modal, spatio-temporal family of foundation models for Earth Observations.

Earth system foundation model: data, training, and evaluation
The OlmoEarth models exist as part of the [OlmoEarth platform](https://allenai.org/olmoearth). The OlmoEarth Platform is an end-to-end solution for scalable planetary intelligence, providing everything needed to go from raw data through R&D, to fine-tuning and production deployment.

launching training runs on beaker
## General Setup
## Installation

**Requirements:** Python 3.11 or higher (Python 3.12 recommended)

1. Install uv: `curl -LsSf https://astral.sh/uv/install.sh | sh` (other ways to do it are documented [here](https://docs.astral.sh/uv/getting-started/installation/))
2. Navigate to root directory of this repo and run `uv sync --locked --all-groups --python 3.12`
3. Install the pre-commit tool `uv tool install pre-commit --with pre-commit-uv --force-reinstall`
4. uv installs everything into a venv, so to keep using `python` commands you can activate uv's venv: `source .venv/bin/activate`. Otherwise, swap to `uv run python`.
We recommend Python 3.12, and recommend using [uv](https://docs.astral.sh/uv/getting-started/installation/).
To install dependencies with uv, run:

```bash
git clone git@github.com:allenai/olmoearth_pretrain.git
cd olmoearth_pretrain
uv sync --locked --all-groups --python 3.12
# only necessary for development
uv tool install pre-commit --with pre-commit-uv --force-reinstall
```

## OlmoEarth Pretrain Dataset
uv installs everything into a venv, so to keep using python commands you can activate uv's venv: `source .venv/bin/activate`. Otherwise, swap to `uv run python`.

The dataset for training is stored in h5 datasets. A training dataset can be created from tiles via `python3 -m olmoearth_pretrain.internal.run_h5_conversion` script.
OlmoEarth is built using [OLMo-core](https://github.com/allenai/OLMo-core.git). OLMo-core's published [Docker images](https://github.com/orgs/allenai/packages?repo_name=OLMo-core) contain all core and optional dependencies.

## Model Summary
Comment thread
gabrieltseng marked this conversation as resolved.

We have 2 versions of each dataset 1 with 256 x 256 tiles and 1 with 4x as many 128 by 128 tiles. The 128 by 128 tiles may be faster for data loading due to GB/s bottlenecks on weka.
The OlmoEarth models are trained on three satellite modalities (Sentinel 2, Sentinel 1 and Landsat) and six derived maps (OpenStreetMap, WorldCover).
| Model Size | Weights | Encoder Params | Decoder Params |
| --- | --- | --- | --- |
| Nano | [link](https://huggingface.co/allenai/OlmoEarth-v1-Nano) | 1.4M | 800K |
| Tiny | [link](https://huggingface.co/allenai/OlmoEarth-v1-Tiny) | 6.2M | 1.9M |
| Base | [link](https://huggingface.co/allenai/OlmoEarth-v1-Base) | 89M | 30M |
Comment thread
gabrieltseng marked this conversation as resolved.

OUT OF DATE!
- **Presto Dataset**: ~120k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled via locations used in Galileo paper
- 256 Path: `/weka/dfive-default/helios/dataset/presto/rerun_1_h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/117473/`
- 128 Path: `/weka/dfive-default/helios/dataset/presto/h5py_data_w_missing_timesteps_128_x_4_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/469892`
## Data Summary

- **OSM Sampling Dataset**: ~285k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled across OpenStreetmap classes
- 256 Path: `/weka/dfive-default/helios/dataset/osm_sampling/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/285288/`
- 128 Path: `/weka/dfive-default/helios/dataset/osm_sampling/h5py_data_w_missing_timesteps_128_x_4_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1141152`
- **OSM Big Dataset**: ~324k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled across a wider set of opens treetmap classes
- 256 Path: `/weka/dfive-default/helios/dataset/osmbig/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/324482/`
- 128 Path: `/weka/dfive-default/helios/dataset/osmbig/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1297928`
- **Presto Neighbor Dataset**: ~877k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities presto + the neighboring tiles
- 256 Path: `/weka/dfive-default/helios/dataset/presto_neighbor/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/876937/`
- 128 Path: `/weka/dfive-default/helios/dataset/presto_neighbor/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/3507748`
- **WorldCover Sampling Dataset**: ~1.6M samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities. WorldCover class based sampling and some additional random sampling over the rest of the world.
- 256 Path: `/weka/dfive-default/helios/dataset/worldcover_sampling/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1592645/`
- 128 Path: `/weka/dfive-default/helios/dataset/worldcover_sampling/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/6370580/`
Our pretraining dataset contains around 300,000 samples from around the world of 2.56km×2.56km regions, although many samples contain only a subset of the timesteps and modalities.

The distribution of the samples is available below:

## Running Eval Suite
<img src="assets/datamap.png" alt="Training sample distribution" style="width: 500px; margin-left:'auto' margin-right:'auto' display:'block'"/>

[`olmoearth_pretrain/internal/full_eval_sweep.py`](olmoearth_pretrain/internal/full_eval_sweep.py) runs comprehensive evaluation sweeps across multiple downstream tasks for any OlmoEarth Pretrain checkpoint. It automatically sweeps over learning rates, pooling types, and normalization strategies.
The dataset can be downloaded [here](https://huggingface.co/datasets/allenai/olmoearth_pretrain_dataset).

### 1. How to run eval for a given checkpoint
## Training scripts

Basic command to run evaluation sweep for a checkpoint:
Detailed instructions on how to pretrain your own OlmoEarth model are available in [Pretraining.md](docs/Pretraining.md).

```
python3 olmoearth_pretrain/internal/full_eval_sweep.py \
--cluster=ai2/saturn-cirrascale \
--checkpoint_path=/path/to/your/checkpoint/step450000 \
--module_path=scripts/your_training_script.py \
```
## Evaluations

For just default hyperparameters (faster, single run):
```bash
python3 olmoearth_pretrain/internal/full_eval_sweep.py \
--cluster=ai2/saturn-cirrascale \
--checkpoint_path=/path/to/your/checkpoint/step450000 \
--module_path=scripts/your_training_script.py \
--defaults_only
```

### 2. Example of how to add additional overrides

Pass additional training arguments after the main arguments:
```bash
python3 olmoearth_pretrain/internal/full_eval_sweep.py \
--cluster=ai2/saturn-cirrascale \
--checkpoint_path=/path/to/checkpoint \
--module_path=scripts/your_script.py \
--model.decoder_config.depth=1 \
--trainer.callbacks.downstream_evaluator.tasks_to_run=\[mados,pastis_sentinel2,breizhcrops,sen1floods11,pastis_sentinel1_sentinel2\] \
```
Detailed instructions on how to replicate our evaluations is available in #TODO.
Comment thread
favyen2 marked this conversation as resolved.
Outdated

### 3. How to run panopticon
## Deploying OlmoEarth

Use the `--panopticon` flag for Panopticon model evaluation:
```bash
python3 olmoearth_pretrain/internal/full_eval_sweep.py \
--cluster=ai2/saturn-cirrascale \
--panopticon \
--model_name=panopticon
```

### 4. How to run different dino models

For DINO v3 evaluation:
```bash
python3 olmoearth_pretrain/internal/full_eval_sweep.py \
--cluster=ai2/saturn-cirrascale \
--dino_v3 \
--model_name=dino_v3_large_sat \
--model.model_name=DinoV3Models.LARGE_SATELLITE \
```

### 5. How to run galileo

Use the `--galileo` flag for Galileo model evaluation:
```bash
python3 olmoearth_pretrain/internal/full_eval_sweep.py \
--cluster=ai2/saturn-cirrascale \
--galileo \
--model_name=galileo_vit_base
--model.patch_size=4
```
The OlmoEarth models exist as part of the [OlmoEarth platform](https://allenai.org/olmoearth). The OlmoEarth Platform is an end-to-end solution for scalable planetary intelligence, providing everything needed to go from raw data through R&D, to fine-tuning and production deployment.
Comment thread
gabrieltseng marked this conversation as resolved.
Outdated

**Key Notes:**
- The script automatically determines appropriate normalization strategies for each model type (see [`olmoearth_pretrain/evals/datasets/normalize.py`](olmoearth_pretrain/evals/datasets/normalize.py))
- OlmoEarth Pretrain: Use pretrained normalizer or NORM_METHOD.NORM_NO_CLIP with dataset stats
- Galileo: Use galileo pretrained normalizer or NORM_METHOD.NORM_NO_CLIP with dataset stats
- Panopticon: Uses NORM_METHOD.STANDARDIZE with the dataset statistics
- DinoV3: Uses NORM_METHOD.NORM_YES_CLIP_MIN_MAX_INT to get to 0-1 and then applies either the web or sat normalization values
- Supports both full hyperparameter sweeps and default-only runs
- Use `--dry_run` to preview commands without execution
- For local testing, use `--cluster=local`

See `olmoearth_pretrain/internal/full_eval_sweep.py` for complete argument list and implementation details.
Examples of active OlmoEarth deployments are available at [`olmoearth_projects`](github.com/allenai/olmoearth_projects).
Comment thread
Hgherzog marked this conversation as resolved.
Outdated
Binary file added assets/OlmoEarth-logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/datamap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.