Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
OlmoEarth Artifact License

By exercising the rights granted to you under this OlmoEarth Artifact License
("Agreement"), you accept and agree to its terms and conditions and enter into this
Agreement with The Allen Institute for Artificial Intelligence ("Ai2"). All references
to "you" herein means both an individual and legal entity that an individual is acting
on behalf of.

Subject to your compliance with this Agreement, Ai2 grants you permission, free of
charge, to use the machine learning artifacts, materials, and documentation provided by
Ai2 under this Agreement as follows (collectively, "Artifacts"):

- Model weights, including architecture and parameters ("Model");
- Associated dataset or collection of data in connection with the Model ("Dataset"); and/or
- Associated software to process and run the Dataset and Model, including code in
source or binary form for training, inference, and evaluation ("Code").

1. Use Rights

Subject to the terms in Section 2 and 3 below, you may:

(a) Use, reproduce, modify, display and distribute the Artifacts, in whole or in part;
(b) Create any other machine learning models, datasets, and derivative works that are
derived from or based on the Artifacts, including by (i) transfer of patterns of
the weights, parameters, operations, or outputs of the Model, (ii) generating
outputs of the Model to produce synthetic data, and (iii) using Code to prepare any
work of authorship (collectively, "Derivatives"); and
(c) Publish and share Derivatives.

2. Use Restrictions

You will not (and will not encourage, permit, or facilitate any others to) use any
portion of the Artifacts or Derivatives for the following purposes:

(a) Any military and defense-related applications and use cases, including without
limitation, for weapons development, military operations, intelligence gathering,
or human surveillance and policing activities.
(b) Any extractive activities, operations and use cases involving the removal of raw
materials from the earth, including without limitation, to plan or facilitate the
extraction of oil, natural gas and minerals through activities such as drilling,
mining, and deforestation.

3. Distribution

In any distribution of the Artifacts or Derivatives:

(a) You will cite Ai2 as the source of the Artifacts in any distribution of Artifacts
or Derivatives.
(b) If you distribute any portion of the Artifacts, you will either link to or provide
a copy of this Agreement to all third party recipients.
(c) If you distribute any Derivatives, you may add your own intellectual property
notices and apply other licenses and terms of use, provided that you include and
require the use restrictions in Section 2 in all downstream distribution unless Ai2
provides written approval otherwise.

4. Termination

This Agreement will automatically terminate with immediate effect and without notice to
you in the following circumstances:

(a) Your breach of the use restrictions in Section 2 and any other terms and conditions
herein; or
(b) If you file, maintain, or voluntarily participate in a lawsuit against any person
or entity asserting that the Artifacts or any portion thereof directly or
indirectly infringe any patent, except where a lawsuit is filed in response to a
corresponding lawsuit first brought against you.

For the avoidance of doubt, Ai2 may also offer the Artifacts under separate terms and
conditions or stop distributing the Artifacts at any time; however, doing so will not
terminate this Agreement. You may continue to use the Artifacts under this Agreement
unless it is terminated in accordance with the circumstances expressly stated herein.

5. Rights Not Covered

This Agreement does not cover any patents or trademarks associated with the Artifacts,
including with respect to any individual items of information and materials that are
included or incorporated within a Dataset ("Contents"). Such Contents may be factual
data or independent works such as text, images, audio, and audio visual material.
Contents may be subject to other rights, including copyright, patent, data protection,
privacy, or personality rights, and this Agreement does not cover such rights. The use
rights in Section 1 expressly exclude any and all other rights that may apply to the
Contents of a Dataset.

6. Disclaimer and Limitation of Liability

(a) THE ARTIFACTS ARE PROVIDED "AS IS", "AS AVAILABLE", AND "WITH ALL FAULTS", WITHOUT
ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND, INCLUDING WITHOUT LIMITATION, IMPLIED
WARRANTIES OF MERCHANTABILITY, TITLE, NON-INFRINGEMENT, FITNESS FOR A PARTICULAR
PURPOSE, ABSENCE OF LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OR ABSENCE
OF ERRORS, WHETHER OR NOT KNOWN OR DISCOVERABLE. AI2 MAKES NO REPRESENTATIONS OR
WARRANTIES AS TO THE RELIABILITY, COMPLETENESS, QUALITY, PERFORMANCE,
FUNCTIONALITY, OR UTILITY OF ANY ARTIFACTS. ANY USE OF THE ARTIFACTS IS AT YOUR
SOLE RISK AND DISCRETION. YOU ARE SOLELY RESPONSIBLE FOR (1) CLEARING ANY THIRD
PARTY RIGHTS THAT MAY APPLY TO OR BE EMBODIED IN ANY ARTIFACTS, INCLUDING ANY
CONTENTS IN A DATASET; (2) OBTAINING ANY NECESSARY RIGHTS, LICENSES, CONSENTS, OR
PERMISSIONS REQUIRED FOR YOUR USE OF THE ARTIFACTS; AND (3) PERFORMING ANY DUE
DILIGENCE ON THE ARTIFACTS TO VERIFY SUITABILITY FOR YOUR INTENDED USE.

(b) TO THE MAXIMUM EXTENT PERMITTED UNDER APPLICABLE LAWS, IN NO EVENT WILL AI2 BE
LIABLE FOR ANY CLAIM, DAMAGES, LOSSES, OR OTHER LIABILITY OF ANY KIND WHATSOEVER,
WHETHER IN AN ACTION OF CONTRACT, TORT, OR OTHERWISE, ARISING FROM OR IN CONNECTION
WITH THE ARTIFACTS OR YOUR USE THEREOF.

(c) The disclaimers and limitations of liability set forth above will be interpreted in
a manner that, to the greatest extent possible, constitute an absolute disclaimer
and waiver of all liability.

7. Other Agreements

If you have entered into a separate written agreement with Ai2 regarding your use of
the specific Artifacts that are subject to this Agreement ("Other Agreement"), the
terms of such Other Agreement will supplement the terms herein. To the extent of any
conflict between the terms of the Other Agreement and this Agreement, the Other
Agreement will take precedence.

8. Miscellaneous

If any term or provision of this Agreement is deemed invalid or unenforceable, it will
automatically be reformed to the minimum extent necessary to make it enforceable. To
the extent it cannot be reformed, it will be severed from this Agreement, and the
remaining terms and conditions will remain in full force and effect. Any delay or
failure by Ai2 to take any action or enforce any breach of this Agreement will not be
deemed as a waiver or consent by Ai2. No term or provision of this Agreement will be
waived by Ai2 unless expressly agreed in writing. Nothing in this Agreement constitutes
as a limitation of any privileges, immunities, and rights that apply to you or Ai2
under applicable laws, including from the legal processes of any jurisdiction or
authority.
163 changes: 65 additions & 98 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,122 +1,89 @@
# OlmoEarth Pretrain
<div align="center">
<img src="assets/OlmoEarth-logo.png" alt="OlmoEarth Logo" style="width: 600px; margin-left:'auto' margin-right:'auto' display:'block'"/>
<br>
<br>
</div>
<p align="center">
<a href="https://github.com/allenai/olmoearth_pretrain/blob/main/LICENSE">
<img alt="GitHub License" src="https://img.shields.io/badge/license-OlmoEarth-green">
</a>
<a href="https://huggingface.co/collections/allenai/olmoearth">
<img alt="Model Checkpoints" src="https://img.shields.io/badge/%F0%9F%A4%97%20HF-Models-yellow">
</a>
<a href="https://allenai.org/papers/olmoearth">
<img alt="Paper PDF" src="https://img.shields.io/badge/OlmoEarth-pdf-blue">
</a>
</p>
Comment thread
gabrieltseng marked this conversation as resolved.

The OlmoEarth models are a flexible, multi-modal, spatio-temporal family of foundation models for Earth Observations.

The OlmoEarth models exist as part of the [OlmoEarth platform](https://olmoearth.allenai.org/). The OlmoEarth Platform is an end-to-end solution for scalable planetary intelligence, providing everything needed to go from raw data through R&D, to fine-tuning and production deployment.

## Installation

We recommend Python 3.12, and recommend using [uv](https://docs.astral.sh/uv/getting-started/installation/).
To install dependencies with uv, run:

Allen Institute for AI's OlmoEarth Pretrain project

Earth system foundation model: data, training, and evaluation

launching training runs on beaker
## General Setup

**Requirements:** Python 3.11 or higher (Python 3.12 recommended)

1. Install uv: `curl -LsSf https://astral.sh/uv/install.sh | sh` (other ways to do it are documented [here](https://docs.astral.sh/uv/getting-started/installation/))
2. Navigate to root directory of this repo and run `uv sync --locked --all-groups --python 3.12`
3. Install the pre-commit tool `uv tool install pre-commit --with pre-commit-uv --force-reinstall`
4. uv installs everything into a venv, so to keep using `python` commands you can activate uv's venv: `source .venv/bin/activate`. Otherwise, swap to `uv run python`.
```bash
git clone git@github.com:allenai/olmoearth_pretrain.git
cd olmoearth_pretrain
uv sync --locked --all-groups --python 3.12
# only necessary for development
uv tool install pre-commit --with pre-commit-uv --force-reinstall
```

uv installs everything into a venv, so to keep using python commands you can activate uv's venv: `source .venv/bin/activate`. Otherwise, swap to `uv run python`.

## OlmoEarth Pretrain Dataset
OlmoEarth is built using [OLMo-core](https://github.com/allenai/OLMo-core.git). OLMo-core's published [Docker images](https://github.com/orgs/allenai/packages?repo_name=OLMo-core) contain all core and optional dependencies.

The dataset for training is stored in h5 datasets. A training dataset can be created from tiles via `python3 -m olmoearth_pretrain.internal.run_h5_conversion` script.
## Model Summary
Comment thread
gabrieltseng marked this conversation as resolved.

<img src="assets/model.jpg" alt="Model Architecture Diagram" style="width: 700px; margin-left:'auto' margin-right:'auto' display:'block'"/>

We have 2 versions of each dataset 1 with 256 x 256 tiles and 1 with 4x as many 128 by 128 tiles. The 128 by 128 tiles may be faster for data loading due to GB/s bottlenecks on weka.
The OlmoEarth models are trained on three satellite modalities (Sentinel 2, Sentinel 1 and Landsat) and six derived maps (OpenStreetMap, WorldCover, USDA Cropland Data Layer, SRTM DEM, WRI Canopy Height Map, and WorldCereal).
| Model Size | Weights | Encoder Params | Decoder Params |
| --- | --- | --- | --- |
| Nano | [link](https://huggingface.co/allenai/OlmoEarth-v1-Nano) | 1.4M | 800K |
| Tiny | [link](https://huggingface.co/allenai/OlmoEarth-v1-Tiny) | 6.2M | 1.9M |
| Base | [link](https://huggingface.co/allenai/OlmoEarth-v1-Base) | 89M | 30M |
Comment thread
gabrieltseng marked this conversation as resolved.
| Large | [link](https://huggingface.co/allenai/OlmoEarth-v1-Large) | 308M | 53M |

OUT OF DATE!
- **Presto Dataset**: ~120k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled via locations used in Galileo paper
- 256 Path: `/weka/dfive-default/helios/dataset/presto/rerun_1_h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/117473/`
- 128 Path: `/weka/dfive-default/helios/dataset/presto/h5py_data_w_missing_timesteps_128_x_4_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/469892`
## Using OlmoEarth

- **OSM Sampling Dataset**: ~285k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled across OpenStreetmap classes
- 256 Path: `/weka/dfive-default/helios/dataset/osm_sampling/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/285288/`
- 128 Path: `/weka/dfive-default/helios/dataset/osm_sampling/h5py_data_w_missing_timesteps_128_x_4_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1141152`
- **OSM Big Dataset**: ~324k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled across a wider set of opens treetmap classes
- 256 Path: `/weka/dfive-default/helios/dataset/osmbig/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/324482/`
- 128 Path: `/weka/dfive-default/helios/dataset/osmbig/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1297928`
- **Presto Neighbor Dataset**: ~877k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities presto + the neighboring tiles
- 256 Path: `/weka/dfive-default/helios/dataset/presto_neighbor/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/876937/`
- 128 Path: `/weka/dfive-default/helios/dataset/presto_neighbor/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/3507748`
- **WorldCover Sampling Dataset**: ~1.6M samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities. WorldCover class based sampling and some additional random sampling over the rest of the world.
- 256 Path: `/weka/dfive-default/helios/dataset/worldcover_sampling/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1592645/`
- 128 Path: `/weka/dfive-default/helios/dataset/worldcover_sampling/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/6370580/`
[InferenceQuickstart](docs/Inference-Quickstart.md) shows how to initialize the
OlmoEarth model and apply it on a satellite image.

We also have several more in-depth tutorials for computing OlmoEarth embeddings and fine-tuning OlmoEarth on downstream tasks:

## Running Eval Suite
- [Fine-tuning OlmoEarth for Segmentation](https://github.com/allenai/olmoearth_projects/blob/main/docs/tutorials/FinetuneOlmoEarthSegmentation.md)
- [Computing Embeddings using OlmoEarth](https://github.com/allenai/rslearn/blob/master/docs/examples/OlmoEarthEmbeddings.md)
- [Fine-tuning OlmoEarth in rslearn](https://github.com/allenai/rslearn/blob/master/docs/examples/FinetuneOlmoEarth.md)

[`olmoearth_pretrain/internal/full_eval_sweep.py`](olmoearth_pretrain/internal/full_eval_sweep.py) runs comprehensive evaluation sweeps across multiple downstream tasks for any OlmoEarth Pretrain checkpoint. It automatically sweeps over learning rates, pooling types, and normalization strategies.
Additionally, [`olmoearth_projects`](https://github.com/allenai/olmoearth_projects) has several examples of active OlmoEarth deployments.

### 1. How to run eval for a given checkpoint
## Data Summary

Basic command to run evaluation sweep for a checkpoint:
Our pretraining dataset contains 285,288 samples from around the world of 2.56km×2.56km regions, although many samples contain only a subset of the timesteps and modalities.

```
python3 olmoearth_pretrain/internal/full_eval_sweep.py \
--cluster=ai2/saturn-cirrascale \
--checkpoint_path=/path/to/your/checkpoint/step450000 \
--module_path=scripts/your_training_script.py \
```
The distribution of the samples is available below:

For just default hyperparameters (faster, single run):
```bash
python3 olmoearth_pretrain/internal/full_eval_sweep.py \
--cluster=ai2/saturn-cirrascale \
--checkpoint_path=/path/to/your/checkpoint/step450000 \
--module_path=scripts/your_training_script.py \
--defaults_only
```
<img src="assets/datamap.png" alt="Training sample distribution" style="width: 500px; margin-left:'auto' margin-right:'auto' display:'block'"/>

### 2. Example of how to add additional overrides
The dataset can be downloaded [here](https://huggingface.co/datasets/allenai/olmoearth_pretrain_dataset).

Pass additional training arguments after the main arguments:
```bash
python3 olmoearth_pretrain/internal/full_eval_sweep.py \
--cluster=ai2/saturn-cirrascale \
--checkpoint_path=/path/to/checkpoint \
--module_path=scripts/your_script.py \
--model.decoder_config.depth=1 \
--trainer.callbacks.downstream_evaluator.tasks_to_run=\[mados,pastis_sentinel2,breizhcrops,sen1floods11,pastis_sentinel1_sentinel2\] \
```
Detailed instructions on how to make your own pretraining dataset are available in [the dataset README](docs/Dataset-Creation.md).

### 3. How to run panopticon
## Training scripts

Use the `--panopticon` flag for Panopticon model evaluation:
```bash
python3 olmoearth_pretrain/internal/full_eval_sweep.py \
--cluster=ai2/saturn-cirrascale \
--panopticon \
--model_name=panopticon
```
Detailed instructions on how to pretrain your own OlmoEarth model are available in [Pretraining.md](docs/Pretraining.md).

### 4. How to run different dino models
## Evaluations

For DINO v3 evaluation:
```bash
python3 olmoearth_pretrain/internal/full_eval_sweep.py \
--cluster=ai2/saturn-cirrascale \
--dino_v3 \
--model_name=dino_v3_large_sat \
--model.model_name=DinoV3Models.LARGE_SATELLITE \
```
Detailed instructions on how to replicate our evaluations is available here:

### 5. How to run galileo
- [Evaluations on Research Benchmarks](docs/Evaluation.md)
- [Evaluations on Partner Tasks](https://github.com/allenai/rslearn_projects/blob/master/rslp/olmoearth_evals/README.md)

Use the `--galileo` flag for Galileo model evaluation:
```bash
python3 olmoearth_pretrain/internal/full_eval_sweep.py \
--cluster=ai2/saturn-cirrascale \
--galileo \
--model_name=galileo_vit_base
--model.patch_size=4
```
## License

**Key Notes:**
- The script automatically determines appropriate normalization strategies for each model type (see [`olmoearth_pretrain/evals/datasets/normalize.py`](olmoearth_pretrain/evals/datasets/normalize.py))
- OlmoEarth Pretrain: Use pretrained normalizer or NORM_METHOD.NORM_NO_CLIP with dataset stats
- Galileo: Use galileo pretrained normalizer or NORM_METHOD.NORM_NO_CLIP with dataset stats
- Panopticon: Uses NORM_METHOD.STANDARDIZE with the dataset statistics
- DinoV3: Uses NORM_METHOD.NORM_YES_CLIP_MIN_MAX_INT to get to 0-1 and then applies either the web or sat normalization values
- Supports both full hyperparameter sweeps and default-only runs
- Use `--dry_run` to preview commands without execution
- For local testing, use `--cluster=local`

See `olmoearth_pretrain/internal/full_eval_sweep.py` for complete argument list and implementation details.
This code is licensed under the [OlmoEarth Artifact License](LICENSE).
Binary file added assets/OlmoEarth-logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/datamap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/model.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes.