allenai · favyen2 · Nov 4, 2025 · Oct 30, 2025 · Oct 30, 2025 · Oct 30, 2025
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,127 @@
+OlmoEarth Artifact License
+
+By exercising the rights granted to you under this OlmoEarth Artifact License
+("Agreement"), you accept and agree to its terms and conditions and enter into this
+Agreement with The Allen Institute for Artificial Intelligence ("Ai2"). All references
+to "you" herein means both an individual and legal entity that an individual is acting
+on behalf of.
+
+Subject to your compliance with this Agreement, Ai2 grants you permission, free of
+charge, to use the machine learning artifacts, materials, and documentation provided by
+Ai2 under this Agreement as follows (collectively, "Artifacts"):
+
+- Model weights, including architecture and parameters ("Model");
+- Associated dataset or collection of data in connection with the Model ("Dataset"); and/or
+- Associated software to process and run the Dataset and Model, including code in
+  source or binary form for training, inference, and evaluation ("Code").
+
+1. Use Rights
+
+Subject to the terms in Section 2 and 3 below, you may:
+
+(a) Use, reproduce, modify, display and distribute the Artifacts, in whole or in part;
+(b) Create any other machine learning models, datasets, and derivative works that are
+    derived from or based on the Artifacts, including by (i) transfer of patterns of
+    the weights, parameters, operations, or outputs of the Model, (ii) generating
+    outputs of the Model to produce synthetic data, and (iii) using Code to prepare any
+    work of authorship (collectively, "Derivatives"); and
+(c) Publish and share Derivatives.
+
+2. Use Restrictions
+
+You will not (and will not encourage, permit, or facilitate any others to) use any
+portion of the Artifacts or Derivatives for the following purposes:
+
+(a) Any military and defense-related applications and use cases, including without
+    limitation, for weapons development, military operations, intelligence gathering,
+    or human surveillance and policing activities.
+(b) Any extractive activities, operations and use cases involving the removal of raw
+    materials from the earth, including without limitation, to plan or facilitate the
+    extraction of oil, natural gas and minerals through activities such as drilling,
+    mining, and deforestation.
+
+3. Distribution
+
+In any distribution of the Artifacts or Derivatives:
+
+(a) You will cite Ai2 as the source of the Artifacts in any distribution of Artifacts
+    or Derivatives.
+(b) If you distribute any portion of the Artifacts, you will either link to or provide
+    a copy of this Agreement to all third party recipients.
+(c) If you distribute any Derivatives, you may add your own intellectual property
+    notices and apply other licenses and terms of use, provided that you include and
+    require the use restrictions in Section 2 in all downstream distribution unless Ai2
+    provides written approval otherwise.
+
+4. Termination
+
+This Agreement will automatically terminate with immediate effect and without notice to
+you in the following circumstances:
+
+(a) Your breach of the use restrictions in Section 2 and any other terms and conditions
+    herein; or
+(b) If you file, maintain, or voluntarily participate in a lawsuit against any person
+    or entity asserting that the Artifacts or any portion thereof directly or
+    indirectly infringe any patent, except where a lawsuit is filed in response to a
+    corresponding lawsuit first brought against you.
+
+For the avoidance of doubt, Ai2 may also offer the Artifacts under separate terms and
+conditions or stop distributing the Artifacts at any time; however, doing so will not
+terminate this Agreement. You may continue to use the Artifacts under this Agreement
+unless it is terminated in accordance with the circumstances expressly stated herein.
+
+5. Rights Not Covered
+
+This Agreement does not cover any patents or trademarks associated with the Artifacts,
+including with respect to any individual items of information and materials that are
+included or incorporated within a Dataset ("Contents"). Such Contents may be factual
+data or independent works such as text, images, audio, and audio visual material.
+Contents may be subject to other rights, including copyright, patent, data protection,
+privacy, or personality rights, and this Agreement does not cover such rights. The use
+rights in Section 1 expressly exclude any and all other rights that may apply to the
+Contents of a Dataset.
+
+6. Disclaimer and Limitation of Liability
+
+(a) THE ARTIFACTS ARE PROVIDED "AS IS", "AS AVAILABLE", AND "WITH ALL FAULTS", WITHOUT
+    ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND, INCLUDING WITHOUT LIMITATION, IMPLIED
+    WARRANTIES OF MERCHANTABILITY, TITLE, NON-INFRINGEMENT, FITNESS FOR A PARTICULAR
+    PURPOSE, ABSENCE OF LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OR ABSENCE
+    OF ERRORS, WHETHER OR NOT KNOWN OR DISCOVERABLE. AI2 MAKES NO REPRESENTATIONS OR
+    WARRANTIES AS TO THE RELIABILITY, COMPLETENESS, QUALITY, PERFORMANCE,
+    FUNCTIONALITY, OR UTILITY OF ANY ARTIFACTS. ANY USE OF THE ARTIFACTS IS AT YOUR
+    SOLE RISK AND DISCRETION. YOU ARE SOLELY RESPONSIBLE FOR (1) CLEARING ANY THIRD
+    PARTY RIGHTS THAT MAY APPLY TO OR BE EMBODIED IN ANY ARTIFACTS, INCLUDING ANY
+    CONTENTS IN A DATASET; (2) OBTAINING ANY NECESSARY RIGHTS, LICENSES, CONSENTS, OR
+    PERMISSIONS REQUIRED FOR YOUR USE OF THE ARTIFACTS; AND (3) PERFORMING ANY DUE
+    DILIGENCE ON THE ARTIFACTS TO VERIFY SUITABILITY FOR YOUR INTENDED USE.
+
+(b) TO THE MAXIMUM EXTENT PERMITTED UNDER APPLICABLE LAWS, IN NO EVENT WILL AI2 BE
+    LIABLE FOR ANY CLAIM, DAMAGES, LOSSES, OR OTHER LIABILITY OF ANY KIND WHATSOEVER,
+    WHETHER IN AN ACTION OF CONTRACT, TORT, OR OTHERWISE, ARISING FROM OR IN CONNECTION
+    WITH THE ARTIFACTS OR YOUR USE THEREOF.
+
+(c) The disclaimers and limitations of liability set forth above will be interpreted in
+    a manner that, to the greatest extent possible, constitute an absolute disclaimer
+    and waiver of all liability.
+
+7. Other Agreements
+
+If you have entered into a separate written agreement with Ai2 regarding your use of
+the specific Artifacts that are subject to this Agreement ("Other Agreement"), the
+terms of such Other Agreement will supplement the terms herein. To the extent of any
+conflict between the terms of the Other Agreement and this Agreement, the Other
+Agreement will take precedence.
+
+8. Miscellaneous
+
+If any term or provision of this Agreement is deemed invalid or unenforceable, it will
+automatically be reformed to the minimum extent necessary to make it enforceable. To
+the extent it cannot be reformed, it will be severed from this Agreement, and the
+remaining terms and conditions will remain in full force and effect. Any delay or
+failure by Ai2 to take any action or enforce any breach of this Agreement will not be
+deemed as a waiver or consent by Ai2. No term or provision of this Agreement will be
+waived by Ai2 unless expressly agreed in writing. Nothing in this Agreement constitutes
+as a limitation of any privileges, immunities, and rights that apply to you or Ai2
+under applicable laws, including from the legal processes of any jurisdiction or
+authority.
diff --git a/README.md b/README.md
@@ -1,122 +1,89 @@
-# OlmoEarth Pretrain
+<div align="center">
+  <img src="assets/OlmoEarth-logo.png" alt="OlmoEarth Logo" style="width: 600px; margin-left:'auto' margin-right:'auto' display:'block'"/>
+  <br>
+  <br>
+</div>
+<p align="center">
+  <a href="https://github.com/allenai/olmoearth_pretrain/blob/main/LICENSE">
+    <img alt="GitHub License" src="https://img.shields.io/badge/license-OlmoEarth-green">
+  </a>
+  <a href="https://huggingface.co/collections/allenai/olmoearth">
+    <img alt="Model Checkpoints" src="https://img.shields.io/badge/%F0%9F%A4%97%20HF-Models-yellow">
+  </a>
+  <a href="https://allenai.org/papers/olmoearth">
+    <img alt="Paper PDF" src="https://img.shields.io/badge/OlmoEarth-pdf-blue">
+  </a>
+</p>
+
+The OlmoEarth models are a flexible, multi-modal, spatio-temporal family of foundation models for Earth Observations.
+
+The OlmoEarth models exist as part of the [OlmoEarth platform](https://olmoearth.allenai.org/). The OlmoEarth Platform is an end-to-end solution for scalable planetary intelligence, providing everything needed to go from raw data through R&D, to fine-tuning and production deployment.
+
+## Installation
+
+We recommend Python 3.12, and recommend using [uv](https://docs.astral.sh/uv/getting-started/installation/).
+To install dependencies with uv, run:
 
-Allen Institute for AI's OlmoEarth Pretrain project
-
-Earth system foundation model: data, training, and evaluation
-
-launching training runs on beaker
-## General Setup
-
-**Requirements:** Python 3.11 or higher (Python 3.12 recommended)
-
-1. Install uv: `curl -LsSf https://astral.sh/uv/install.sh | sh` (other ways to do it are documented [here](https://docs.astral.sh/uv/getting-started/installation/))
-2. Navigate to root directory of this repo and run `uv sync --locked --all-groups --python 3.12`
-3. Install the pre-commit tool `uv tool install pre-commit --with pre-commit-uv --force-reinstall`
-4. uv installs everything into a venv, so to keep using `python` commands you can activate uv's venv: `source .venv/bin/activate`. Otherwise, swap to `uv run python`.
+```bash
+git clone git@github.com:allenai/olmoearth_pretrain.git
+cd olmoearth_pretrain
+uv sync --locked --all-groups --python 3.12
+# only necessary for development
+uv tool install pre-commit --with pre-commit-uv --force-reinstall
+```
 
+uv installs everything into a venv, so to keep using python commands you can activate uv's venv: `source .venv/bin/activate`. Otherwise, swap to `uv run python`.
 
-## OlmoEarth Pretrain Dataset
+OlmoEarth is built using [OLMo-core](https://github.com/allenai/OLMo-core.git). OLMo-core's published [Docker images](https://github.com/orgs/allenai/packages?repo_name=OLMo-core) contain all core and optional dependencies.
 
-The dataset for training is stored in h5 datasets. A training dataset can be created from tiles via `python3 -m olmoearth_pretrain.internal.run_h5_conversion` script.
+## Model Summary
 
+<img src="assets/model.jpg" alt="Model Architecture Diagram" style="width: 700px; margin-left:'auto' margin-right:'auto' display:'block'"/>
 
-We have 2 versions of each dataset 1 with 256 x 256 tiles and 1 with 4x as many 128 by 128 tiles. The 128 by 128 tiles may be faster for data loading due to GB/s bottlenecks on weka.
+The OlmoEarth models are trained on three satellite modalities (Sentinel 2, Sentinel 1 and Landsat) and six derived maps (OpenStreetMap, WorldCover, USDA Cropland Data Layer, SRTM DEM, WRI Canopy Height Map, and WorldCereal).
+| Model Size | Weights | Encoder Params | Decoder Params |
+| --- | --- | --- | --- |
+| Nano | [link](https://huggingface.co/allenai/OlmoEarth-v1-Nano) | 1.4M | 800K |
+| Tiny | [link](https://huggingface.co/allenai/OlmoEarth-v1-Tiny) | 6.2M | 1.9M |
+| Base | [link](https://huggingface.co/allenai/OlmoEarth-v1-Base) | 89M | 30M |
+| Large | [link](https://huggingface.co/allenai/OlmoEarth-v1-Large) | 308M | 53M |
 
-OUT OF DATE!
-- **Presto Dataset**: ~120k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled via locations used in Galileo paper
-  - 256 Path: `/weka/dfive-default/helios/dataset/presto/rerun_1_h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/117473/`
-  - 128 Path: `/weka/dfive-default/helios/dataset/presto/h5py_data_w_missing_timesteps_128_x_4_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/469892`
+## Using OlmoEarth
 
-- **OSM Sampling Dataset**: ~285k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled across OpenStreetmap classes
-  - 256 Path: `/weka/dfive-default/helios/dataset/osm_sampling/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/285288/`
-  - 128 Path: `/weka/dfive-default/helios/dataset/osm_sampling/h5py_data_w_missing_timesteps_128_x_4_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1141152`
-- **OSM Big Dataset**: ~324k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities  sampled across a wider set of opens treetmap classes
-  - 256 Path: `/weka/dfive-default/helios/dataset/osmbig/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/324482/`
-  - 128 Path: `/weka/dfive-default/helios/dataset/osmbig/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1297928`
-- **Presto Neighbor Dataset**: ~877k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities presto + the neighboring tiles
-  - 256 Path: `/weka/dfive-default/helios/dataset/presto_neighbor/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/876937/`
-  - 128 Path: `/weka/dfive-default/helios/dataset/presto_neighbor/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/3507748`
-- **WorldCover Sampling Dataset**: ~1.6M samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities. WorldCover class based sampling and some additional random sampling over the rest of the world.
-  - 256 Path: `/weka/dfive-default/helios/dataset/worldcover_sampling/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1592645/`
-  - 128 Path: `/weka/dfive-default/helios/dataset/worldcover_sampling/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/6370580/`
+[InferenceQuickstart](docs/Inference-Quickstart.md) shows how to initialize the
+OlmoEarth model and apply it on a satellite image.
 
+We also have several more in-depth tutorials for computing OlmoEarth embeddings and fine-tuning OlmoEarth on downstream tasks:
 
-## Running Eval Suite
+- [Fine-tuning OlmoEarth for Segmentation](https://github.com/allenai/olmoearth_projects/blob/main/docs/tutorials/FinetuneOlmoEarthSegmentation.md)
+- [Computing Embeddings using OlmoEarth](https://github.com/allenai/rslearn/blob/master/docs/examples/OlmoEarthEmbeddings.md)
+- [Fine-tuning OlmoEarth in rslearn](https://github.com/allenai/rslearn/blob/master/docs/examples/FinetuneOlmoEarth.md)
 
-[`olmoearth_pretrain/internal/full_eval_sweep.py`](olmoearth_pretrain/internal/full_eval_sweep.py) runs comprehensive evaluation sweeps across multiple downstream tasks for any OlmoEarth Pretrain checkpoint. It automatically sweeps over learning rates, pooling types, and normalization strategies.
+Additionally, [`olmoearth_projects`](https://github.com/allenai/olmoearth_projects) has several examples of active OlmoEarth deployments.
 
-### 1. How to run eval for a given checkpoint
+## Data Summary
 
-Basic command to run evaluation sweep for a checkpoint:
+Our pretraining dataset contains 285,288 samples from around the world of 2.56km×2.56km regions, although many samples contain only a subset of the timesteps and modalities.
 
-```
-python3 olmoearth_pretrain/internal/full_eval_sweep.py \
-  --cluster=ai2/saturn-cirrascale \
-  --checkpoint_path=/path/to/your/checkpoint/step450000 \
-  --module_path=scripts/your_training_script.py \
-```
+The distribution of the samples is available below:
 
-For just default hyperparameters (faster, single run):
-```bash
-python3 olmoearth_pretrain/internal/full_eval_sweep.py \
-  --cluster=ai2/saturn-cirrascale \
-  --checkpoint_path=/path/to/your/checkpoint/step450000 \
-  --module_path=scripts/your_training_script.py \
-  --defaults_only
-```
+<img src="assets/datamap.png" alt="Training sample distribution" style="width: 500px; margin-left:'auto' margin-right:'auto' display:'block'"/>
 
-### 2. Example of how to add additional overrides
+The dataset can be downloaded [here](https://huggingface.co/datasets/allenai/olmoearth_pretrain_dataset).
 
-Pass additional training arguments after the main arguments:
-```bash
-python3 olmoearth_pretrain/internal/full_eval_sweep.py \
-  --cluster=ai2/saturn-cirrascale \
-  --checkpoint_path=/path/to/checkpoint \
-  --module_path=scripts/your_script.py \
-  --model.decoder_config.depth=1 \
-  --trainer.callbacks.downstream_evaluator.tasks_to_run=\[mados,pastis_sentinel2,breizhcrops,sen1floods11,pastis_sentinel1_sentinel2\]  \
-```
+Detailed instructions on how to make your own pretraining dataset are available in [the dataset README](docs/Dataset-Creation.md).
 
-### 3. How to run panopticon
+## Training scripts
 
-Use the `--panopticon` flag for Panopticon model evaluation:
-```bash
-python3 olmoearth_pretrain/internal/full_eval_sweep.py \
-  --cluster=ai2/saturn-cirrascale \
-  --panopticon \
-  --model_name=panopticon
-```
+Detailed instructions on how to pretrain your own OlmoEarth model are available in [Pretraining.md](docs/Pretraining.md).
 
-### 4. How to run different dino models
+## Evaluations
 
-For DINO v3 evaluation:
-```bash
-python3 olmoearth_pretrain/internal/full_eval_sweep.py \
-  --cluster=ai2/saturn-cirrascale \
-  --dino_v3 \
-  --model_name=dino_v3_large_sat \
-  --model.model_name=DinoV3Models.LARGE_SATELLITE  \
-```
+Detailed instructions on how to replicate our evaluations is available here:
 
-### 5. How to run galileo
+- [Evaluations on Research Benchmarks](docs/Evaluation.md)
+- [Evaluations on Partner Tasks](https://github.com/allenai/rslearn_projects/blob/master/rslp/olmoearth_evals/README.md)
 
-Use the `--galileo` flag for Galileo model evaluation:
-```bash
-python3 olmoearth_pretrain/internal/full_eval_sweep.py \
-  --cluster=ai2/saturn-cirrascale \
-  --galileo \
-  --model_name=galileo_vit_base
-  --model.patch_size=4
-```
+## License
 
-**Key Notes:**
-- The script automatically determines appropriate normalization strategies for each model type (see [`olmoearth_pretrain/evals/datasets/normalize.py`](olmoearth_pretrain/evals/datasets/normalize.py))
-  - OlmoEarth Pretrain: Use pretrained normalizer or NORM_METHOD.NORM_NO_CLIP with dataset stats
-  - Galileo: Use galileo pretrained normalizer or  NORM_METHOD.NORM_NO_CLIP with dataset stats
-  - Panopticon: Uses NORM_METHOD.STANDARDIZE with the dataset statistics
-  - DinoV3: Uses NORM_METHOD.NORM_YES_CLIP_MIN_MAX_INT to get to 0-1 and then applies either the web or sat normalization values
-- Supports both full hyperparameter sweeps and default-only runs
-- Use `--dry_run` to preview commands without execution
-- For local testing, use `--cluster=local`
-
-See `olmoearth_pretrain/internal/full_eval_sweep.py` for complete argument list and implementation details.
+This code is licensed under the [OlmoEarth Artifact License](LICENSE).
diff --git a/assets/OlmoEarth-logo.png b/assets/OlmoEarth-logo.png
diff --git a/assets/datamap.png b/assets/datamap.png
diff --git a/assets/model.jpg b/assets/model.jpg
diff --git a/docs/inference_quickstart.md → docs/Inference-Quickstart.md b/docs/inference_quickstart.md → docs/Inference-Quickstart.md