-
Notifications
You must be signed in to change notification settings - Fork 45
Readme #417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Readme #417
Changes from 5 commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
120cc17
Update Readme
gabrieltseng eb5ca7e
Update Readme
gabrieltseng 7e42fe7
Cleanup
gabrieltseng 7434ab6
updates
gabrieltseng 9d0a256
thanks cursor
gabrieltseng 05f1a47
Address comments
gabrieltseng da7bd19
increase file size a bit
gabrieltseng ad371ea
Merge branch 'main' into gabi/readme
gabrieltseng f0b5235
add liscense
Hgherzog a33cb61
add badges
gabrieltseng 720ebd8
thanks cursor
gabrieltseng 712317a
fill in the link to evaluation doc
favyen2 8469ee3
update using olmoearth section
favyen2 008f699
add license
favyen2 3497ed4
add all six maps
favyen2 5c6f615
misc fixes and more links
favyen2 32d1ce5
fix olmoearth platform link
favyen2 8fd4406
fix olmoearth_projects link
favyen2 e3ee870
fix fine-tuning guide link
favyen2 f4def3c
update data map with henry's update
favyen2 b7f4351
fix
favyen2 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,122 +1,64 @@ | ||
| # OlmoEarth Pretrain | ||
| <div align="center"> | ||
| <img src="assets/OlmoEarth-logo.png" alt="OlmoEarth Logo" style="width: 600px; margin-left:'auto' margin-right:'auto' display:'block'"/> | ||
| <br> | ||
| <br> | ||
| </div> | ||
| <p align="center"> | ||
| <a href="https://huggingface.co/collections/allenai/olmoearth"> | ||
| <img alt="Model Checkpoints" src="https://img.shields.io/badge/%F0%9F%A4%97%20HF-Models-yellow"> | ||
| </a> | ||
| </p> | ||
|
|
||
| Allen Institute for AI's OlmoEarth Pretrain project | ||
| The OlmoEarth models are a flexible, multi-modal, spatio-temporal family of foundation models for Earth Observations. | ||
|
|
||
| Earth system foundation model: data, training, and evaluation | ||
| The OlmoEarth models exist as part of the [OlmoEarth platform](https://allenai.org/olmoearth). The OlmoEarth Platform is an end-to-end solution for scalable planetary intelligence, providing everything needed to go from raw data through R&D, to fine-tuning and production deployment. | ||
|
|
||
| launching training runs on beaker | ||
| ## General Setup | ||
| ## Installation | ||
|
|
||
| **Requirements:** Python 3.11 or higher (Python 3.12 recommended) | ||
|
|
||
| 1. Install uv: `curl -LsSf https://astral.sh/uv/install.sh | sh` (other ways to do it are documented [here](https://docs.astral.sh/uv/getting-started/installation/)) | ||
| 2. Navigate to root directory of this repo and run `uv sync --locked --all-groups --python 3.12` | ||
| 3. Install the pre-commit tool `uv tool install pre-commit --with pre-commit-uv --force-reinstall` | ||
| 4. uv installs everything into a venv, so to keep using `python` commands you can activate uv's venv: `source .venv/bin/activate`. Otherwise, swap to `uv run python`. | ||
| We recommend Python 3.12, and recommend using [uv](https://docs.astral.sh/uv/getting-started/installation/). | ||
| To install dependencies with uv, run: | ||
|
|
||
| ```bash | ||
| git clone git@github.com:allenai/olmoearth_pretrain.git | ||
| cd olmoearth_pretrain | ||
| uv sync --locked --all-groups --python 3.12 | ||
| # only necessary for development | ||
| uv tool install pre-commit --with pre-commit-uv --force-reinstall | ||
| ``` | ||
|
|
||
| ## OlmoEarth Pretrain Dataset | ||
| uv installs everything into a venv, so to keep using python commands you can activate uv's venv: `source .venv/bin/activate`. Otherwise, swap to `uv run python`. | ||
|
|
||
| The dataset for training is stored in h5 datasets. A training dataset can be created from tiles via `python3 -m olmoearth_pretrain.internal.run_h5_conversion` script. | ||
| OlmoEarth is built using [OLMo-core](https://github.com/allenai/OLMo-core.git). OLMo-core's published [Docker images](https://github.com/orgs/allenai/packages?repo_name=OLMo-core) contain all core and optional dependencies. | ||
|
|
||
| ## Model Summary | ||
|
gabrieltseng marked this conversation as resolved.
|
||
|
|
||
| We have 2 versions of each dataset 1 with 256 x 256 tiles and 1 with 4x as many 128 by 128 tiles. The 128 by 128 tiles may be faster for data loading due to GB/s bottlenecks on weka. | ||
| The OlmoEarth models are trained on three satellite modalities (Sentinel 2, Sentinel 1 and Landsat) and six derived maps (OpenStreetMap, WorldCover). | ||
| | Model Size | Weights | Encoder Params | Decoder Params | | ||
| | --- | --- | --- | --- | | ||
| | Nano | [link](https://huggingface.co/allenai/OlmoEarth-v1-Nano) | 1.4M | 800K | | ||
| | Tiny | [link](https://huggingface.co/allenai/OlmoEarth-v1-Tiny) | 6.2M | 1.9M | | ||
| | Base | [link](https://huggingface.co/allenai/OlmoEarth-v1-Base) | 89M | 30M | | ||
|
gabrieltseng marked this conversation as resolved.
|
||
|
|
||
| OUT OF DATE! | ||
| - **Presto Dataset**: ~120k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled via locations used in Galileo paper | ||
| - 256 Path: `/weka/dfive-default/helios/dataset/presto/rerun_1_h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/117473/` | ||
| - 128 Path: `/weka/dfive-default/helios/dataset/presto/h5py_data_w_missing_timesteps_128_x_4_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/469892` | ||
| ## Data Summary | ||
|
|
||
| - **OSM Sampling Dataset**: ~285k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled across OpenStreetmap classes | ||
| - 256 Path: `/weka/dfive-default/helios/dataset/osm_sampling/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/285288/` | ||
| - 128 Path: `/weka/dfive-default/helios/dataset/osm_sampling/h5py_data_w_missing_timesteps_128_x_4_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1141152` | ||
| - **OSM Big Dataset**: ~324k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled across a wider set of opens treetmap classes | ||
| - 256 Path: `/weka/dfive-default/helios/dataset/osmbig/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/324482/` | ||
| - 128 Path: `/weka/dfive-default/helios/dataset/osmbig/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1297928` | ||
| - **Presto Neighbor Dataset**: ~877k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities presto + the neighboring tiles | ||
| - 256 Path: `/weka/dfive-default/helios/dataset/presto_neighbor/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/876937/` | ||
| - 128 Path: `/weka/dfive-default/helios/dataset/presto_neighbor/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/3507748` | ||
| - **WorldCover Sampling Dataset**: ~1.6M samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities. WorldCover class based sampling and some additional random sampling over the rest of the world. | ||
| - 256 Path: `/weka/dfive-default/helios/dataset/worldcover_sampling/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1592645/` | ||
| - 128 Path: `/weka/dfive-default/helios/dataset/worldcover_sampling/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/6370580/` | ||
| Our pretraining dataset contains around 300,000 samples from around the world of 2.56km×2.56km regions, although many samples contain only a subset of the timesteps and modalities. | ||
|
|
||
| The distribution of the samples is available below: | ||
|
|
||
| ## Running Eval Suite | ||
| <img src="assets/datamap.png" alt="Training sample distribution" style="width: 500px; margin-left:'auto' margin-right:'auto' display:'block'"/> | ||
|
|
||
| [`olmoearth_pretrain/internal/full_eval_sweep.py`](olmoearth_pretrain/internal/full_eval_sweep.py) runs comprehensive evaluation sweeps across multiple downstream tasks for any OlmoEarth Pretrain checkpoint. It automatically sweeps over learning rates, pooling types, and normalization strategies. | ||
| The dataset can be downloaded [here](https://huggingface.co/datasets/allenai/olmoearth_pretrain_dataset). | ||
|
|
||
| ### 1. How to run eval for a given checkpoint | ||
| ## Training scripts | ||
|
|
||
| Basic command to run evaluation sweep for a checkpoint: | ||
| Detailed instructions on how to pretrain your own OlmoEarth model are available in [Pretraining.md](docs/Pretraining.md). | ||
|
|
||
| ``` | ||
| python3 olmoearth_pretrain/internal/full_eval_sweep.py \ | ||
| --cluster=ai2/saturn-cirrascale \ | ||
| --checkpoint_path=/path/to/your/checkpoint/step450000 \ | ||
| --module_path=scripts/your_training_script.py \ | ||
| ``` | ||
| ## Evaluations | ||
|
|
||
| For just default hyperparameters (faster, single run): | ||
| ```bash | ||
| python3 olmoearth_pretrain/internal/full_eval_sweep.py \ | ||
| --cluster=ai2/saturn-cirrascale \ | ||
| --checkpoint_path=/path/to/your/checkpoint/step450000 \ | ||
| --module_path=scripts/your_training_script.py \ | ||
| --defaults_only | ||
| ``` | ||
|
|
||
| ### 2. Example of how to add additional overrides | ||
|
|
||
| Pass additional training arguments after the main arguments: | ||
| ```bash | ||
| python3 olmoearth_pretrain/internal/full_eval_sweep.py \ | ||
| --cluster=ai2/saturn-cirrascale \ | ||
| --checkpoint_path=/path/to/checkpoint \ | ||
| --module_path=scripts/your_script.py \ | ||
| --model.decoder_config.depth=1 \ | ||
| --trainer.callbacks.downstream_evaluator.tasks_to_run=\[mados,pastis_sentinel2,breizhcrops,sen1floods11,pastis_sentinel1_sentinel2\] \ | ||
| ``` | ||
| Detailed instructions on how to replicate our evaluations is available in #TODO. | ||
|
favyen2 marked this conversation as resolved.
Outdated
|
||
|
|
||
| ### 3. How to run panopticon | ||
| ## Deploying OlmoEarth | ||
|
|
||
| Use the `--panopticon` flag for Panopticon model evaluation: | ||
| ```bash | ||
| python3 olmoearth_pretrain/internal/full_eval_sweep.py \ | ||
| --cluster=ai2/saturn-cirrascale \ | ||
| --panopticon \ | ||
| --model_name=panopticon | ||
| ``` | ||
|
|
||
| ### 4. How to run different dino models | ||
|
|
||
| For DINO v3 evaluation: | ||
| ```bash | ||
| python3 olmoearth_pretrain/internal/full_eval_sweep.py \ | ||
| --cluster=ai2/saturn-cirrascale \ | ||
| --dino_v3 \ | ||
| --model_name=dino_v3_large_sat \ | ||
| --model.model_name=DinoV3Models.LARGE_SATELLITE \ | ||
| ``` | ||
|
|
||
| ### 5. How to run galileo | ||
|
|
||
| Use the `--galileo` flag for Galileo model evaluation: | ||
| ```bash | ||
| python3 olmoearth_pretrain/internal/full_eval_sweep.py \ | ||
| --cluster=ai2/saturn-cirrascale \ | ||
| --galileo \ | ||
| --model_name=galileo_vit_base | ||
| --model.patch_size=4 | ||
| ``` | ||
| The OlmoEarth models exist as part of the [OlmoEarth platform](https://allenai.org/olmoearth). The OlmoEarth Platform is an end-to-end solution for scalable planetary intelligence, providing everything needed to go from raw data through R&D, to fine-tuning and production deployment. | ||
|
gabrieltseng marked this conversation as resolved.
Outdated
|
||
|
|
||
| **Key Notes:** | ||
| - The script automatically determines appropriate normalization strategies for each model type (see [`olmoearth_pretrain/evals/datasets/normalize.py`](olmoearth_pretrain/evals/datasets/normalize.py)) | ||
| - OlmoEarth Pretrain: Use pretrained normalizer or NORM_METHOD.NORM_NO_CLIP with dataset stats | ||
| - Galileo: Use galileo pretrained normalizer or NORM_METHOD.NORM_NO_CLIP with dataset stats | ||
| - Panopticon: Uses NORM_METHOD.STANDARDIZE with the dataset statistics | ||
| - DinoV3: Uses NORM_METHOD.NORM_YES_CLIP_MIN_MAX_INT to get to 0-1 and then applies either the web or sat normalization values | ||
| - Supports both full hyperparameter sweeps and default-only runs | ||
| - Use `--dry_run` to preview commands without execution | ||
| - For local testing, use `--cluster=local` | ||
|
|
||
| See `olmoearth_pretrain/internal/full_eval_sweep.py` for complete argument list and implementation details. | ||
| Examples of active OlmoEarth deployments are available at [`olmoearth_projects`](github.com/allenai/olmoearth_projects). | ||
|
Hgherzog marked this conversation as resolved.
Outdated
|
||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.