diff --git a/LICENSE b/LICENSE new file mode 100644 index 000000000..f764333f5 --- /dev/null +++ b/LICENSE @@ -0,0 +1,127 @@ +OlmoEarth Artifact License + +By exercising the rights granted to you under this OlmoEarth Artifact License +("Agreement"), you accept and agree to its terms and conditions and enter into this +Agreement with The Allen Institute for Artificial Intelligence ("Ai2"). All references +to "you" herein means both an individual and legal entity that an individual is acting +on behalf of. + +Subject to your compliance with this Agreement, Ai2 grants you permission, free of +charge, to use the machine learning artifacts, materials, and documentation provided by +Ai2 under this Agreement as follows (collectively, "Artifacts"): + +- Model weights, including architecture and parameters ("Model"); +- Associated dataset or collection of data in connection with the Model ("Dataset"); and/or +- Associated software to process and run the Dataset and Model, including code in + source or binary form for training, inference, and evaluation ("Code"). + +1. Use Rights + +Subject to the terms in Section 2 and 3 below, you may: + +(a) Use, reproduce, modify, display and distribute the Artifacts, in whole or in part; +(b) Create any other machine learning models, datasets, and derivative works that are + derived from or based on the Artifacts, including by (i) transfer of patterns of + the weights, parameters, operations, or outputs of the Model, (ii) generating + outputs of the Model to produce synthetic data, and (iii) using Code to prepare any + work of authorship (collectively, "Derivatives"); and +(c) Publish and share Derivatives. + +2. Use Restrictions + +You will not (and will not encourage, permit, or facilitate any others to) use any +portion of the Artifacts or Derivatives for the following purposes: + +(a) Any military and defense-related applications and use cases, including without + limitation, for weapons development, military operations, intelligence gathering, + or human surveillance and policing activities. +(b) Any extractive activities, operations and use cases involving the removal of raw + materials from the earth, including without limitation, to plan or facilitate the + extraction of oil, natural gas and minerals through activities such as drilling, + mining, and deforestation. + +3. Distribution + +In any distribution of the Artifacts or Derivatives: + +(a) You will cite Ai2 as the source of the Artifacts in any distribution of Artifacts + or Derivatives. +(b) If you distribute any portion of the Artifacts, you will either link to or provide + a copy of this Agreement to all third party recipients. +(c) If you distribute any Derivatives, you may add your own intellectual property + notices and apply other licenses and terms of use, provided that you include and + require the use restrictions in Section 2 in all downstream distribution unless Ai2 + provides written approval otherwise. + +4. Termination + +This Agreement will automatically terminate with immediate effect and without notice to +you in the following circumstances: + +(a) Your breach of the use restrictions in Section 2 and any other terms and conditions + herein; or +(b) If you file, maintain, or voluntarily participate in a lawsuit against any person + or entity asserting that the Artifacts or any portion thereof directly or + indirectly infringe any patent, except where a lawsuit is filed in response to a + corresponding lawsuit first brought against you. + +For the avoidance of doubt, Ai2 may also offer the Artifacts under separate terms and +conditions or stop distributing the Artifacts at any time; however, doing so will not +terminate this Agreement. You may continue to use the Artifacts under this Agreement +unless it is terminated in accordance with the circumstances expressly stated herein. + +5. Rights Not Covered + +This Agreement does not cover any patents or trademarks associated with the Artifacts, +including with respect to any individual items of information and materials that are +included or incorporated within a Dataset ("Contents"). Such Contents may be factual +data or independent works such as text, images, audio, and audio visual material. +Contents may be subject to other rights, including copyright, patent, data protection, +privacy, or personality rights, and this Agreement does not cover such rights. The use +rights in Section 1 expressly exclude any and all other rights that may apply to the +Contents of a Dataset. + +6. Disclaimer and Limitation of Liability + +(a) THE ARTIFACTS ARE PROVIDED "AS IS", "AS AVAILABLE", AND "WITH ALL FAULTS", WITHOUT + ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND, INCLUDING WITHOUT LIMITATION, IMPLIED + WARRANTIES OF MERCHANTABILITY, TITLE, NON-INFRINGEMENT, FITNESS FOR A PARTICULAR + PURPOSE, ABSENCE OF LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OR ABSENCE + OF ERRORS, WHETHER OR NOT KNOWN OR DISCOVERABLE. AI2 MAKES NO REPRESENTATIONS OR + WARRANTIES AS TO THE RELIABILITY, COMPLETENESS, QUALITY, PERFORMANCE, + FUNCTIONALITY, OR UTILITY OF ANY ARTIFACTS. ANY USE OF THE ARTIFACTS IS AT YOUR + SOLE RISK AND DISCRETION. YOU ARE SOLELY RESPONSIBLE FOR (1) CLEARING ANY THIRD + PARTY RIGHTS THAT MAY APPLY TO OR BE EMBODIED IN ANY ARTIFACTS, INCLUDING ANY + CONTENTS IN A DATASET; (2) OBTAINING ANY NECESSARY RIGHTS, LICENSES, CONSENTS, OR + PERMISSIONS REQUIRED FOR YOUR USE OF THE ARTIFACTS; AND (3) PERFORMING ANY DUE + DILIGENCE ON THE ARTIFACTS TO VERIFY SUITABILITY FOR YOUR INTENDED USE. + +(b) TO THE MAXIMUM EXTENT PERMITTED UNDER APPLICABLE LAWS, IN NO EVENT WILL AI2 BE + LIABLE FOR ANY CLAIM, DAMAGES, LOSSES, OR OTHER LIABILITY OF ANY KIND WHATSOEVER, + WHETHER IN AN ACTION OF CONTRACT, TORT, OR OTHERWISE, ARISING FROM OR IN CONNECTION + WITH THE ARTIFACTS OR YOUR USE THEREOF. + +(c) The disclaimers and limitations of liability set forth above will be interpreted in + a manner that, to the greatest extent possible, constitute an absolute disclaimer + and waiver of all liability. + +7. Other Agreements + +If you have entered into a separate written agreement with Ai2 regarding your use of +the specific Artifacts that are subject to this Agreement ("Other Agreement"), the +terms of such Other Agreement will supplement the terms herein. To the extent of any +conflict between the terms of the Other Agreement and this Agreement, the Other +Agreement will take precedence. + +8. Miscellaneous + +If any term or provision of this Agreement is deemed invalid or unenforceable, it will +automatically be reformed to the minimum extent necessary to make it enforceable. To +the extent it cannot be reformed, it will be severed from this Agreement, and the +remaining terms and conditions will remain in full force and effect. Any delay or +failure by Ai2 to take any action or enforce any breach of this Agreement will not be +deemed as a waiver or consent by Ai2. No term or provision of this Agreement will be +waived by Ai2 unless expressly agreed in writing. Nothing in this Agreement constitutes +as a limitation of any privileges, immunities, and rights that apply to you or Ai2 +under applicable laws, including from the legal processes of any jurisdiction or +authority. diff --git a/README.md b/README.md index beaa7d9c8..ea5bd18fb 100644 --- a/README.md +++ b/README.md @@ -1,122 +1,89 @@ -# OlmoEarth Pretrain +
+ OlmoEarth Logo +
+
+
+

+ + GitHub License + + + Model Checkpoints + + + Paper PDF + +

+ +The OlmoEarth models are a flexible, multi-modal, spatio-temporal family of foundation models for Earth Observations. + +The OlmoEarth models exist as part of the [OlmoEarth platform](https://olmoearth.allenai.org/). The OlmoEarth Platform is an end-to-end solution for scalable planetary intelligence, providing everything needed to go from raw data through R&D, to fine-tuning and production deployment. + +## Installation + +We recommend Python 3.12, and recommend using [uv](https://docs.astral.sh/uv/getting-started/installation/). +To install dependencies with uv, run: -Allen Institute for AI's OlmoEarth Pretrain project - -Earth system foundation model: data, training, and evaluation - -launching training runs on beaker -## General Setup - -**Requirements:** Python 3.11 or higher (Python 3.12 recommended) - -1. Install uv: `curl -LsSf https://astral.sh/uv/install.sh | sh` (other ways to do it are documented [here](https://docs.astral.sh/uv/getting-started/installation/)) -2. Navigate to root directory of this repo and run `uv sync --locked --all-groups --python 3.12` -3. Install the pre-commit tool `uv tool install pre-commit --with pre-commit-uv --force-reinstall` -4. uv installs everything into a venv, so to keep using `python` commands you can activate uv's venv: `source .venv/bin/activate`. Otherwise, swap to `uv run python`. +```bash +git clone git@github.com:allenai/olmoearth_pretrain.git +cd olmoearth_pretrain +uv sync --locked --all-groups --python 3.12 +# only necessary for development +uv tool install pre-commit --with pre-commit-uv --force-reinstall +``` +uv installs everything into a venv, so to keep using python commands you can activate uv's venv: `source .venv/bin/activate`. Otherwise, swap to `uv run python`. -## OlmoEarth Pretrain Dataset +OlmoEarth is built using [OLMo-core](https://github.com/allenai/OLMo-core.git). OLMo-core's published [Docker images](https://github.com/orgs/allenai/packages?repo_name=OLMo-core) contain all core and optional dependencies. -The dataset for training is stored in h5 datasets. A training dataset can be created from tiles via `python3 -m olmoearth_pretrain.internal.run_h5_conversion` script. +## Model Summary +Model Architecture Diagram -We have 2 versions of each dataset 1 with 256 x 256 tiles and 1 with 4x as many 128 by 128 tiles. The 128 by 128 tiles may be faster for data loading due to GB/s bottlenecks on weka. +The OlmoEarth models are trained on three satellite modalities (Sentinel 2, Sentinel 1 and Landsat) and six derived maps (OpenStreetMap, WorldCover, USDA Cropland Data Layer, SRTM DEM, WRI Canopy Height Map, and WorldCereal). +| Model Size | Weights | Encoder Params | Decoder Params | +| --- | --- | --- | --- | +| Nano | [link](https://huggingface.co/allenai/OlmoEarth-v1-Nano) | 1.4M | 800K | +| Tiny | [link](https://huggingface.co/allenai/OlmoEarth-v1-Tiny) | 6.2M | 1.9M | +| Base | [link](https://huggingface.co/allenai/OlmoEarth-v1-Base) | 89M | 30M | +| Large | [link](https://huggingface.co/allenai/OlmoEarth-v1-Large) | 308M | 53M | -OUT OF DATE! -- **Presto Dataset**: ~120k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled via locations used in Galileo paper - - 256 Path: `/weka/dfive-default/helios/dataset/presto/rerun_1_h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/117473/` - - 128 Path: `/weka/dfive-default/helios/dataset/presto/h5py_data_w_missing_timesteps_128_x_4_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/469892` +## Using OlmoEarth -- **OSM Sampling Dataset**: ~285k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled across OpenStreetmap classes - - 256 Path: `/weka/dfive-default/helios/dataset/osm_sampling/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/285288/` - - 128 Path: `/weka/dfive-default/helios/dataset/osm_sampling/h5py_data_w_missing_timesteps_128_x_4_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1141152` -- **OSM Big Dataset**: ~324k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled across a wider set of opens treetmap classes - - 256 Path: `/weka/dfive-default/helios/dataset/osmbig/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/324482/` - - 128 Path: `/weka/dfive-default/helios/dataset/osmbig/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1297928` -- **Presto Neighbor Dataset**: ~877k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities presto + the neighboring tiles - - 256 Path: `/weka/dfive-default/helios/dataset/presto_neighbor/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/876937/` - - 128 Path: `/weka/dfive-default/helios/dataset/presto_neighbor/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/3507748` -- **WorldCover Sampling Dataset**: ~1.6M samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities. WorldCover class based sampling and some additional random sampling over the rest of the world. - - 256 Path: `/weka/dfive-default/helios/dataset/worldcover_sampling/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1592645/` - - 128 Path: `/weka/dfive-default/helios/dataset/worldcover_sampling/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/6370580/` +[InferenceQuickstart](docs/Inference-Quickstart.md) shows how to initialize the +OlmoEarth model and apply it on a satellite image. +We also have several more in-depth tutorials for computing OlmoEarth embeddings and fine-tuning OlmoEarth on downstream tasks: -## Running Eval Suite +- [Fine-tuning OlmoEarth for Segmentation](https://github.com/allenai/olmoearth_projects/blob/main/docs/tutorials/FinetuneOlmoEarthSegmentation.md) +- [Computing Embeddings using OlmoEarth](https://github.com/allenai/rslearn/blob/master/docs/examples/OlmoEarthEmbeddings.md) +- [Fine-tuning OlmoEarth in rslearn](https://github.com/allenai/rslearn/blob/master/docs/examples/FinetuneOlmoEarth.md) -[`olmoearth_pretrain/internal/full_eval_sweep.py`](olmoearth_pretrain/internal/full_eval_sweep.py) runs comprehensive evaluation sweeps across multiple downstream tasks for any OlmoEarth Pretrain checkpoint. It automatically sweeps over learning rates, pooling types, and normalization strategies. +Additionally, [`olmoearth_projects`](https://github.com/allenai/olmoearth_projects) has several examples of active OlmoEarth deployments. -### 1. How to run eval for a given checkpoint +## Data Summary -Basic command to run evaluation sweep for a checkpoint: +Our pretraining dataset contains 285,288 samples from around the world of 2.56km×2.56km regions, although many samples contain only a subset of the timesteps and modalities. -``` -python3 olmoearth_pretrain/internal/full_eval_sweep.py \ - --cluster=ai2/saturn-cirrascale \ - --checkpoint_path=/path/to/your/checkpoint/step450000 \ - --module_path=scripts/your_training_script.py \ -``` +The distribution of the samples is available below: -For just default hyperparameters (faster, single run): -```bash -python3 olmoearth_pretrain/internal/full_eval_sweep.py \ - --cluster=ai2/saturn-cirrascale \ - --checkpoint_path=/path/to/your/checkpoint/step450000 \ - --module_path=scripts/your_training_script.py \ - --defaults_only -``` +Training sample distribution -### 2. Example of how to add additional overrides +The dataset can be downloaded [here](https://huggingface.co/datasets/allenai/olmoearth_pretrain_dataset). -Pass additional training arguments after the main arguments: -```bash -python3 olmoearth_pretrain/internal/full_eval_sweep.py \ - --cluster=ai2/saturn-cirrascale \ - --checkpoint_path=/path/to/checkpoint \ - --module_path=scripts/your_script.py \ - --model.decoder_config.depth=1 \ - --trainer.callbacks.downstream_evaluator.tasks_to_run=\[mados,pastis_sentinel2,breizhcrops,sen1floods11,pastis_sentinel1_sentinel2\] \ -``` +Detailed instructions on how to make your own pretraining dataset are available in [the dataset README](docs/Dataset-Creation.md). -### 3. How to run panopticon +## Training scripts -Use the `--panopticon` flag for Panopticon model evaluation: -```bash -python3 olmoearth_pretrain/internal/full_eval_sweep.py \ - --cluster=ai2/saturn-cirrascale \ - --panopticon \ - --model_name=panopticon -``` +Detailed instructions on how to pretrain your own OlmoEarth model are available in [Pretraining.md](docs/Pretraining.md). -### 4. How to run different dino models +## Evaluations -For DINO v3 evaluation: -```bash -python3 olmoearth_pretrain/internal/full_eval_sweep.py \ - --cluster=ai2/saturn-cirrascale \ - --dino_v3 \ - --model_name=dino_v3_large_sat \ - --model.model_name=DinoV3Models.LARGE_SATELLITE \ -``` +Detailed instructions on how to replicate our evaluations is available here: -### 5. How to run galileo +- [Evaluations on Research Benchmarks](docs/Evaluation.md) +- [Evaluations on Partner Tasks](https://github.com/allenai/rslearn_projects/blob/master/rslp/olmoearth_evals/README.md) -Use the `--galileo` flag for Galileo model evaluation: -```bash -python3 olmoearth_pretrain/internal/full_eval_sweep.py \ - --cluster=ai2/saturn-cirrascale \ - --galileo \ - --model_name=galileo_vit_base - --model.patch_size=4 -``` +## License -**Key Notes:** -- The script automatically determines appropriate normalization strategies for each model type (see [`olmoearth_pretrain/evals/datasets/normalize.py`](olmoearth_pretrain/evals/datasets/normalize.py)) - - OlmoEarth Pretrain: Use pretrained normalizer or NORM_METHOD.NORM_NO_CLIP with dataset stats - - Galileo: Use galileo pretrained normalizer or NORM_METHOD.NORM_NO_CLIP with dataset stats - - Panopticon: Uses NORM_METHOD.STANDARDIZE with the dataset statistics - - DinoV3: Uses NORM_METHOD.NORM_YES_CLIP_MIN_MAX_INT to get to 0-1 and then applies either the web or sat normalization values -- Supports both full hyperparameter sweeps and default-only runs -- Use `--dry_run` to preview commands without execution -- For local testing, use `--cluster=local` - -See `olmoearth_pretrain/internal/full_eval_sweep.py` for complete argument list and implementation details. +This code is licensed under the [OlmoEarth Artifact License](LICENSE). diff --git a/assets/OlmoEarth-logo.png b/assets/OlmoEarth-logo.png new file mode 100644 index 000000000..6a141a35e Binary files /dev/null and b/assets/OlmoEarth-logo.png differ diff --git a/assets/datamap.png b/assets/datamap.png new file mode 100644 index 000000000..9dff4b16c Binary files /dev/null and b/assets/datamap.png differ diff --git a/assets/model.jpg b/assets/model.jpg new file mode 100644 index 000000000..36f13e76f Binary files /dev/null and b/assets/model.jpg differ diff --git a/docs/inference_quickstart.md b/docs/Inference-Quickstart.md similarity index 100% rename from docs/inference_quickstart.md rename to docs/Inference-Quickstart.md