diff --git a/LICENSE b/LICENSE new file mode 100644 index 000000000..f764333f5 --- /dev/null +++ b/LICENSE @@ -0,0 +1,127 @@ +OlmoEarth Artifact License + +By exercising the rights granted to you under this OlmoEarth Artifact License +("Agreement"), you accept and agree to its terms and conditions and enter into this +Agreement with The Allen Institute for Artificial Intelligence ("Ai2"). All references +to "you" herein means both an individual and legal entity that an individual is acting +on behalf of. + +Subject to your compliance with this Agreement, Ai2 grants you permission, free of +charge, to use the machine learning artifacts, materials, and documentation provided by +Ai2 under this Agreement as follows (collectively, "Artifacts"): + +- Model weights, including architecture and parameters ("Model"); +- Associated dataset or collection of data in connection with the Model ("Dataset"); and/or +- Associated software to process and run the Dataset and Model, including code in + source or binary form for training, inference, and evaluation ("Code"). + +1. Use Rights + +Subject to the terms in Section 2 and 3 below, you may: + +(a) Use, reproduce, modify, display and distribute the Artifacts, in whole or in part; +(b) Create any other machine learning models, datasets, and derivative works that are + derived from or based on the Artifacts, including by (i) transfer of patterns of + the weights, parameters, operations, or outputs of the Model, (ii) generating + outputs of the Model to produce synthetic data, and (iii) using Code to prepare any + work of authorship (collectively, "Derivatives"); and +(c) Publish and share Derivatives. + +2. Use Restrictions + +You will not (and will not encourage, permit, or facilitate any others to) use any +portion of the Artifacts or Derivatives for the following purposes: + +(a) Any military and defense-related applications and use cases, including without + limitation, for weapons development, military operations, intelligence gathering, + or human surveillance and policing activities. +(b) Any extractive activities, operations and use cases involving the removal of raw + materials from the earth, including without limitation, to plan or facilitate the + extraction of oil, natural gas and minerals through activities such as drilling, + mining, and deforestation. + +3. Distribution + +In any distribution of the Artifacts or Derivatives: + +(a) You will cite Ai2 as the source of the Artifacts in any distribution of Artifacts + or Derivatives. +(b) If you distribute any portion of the Artifacts, you will either link to or provide + a copy of this Agreement to all third party recipients. +(c) If you distribute any Derivatives, you may add your own intellectual property + notices and apply other licenses and terms of use, provided that you include and + require the use restrictions in Section 2 in all downstream distribution unless Ai2 + provides written approval otherwise. + +4. Termination + +This Agreement will automatically terminate with immediate effect and without notice to +you in the following circumstances: + +(a) Your breach of the use restrictions in Section 2 and any other terms and conditions + herein; or +(b) If you file, maintain, or voluntarily participate in a lawsuit against any person + or entity asserting that the Artifacts or any portion thereof directly or + indirectly infringe any patent, except where a lawsuit is filed in response to a + corresponding lawsuit first brought against you. + +For the avoidance of doubt, Ai2 may also offer the Artifacts under separate terms and +conditions or stop distributing the Artifacts at any time; however, doing so will not +terminate this Agreement. You may continue to use the Artifacts under this Agreement +unless it is terminated in accordance with the circumstances expressly stated herein. + +5. Rights Not Covered + +This Agreement does not cover any patents or trademarks associated with the Artifacts, +including with respect to any individual items of information and materials that are +included or incorporated within a Dataset ("Contents"). Such Contents may be factual +data or independent works such as text, images, audio, and audio visual material. +Contents may be subject to other rights, including copyright, patent, data protection, +privacy, or personality rights, and this Agreement does not cover such rights. The use +rights in Section 1 expressly exclude any and all other rights that may apply to the +Contents of a Dataset. + +6. Disclaimer and Limitation of Liability + +(a) THE ARTIFACTS ARE PROVIDED "AS IS", "AS AVAILABLE", AND "WITH ALL FAULTS", WITHOUT + ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND, INCLUDING WITHOUT LIMITATION, IMPLIED + WARRANTIES OF MERCHANTABILITY, TITLE, NON-INFRINGEMENT, FITNESS FOR A PARTICULAR + PURPOSE, ABSENCE OF LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OR ABSENCE + OF ERRORS, WHETHER OR NOT KNOWN OR DISCOVERABLE. AI2 MAKES NO REPRESENTATIONS OR + WARRANTIES AS TO THE RELIABILITY, COMPLETENESS, QUALITY, PERFORMANCE, + FUNCTIONALITY, OR UTILITY OF ANY ARTIFACTS. ANY USE OF THE ARTIFACTS IS AT YOUR + SOLE RISK AND DISCRETION. YOU ARE SOLELY RESPONSIBLE FOR (1) CLEARING ANY THIRD + PARTY RIGHTS THAT MAY APPLY TO OR BE EMBODIED IN ANY ARTIFACTS, INCLUDING ANY + CONTENTS IN A DATASET; (2) OBTAINING ANY NECESSARY RIGHTS, LICENSES, CONSENTS, OR + PERMISSIONS REQUIRED FOR YOUR USE OF THE ARTIFACTS; AND (3) PERFORMING ANY DUE + DILIGENCE ON THE ARTIFACTS TO VERIFY SUITABILITY FOR YOUR INTENDED USE. + +(b) TO THE MAXIMUM EXTENT PERMITTED UNDER APPLICABLE LAWS, IN NO EVENT WILL AI2 BE + LIABLE FOR ANY CLAIM, DAMAGES, LOSSES, OR OTHER LIABILITY OF ANY KIND WHATSOEVER, + WHETHER IN AN ACTION OF CONTRACT, TORT, OR OTHERWISE, ARISING FROM OR IN CONNECTION + WITH THE ARTIFACTS OR YOUR USE THEREOF. + +(c) The disclaimers and limitations of liability set forth above will be interpreted in + a manner that, to the greatest extent possible, constitute an absolute disclaimer + and waiver of all liability. + +7. Other Agreements + +If you have entered into a separate written agreement with Ai2 regarding your use of +the specific Artifacts that are subject to this Agreement ("Other Agreement"), the +terms of such Other Agreement will supplement the terms herein. To the extent of any +conflict between the terms of the Other Agreement and this Agreement, the Other +Agreement will take precedence. + +8. Miscellaneous + +If any term or provision of this Agreement is deemed invalid or unenforceable, it will +automatically be reformed to the minimum extent necessary to make it enforceable. To +the extent it cannot be reformed, it will be severed from this Agreement, and the +remaining terms and conditions will remain in full force and effect. Any delay or +failure by Ai2 to take any action or enforce any breach of this Agreement will not be +deemed as a waiver or consent by Ai2. No term or provision of this Agreement will be +waived by Ai2 unless expressly agreed in writing. Nothing in this Agreement constitutes +as a limitation of any privileges, immunities, and rights that apply to you or Ai2 +under applicable laws, including from the legal processes of any jurisdiction or +authority. diff --git a/README.md b/README.md index beaa7d9c8..ea5bd18fb 100644 --- a/README.md +++ b/README.md @@ -1,122 +1,89 @@ -# OlmoEarth Pretrain +
+
-We have 2 versions of each dataset 1 with 256 x 256 tiles and 1 with 4x as many 128 by 128 tiles. The 128 by 128 tiles may be faster for data loading due to GB/s bottlenecks on weka.
+The OlmoEarth models are trained on three satellite modalities (Sentinel 2, Sentinel 1 and Landsat) and six derived maps (OpenStreetMap, WorldCover, USDA Cropland Data Layer, SRTM DEM, WRI Canopy Height Map, and WorldCereal).
+| Model Size | Weights | Encoder Params | Decoder Params |
+| --- | --- | --- | --- |
+| Nano | [link](https://huggingface.co/allenai/OlmoEarth-v1-Nano) | 1.4M | 800K |
+| Tiny | [link](https://huggingface.co/allenai/OlmoEarth-v1-Tiny) | 6.2M | 1.9M |
+| Base | [link](https://huggingface.co/allenai/OlmoEarth-v1-Base) | 89M | 30M |
+| Large | [link](https://huggingface.co/allenai/OlmoEarth-v1-Large) | 308M | 53M |
-OUT OF DATE!
-- **Presto Dataset**: ~120k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled via locations used in Galileo paper
- - 256 Path: `/weka/dfive-default/helios/dataset/presto/rerun_1_h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/117473/`
- - 128 Path: `/weka/dfive-default/helios/dataset/presto/h5py_data_w_missing_timesteps_128_x_4_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/469892`
+## Using OlmoEarth
-- **OSM Sampling Dataset**: ~285k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled across OpenStreetmap classes
- - 256 Path: `/weka/dfive-default/helios/dataset/osm_sampling/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/285288/`
- - 128 Path: `/weka/dfive-default/helios/dataset/osm_sampling/h5py_data_w_missing_timesteps_128_x_4_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1141152`
-- **OSM Big Dataset**: ~324k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities sampled across a wider set of opens treetmap classes
- - 256 Path: `/weka/dfive-default/helios/dataset/osmbig/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/324482/`
- - 128 Path: `/weka/dfive-default/helios/dataset/osmbig/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1297928`
-- **Presto Neighbor Dataset**: ~877k samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities presto + the neighboring tiles
- - 256 Path: `/weka/dfive-default/helios/dataset/presto_neighbor/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/876937/`
- - 128 Path: `/weka/dfive-default/helios/dataset/presto_neighbor/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/3507748`
-- **WorldCover Sampling Dataset**: ~1.6M samples with Landsat, OpenStreetMap raster, Sentinel-1, Sentinel-2 L2A, SRTM, and WorldCover modalities. WorldCover class based sampling and some additional random sampling over the rest of the world.
- - 256 Path: `/weka/dfive-default/helios/dataset/worldcover_sampling/h5py_data_w_missing_timesteps_zstd_3/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/1592645/`
- - 128 Path: `/weka/dfive-default/helios/dataset/worldcover_sampling/h5py_data_w_missing_timesteps_zstd_3_128_x_4/landsat_openstreetmap_raster_sentinel1_sentinel2_l2a_srtm_worldcover/6370580/`
+[InferenceQuickstart](docs/Inference-Quickstart.md) shows how to initialize the
+OlmoEarth model and apply it on a satellite image.
+We also have several more in-depth tutorials for computing OlmoEarth embeddings and fine-tuning OlmoEarth on downstream tasks:
-## Running Eval Suite
+- [Fine-tuning OlmoEarth for Segmentation](https://github.com/allenai/olmoearth_projects/blob/main/docs/tutorials/FinetuneOlmoEarthSegmentation.md)
+- [Computing Embeddings using OlmoEarth](https://github.com/allenai/rslearn/blob/master/docs/examples/OlmoEarthEmbeddings.md)
+- [Fine-tuning OlmoEarth in rslearn](https://github.com/allenai/rslearn/blob/master/docs/examples/FinetuneOlmoEarth.md)
-[`olmoearth_pretrain/internal/full_eval_sweep.py`](olmoearth_pretrain/internal/full_eval_sweep.py) runs comprehensive evaluation sweeps across multiple downstream tasks for any OlmoEarth Pretrain checkpoint. It automatically sweeps over learning rates, pooling types, and normalization strategies.
+Additionally, [`olmoearth_projects`](https://github.com/allenai/olmoearth_projects) has several examples of active OlmoEarth deployments.
-### 1. How to run eval for a given checkpoint
+## Data Summary
-Basic command to run evaluation sweep for a checkpoint:
+Our pretraining dataset contains 285,288 samples from around the world of 2.56km×2.56km regions, although many samples contain only a subset of the timesteps and modalities.
-```
-python3 olmoearth_pretrain/internal/full_eval_sweep.py \
- --cluster=ai2/saturn-cirrascale \
- --checkpoint_path=/path/to/your/checkpoint/step450000 \
- --module_path=scripts/your_training_script.py \
-```
+The distribution of the samples is available below:
-For just default hyperparameters (faster, single run):
-```bash
-python3 olmoearth_pretrain/internal/full_eval_sweep.py \
- --cluster=ai2/saturn-cirrascale \
- --checkpoint_path=/path/to/your/checkpoint/step450000 \
- --module_path=scripts/your_training_script.py \
- --defaults_only
-```
+
-### 2. Example of how to add additional overrides
+The dataset can be downloaded [here](https://huggingface.co/datasets/allenai/olmoearth_pretrain_dataset).
-Pass additional training arguments after the main arguments:
-```bash
-python3 olmoearth_pretrain/internal/full_eval_sweep.py \
- --cluster=ai2/saturn-cirrascale \
- --checkpoint_path=/path/to/checkpoint \
- --module_path=scripts/your_script.py \
- --model.decoder_config.depth=1 \
- --trainer.callbacks.downstream_evaluator.tasks_to_run=\[mados,pastis_sentinel2,breizhcrops,sen1floods11,pastis_sentinel1_sentinel2\] \
-```
+Detailed instructions on how to make your own pretraining dataset are available in [the dataset README](docs/Dataset-Creation.md).
-### 3. How to run panopticon
+## Training scripts
-Use the `--panopticon` flag for Panopticon model evaluation:
-```bash
-python3 olmoearth_pretrain/internal/full_eval_sweep.py \
- --cluster=ai2/saturn-cirrascale \
- --panopticon \
- --model_name=panopticon
-```
+Detailed instructions on how to pretrain your own OlmoEarth model are available in [Pretraining.md](docs/Pretraining.md).
-### 4. How to run different dino models
+## Evaluations
-For DINO v3 evaluation:
-```bash
-python3 olmoearth_pretrain/internal/full_eval_sweep.py \
- --cluster=ai2/saturn-cirrascale \
- --dino_v3 \
- --model_name=dino_v3_large_sat \
- --model.model_name=DinoV3Models.LARGE_SATELLITE \
-```
+Detailed instructions on how to replicate our evaluations is available here:
-### 5. How to run galileo
+- [Evaluations on Research Benchmarks](docs/Evaluation.md)
+- [Evaluations on Partner Tasks](https://github.com/allenai/rslearn_projects/blob/master/rslp/olmoearth_evals/README.md)
-Use the `--galileo` flag for Galileo model evaluation:
-```bash
-python3 olmoearth_pretrain/internal/full_eval_sweep.py \
- --cluster=ai2/saturn-cirrascale \
- --galileo \
- --model_name=galileo_vit_base
- --model.patch_size=4
-```
+## License
-**Key Notes:**
-- The script automatically determines appropriate normalization strategies for each model type (see [`olmoearth_pretrain/evals/datasets/normalize.py`](olmoearth_pretrain/evals/datasets/normalize.py))
- - OlmoEarth Pretrain: Use pretrained normalizer or NORM_METHOD.NORM_NO_CLIP with dataset stats
- - Galileo: Use galileo pretrained normalizer or NORM_METHOD.NORM_NO_CLIP with dataset stats
- - Panopticon: Uses NORM_METHOD.STANDARDIZE with the dataset statistics
- - DinoV3: Uses NORM_METHOD.NORM_YES_CLIP_MIN_MAX_INT to get to 0-1 and then applies either the web or sat normalization values
-- Supports both full hyperparameter sweeps and default-only runs
-- Use `--dry_run` to preview commands without execution
-- For local testing, use `--cluster=local`
-
-See `olmoearth_pretrain/internal/full_eval_sweep.py` for complete argument list and implementation details.
+This code is licensed under the [OlmoEarth Artifact License](LICENSE).
diff --git a/assets/OlmoEarth-logo.png b/assets/OlmoEarth-logo.png
new file mode 100644
index 000000000..6a141a35e
Binary files /dev/null and b/assets/OlmoEarth-logo.png differ
diff --git a/assets/datamap.png b/assets/datamap.png
new file mode 100644
index 000000000..9dff4b16c
Binary files /dev/null and b/assets/datamap.png differ
diff --git a/assets/model.jpg b/assets/model.jpg
new file mode 100644
index 000000000..36f13e76f
Binary files /dev/null and b/assets/model.jpg differ
diff --git a/docs/inference_quickstart.md b/docs/Inference-Quickstart.md
similarity index 100%
rename from docs/inference_quickstart.md
rename to docs/Inference-Quickstart.md