allenai · yawenzzzz · Oct 20, 2025 · Oct 20, 2025 · Oct 21, 2025 · Oct 21, 2025
diff --git a/README.md b/README.md
@@ -13,65 +13,6 @@ launching training runs on beaker
 4. Run `pip install pre-commit`
 5. Run `pre-commit install`
 
-## Training Setup
-1. Create a Github Token that is able to clone this repo on beaker. You can generate a token [here](https://github.com/settings/tokens) Following permissions are sufficient
-    - repo
-    - read:packages
-    - read:org
-    - write:org
-    - read:project
-
-    Authorize this token for the allenai org. by clicking on the Configure SSO drop down in [here](https://github.com/settings/tokens) for the token you created.
-2. Set your default Beaker workspace and budget:
-    `beaker config set default_workspace ai2/earth-systems`
-    `beaker workspace set-budget ai2/earth-systems ai2/d5`
-3. Set the following Beaker Secrets:
-    - `beaker secret write <your_beaker_username>_WANDB_API_KEY <your_key>`
-    - `beaker secret write <your_beaker_username>_BEAKER_TOKEN <your_token>`
-    - `beaker secret write <your_beaker_username>_GITHUB_TOKEN <your_key>`
-
-4. Create a script based on scripts/latent_mim.py and configure your experiment (you can override specific changes)
-
-
-## Launch
-
-### Pre-emptible Jobs
-
-To launch pre-emptible jobs, we will use the main entrypoint in [olmoearth_pretrain/internal/experiment.py](olmoearth_pretrain/internal/experiment.py) and write python configuration files that use it like [scripts/latent_mim.py](scripts/latent_mim.py). Depednign on your experiment it might make sense to write a new script with different builders or to just overide as needed for an existing one.
-Before launching your script **MAKE SURE YOUR CODE IS COMMITED AND PUSHED** as we are cloning the code on top of a docker image when we launch the job.
-
-We can launch a script as follows:
-
-`python3 scripts/base_debug_scripts/latent_mim.py launch test_run ai2/saturn-cirrascale`
-
-This will launch a beaker job and stream the logs to your console until you cancel.
-Add additional overides as needed.
-
-### Sessions
-
-[VSCODE/Cursor workflow setup](https://docs.google.com/document/d/1ydiCqIn45xlbrIcfPi8bILn_y00adTAHhIY1MPh9szE/edit?tab=t.0#heading=h.wua78h35aq1n) \
-Be sure your session creation has included the following args
- - `  --secret-env WANDB_API_KEY=<your_beaker_username>_WANDB_API_KEY
-    --secret-env BEAKER_TOKEN=<your_beaker_username>__BEAKER_TOKEN `
-
-Note: In order to use flash attention in a session, use `"beaker://petew/olmo-core-tch270cu128"` as your base beaker image.
-Then, set up a conda environment so you can use the flash attention code saved in the base image.
-1. `conda init`
-2. `exec bash`
-3. `conda shell.bash activate base`
-4. `pip install -e '.[all]'`
-
-When launching runs in Sessions for debugging, use the following command,
-
-`torchrun scripts/base_debug_scripts/latent_mim.py train test_run local`
-
-Add additional overides as needed.
-
-## Beaker Information
-
-budget: `ai2/es-platform` \
-workspace: `ai2/earth-systems` \
-weka: `weka://dfive-default`
 
 ## OlmoEarth Pretrain Dataset
 

diff --git a/beaker_config_example.yaml b/beaker_config_example.yaml
diff --git a/docs/Evaluation.md b/docs/Evaluation.md
@@ -0,0 +1,185 @@
+# OlmoEarth Evaluation Guide
+
+This guide explains how we launch evaluation sweeps for OlmoEarth checkpoints and baseline models, including KNN, linear probing, and finetuning jobs.
+
+---
+
+## Choose Your Evaluation Path
+
+> **🏢 AI2 Researchers (Internal):**
+> You have access to Beaker/Weka clusters and shared checkpoints. Skim [Setup-Internal.md](Setup-Internal.md) for environment details, then follow the launch instructions below.
+
+> **🌍 External Users:**
+> You can run these workflows on local/cloud GPUs. You will need the datasets referenced in [Dataset Setup](Pretraining.md#dataset-setup).
+
+---
+
+## Table of Contents
+
+1. [Evaluation Overview](#evaluation-overview)
+2. [Quick Start](#quick-start)
+3. [KNN / Linear Probing](#knn--linear-probing)
+4. [Finetune](#finetune-sweep)
+5. [Monitoring & Outputs](#monitoring--outputs)
+6. [Helpful Files](#helpful-files)
+
+---
+
+## Evaluation Overview
+
+We run evaluations through the same `olmoearth_pretrain/internal/experiment.py` entrypoint used for pretraining. The helper scripts below build the underlying launch commands and fan out the learning rate, normalization, and pooling sweeps we used in paper.
+
+- `olmoearth_pretrain/internal/full_eval_sweep.py` launches KNN (for classification) and linear probing (for segmentation) against an OlmoEarth checkpoint or a supported baseline model.
+- `olmoearth_pretrain/internal/full_eval_sweep_finetune.py` launches finetuning evaluations, including optional sweeps over pretrained and dataset normalizers.
+
+Both scripts rely on:
+- [`olmoearth_pretrain/internal/all_evals.py`](../olmoearth_pretrain/internal/all_evals.py) for the task registry.
+- [`olmoearth_pretrain/evals`](../olmoearth_pretrain/evals) for dataset/model wrappers.
+
+### Prerequisites
+
+- Python environment configured as described in [Pretraining.md](Pretraining.md#environment-setup).
+- Access to evaluation datasets (see [`evals/datasets/paths.py`](../olmoearth_pretrain/evals/datasets/paths.py) for expected locations).
+- W&B API key (`WANDB_API_KEY`) if you want metrics to stream automatically.
+- For AI2 infra: valid Beaker cluster name (`ai2/saturn`, `ai2/titan`, etc.).
+
+### Supported Models
+
+- **OlmoEarth checkpoints:** Any checkpoint compatible with the evaluation `experiment.py` entrypoint.
+- **Baseline presets:** `dino_v3`, `panopticon`, `galileo`, `satlas`, `croma`, `copernicusfm`, `presto`, `anysat`, `tessera`, `prithvi_v2`, `terramind`, `clay`. Multi-size variants (e.g. `croma_large`, `galileo_large`, `terramind_large`) are handled automatically by the sweep scripts when requested.
+
+---
+
+## Quick Start
+
+### 1. Activate your environment
+
+```bash
+source .venv-olmoearth_pretrain/bin/activate
+```
+
+### 2. Run a dry run to inspect the planned commands
+
+```bash
+python -m olmoearth_pretrain.internal.full_eval_sweep \
+  --cluster=local \
+  --checkpoint_path=/path/to/checkpoint/step123000 \
+  --defaults_only \
+  --dry_run
+```
+
+This prints the exact `torchrun`/`python3` command that will be executed for each task and hyperparameter combination.
+
+### 3. Launch for real
+
+Remove `--dry_run` once the command looks correct. On local GPUs the helper scripts will call `torchrun`; on Beaker they call `python3` with the launch module defined by `EVAL_LAUNCH_PATH`.
+
+---
+
+## KNN / Linear Probing
+
+Use this script for KNN and linear probing evaluations. Invoke it either through `python -m olmoearth_pretrain.internal.full_eval_sweep` or by running the file directly.
+
+### Required flags
+
+- `--cluster`: Cluster identifier (`local` for on-box runs).
+- Exactly one of:
+  - `--checkpoint_path=/path/to/checkpoint/stepXXXX`: Evaluate an OlmoEarth checkpoint.
+  - `--model=<baseline_name>` or `--model=all`: Evaluate published baseline models defined in [`evals/models`](../olmoearth_pretrain/evals/models).
+
+### Common optional flags
+
+- `--module_path`: Override the launch module (defaults to the model-specific launcher).
+- `--project_name`: W&B project (defaults to `EVAL_WANDB_PROJECT`).
+- `--defaults_only`: Run a single command using the default lr / normalization / pooling.
+- `--lr_only`: Sweep learning rates but keep normalization + pooling at defaults.
+- `--all_sizes` or `--size=<variant>`: Evaluate every published size for multi-size baselines.
+- `--model-skip-names=a,b`: Skip a subset when using `--model=all`.
+- `--select_best_val`: Uses validation MIoU to pick the best epoch before reporting test metrics.
+- `--dry_run`: Print commands without launching.
+- Extra CLI arguments (e.g. `--trainer.max_duration.unit=epochs`) are forwarded to the underlying train module.
+
+### Example: Launch OlmoEarth evaluation against a checkpoint (local debug)
+
+```bash
+python -m olmoearth_pretrain.internal.full_eval_sweep \
+  --cluster=local \
+  --checkpoint_path=/data/checkpoints/phase2_base/step667200 \
+  --module_path=scripts/2025_10_02_phase2/base.py \
+  --defaults_only
+```
+
+### Example: Launch baseline sweep on Beaker
+
+```bash
+python -m olmoearth_pretrain.internal.full_eval_sweep \
+  --cluster=ai2/saturn-cirrascale \
+  --model=dino_v3 \
+  --project_name=2025_10_eval_comparison \
+  --lr_only
+```
+
+When `--model=all`, the script automatically switches to the correct launcher for each model and constructs run names like `<checkpoint>_lr1e-3_norm_dataset_pool_mean`.
+
+---
+
+## Finetune Sweep
+
+Use `olmoearth_pretrain/internal/full_eval_sweep_finetune.py` for downstream fine-tuning tasks. It shares many flags with the KNN and linear probing sweep but adds fine-tune–specific knobs.
+
+### Required flags
+
+- `--cluster`: Cluster identifier.
+- One of:
+  - `--checkpoint_path=/path/to/olmoearth/stepXXXX`: Fine-tune an OlmoEarth checkpoint.
+  - `--model=<preset_key>`: Use a baseline preset (choices listed in the script’s help).
+
+### Fine-tune specific flags
+
+- `--defaults_only`: Run only the first learning rate in `FT_LRS`.
+- `--sweep_normalizer`: For models with pretrained normalizers, run both dataset stats and pretrained normalizer variants.
+- `--module_path`: Override the launch script (defaults to the preset’s launcher).
+- Extra CLI arguments append to every command (e.g. `--trainer.max_duration.value=50000`).
+- `--dry_run`: Preview commands.
+
+The script sets `FINETUNE=1` in the environment before launching so downstream code enables fine-tuning heads automatically.
+
+### Example: OlmoEarth checkpoint fine-tune sweep (Beaker)
+
+```bash
+python -m olmoearth_pretrain.internal.full_eval_sweep_finetune \
+  --cluster=ai2/titan \
+  --checkpoint_path=/weka/.../phase2.0_base_lr0.0001_wd0.02/step667200 \
+  --project_name=2025_10_08_phase2_finetune \
+  --defaults_only
+```
+
+### Example: Baseline fine-tune with normalizer sweep
+
+```bash
+python -m olmoearth_pretrain.internal.full_eval_sweep_finetune \
+  --cluster=ai2/saturn-cirrascale \
+  --model=galileo \
+  --sweep_normalizer
+```
+
+---
+
+## Monitoring & Outputs
+
+- **W&B logging:** Both scripts default to `EVAL_WANDB_PROJECT`. Override with `--project_name` or disable W&B via `--trainer.callbacks.wandb.enabled=False`.
+- **Checkpoints:** Evaluation launches set `--trainer.no_checkpoints=True` for baseline models so runs do not write new checkpoints. OlmoEarth checkpoints keep checkpointing enabled by default.
+- **Run names:** Generated from the checkpoint directory (`<run>/<step>`) or baseline name plus the swept hyperparameters to simplify aggregation.
+- **Inspecting results:** Use [`scripts/get_max_eval_metrics_from_wandb.py`](../scripts/get_max_eval_metrics_from_wandb.py) to pull the best MIoU/accuracy per task across runs.
+- **Dry run safety:** Always start with `--dry_run` when editing sweeps or passing overrides—command strings can be long and the dry run verifies the generated arguments.
+
+---
+
+## Helpful Files
+
+- [`internal/all_evals.py`](../olmoearth_pretrain/internal/all_evals.py): Lists frozen and fine-tune tasks, feature extractor settings, and metric names.
+- [`evals/models`](../olmoearth_pretrain/evals/models): Launcher modules and wrappers for baseline models.
+- [`evals/datasets/configs.py`](../olmoearth_pretrain/evals/datasets/configs.py): Dataset configs used when constructing evaluation commands.
+- [`docs/Pretraining.md`](Pretraining.md): Shared environment setup; refer back if you need to rebuild Docker images or install dependencies.
+
+Happy evaluating! Let the team know in `#olmoearth` if new baselines or tasks need presets added to the sweep scripts.