Henryh/pre train tutorial by Hgherzog · Pull Request #408 · allenai/olmoearth_pretrain

Hgherzog · 2025-10-23T17:47:31Z

Depends on #393
Major Documentation Overhaul
New comprehensive documentation structure:

docs/Pretraining.md - Complete pretraining guide covering:
Environment setup for both external users and AI2 researchers
Script launching with torchrun and command structure
Dataset setup (H5 format requirements and structure)
Official training scripts table (nano/tiny/base/large)
Configuration overrides and experiment customization
Hardware adaptation notes
docs/Setup-Internal.md - AI2-specific guide:
Beaker setup (GitHub tokens, workspace/budget config, secrets)
Launch methods (pre-emptible jobs vs interactive sessions)
Internal dataset locations on Weka
Beaker gotchas and best practices

README.md cleanup:

Removed detailed "Training Setup", "Launch", and "Beaker Information" sections (now in dedicated docs)
Deleted beaker_config_example.yaml

Enabling Pre-training to Run not on Beaker

Dataset path Centralization:
- New olmoearth_pretrain/evals/datasets/paths.py centralizes all eval dataset paths
- Supports environment variable overrides (e.g., GEOBENCH_DIR, PASTIS_DIR)
- Removed hardcoded *_DIR constants from individual dataset modules (breizhcrops, cropharvest, floods, mados, pastis, geobench)
Allow defaulting to not using beaker based env vars if not running on beaker

New Official Training Scripts

Added scripts/official/ with model size variants:
nano.py + nano_launch.sh - 4 GPU experiments with 9 lr/wd combinations
tiny.py + tiny_launch.sh - 4 GPU experiments with 9 lr/wd combinations
base.py + base_launch.sh - 8 GPU experiments with 9 lr/wd combinations
large.py + large_launch.sh - Large model experiments with 9 lr/wd combinations
ablations/base_mae.py - MAE ablation configuration

Impact
This PR makes the codebase significantly more accessible to external users while maintaining internal AI2 workflows. The documentation now has separation between internal/external use cases, and the code is better organized with centralized configuration and optional dependencies

Merge remote-tracking branch 'origin/olmo-core-2.3' into henryh/pre-train-tutorial

…ker import for types

favyen2

leaving a couple comments for now, didn't have time to go over the whole thing

…g docs

Hgherzog added 15 commits October 20, 2025 21:19

:q

3e6cd65

Merge remote-tracking branch 'origin/olmo-core-2.3' into henryh/pre-train-tutorial

Able to hit the dataset not here error

096a75f

training works

73595ac

add in the other files

f5687db

path to have pretraining work outside beaker but still requires a bea…

d92bda1

…ker import for types

move paths out to a seperate file that loads as env vars

c1d5f88

more clean ups

0ffbd91

split out sickle processor

b37822a

cull imports

2cb7702

training runs decoupled from evaluation

a5d5975

official scripts ready

8c0fbc0

add docs example

2985c8c

updated docs still need some more work

526157e

updated pretraining.md

afcc2b0

pre-training docs

eb70de6

github-actions Bot added the size/xl label Oct 23, 2025

root and others added 10 commits October 23, 2025 17:57

works on a beaker session

cacbc8d

update official scripts

ba598b5

update tutorial order

f7b77be

add priority note

deded88

spelling

5965db0

actually enable torchrun

86605fa

simplify as we are required to have it for all

e23d572

formatting changes

cd150e2

linting fixes

1d9a8a6

fix mor elints

41fa6fc

yawenzzzz mentioned this pull request Oct 24, 2025

Evaluation tutorial #412

Closed

Hgherzog added 3 commits October 24, 2025 11:49

move the beaker launch config back

cc77728

rename and fix lints

0801bc4

use goebench library directly

99057ab

Hgherzog added 2 commits October 24, 2025 11:56

none checking

ba697c4

pretrain tutorial updated

05eb0a6

favyen2 reviewed Oct 24, 2025

View reviewed changes

Comment thread docs/Pretraining.md Outdated

Comment thread docs/Pretraining.md Outdated

Farbum reviewed Oct 24, 2025

View reviewed changes

Comment thread docs/Pretraining.md Outdated

Hgherzog and others added 11 commits October 24, 2025 19:21

Merge branch 'main' into henryh/pre-train-tutorial

3aefa8d

adress initial comments

c220f9b

clean ups

970e718

fix orig size default

de619e1

add more info about ablations

5fc7277

Merge branch 'main' into henryh/pre-train-tutorial

da0e329

Merge branch 'main' into henryh/pre-train-tutorial

2e36660

update uv to specify python version and point to it in the pretrainin…

5fe66df

…g docs

add dataset extraction

56da4c8

better tar opening instructions

1471471

add link to hugging face page

4e52495

gabrieltseng approved these changes Oct 28, 2025

View reviewed changes

Comment thread docs/Pretraining.md Outdated

Comment thread docs/Pretraining.md

Comment thread docs/Pretraining.md Outdated

Comment thread docs/Pretraining.md

Comment thread docs/Pretraining.md Outdated

Hgherzog added 4 commits October 28, 2025 07:50

fix ai2 and remove notes

f867d24

remove a bunch of the overide examples

4c8f66a

shorten explanation

40d03b4

remove reference

91b01c0

Hgherzog merged commit df2766e into main Oct 28, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Henryh/pre train tutorial#408

Henryh/pre train tutorial#408
Hgherzog merged 45 commits intomainfrom
henryh/pre-train-tutorial

Hgherzog commented Oct 23, 2025 •

edited

Loading

Uh oh!

favyen2 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Hgherzog commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

favyen2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Hgherzog commented Oct 23, 2025 •

edited

Loading