Skip to content

Henryh/pre train tutorial#408

Merged
Hgherzog merged 45 commits intomainfrom
henryh/pre-train-tutorial
Oct 28, 2025
Merged

Henryh/pre train tutorial#408
Hgherzog merged 45 commits intomainfrom
henryh/pre-train-tutorial

Conversation

@Hgherzog
Copy link
Copy Markdown
Collaborator

@Hgherzog Hgherzog commented Oct 23, 2025

Depends on #393
Major Documentation Overhaul
New comprehensive documentation structure:

  • docs/Pretraining.md - Complete pretraining guide covering:
  • Environment setup for both external users and AI2 researchers
  • Script launching with torchrun and command structure
  • Dataset setup (H5 format requirements and structure)
  • Official training scripts table (nano/tiny/base/large)
  • Configuration overrides and experiment customization
  • Hardware adaptation notes
  • docs/Setup-Internal.md - AI2-specific guide:
  • Beaker setup (GitHub tokens, workspace/budget config, secrets)
  • Launch methods (pre-emptible jobs vs interactive sessions)
  • Internal dataset locations on Weka
  • Beaker gotchas and best practices

README.md cleanup:

  • Removed detailed "Training Setup", "Launch", and "Beaker Information" sections (now in dedicated docs)
  • Deleted beaker_config_example.yaml

Enabling Pre-training to Run not on Beaker

  • Dataset path Centralization:
    • New olmoearth_pretrain/evals/datasets/paths.py centralizes all eval dataset paths
    • Supports environment variable overrides (e.g., GEOBENCH_DIR, PASTIS_DIR)
    • Removed hardcoded *_DIR constants from individual dataset modules (breizhcrops, cropharvest, floods, mados, pastis, geobench)
  • Allow defaulting to not using beaker based env vars if not running on beaker

New Official Training Scripts

  • Added scripts/official/ with model size variants:
  • nano.py + nano_launch.sh - 4 GPU experiments with 9 lr/wd combinations
  • tiny.py + tiny_launch.sh - 4 GPU experiments with 9 lr/wd combinations
  • base.py + base_launch.sh - 8 GPU experiments with 9 lr/wd combinations
  • large.py + large_launch.sh - Large model experiments with 9 lr/wd combinations
  • ablations/base_mae.py - MAE ablation configuration

Impact
This PR makes the codebase significantly more accessible to external users while maintaining internal AI2 workflows. The documentation now has separation between internal/external use cases, and the code is better organized with centralized configuration and optional dependencies

@yawenzzzz yawenzzzz mentioned this pull request Oct 24, 2025
Copy link
Copy Markdown
Collaborator

@favyen2 favyen2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving a couple comments for now, didn't have time to go over the whole thing

Comment thread docs/Pretraining.md Outdated
Comment thread docs/Pretraining.md Outdated
Comment thread docs/Pretraining.md Outdated
Comment thread docs/Pretraining.md Outdated
Comment thread docs/Pretraining.md
Comment thread docs/Pretraining.md Outdated
Comment thread docs/Pretraining.md
Comment thread docs/Pretraining.md Outdated
@Hgherzog Hgherzog merged commit df2766e into main Oct 28, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants