Domain-Phase Data Mixture Swarm Experiments by Calvin-Xu · Pull Request #2393 · marin-community/marin

Calvin-Xu · 2026-01-20T04:54:31Z

Description

Initial code supporting RegMix-like data mixture experiments that have discrete phases & epoching.

…calvin/swarm-olmo3-regmix-test

…gmix-test # Conflicts: # lib/levanter/src/levanter/tracker/tracker_fns.py

Wire the qsplit240 replay onto the Iris east5a path and record the failure analysis that shaped the launcher settings. This also pins the executor to a shared Fray client so large fan-out submissions stop crashing in thread-local client setup.

Check in the GRP ablations, observed-only deployment variants, and subset-validation launchers so they can be reproduced from the repo. The determinism coverage now exercises these launchers and the evaluation table builder reports the new no-penalty ablation alongside the existing rows.

Record the convergence, deployment-variant, and signal-to-noise analyses alongside the self-contained packet that was sent out for procedure review. These artifacts capture the current GRP calibration and regularization work so the plots and packet can be regenerated from committed code.

Sharding the 300M qsplit replay avoids the repeated parent crashes while preserving the original training output roots for checkpoint reuse. The overlap rerun now caches eval datasets and dispatches each Levanter lm-eval step as its own remote TPU job so JAX distributed init does not collide inside the parent process.

Checkpoint the new GRP ablations and local benchmark work while TPU jobs are blocked on capacity. This adds power-law and per-domain baselines plus intrinsic-domain and proposal-bank studies for comparing less extrapolative deployment rules.

Shard the qsplit overlap rerun, add a run_00097 overlap launcher, and backfill missing eval metrics from checkpoint artifacts when historical W&B collection is incomplete. Keep the corrected run_00097 noise baseline and ranked SNR table in-tree so the overlap discussion remains reproducible.

Allow the top-level cache prep and scaling launcher to target regions other than east5 while still reusing prebuilt merged caches where they already exist. This unblocks building the merged runtime layer in us-central1 before moving the scaling runs there.

Record the finished power-law trustblend validation in the GRP table and update the trustblend convergence artifacts with the realized subset and full-data outcomes.

Coscheduled retries could briefly leave old coordinator endpoints visible, letting tasks bootstrap against different JAX coordinators. Prefer the newest endpoint and clear a task's stale endpoints before assigning a retry so distributed init converges on a single coordinator.

Retrying lm-eval task loading now deep-copies dict task specs before passing them to lm-eval. This prevents transient load failures from mutating the caller-owned task config and breaking the retry path.

This batches the new power-family-penalty subset validation launchers with the raw-optimum convergence, comparison, and reporting scripts they feed. It also refreshes the two-phase-many summaries and plots so the validated raw-optimum results and scale-comparison outputs are checked in together.

These tracked artifacts and launch plans were regenerated while updating the exploratory domain-mix workflows. Keeping them in one artifact-only commit makes the branch clean again without mixing them into the code commits above.

Let the 520M and 1.2B swarm launchers schedule across east5 and central1 without a zone pin, and resolve region-sensitive caches through mirror-backed paths. Resume the latest checkpoint roots so restarted baselines and stratified runs keep their existing progress instead of starting from scratch.

Capture the exploratory per-domain exponent probe on top of the power-family penalty surrogate and make the benchmark driver write isolated output stems for follow-up variants. This keeps the local ablation code and probe artifacts together so the negative result is reproducible.

This checkpoints the current data-mixing analysis batch, including run and metric registries, parity rerun launchers, and new GRP, Olmix, and RegMix follow-up artifacts. It also carries the mirror and evaluation fixes needed to make those runs and backfills reproducible from the current workspace.

Add the strong-tier scaling-study launch and tracking plumbing, plus registry and metric-provenance updates for recent runs. Also document the Olmix investigation, add the exact two-phase proposer path, and refresh the key convergence and comparison plots.

Force HF Hub and datasets into offline mode after syncing mirrored eval datasets so LM Eval task loading stays cache-only. Add focused tests covering the offline-mode toggle and sync path.

Calvin-Xu added 14 commits January 13, 2026 16:05

Olmo3 (3B), RegMix 1M, 60M (1B) Swarm Test

fadf8b4

lr sweep

2475d75

Merge branch 'main' of https://github.com/marin-community/marin into …

a0403c1

…calvin/swarm-olmo3-regmix-test

initial ver

40665a1

Merge branch 'main' of https://github.com/marin-community/marin into …

58197da

…calvin/swarm-olmo3-regmix-test

Merge branch 'main' of https://github.com/marin-community/marin into …

4610692

…calvin/swarm-olmo3-regmix-test

refactor experiment setup

0219fc3

revamp analysis as step

9fb3ed1

rename dir

03ff0ed

tweak

d9b5b69

use dolmino for midtrain

c2dd95d

pass chat template

fa8be5e

better names

a1107ba

fix natural proportions

ffb69bd

Calvin-Xu force-pushed the calvin/swarm-olmo3-regmix-test branch from 25e9c35 to ffb69bd Compare January 20, 2026 06:40

Calvin-Xu mentioned this pull request Jan 20, 2026

Data Mixture: Initial Few Domains, Few Phases Experiment #2398

Closed

Calvin-Xu force-pushed the calvin/swarm-olmo3-regmix-test branch from 2dbb057 to afb0ee5 Compare January 20, 2026 07:02

add analyze

46c07bd

Calvin-Xu force-pushed the calvin/swarm-olmo3-regmix-test branch from afb0ee5 to 46c07bd Compare January 20, 2026 07:07

Calvin-Xu added 11 commits January 20, 2026 00:21

actually count tokens

a5d74d1

tweaks

f441493

Add Dolma 3 Pool, Domino Pool

dd3770e

fix optimizer, misc

9ec84da

Merge branch 'main' of https://github.com/marin-community/marin into …

b487f49

…calvin/swarm-olmo3-regmix-test

Merge branch 'main' of https://github.com/marin-community/marin into …

37f9c87

…calvin/swarm-olmo3-regmix-test

tweak domain ratios

9f3ca53

weight sampling tweaks

9733268

fix

4e07498

fix mixture logging stages

7628afe

executor parallelism cap

445d834

Calvin-Xu added 30 commits April 1, 2026 01:37

[domain_phase_mix] Migrate launchers to Iris and validate no-groups

463c88f

Merge remote-tracking branch 'origin/main' into calvin/swarm-olmo3-re…

b3b1291

…gmix-test # Conflicts: # lib/levanter/src/levanter/tracker/tracker_fns.py

[domain_phase_mix] Refresh validated GRP artifacts

e2517f7

Record the finished power-law trustblend validation in the GRP table and update the trustblend convergence artifacts with the realized subset and full-data outcomes.

[levanter] Defer Iris TPU init on TPU jobs

d82faef

[domain_phase_mix] Add qsplit pilot launchers and stratified baselines

9a84ed5

[domain_phase_mix] Add GRP family-curvature and penalty variants

0ec2631

lint

9f6bcc3

also lint

c1c7c13

[docs] Add qsplit pilot debug logs

33d2300

[levanter] Retry lm-eval task loads safely

62aeb59

Retrying lm-eval task loading now deep-copies dict task specs before passing them to lm-eval. This prevents transient load failures from mutating the caller-owned task config and breaking the retry path.

[fray] Format pip package parsing

0b13d7e

[domain_phase_mix] Refresh exploratory cached artifacts

191d302

These tracked artifacts and launch plans were regenerated while updating the exploratory domain-mix workflows. Keeping them in one artifact-only commit makes the branch clean again without mixing them into the code commits above.

[domain_phase_mix] Add power-family-penalty analysis outputs

4d3ebec

[domain_phase_mix] Harden region-agnostic replay launches

d3be0d1

[levanter] Fallback multihost sync without JAX client

2a04529

Merge origin/main into calvin/swarm-olmo3-regmix-test

92d87d7

[levanter] Keep cached eval datasets offline

0df3226

Force HF Hub and datasets into offline mode after syncing mirrored eval datasets so LM Eval task loading stays cache-only. Add focused tests covering the offline-mode toggle and sync path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Domain-Phase Data Mixture Swarm Experiments#2393

Domain-Phase Data Mixture Swarm Experiments#2393
Calvin-Xu wants to merge 209 commits intomainfrom
calvin/swarm-olmo3-regmix-test

Calvin-Xu commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Calvin-Xu commented Jan 20, 2026

Description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant