TL;DR
Token-matched data comparison: train the grug-MoE compute-optimal ladder
(d512/d768/d1024/d1280) on raw WET text from Common Crawl's focus crawl
CC-SUPPLEMENTAL-2026-22 ("top-10k science domains") vs the main crawl
CC-MAIN-2024-18 (the crawl used to build the focus seed list). Same model
configs, same per-rung token budgets, only the data differs. Primary metric:
eval/paloma/macro_loss per rung, focus vs main.
Description
Common Crawl ran a quality-steered "focus" crawl (CC-SUPPLEMENTAL-2026-22)
that biases fetching toward high-value science domains instead of the usual
search-engine-style ranking. We want to know whether pretraining on that focus
crawl's text yields lower loss than pretraining on a token-matched random sample
of the contemporaneous general main crawl, at every rung of the grug-MoE
compute-optimal ladder.
Both corpora are ingested from the crawls' WET (extracted plain-text) files over
the data.commoncrawl.org HTTPS gateway, sampled to token pools large enough
that the largest rung (d1280 = 1.24e10 tokens) never repeats data, tokenized
with the llama3 tokenizer, and fed unchanged through the documented baseline
ladder. Because each rung's (num_steps, batch_size) come from the shared
heuristic, every rung consumes an identical token count from whichever corpus —
the comparison is token-matched per rung by construction.
Initiating prompt (verbatim)
Cool - can you set up a side by side experiment comparing the focus crawl from
CC-SUPPLEMENTAL-2026-22 to CC-MAIN-2024-18? The goal should be getting the
same number of randomly sampled tokens for each and then training a ladder of
models from @experiments/grug/moe/agent.md on them
Hypothesis or Goal
H: At matched compute/tokens, the focus (science-steered) crawl yields lower
eval/paloma/macro_loss than a random sample of the main crawl, with the gap
widening (or holding) up the ladder.
Scope
- Primary metric:
eval/paloma/macro_loss (final), per rung.
- Secondary:
throughput/tokens_per_second, throughput/total_tokens.
- Axis under test: data source only. Model/optimizer/batch/steps/seq are the
documented compute-optimal baselines; nothing else varies.
Method
- Ingestion:
lib/marin/src/marin/datakit/download/common_crawl_wet.py
(WET WARC parser → dolma JSONL; focus manifest from the crawl's index parquet,
main manifest from wet.paths.gz; seeded random file sampling).
- Launch:
experiments/grug/moe/launch_crawl_compare.py (STAGE=data ingest+tokenize;
STAGE=train CRAWL=<focus|main> SCALE=<d512|d768|d1024|d1280>).
- Sample sizes (seed 0): focus 1200 WET files (~16.5B tok), main 230 files
(~18.4B tok). Measured yield: ~13.8M llama3 tok per focus WET file.
- Hardware: one v5p-8 per training run (8 runs total), via Iris.
Repro
# data prep (once)
STAGE=data python -m experiments.grug.moe.launch_crawl_compare
# each rung
STAGE=train CRAWL=focus SCALE=d512 python -m experiments.grug.moe.launch_crawl_compare
Current baseline (Nemotron mix, from experiments/grug/moe/README.md)
| Budget |
Dim |
Layers |
Tokens |
Paloma macro |
| 2.19e17 |
d512 |
6 |
8.37e8 |
3.8104 |
| 1.70e18 |
d768 |
8 |
2.71e9 |
3.4339 |
| 9.00e18 |
d1024 |
11 |
6.63e9 |
3.1605 |
| 2.83e19 |
d1280 |
13 |
1.24e10 |
3.0065 |
(The Nemotron numbers are a curated-mix reference, not a direct arm of this
experiment; the head-to-head is focus vs main below.)
Results
| Rung |
focus macro |
main macro |
Δ (main−focus) |
| d512 |
pending |
pending |
|
| d768 |
pending |
pending |
|
| d1024 |
pending |
pending |
|
| d1280 |
pending |
pending |
|
Links
- Branch:
agent/grug-moe-crawl-compare
- Logbook:
.agents/logbooks/grug-moe-crawl-compare.md
- W&B: group
crawl-compare-focus-vs-main, project marin_moe (link once runs start)
TL;DR
Token-matched data comparison: train the grug-MoE compute-optimal ladder
(d512/d768/d1024/d1280) on raw WET text from Common Crawl's focus crawl
CC-SUPPLEMENTAL-2026-22("top-10k science domains") vs the main crawlCC-MAIN-2024-18(the crawl used to build the focus seed list). Same modelconfigs, same per-rung token budgets, only the data differs. Primary metric:
eval/paloma/macro_lossper rung, focus vs main.Description
Common Crawl ran a quality-steered "focus" crawl (
CC-SUPPLEMENTAL-2026-22)that biases fetching toward high-value science domains instead of the usual
search-engine-style ranking. We want to know whether pretraining on that focus
crawl's text yields lower loss than pretraining on a token-matched random sample
of the contemporaneous general main crawl, at every rung of the grug-MoE
compute-optimal ladder.
Both corpora are ingested from the crawls' WET (extracted plain-text) files over
the
data.commoncrawl.orgHTTPS gateway, sampled to token pools large enoughthat the largest rung (d1280 = 1.24e10 tokens) never repeats data, tokenized
with the llama3 tokenizer, and fed unchanged through the documented baseline
ladder. Because each rung's
(num_steps, batch_size)come from the sharedheuristic, every rung consumes an identical token count from whichever corpus —
the comparison is token-matched per rung by construction.
Initiating prompt (verbatim)
Hypothesis or Goal
H: At matched compute/tokens, the focus (science-steered) crawl yields lower
eval/paloma/macro_lossthan a random sample of the main crawl, with the gapwidening (or holding) up the ladder.
Scope
eval/paloma/macro_loss(final), per rung.throughput/tokens_per_second,throughput/total_tokens.documented compute-optimal baselines; nothing else varies.
Method
lib/marin/src/marin/datakit/download/common_crawl_wet.py(WET WARC parser → dolma JSONL; focus manifest from the crawl's index parquet,
main manifest from
wet.paths.gz; seeded random file sampling).experiments/grug/moe/launch_crawl_compare.py(STAGE=data ingest+tokenize;STAGE=train CRAWL=<focus|main> SCALE=<d512|d768|d1024|d1280>).
(~18.4B tok). Measured yield: ~13.8M llama3 tok per focus WET file.
Repro
Current baseline (Nemotron mix, from experiments/grug/moe/README.md)
(The Nemotron numbers are a curated-mix reference, not a direct arm of this
experiment; the head-to-head is focus vs main below.)
Results
Links
agent/grug-moe-crawl-compare.agents/logbooks/grug-moe-crawl-compare.mdcrawl-compare-focus-vs-main, projectmarin_moe(link once runs start)