Skip to content

Agent MoE Experiment: focus crawl (CC-SUPPLEMENTAL-2026-22) vs main crawl (CC-MAIN-2024-18) #6570

Description

@Helw150

TL;DR

Token-matched data comparison: train the grug-MoE compute-optimal ladder
(d512/d768/d1024/d1280) on raw WET text from Common Crawl's focus crawl
CC-SUPPLEMENTAL-2026-22 ("top-10k science domains") vs the main crawl
CC-MAIN-2024-18 (the crawl used to build the focus seed list). Same model
configs, same per-rung token budgets, only the data differs. Primary metric:
eval/paloma/macro_loss per rung, focus vs main.

Description

Common Crawl ran a quality-steered "focus" crawl (CC-SUPPLEMENTAL-2026-22)
that biases fetching toward high-value science domains instead of the usual
search-engine-style ranking. We want to know whether pretraining on that focus
crawl's text yields lower loss than pretraining on a token-matched random sample
of the contemporaneous general main crawl, at every rung of the grug-MoE
compute-optimal ladder.

Both corpora are ingested from the crawls' WET (extracted plain-text) files over
the data.commoncrawl.org HTTPS gateway, sampled to token pools large enough
that the largest rung (d1280 = 1.24e10 tokens) never repeats data, tokenized
with the llama3 tokenizer, and fed unchanged through the documented baseline
ladder. Because each rung's (num_steps, batch_size) come from the shared
heuristic, every rung consumes an identical token count from whichever corpus —
the comparison is token-matched per rung by construction.

Initiating prompt (verbatim)

Cool - can you set up a side by side experiment comparing the focus crawl from
CC-SUPPLEMENTAL-2026-22 to CC-MAIN-2024-18? The goal should be getting the
same number of randomly sampled tokens for each and then training a ladder of
models from @experiments/grug/moe/agent.md on them

Hypothesis or Goal

H: At matched compute/tokens, the focus (science-steered) crawl yields lower
eval/paloma/macro_loss than a random sample of the main crawl, with the gap
widening (or holding) up the ladder.

Scope

  • Primary metric: eval/paloma/macro_loss (final), per rung.
  • Secondary: throughput/tokens_per_second, throughput/total_tokens.
  • Axis under test: data source only. Model/optimizer/batch/steps/seq are the
    documented compute-optimal baselines; nothing else varies.

Method

  • Ingestion: lib/marin/src/marin/datakit/download/common_crawl_wet.py
    (WET WARC parser → dolma JSONL; focus manifest from the crawl's index parquet,
    main manifest from wet.paths.gz; seeded random file sampling).
  • Launch: experiments/grug/moe/launch_crawl_compare.py (STAGE=data ingest+tokenize;
    STAGE=train CRAWL=<focus|main> SCALE=<d512|d768|d1024|d1280>).
  • Sample sizes (seed 0): focus 1200 WET files (~16.5B tok), main 230 files
    (~18.4B tok). Measured yield: ~13.8M llama3 tok per focus WET file.
  • Hardware: one v5p-8 per training run (8 runs total), via Iris.

Repro

# data prep (once)
STAGE=data python -m experiments.grug.moe.launch_crawl_compare
# each rung
STAGE=train CRAWL=focus SCALE=d512 python -m experiments.grug.moe.launch_crawl_compare

Current baseline (Nemotron mix, from experiments/grug/moe/README.md)

Budget Dim Layers Tokens Paloma macro
2.19e17 d512 6 8.37e8 3.8104
1.70e18 d768 8 2.71e9 3.4339
9.00e18 d1024 11 6.63e9 3.1605
2.83e19 d1280 13 1.24e10 3.0065

(The Nemotron numbers are a curated-mix reference, not a direct arm of this
experiment; the head-to-head is focus vs main below.)

Results

Rung focus macro main macro Δ (main−focus)
d512 pending pending
d768 pending pending
d1024 pending pending
d1280 pending pending

Links

  • Branch: agent/grug-moe-crawl-compare
  • Logbook: .agents/logbooks/grug-moe-crawl-compare.md
  • W&B: group crawl-compare-focus-vs-main, project marin_moe (link once runs start)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions