Agent MoE Experiment: focus crawl (CC-SUPPLEMENTAL-2026-22) vs main crawl (CC-MAIN-2024-18)

## TL;DR

Token-matched data comparison: train the grug-MoE compute-optimal ladder
(d512/d768/d1024/d1280) on raw WET text from Common Crawl's **focus crawl**
`CC-SUPPLEMENTAL-2026-22` ("top-10k science domains") vs the **main crawl**
`CC-MAIN-2024-18` (the crawl used to build the focus seed list). Same model
configs, same per-rung token budgets, only the data differs. Primary metric:
`eval/paloma/macro_loss` per rung, focus vs main.

## Description

Common Crawl ran a quality-steered "focus" crawl (`CC-SUPPLEMENTAL-2026-22`)
that biases fetching toward high-value science domains instead of the usual
search-engine-style ranking. We want to know whether pretraining on that focus
crawl's text yields lower loss than pretraining on a token-matched random sample
of the contemporaneous general main crawl, at every rung of the grug-MoE
compute-optimal ladder.

Both corpora are ingested from the crawls' WET (extracted plain-text) files over
the `data.commoncrawl.org` HTTPS gateway, sampled to token pools large enough
that the largest rung (d1280 = 1.24e10 tokens) never repeats data, tokenized
with the llama3 tokenizer, and fed unchanged through the documented baseline
ladder. Because each rung's `(num_steps, batch_size)` come from the shared
heuristic, every rung consumes an identical token count from whichever corpus —
the comparison is token-matched per rung by construction.

### Initiating prompt (verbatim)

> Cool - can you set up a side by side experiment comparing the focus crawl from
> CC-SUPPLEMENTAL-2026-22 to CC-MAIN-2024-18? The goal should be getting the
> same number of randomly sampled tokens for each and then training a ladder of
> models from @experiments/grug/moe/agent.md on them

## Hypothesis or Goal

H: At matched compute/tokens, the focus (science-steered) crawl yields lower
`eval/paloma/macro_loss` than a random sample of the main crawl, with the gap
widening (or holding) up the ladder.

## Scope

- Primary metric: `eval/paloma/macro_loss` (final), per rung.
- Secondary: `throughput/tokens_per_second`, `throughput/total_tokens`.
- Axis under test: **data source only**. Model/optimizer/batch/steps/seq are the
  documented compute-optimal baselines; nothing else varies.

## Method

- Ingestion: `lib/marin/src/marin/datakit/download/common_crawl_wet.py`
  (WET WARC parser → dolma JSONL; focus manifest from the crawl's index parquet,
  main manifest from `wet.paths.gz`; seeded random file sampling).
- Launch: `experiments/grug/moe/launch_crawl_compare.py` (STAGE=data ingest+tokenize;
  STAGE=train CRAWL=<focus|main> SCALE=<d512|d768|d1024|d1280>).
- Sample sizes (seed 0): focus 1200 WET files (~16.5B tok), main 230 files
  (~18.4B tok). Measured yield: ~13.8M llama3 tok per focus WET file.
- Hardware: one v5p-8 per training run (8 runs total), via Iris.

### Repro

    # data prep (once)
    STAGE=data python -m experiments.grug.moe.launch_crawl_compare
    # each rung
    STAGE=train CRAWL=focus SCALE=d512 python -m experiments.grug.moe.launch_crawl_compare

## Current baseline (Nemotron mix, from experiments/grug/moe/README.md)

| Budget  | Dim   | Layers | Tokens  | Paloma macro |
|---------|-------|--------|---------|--------------|
| 2.19e17 | d512  | 6      | 8.37e8  | 3.8104 |
| 1.70e18 | d768  | 8      | 2.71e9  | 3.4339 |
| 9.00e18 | d1024 | 11     | 6.63e9  | 3.1605 |
| 2.83e19 | d1280 | 13     | 1.24e10 | 3.0065 |

(The Nemotron numbers are a curated-mix reference, not a direct arm of this
experiment; the head-to-head is focus vs main below.)

## Results

| Rung  | focus macro | main macro | Δ (main−focus) |
|-------|-------------|------------|----------------|
| d512  | _pending_   | _pending_  | |
| d768  | _pending_   | _pending_  | |
| d1024 | _pending_   | _pending_  | |
| d1280 | _pending_   | _pending_  | |

### Links

- Branch: `agent/grug-moe-crawl-compare`
- Logbook: `.agents/logbooks/grug-moe-crawl-compare.md`
- W&B: group `crawl-compare-focus-vs-main`, project `marin_moe` (link once runs start)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Agent MoE Experiment: focus crawl (CC-SUPPLEMENTAL-2026-22) vs main crawl (CC-MAIN-2024-18) #6570

TL;DR

Description

Initiating prompt (verbatim)

Hypothesis or Goal

Scope

Method

Repro

Current baseline (Nemotron mix, from experiments/grug/moe/README.md)

Results

Links

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Budget	Dim	Layers	Tokens	Paloma macro
2.19e17	d512	6	8.37e8	3.8104
1.70e18	d768	8	2.71e9	3.4339
9.00e18	d1024	11	6.63e9	3.1605
2.83e19	d1280	13	1.24e10	3.0065

Rung	focus macro	main macro
d512	pending	pending
d768	pending	pending
d1024	pending	pending
d1280	pending	pending

Uh oh!

Agent MoE Experiment: focus crawl (CC-SUPPLEMENTAL-2026-22) vs main crawl (CC-MAIN-2024-18) #6570

Description

TL;DR

Description

Initiating prompt (verbatim)

Hypothesis or Goal

Scope

Method

Repro

Current baseline (Nemotron mix, from experiments/grug/moe/README.md)

Results

Links

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions