Bring Nano3 pretrain blend documentation to parity with Super3 (`_missing_categories`, real weights, phase split)

## Summary

Thanks for the impressively transparent open-source release — particularly the `_missing_categories` field in Super3's `data_blend_raw_phase1.json` / `data_blend_raw_phase2.json`. That annotation is one of the most considerate "what we couldn't open-source and why" disclosures I've seen for an LLM release.

This issue is **complementary to #120** (which focuses on a release plan for the missing datasets themselves). My ask is narrower and orthogonal: **bring Nano3's pretrain blend metadata to the same documentation standard as Super3's**, so users can at least *reason about* the gap even before any new datasets are released.

## What Super3 has (and Nano3 doesn't)

[`src/nemotron/recipes/super3/stage0_pretrain/config/data_prep/data_blend_raw_phase1.json`](https://github.com/NVIDIA-NeMo/Nemotron/blob/main/src/nemotron/recipes/super3/stage0_pretrain/config/data_prep/data_blend_raw_phase1.json):

1. **`_comment` field** estimating coverage: *"the open-sourced data covers an estimated 8-10T tokens total (~40-50% of internal)"*
2. **`_missing_categories` field** with precise weights and descriptions:
   ```json
   "_missing_categories": {
     "code":             {"weight": 14.0, "description": "Code data (e.g., The Stack, StarCoder data). Largest missing component."},
     "nemotron-cc-code": {"weight": 2.1,  "description": "Nemotron-CC code-classified web documents."},
     "crawl++":          {"weight": 1.8,  "description": "OpenWebText + BigScience + Reddit."},
     "academic":         {"weight": 1.7,  "description": "Academic papers and text."}
   }
   ```
3. **Phase 1 / Phase 2 / long-context** as three separate blend files reflecting the actual curriculum
4. **Approximate internal weights** (e.g. `weight: 14.9` for `nemotron-cc-v2.1-high-quality-synthetic`) instead of uniform `1.0`

## What Nano3 currently has

[`src/nemotron/recipes/nano3/stage0_pretrain/config/data_prep/data_blend_raw.json`](https://github.com/NVIDIA-NeMo/Nemotron/blob/main/src/nemotron/recipes/nano3/stage0_pretrain/config/data_prep/data_blend_raw.json) and `data_blend_raw_large.json`:

- No `_comment` describing coverage vs internal
- No `_missing_categories` annotation
- All `weight: 1.0` (uniform mixing) — does not reflect internal ratios
- No Phase 1 / Phase 2 split, despite [`docs/nemotron/nano3/pretrain.md`](https://github.com/NVIDIA-NeMo/Nemotron/blob/main/docs/nemotron/nano3/pretrain.md) describing a 23.5T → 1.5T curriculum
- No `HuggingFaceFW/finepdfs` (Super3 uses `weight: 6.1` in Phase 1, `14.3` in Phase 2)
- No `Nemotron-Pretraining-Specialized-v1.1` subsets (Multiple-Choice, Economics, Formal-Logic, Code-Concepts, Unconditional-Algorithmic) — all of which Super3 includes

The only doc-level acknowledgement is a brief sentence in [`pretrain.md`](https://github.com/NVIDIA-NeMo/Nemotron/blob/main/docs/nemotron/nano3/pretrain.md):
> "Results will differ from the benchmarks in the tech report. Use this recipe as a reference implementation to apply the methodology with your own data."

This doesn't tell users *what* is missing or *how much*.

## Why this matters

For users doing CPT / domain adaptation / partial reproduction with Nano3:

1. **They cannot reason about expected performance gaps.** With Super3 they can compute "I'm missing ~19.6% weight, here are the categories." With Nano3 they have no anchor.
2. **They cannot easily build a substitute blend.** Super3 effectively documents how to backfill (e.g. The Stack v2, OpenWebText, peS2o). Nano3 doesn't even tell you what to backfill.
3. **The current blend is uniform `1.0`,** so users who run it as-is unknowingly train on a distribution very different from the released model.

## Concrete request

Please add to Nano3's pretrain config:

1. **`_comment` field** in `data_blend_raw.json` and `data_blend_raw_large.json` mirroring Super3's coverage estimate.
2. **`_missing_categories` field** listing internal-only data with weights, matching Super3's schema.
3. **Approximate internal ratio weights** (rather than uniform `1.0`) — even rough estimates would be more useful than uniform.
4. **Phase 1 / Phase 2 split** (`data_blend_nano3_phase1.json`, `data_blend_nano3_phase2.json`) reflecting the 23.5T / 1.5T curriculum described in `pretrain.md`.
5. Optional but high-value: **add `HuggingFaceFW/finepdfs` and `Nemotron-Pretraining-Specialized-v1.1`** to the Nano3 blend, as Super3 does.

If the Nano3 blend is intentionally a simplified template and not meant to mirror internal training, it would be helpful for the README to state that explicitly so users don't expect parity with Super3.

## Related

- #120 — request for a release plan for the datasets themselves (complementary; this issue is about *metadata format parity*, that one is about *new dataset releases*)

Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bring Nano3 pretrain blend documentation to parity with Super3 (`_missing_categories`, real weights, phase split) #136

Summary

What Super3 has (and Nano3 doesn't)

What Nano3 currently has

Why this matters

Concrete request

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bring Nano3 pretrain blend documentation to parity with Super3 (_missing_categories, real weights, phase split) #136

Description

Summary

What Super3 has (and Nano3 doesn't)

What Nano3 currently has

Why this matters

Concrete request

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bring Nano3 pretrain blend documentation to parity with Super3 (`_missing_categories`, real weights, phase split) #136