Skip to content

Bring Nano3 pretrain blend documentation to parity with Super3 (_missing_categories, real weights, phase split) #136

@ridgerchu

Description

@ridgerchu

Summary

Thanks for the impressively transparent open-source release — particularly the _missing_categories field in Super3's data_blend_raw_phase1.json / data_blend_raw_phase2.json. That annotation is one of the most considerate "what we couldn't open-source and why" disclosures I've seen for an LLM release.

This issue is complementary to #120 (which focuses on a release plan for the missing datasets themselves). My ask is narrower and orthogonal: bring Nano3's pretrain blend metadata to the same documentation standard as Super3's, so users can at least reason about the gap even before any new datasets are released.

What Super3 has (and Nano3 doesn't)

src/nemotron/recipes/super3/stage0_pretrain/config/data_prep/data_blend_raw_phase1.json:

  1. _comment field estimating coverage: "the open-sourced data covers an estimated 8-10T tokens total (~40-50% of internal)"
  2. _missing_categories field with precise weights and descriptions:
    "_missing_categories": {
      "code":             {"weight": 14.0, "description": "Code data (e.g., The Stack, StarCoder data). Largest missing component."},
      "nemotron-cc-code": {"weight": 2.1,  "description": "Nemotron-CC code-classified web documents."},
      "crawl++":          {"weight": 1.8,  "description": "OpenWebText + BigScience + Reddit."},
      "academic":         {"weight": 1.7,  "description": "Academic papers and text."}
    }
  3. Phase 1 / Phase 2 / long-context as three separate blend files reflecting the actual curriculum
  4. Approximate internal weights (e.g. weight: 14.9 for nemotron-cc-v2.1-high-quality-synthetic) instead of uniform 1.0

What Nano3 currently has

src/nemotron/recipes/nano3/stage0_pretrain/config/data_prep/data_blend_raw.json and data_blend_raw_large.json:

  • No _comment describing coverage vs internal
  • No _missing_categories annotation
  • All weight: 1.0 (uniform mixing) — does not reflect internal ratios
  • No Phase 1 / Phase 2 split, despite docs/nemotron/nano3/pretrain.md describing a 23.5T → 1.5T curriculum
  • No HuggingFaceFW/finepdfs (Super3 uses weight: 6.1 in Phase 1, 14.3 in Phase 2)
  • No Nemotron-Pretraining-Specialized-v1.1 subsets (Multiple-Choice, Economics, Formal-Logic, Code-Concepts, Unconditional-Algorithmic) — all of which Super3 includes

The only doc-level acknowledgement is a brief sentence in pretrain.md:

"Results will differ from the benchmarks in the tech report. Use this recipe as a reference implementation to apply the methodology with your own data."

This doesn't tell users what is missing or how much.

Why this matters

For users doing CPT / domain adaptation / partial reproduction with Nano3:

  1. They cannot reason about expected performance gaps. With Super3 they can compute "I'm missing ~19.6% weight, here are the categories." With Nano3 they have no anchor.
  2. They cannot easily build a substitute blend. Super3 effectively documents how to backfill (e.g. The Stack v2, OpenWebText, peS2o). Nano3 doesn't even tell you what to backfill.
  3. The current blend is uniform 1.0, so users who run it as-is unknowingly train on a distribution very different from the released model.

Concrete request

Please add to Nano3's pretrain config:

  1. _comment field in data_blend_raw.json and data_blend_raw_large.json mirroring Super3's coverage estimate.
  2. _missing_categories field listing internal-only data with weights, matching Super3's schema.
  3. Approximate internal ratio weights (rather than uniform 1.0) — even rough estimates would be more useful than uniform.
  4. Phase 1 / Phase 2 split (data_blend_nano3_phase1.json, data_blend_nano3_phase2.json) reflecting the 23.5T / 1.5T curriculum described in pretrain.md.
  5. Optional but high-value: add HuggingFaceFW/finepdfs and Nemotron-Pretraining-Specialized-v1.1 to the Nano3 blend, as Super3 does.

If the Nano3 blend is intentionally a simplified template and not meant to mirror internal training, it would be helpful for the README to state that explicitly so users don't expect parity with Super3.

Related

  • Datasets release plan required #120 — request for a release plan for the datasets themselves (complementary; this issue is about metadata format parity, that one is about new dataset releases)

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions