You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for the impressively transparent open-source release — particularly the _missing_categories field in Super3's data_blend_raw_phase1.json / data_blend_raw_phase2.json. That annotation is one of the most considerate "what we couldn't open-source and why" disclosures I've seen for an LLM release.
This issue is complementary to #120 (which focuses on a release plan for the missing datasets themselves). My ask is narrower and orthogonal: bring Nano3's pretrain blend metadata to the same documentation standard as Super3's, so users can at least reason about the gap even before any new datasets are released.
No HuggingFaceFW/finepdfs (Super3 uses weight: 6.1 in Phase 1, 14.3 in Phase 2)
No Nemotron-Pretraining-Specialized-v1.1 subsets (Multiple-Choice, Economics, Formal-Logic, Code-Concepts, Unconditional-Algorithmic) — all of which Super3 includes
The only doc-level acknowledgement is a brief sentence in pretrain.md:
"Results will differ from the benchmarks in the tech report. Use this recipe as a reference implementation to apply the methodology with your own data."
This doesn't tell users what is missing or how much.
Why this matters
For users doing CPT / domain adaptation / partial reproduction with Nano3:
They cannot reason about expected performance gaps. With Super3 they can compute "I'm missing ~19.6% weight, here are the categories." With Nano3 they have no anchor.
They cannot easily build a substitute blend. Super3 effectively documents how to backfill (e.g. The Stack v2, OpenWebText, peS2o). Nano3 doesn't even tell you what to backfill.
The current blend is uniform 1.0, so users who run it as-is unknowingly train on a distribution very different from the released model.
Concrete request
Please add to Nano3's pretrain config:
_comment field in data_blend_raw.json and data_blend_raw_large.json mirroring Super3's coverage estimate.
_missing_categories field listing internal-only data with weights, matching Super3's schema.
Approximate internal ratio weights (rather than uniform 1.0) — even rough estimates would be more useful than uniform.
Phase 1 / Phase 2 split (data_blend_nano3_phase1.json, data_blend_nano3_phase2.json) reflecting the 23.5T / 1.5T curriculum described in pretrain.md.
Optional but high-value: add HuggingFaceFW/finepdfs and Nemotron-Pretraining-Specialized-v1.1 to the Nano3 blend, as Super3 does.
If the Nano3 blend is intentionally a simplified template and not meant to mirror internal training, it would be helpful for the README to state that explicitly so users don't expect parity with Super3.
Related
Datasets release plan required #120 — request for a release plan for the datasets themselves (complementary; this issue is about metadata format parity, that one is about new dataset releases)
Summary
Thanks for the impressively transparent open-source release — particularly the
_missing_categoriesfield in Super3'sdata_blend_raw_phase1.json/data_blend_raw_phase2.json. That annotation is one of the most considerate "what we couldn't open-source and why" disclosures I've seen for an LLM release.This issue is complementary to #120 (which focuses on a release plan for the missing datasets themselves). My ask is narrower and orthogonal: bring Nano3's pretrain blend metadata to the same documentation standard as Super3's, so users can at least reason about the gap even before any new datasets are released.
What Super3 has (and Nano3 doesn't)
src/nemotron/recipes/super3/stage0_pretrain/config/data_prep/data_blend_raw_phase1.json:_commentfield estimating coverage: "the open-sourced data covers an estimated 8-10T tokens total (~40-50% of internal)"_missing_categoriesfield with precise weights and descriptions:weight: 14.9fornemotron-cc-v2.1-high-quality-synthetic) instead of uniform1.0What Nano3 currently has
src/nemotron/recipes/nano3/stage0_pretrain/config/data_prep/data_blend_raw.jsonanddata_blend_raw_large.json:_commentdescribing coverage vs internal_missing_categoriesannotationweight: 1.0(uniform mixing) — does not reflect internal ratiosdocs/nemotron/nano3/pretrain.mddescribing a 23.5T → 1.5T curriculumHuggingFaceFW/finepdfs(Super3 usesweight: 6.1in Phase 1,14.3in Phase 2)Nemotron-Pretraining-Specialized-v1.1subsets (Multiple-Choice, Economics, Formal-Logic, Code-Concepts, Unconditional-Algorithmic) — all of which Super3 includesThe only doc-level acknowledgement is a brief sentence in
pretrain.md:This doesn't tell users what is missing or how much.
Why this matters
For users doing CPT / domain adaptation / partial reproduction with Nano3:
1.0, so users who run it as-is unknowingly train on a distribution very different from the released model.Concrete request
Please add to Nano3's pretrain config:
_commentfield indata_blend_raw.jsonanddata_blend_raw_large.jsonmirroring Super3's coverage estimate._missing_categoriesfield listing internal-only data with weights, matching Super3's schema.1.0) — even rough estimates would be more useful than uniform.data_blend_nano3_phase1.json,data_blend_nano3_phase2.json) reflecting the 23.5T / 1.5T curriculum described inpretrain.md.HuggingFaceFW/finepdfsandNemotron-Pretraining-Specialized-v1.1to the Nano3 blend, as Super3 does.If the Nano3 blend is intentionally a simplified template and not meant to mirror internal training, it would be helpful for the README to state that explicitly so users don't expect parity with Super3.
Related
Thanks!