Skip to content

Customization Recipies#143

Draft
rkalaniNV wants to merge 6 commits intoNVIDIA-NeMo:mainfrom
rkalaniNV:rkalani/customization-recipes-merger
Draft

Customization Recipies#143
rkalaniNV wants to merge 6 commits intoNVIDIA-NeMo:mainfrom
rkalaniNV:rkalani/customization-recipes-merger

Conversation

@rkalaniNV
Copy link
Copy Markdown

WIP code placement review. DO NOT MERGE.

rkalaniNV and others added 6 commits April 8, 2026 12:22
Integrate the Speaker (Sovereign AI Playbook) project as
customization_recipes/ for language/domain model customization.

What's added:
- 6-stage customization pipeline (CPT, SFT, RL, BYOB, Eval, Quantization)
- data_prep/ shared library (acquire, translate, SDG, quality, tokenize,
  BYOB, quantize) with lazy imports for optional deps
- 12 SKILL.md agentic docs as primary agent interface
- AGENTS.md repo capability map
- 8 CLI commands under `nemotron customize` with shared _execute.py
- Multi-container Docker deployment (orchestrator, curator, trainer,
  evaluator, NIM) with command dispatcher for single-entry-point UX
- Airgap support (pre-download + deploy scripts + compose overlay)
- Sovereign benchmark bridge (BYOB → NeMo Evaluator BYOB framework)
- Lepton and Run:AI executor support in nemo_runspec/execution.py
- Nemotron model family populated; Llama/Qwen stubs
- [customize] optional dependency group (14 packages)
- Tests for data_prep, configs, and SKILL.md integrity

What's NOT changed:
- Existing nano3/super3/embed recipes (zero modifications)
- Core dependencies (customize deps are optional)
- nemotron.kit, nemo_runspec (only execution.py extended)

Verified: 8-point safety audit confirms no breakage to existing
functionality. 3-round review cycle completed (all issues fixed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove ~280 lines of reimplemented logic (tokenizer loading, file I/O,
ShareGPT conversion, tokenization, packing, bin/idx building) and
delegate to the existing production-grade nemotron.data_prep pipeline.

tokenize_pack.py is now a thin adapter (272 lines, down from 549):
- CPTConfig/SFTConfig dataclasses kept as OmegaConf interface
- Added to_data_blend()/to_tokenizer_config() converters
- prepare_cpt_data() delegates to run_pretrain_pipeline()
- prepare_sft_data() delegates to run_sft_pipeline()
- Thinking token support confirmed already in nemotron.data_prep
  (chat_template + nano3.jinja) — removed local reimplementation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove "(WIP)" from customize CLI entry in AGENTS.md — it is fully
  implemented with 8 commands
- Add data quality evaluation (--mode data) documentation to the E2E
  SKILL.md stage 4 walkthrough

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every SKILL.md now tells agents to gather requirements from the user
BEFORE executing any commands. Adds structured input tables with
specific questions to ask, defaults, and conditional triggers.

- E2E SKILL.md: Step 0 with 7 required + 12 optional inputs
- Per-stage SKILL.md: "Inputs Required" tables (7-12 inputs each)
- data_prep SKILL.md: Per-utility input sections (acquire, translate,
  SDG, quality, tokenize/pack)

Agents will no longer execute the Hindi medical example verbatim —
they will ask the user what language, domain, data, and compute
environment to use first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major changes:
- Training CLI commands (cpt, sft, rl) now reuse existing nano3/super3
  training scripts instead of maintaining copies. The --model-family
  flag (-m) selects which script set to use (default: nano3).
- Added MODEL_FAMILY_SCRIPTS mapping and resolve_training_script() to
  _execute.py for dynamic script resolution at execution time.
- Deleted duplicate run_cpt.py, run_sft.py, run_rl.py (replaced with
  tombstones pointing to the reused scripts).
- YAML configs rewritten to Megatron-Bridge format (matching actual
  nano3/super3 configs): recipe._target_, train.*, model.*, etc.
- All SKILL.md config references aligned with Megatron-Bridge keys.

Additional fixes:
- SDGConfig: added optional domain/language fields for targeted generation
- acquire.py: auto-downloads FastText lid.176.bin when lid_model_path=None
- deploy_airgap.sh: fixed service names to match docker-compose.yaml
- CLI override examples: use correct Megatron-Bridge key paths
- RL examples: use -c default with training_type= override
- Env vars: standardized on OPENAI_API_KEY across docs and compose

Usage:
  nemotron customize cpt -c default                    # nano3 (default)
  nemotron customize sft -c default -m super3          # super3
  nemotron customize rl -c default training_type=dpo   # DPO on nano3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant