Customization Recipies by rkalaniNV · Pull Request #143 · NVIDIA-NeMo/Nemotron

rkalaniNV · 2026-04-13T06:01:17Z

WIP code placement review. DO NOT MERGE.

Integrate the Speaker (Sovereign AI Playbook) project as customization_recipes/ for language/domain model customization. What's added: - 6-stage customization pipeline (CPT, SFT, RL, BYOB, Eval, Quantization) - data_prep/ shared library (acquire, translate, SDG, quality, tokenize, BYOB, quantize) with lazy imports for optional deps - 12 SKILL.md agentic docs as primary agent interface - AGENTS.md repo capability map - 8 CLI commands under `nemotron customize` with shared _execute.py - Multi-container Docker deployment (orchestrator, curator, trainer, evaluator, NIM) with command dispatcher for single-entry-point UX - Airgap support (pre-download + deploy scripts + compose overlay) - Sovereign benchmark bridge (BYOB → NeMo Evaluator BYOB framework) - Lepton and Run:AI executor support in nemo_runspec/execution.py - Nemotron model family populated; Llama/Qwen stubs - [customize] optional dependency group (14 packages) - Tests for data_prep, configs, and SKILL.md integrity What's NOT changed: - Existing nano3/super3/embed recipes (zero modifications) - Core dependencies (customize deps are optional) - nemotron.kit, nemo_runspec (only execution.py extended) Verified: 8-point safety audit confirms no breakage to existing functionality. 3-round review cycle completed (all issues fixed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove ~280 lines of reimplemented logic (tokenizer loading, file I/O, ShareGPT conversion, tokenization, packing, bin/idx building) and delegate to the existing production-grade nemotron.data_prep pipeline. tokenize_pack.py is now a thin adapter (272 lines, down from 549): - CPTConfig/SFTConfig dataclasses kept as OmegaConf interface - Added to_data_blend()/to_tokenizer_config() converters - prepare_cpt_data() delegates to run_pretrain_pipeline() - prepare_sft_data() delegates to run_sft_pipeline() - Thinking token support confirmed already in nemotron.data_prep (chat_template + nano3.jinja) — removed local reimplementation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove "(WIP)" from customize CLI entry in AGENTS.md — it is fully implemented with 8 commands - Add data quality evaluation (--mode data) documentation to the E2E SKILL.md stage 4 walkthrough Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Every SKILL.md now tells agents to gather requirements from the user BEFORE executing any commands. Adds structured input tables with specific questions to ask, defaults, and conditional triggers. - E2E SKILL.md: Step 0 with 7 required + 12 optional inputs - Per-stage SKILL.md: "Inputs Required" tables (7-12 inputs each) - data_prep SKILL.md: Per-utility input sections (acquire, translate, SDG, quality, tokenize/pack) Agents will no longer execute the Hindi medical example verbatim — they will ask the user what language, domain, data, and compute environment to use first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Major changes: - Training CLI commands (cpt, sft, rl) now reuse existing nano3/super3 training scripts instead of maintaining copies. The --model-family flag (-m) selects which script set to use (default: nano3). - Added MODEL_FAMILY_SCRIPTS mapping and resolve_training_script() to _execute.py for dynamic script resolution at execution time. - Deleted duplicate run_cpt.py, run_sft.py, run_rl.py (replaced with tombstones pointing to the reused scripts). - YAML configs rewritten to Megatron-Bridge format (matching actual nano3/super3 configs): recipe._target_, train.*, model.*, etc. - All SKILL.md config references aligned with Megatron-Bridge keys. Additional fixes: - SDGConfig: added optional domain/language fields for targeted generation - acquire.py: auto-downloads FastText lid.176.bin when lid_model_path=None - deploy_airgap.sh: fixed service names to match docker-compose.yaml - CLI override examples: use correct Megatron-Bridge key paths - RL examples: use -c default with training_type= override - Env vars: standardized on OPENAI_API_KEY across docs and compose Usage: nemotron customize cpt -c default # nano3 (default) nemotron customize sft -c default -m super3 # super3 nemotron customize rl -c default training_type=dpo # DPO on nano3 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rkalaniNV and others added 6 commits April 8, 2026 12:22

Add Translation

5e3c3e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Customization Recipies#143

Customization Recipies#143
rkalaniNV wants to merge 6 commits intoNVIDIA-NeMo:mainfrom
rkalaniNV:rkalani/customization-recipes-merger

rkalaniNV commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rkalaniNV commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant